github.com/cosmos/cosmos-sdk@v0.50.10/docs/architecture/adr-027-deterministic-protobuf-serialization.md

github.com/cosmos/cosmos-sdk@v0.50.10/docs/architecture/adr-027-deterministic-protobuf-serialization.md (about)

     1  # ADR 027: Deterministic Protobuf Serialization
     2  
     3  ## Changelog
     4  
     5  * 2020-08-07: Initial Draft
     6  * 2020-09-01: Further clarify rules
     7  
     8  ## Status
     9  
    10  Proposed
    11  
    12  ## Abstract
    13  
    14  Fully deterministic structure serialization, which works across many languages and clients,
    15  is needed when signing messages. We need to be sure that whenever we serialize
    16  a data structure, no matter in which supported language, the raw bytes
    17  will stay the same.
    18  [Protobuf](https://developers.google.com/protocol-buffers/docs/proto3)
    19  serialization is not bijective (i.e. there exist a practically unlimited number of
    20  valid binary representations for a given protobuf document)<sup>1</sup>.
    21  
    22  This document describes a deterministic serialization scheme for
    23  a subset of protobuf documents, that covers this use case but can be reused in
    24  other cases as well.
    25  
    26  ### Context
    27  
    28  For signature verification in Cosmos SDK, the signer and verifier need to agree on
    29  the same serialization of a `SignDoc` as defined in
    30  [ADR-020](./adr-020-protobuf-transaction-encoding.md) without transmitting the
    31  serialization.
    32  
    33  Currently, for block signatures we are using a workaround: we create a new [TxRaw](https://github.com/cosmos/cosmos-sdk/blob/9e85e81e0e8140067dd893421290c191529c148c/proto/cosmos/tx/v1beta1/tx.proto#L30)
    34  instance (as defined in [adr-020-protobuf-transaction-encoding](https://github.com/cosmos/cosmos-sdk/blob/main/docs/architecture/adr-020-protobuf-transaction-encoding.md#transactions))
    35  by converting all [Tx](https://github.com/cosmos/cosmos-sdk/blob/9e85e81e0e8140067dd893421290c191529c148c/proto/cosmos/tx/v1beta1/tx.proto#L13)
    36  fields to bytes on the client side. This adds an additional manual
    37  step when sending and signing transactions.
    38  
    39  ### Decision
    40  
    41  The following encoding scheme is to be used by other ADRs,
    42  and in particular for `SignDoc` serialization.
    43  
    44  ## Specification
    45  
    46  ### Scope
    47  
    48  This ADR defines a protobuf3 serializer. The output is a valid protobuf
    49  serialization, such that every protobuf parser can parse it.
    50  
    51  No maps are supported in version 1 due to the complexity of defining a
    52  deterministic serialization. This might change in future. Implementations must
    53  reject documents containing maps as invalid input.
    54  
    55  ### Background - Protobuf3 Encoding
    56  
    57  Most numeric types in protobuf3 are encoded as
    58  [varints](https://developers.google.com/protocol-buffers/docs/encoding#varints).
    59  Varints are at most 10 bytes, and since each varint byte has 7 bits of data,
    60  varints are a representation of `uint70` (70-bit unsigned integer). When
    61  encoding, numeric values are casted from their base type to `uint70`, and when
    62  decoding, the parsed `uint70` is casted to the appropriate numeric type.
    63  
    64  The maximum valid value for a varint that complies with protobuf3 is
    65  `FF FF FF FF FF FF FF FF FF 7F` (i.e. `2**70 -1`). If the field type is
    66  `{,u,s}int64`, the highest 6 bits of the 70 are dropped during decoding,
    67  introducing 6 bits of malleability. If the field type is `{,u,s}int32`, the
    68  highest 38 bits of the 70 are dropped during decoding, introducing 38 bits of
    69  malleability.
    70  
    71  Among other sources of non-determinism, this ADR eliminates the possibility of
    72  encoding malleability.
    73  
    74  ### Serialization rules
    75  
    76  The serialization is based on the
    77  [protobuf3 encoding](https://developers.google.com/protocol-buffers/docs/encoding)
    78  with the following additions:
    79  
    80  1. Fields must be serialized only once in ascending order
    81  2. Extra fields or any extra data must not be added
    82  3. [Default values](https://developers.google.com/protocol-buffers/docs/proto3#default)
    83     must be omitted
    84  4. `repeated` fields of scalar numeric types must use
    85     [packed encoding](https://developers.google.com/protocol-buffers/docs/encoding#packed)
    86  5. Varint encoding must not be longer than needed:
    87      * No trailing zero bytes (in little endian, i.e. no leading zeroes in big
    88        endian). Per rule 3 above, the default value of `0` must be omitted, so
    89        this rule does not apply in such cases.
    90      * The maximum value for a varint must be `FF FF FF FF FF FF FF FF FF 01`.
    91        In other words, when decoded, the highest 6 bits of the 70-bit unsigned
    92        integer must be `0`. (10-byte varints are 10 groups of 7 bits, i.e.
    93        70 bits, of which only the lowest 70-6=64 are useful.)
    94      * The maximum value for 32-bit values in varint encoding must be `FF FF FF FF 0F`
    95        with one exception (below). In other words, when decoded, the highest 38
    96        bits of the 70-bit unsigned integer must be `0`.
    97          * The one exception to the above is _negative_ `int32`, which must be
    98            encoded using the full 10 bytes for sign extension<sup>2</sup>.
    99      * The maximum value for Boolean values in varint encoding must be `01` (i.e.
   100        it must be `0` or `1`). Per rule 3 above, the default value of `0` must
   101        be omitted, so if a Boolean is included it must have a value of `1`.
   102  
   103  While rule number 1. and 2. should be pretty straight forward and describe the
   104  default behavior of all protobuf encoders the author is aware of, the 3rd rule
   105  is more interesting. After a protobuf3 deserialization you cannot differentiate
   106  between unset fields and fields set to the default value<sup>3</sup>. At
   107  serialization level however, it is possible to set the fields with an empty
   108  value or omitting them entirely. This is a significant difference to e.g. JSON
   109  where a property can be empty (`""`, `0`), `null` or undefined, leading to 3
   110  different documents.
   111  
   112  Omitting fields set to default values is valid because the parser must assign
   113  the default value to fields missing in the serialization<sup>4</sup>. For scalar
   114  types, omitting defaults is required by the spec<sup>5</sup>. For `repeated`
   115  fields, not serializing them is the only way to express empty lists. Enums must
   116  have a first element of numeric value 0, which is the default<sup>6</sup>. And
   117  message fields default to unset<sup>7</sup>.
   118  
   119  Omitting defaults allows for some amount of forward compatibility: users of
   120  newer versions of a protobuf schema produce the same serialization as users of
   121  older versions as long as newly added fields are not used (i.e. set to their
   122  default value).
   123  
   124  ### Implementation
   125  
   126  There are three main implementation strategies, ordered from the least to the
   127  most custom development:
   128  
   129  * **Use a protobuf serializer that follows the above rules by default.** E.g.
   130    [gogoproto](https://pkg.go.dev/github.com/cosmos/gogoproto/gogoproto) is known to
   131    be compliant by in most cases, but not when certain annotations such as
   132    `nullable = false` are used. It might also be an option to configure an
   133    existing serializer accordingly.
   134  * **Normalize default values before encoding them.** If your serializer follows
   135    rule 1. and 2. and allows you to explicitly unset fields for serialization,
   136    you can normalize default values to unset. This can be done when working with
   137    [protobuf.js](https://www.npmjs.com/package/protobufjs):
   138  
   139    ```js
   140    const bytes = SignDoc.encode({
   141      bodyBytes: body.length > 0 ? body : null, // normalize empty bytes to unset
   142      authInfoBytes: authInfo.length > 0 ? authInfo : null, // normalize empty bytes to unset
   143      chainId: chainId || null, // normalize "" to unset
   144      accountNumber: accountNumber || null, // normalize 0 to unset
   145      accountSequence: accountSequence || null, // normalize 0 to unset
   146    }).finish();
   147    ```
   148  
   149  * **Use a hand-written serializer for the types you need.** If none of the above
   150    ways works for you, you can write a serializer yourself. For SignDoc this
   151    would look something like this in Go, building on existing protobuf utilities:
   152  
   153    ```go
   154    if !signDoc.body_bytes.empty() {
   155        buf.WriteUVarInt64(0xA) // wire type and field number for body_bytes
   156        buf.WriteUVarInt64(signDoc.body_bytes.length())
   157        buf.WriteBytes(signDoc.body_bytes)
   158    }
   159  
   160    if !signDoc.auth_info.empty() {
   161        buf.WriteUVarInt64(0x12) // wire type and field number for auth_info
   162        buf.WriteUVarInt64(signDoc.auth_info.length())
   163        buf.WriteBytes(signDoc.auth_info)
   164    }
   165  
   166    if !signDoc.chain_id.empty() {
   167        buf.WriteUVarInt64(0x1a) // wire type and field number for chain_id
   168        buf.WriteUVarInt64(signDoc.chain_id.length())
   169        buf.WriteBytes(signDoc.chain_id)
   170    }
   171  
   172    if signDoc.account_number != 0 {
   173        buf.WriteUVarInt64(0x20) // wire type and field number for account_number
   174        buf.WriteUVarInt(signDoc.account_number)
   175    }
   176  
   177    if signDoc.account_sequence != 0 {
   178        buf.WriteUVarInt64(0x28) // wire type and field number for account_sequence
   179        buf.WriteUVarInt(signDoc.account_sequence)
   180    }
   181    ```
   182  
   183  ### Test vectors
   184  
   185  Given the protobuf definition `Article.proto`
   186  
   187  ```protobuf
   188  package blog;
   189  syntax = "proto3";
   190  
   191  enum Type {
   192    UNSPECIFIED = 0;
   193    IMAGES = 1;
   194    NEWS = 2;
   195  };
   196  
   197  enum Review {
   198    UNSPECIFIED = 0;
   199    ACCEPTED = 1;
   200    REJECTED = 2;
   201  };
   202  
   203  message Article {
   204    string title = 1;
   205    string description = 2;
   206    uint64 created = 3;
   207    uint64 updated = 4;
   208    bool public = 5;
   209    bool promoted = 6;
   210    Type type = 7;
   211    Review review = 8;
   212    repeated string comments = 9;
   213    repeated string backlinks = 10;
   214  };
   215  ```
   216  
   217  serializing the values
   218  
   219  ```yaml
   220  title: "The world needs change 🌳"
   221  description: ""
   222  created: 1596806111080
   223  updated: 0
   224  public: true
   225  promoted: false
   226  type: Type.NEWS
   227  review: Review.UNSPECIFIED
   228  comments: ["Nice one", "Thank you"]
   229  backlinks: []
   230  ```
   231  
   232  must result in the serialization
   233  
   234  ```text
   235  0a1b54686520776f726c64206e65656473206368616e676520f09f8cb318e8bebec8bc2e280138024a084e696365206f6e654a095468616e6b20796f75
   236  ```
   237  
   238  When inspecting the serialized document, you see that every second field is
   239  omitted:
   240  
   241  ```shell
   242  $ echo 0a1b54686520776f726c64206e65656473206368616e676520f09f8cb318e8bebec8bc2e280138024a084e696365206f6e654a095468616e6b20796f75 | xxd -r -p | protoc --decode_raw
   243  1: "The world needs change \360\237\214\263"
   244  3: 1596806111080
   245  5: 1
   246  7: 2
   247  9: "Nice one"
   248  9: "Thank you"
   249  ```
   250  
   251  ## Consequences
   252  
   253  Having such an encoding available allows us to get deterministic serialization
   254  for all protobuf documents we need in the context of Cosmos SDK signing.
   255  
   256  ### Positive
   257  
   258  * Well defined rules that can be verified independent of a reference
   259    implementation
   260  * Simple enough to keep the barrier to implement transaction signing low
   261  * It allows us to continue to use 0 and other empty values in SignDoc, avoiding
   262    the need to work around 0 sequences. This does not imply the change from
   263    https://github.com/cosmos/cosmos-sdk/pull/6949 should not be merged, but not
   264    too important anymore.
   265  
   266  ### Negative
   267  
   268  * When implementing transaction signing, the encoding rules above must be
   269    understood and implemented.
   270  * The need for rule number 3. adds some complexity to implementations.
   271  * Some data structures may require custom code for serialization. Thus
   272    the code is not very portable - it will require additional work for each
   273    client implementing serialization to properly handle custom data structures.
   274  
   275  ### Neutral
   276  
   277  ### Usage in Cosmos SDK
   278  
   279  For the reasons mentioned above ("Negative" section) we prefer to keep workarounds
   280  for shared data structure. Example: the aforementioned `TxRaw` is using raw bytes
   281  as a workaround. This allows them to use any valid Protobuf library without
   282  the need of implementing a custom serializer that adheres to this standard (and related risks of bugs).
   283  
   284  ## References
   285  
   286  * <sup>1</sup> _When a message is serialized, there is no guaranteed order for
   287    how its known or unknown fields should be written. Serialization order is an
   288    implementation detail and the details of any particular implementation may
   289    change in the future. Therefore, protocol buffer parsers must be able to parse
   290    fields in any order._ from
   291    https://developers.google.com/protocol-buffers/docs/encoding#order
   292  * <sup>2</sup> https://developers.google.com/protocol-buffers/docs/encoding#signed_integers
   293  * <sup>3</sup> _Note that for scalar message fields, once a message is parsed
   294    there's no way of telling whether a field was explicitly set to the default
   295    value (for example whether a boolean was set to false) or just not set at all:
   296    you should bear this in mind when defining your message types. For example,
   297    don't have a boolean that switches on some behavior when set to false if you
   298    don't want that behavior to also happen by default._ from
   299    https://developers.google.com/protocol-buffers/docs/proto3#default
   300  * <sup>4</sup> _When a message is parsed, if the encoded message does not
   301    contain a particular singular element, the corresponding field in the parsed
   302    object is set to the default value for that field._ from
   303    https://developers.google.com/protocol-buffers/docs/proto3#default
   304  * <sup>5</sup> _Also note that if a scalar message field is set to its default,
   305    the value will not be serialized on the wire._ from
   306    https://developers.google.com/protocol-buffers/docs/proto3#default
   307  * <sup>6</sup> _For enums, the default value is the first defined enum value,
   308    which must be 0._ from
   309    https://developers.google.com/protocol-buffers/docs/proto3#default
   310  * <sup>7</sup> _For message fields, the field is not set. Its exact value is
   311    language-dependent._ from
   312    https://developers.google.com/protocol-buffers/docs/proto3#default
   313  * Encoding rules and parts of the reasoning taken from
   314    [canonical-proto3 Aaron Craelius](https://github.com/regen-network/canonical-proto3)