github.com/cosmos/cosmos-sdk@v0.50.10/docs/architecture/adr-027-deterministic-protobuf-serialization.md (about) 1 # ADR 027: Deterministic Protobuf Serialization 2 3 ## Changelog 4 5 * 2020-08-07: Initial Draft 6 * 2020-09-01: Further clarify rules 7 8 ## Status 9 10 Proposed 11 12 ## Abstract 13 14 Fully deterministic structure serialization, which works across many languages and clients, 15 is needed when signing messages. We need to be sure that whenever we serialize 16 a data structure, no matter in which supported language, the raw bytes 17 will stay the same. 18 [Protobuf](https://developers.google.com/protocol-buffers/docs/proto3) 19 serialization is not bijective (i.e. there exist a practically unlimited number of 20 valid binary representations for a given protobuf document)<sup>1</sup>. 21 22 This document describes a deterministic serialization scheme for 23 a subset of protobuf documents, that covers this use case but can be reused in 24 other cases as well. 25 26 ### Context 27 28 For signature verification in Cosmos SDK, the signer and verifier need to agree on 29 the same serialization of a `SignDoc` as defined in 30 [ADR-020](./adr-020-protobuf-transaction-encoding.md) without transmitting the 31 serialization. 32 33 Currently, for block signatures we are using a workaround: we create a new [TxRaw](https://github.com/cosmos/cosmos-sdk/blob/9e85e81e0e8140067dd893421290c191529c148c/proto/cosmos/tx/v1beta1/tx.proto#L30) 34 instance (as defined in [adr-020-protobuf-transaction-encoding](https://github.com/cosmos/cosmos-sdk/blob/main/docs/architecture/adr-020-protobuf-transaction-encoding.md#transactions)) 35 by converting all [Tx](https://github.com/cosmos/cosmos-sdk/blob/9e85e81e0e8140067dd893421290c191529c148c/proto/cosmos/tx/v1beta1/tx.proto#L13) 36 fields to bytes on the client side. This adds an additional manual 37 step when sending and signing transactions. 38 39 ### Decision 40 41 The following encoding scheme is to be used by other ADRs, 42 and in particular for `SignDoc` serialization. 43 44 ## Specification 45 46 ### Scope 47 48 This ADR defines a protobuf3 serializer. The output is a valid protobuf 49 serialization, such that every protobuf parser can parse it. 50 51 No maps are supported in version 1 due to the complexity of defining a 52 deterministic serialization. This might change in future. Implementations must 53 reject documents containing maps as invalid input. 54 55 ### Background - Protobuf3 Encoding 56 57 Most numeric types in protobuf3 are encoded as 58 [varints](https://developers.google.com/protocol-buffers/docs/encoding#varints). 59 Varints are at most 10 bytes, and since each varint byte has 7 bits of data, 60 varints are a representation of `uint70` (70-bit unsigned integer). When 61 encoding, numeric values are casted from their base type to `uint70`, and when 62 decoding, the parsed `uint70` is casted to the appropriate numeric type. 63 64 The maximum valid value for a varint that complies with protobuf3 is 65 `FF FF FF FF FF FF FF FF FF 7F` (i.e. `2**70 -1`). If the field type is 66 `{,u,s}int64`, the highest 6 bits of the 70 are dropped during decoding, 67 introducing 6 bits of malleability. If the field type is `{,u,s}int32`, the 68 highest 38 bits of the 70 are dropped during decoding, introducing 38 bits of 69 malleability. 70 71 Among other sources of non-determinism, this ADR eliminates the possibility of 72 encoding malleability. 73 74 ### Serialization rules 75 76 The serialization is based on the 77 [protobuf3 encoding](https://developers.google.com/protocol-buffers/docs/encoding) 78 with the following additions: 79 80 1. Fields must be serialized only once in ascending order 81 2. Extra fields or any extra data must not be added 82 3. [Default values](https://developers.google.com/protocol-buffers/docs/proto3#default) 83 must be omitted 84 4. `repeated` fields of scalar numeric types must use 85 [packed encoding](https://developers.google.com/protocol-buffers/docs/encoding#packed) 86 5. Varint encoding must not be longer than needed: 87 * No trailing zero bytes (in little endian, i.e. no leading zeroes in big 88 endian). Per rule 3 above, the default value of `0` must be omitted, so 89 this rule does not apply in such cases. 90 * The maximum value for a varint must be `FF FF FF FF FF FF FF FF FF 01`. 91 In other words, when decoded, the highest 6 bits of the 70-bit unsigned 92 integer must be `0`. (10-byte varints are 10 groups of 7 bits, i.e. 93 70 bits, of which only the lowest 70-6=64 are useful.) 94 * The maximum value for 32-bit values in varint encoding must be `FF FF FF FF 0F` 95 with one exception (below). In other words, when decoded, the highest 38 96 bits of the 70-bit unsigned integer must be `0`. 97 * The one exception to the above is _negative_ `int32`, which must be 98 encoded using the full 10 bytes for sign extension<sup>2</sup>. 99 * The maximum value for Boolean values in varint encoding must be `01` (i.e. 100 it must be `0` or `1`). Per rule 3 above, the default value of `0` must 101 be omitted, so if a Boolean is included it must have a value of `1`. 102 103 While rule number 1. and 2. should be pretty straight forward and describe the 104 default behavior of all protobuf encoders the author is aware of, the 3rd rule 105 is more interesting. After a protobuf3 deserialization you cannot differentiate 106 between unset fields and fields set to the default value<sup>3</sup>. At 107 serialization level however, it is possible to set the fields with an empty 108 value or omitting them entirely. This is a significant difference to e.g. JSON 109 where a property can be empty (`""`, `0`), `null` or undefined, leading to 3 110 different documents. 111 112 Omitting fields set to default values is valid because the parser must assign 113 the default value to fields missing in the serialization<sup>4</sup>. For scalar 114 types, omitting defaults is required by the spec<sup>5</sup>. For `repeated` 115 fields, not serializing them is the only way to express empty lists. Enums must 116 have a first element of numeric value 0, which is the default<sup>6</sup>. And 117 message fields default to unset<sup>7</sup>. 118 119 Omitting defaults allows for some amount of forward compatibility: users of 120 newer versions of a protobuf schema produce the same serialization as users of 121 older versions as long as newly added fields are not used (i.e. set to their 122 default value). 123 124 ### Implementation 125 126 There are three main implementation strategies, ordered from the least to the 127 most custom development: 128 129 * **Use a protobuf serializer that follows the above rules by default.** E.g. 130 [gogoproto](https://pkg.go.dev/github.com/cosmos/gogoproto/gogoproto) is known to 131 be compliant by in most cases, but not when certain annotations such as 132 `nullable = false` are used. It might also be an option to configure an 133 existing serializer accordingly. 134 * **Normalize default values before encoding them.** If your serializer follows 135 rule 1. and 2. and allows you to explicitly unset fields for serialization, 136 you can normalize default values to unset. This can be done when working with 137 [protobuf.js](https://www.npmjs.com/package/protobufjs): 138 139 ```js 140 const bytes = SignDoc.encode({ 141 bodyBytes: body.length > 0 ? body : null, // normalize empty bytes to unset 142 authInfoBytes: authInfo.length > 0 ? authInfo : null, // normalize empty bytes to unset 143 chainId: chainId || null, // normalize "" to unset 144 accountNumber: accountNumber || null, // normalize 0 to unset 145 accountSequence: accountSequence || null, // normalize 0 to unset 146 }).finish(); 147 ``` 148 149 * **Use a hand-written serializer for the types you need.** If none of the above 150 ways works for you, you can write a serializer yourself. For SignDoc this 151 would look something like this in Go, building on existing protobuf utilities: 152 153 ```go 154 if !signDoc.body_bytes.empty() { 155 buf.WriteUVarInt64(0xA) // wire type and field number for body_bytes 156 buf.WriteUVarInt64(signDoc.body_bytes.length()) 157 buf.WriteBytes(signDoc.body_bytes) 158 } 159 160 if !signDoc.auth_info.empty() { 161 buf.WriteUVarInt64(0x12) // wire type and field number for auth_info 162 buf.WriteUVarInt64(signDoc.auth_info.length()) 163 buf.WriteBytes(signDoc.auth_info) 164 } 165 166 if !signDoc.chain_id.empty() { 167 buf.WriteUVarInt64(0x1a) // wire type and field number for chain_id 168 buf.WriteUVarInt64(signDoc.chain_id.length()) 169 buf.WriteBytes(signDoc.chain_id) 170 } 171 172 if signDoc.account_number != 0 { 173 buf.WriteUVarInt64(0x20) // wire type and field number for account_number 174 buf.WriteUVarInt(signDoc.account_number) 175 } 176 177 if signDoc.account_sequence != 0 { 178 buf.WriteUVarInt64(0x28) // wire type and field number for account_sequence 179 buf.WriteUVarInt(signDoc.account_sequence) 180 } 181 ``` 182 183 ### Test vectors 184 185 Given the protobuf definition `Article.proto` 186 187 ```protobuf 188 package blog; 189 syntax = "proto3"; 190 191 enum Type { 192 UNSPECIFIED = 0; 193 IMAGES = 1; 194 NEWS = 2; 195 }; 196 197 enum Review { 198 UNSPECIFIED = 0; 199 ACCEPTED = 1; 200 REJECTED = 2; 201 }; 202 203 message Article { 204 string title = 1; 205 string description = 2; 206 uint64 created = 3; 207 uint64 updated = 4; 208 bool public = 5; 209 bool promoted = 6; 210 Type type = 7; 211 Review review = 8; 212 repeated string comments = 9; 213 repeated string backlinks = 10; 214 }; 215 ``` 216 217 serializing the values 218 219 ```yaml 220 title: "The world needs change 🌳" 221 description: "" 222 created: 1596806111080 223 updated: 0 224 public: true 225 promoted: false 226 type: Type.NEWS 227 review: Review.UNSPECIFIED 228 comments: ["Nice one", "Thank you"] 229 backlinks: [] 230 ``` 231 232 must result in the serialization 233 234 ```text 235 0a1b54686520776f726c64206e65656473206368616e676520f09f8cb318e8bebec8bc2e280138024a084e696365206f6e654a095468616e6b20796f75 236 ``` 237 238 When inspecting the serialized document, you see that every second field is 239 omitted: 240 241 ```shell 242 $ echo 0a1b54686520776f726c64206e65656473206368616e676520f09f8cb318e8bebec8bc2e280138024a084e696365206f6e654a095468616e6b20796f75 | xxd -r -p | protoc --decode_raw 243 1: "The world needs change \360\237\214\263" 244 3: 1596806111080 245 5: 1 246 7: 2 247 9: "Nice one" 248 9: "Thank you" 249 ``` 250 251 ## Consequences 252 253 Having such an encoding available allows us to get deterministic serialization 254 for all protobuf documents we need in the context of Cosmos SDK signing. 255 256 ### Positive 257 258 * Well defined rules that can be verified independent of a reference 259 implementation 260 * Simple enough to keep the barrier to implement transaction signing low 261 * It allows us to continue to use 0 and other empty values in SignDoc, avoiding 262 the need to work around 0 sequences. This does not imply the change from 263 https://github.com/cosmos/cosmos-sdk/pull/6949 should not be merged, but not 264 too important anymore. 265 266 ### Negative 267 268 * When implementing transaction signing, the encoding rules above must be 269 understood and implemented. 270 * The need for rule number 3. adds some complexity to implementations. 271 * Some data structures may require custom code for serialization. Thus 272 the code is not very portable - it will require additional work for each 273 client implementing serialization to properly handle custom data structures. 274 275 ### Neutral 276 277 ### Usage in Cosmos SDK 278 279 For the reasons mentioned above ("Negative" section) we prefer to keep workarounds 280 for shared data structure. Example: the aforementioned `TxRaw` is using raw bytes 281 as a workaround. This allows them to use any valid Protobuf library without 282 the need of implementing a custom serializer that adheres to this standard (and related risks of bugs). 283 284 ## References 285 286 * <sup>1</sup> _When a message is serialized, there is no guaranteed order for 287 how its known or unknown fields should be written. Serialization order is an 288 implementation detail and the details of any particular implementation may 289 change in the future. Therefore, protocol buffer parsers must be able to parse 290 fields in any order._ from 291 https://developers.google.com/protocol-buffers/docs/encoding#order 292 * <sup>2</sup> https://developers.google.com/protocol-buffers/docs/encoding#signed_integers 293 * <sup>3</sup> _Note that for scalar message fields, once a message is parsed 294 there's no way of telling whether a field was explicitly set to the default 295 value (for example whether a boolean was set to false) or just not set at all: 296 you should bear this in mind when defining your message types. For example, 297 don't have a boolean that switches on some behavior when set to false if you 298 don't want that behavior to also happen by default._ from 299 https://developers.google.com/protocol-buffers/docs/proto3#default 300 * <sup>4</sup> _When a message is parsed, if the encoded message does not 301 contain a particular singular element, the corresponding field in the parsed 302 object is set to the default value for that field._ from 303 https://developers.google.com/protocol-buffers/docs/proto3#default 304 * <sup>5</sup> _Also note that if a scalar message field is set to its default, 305 the value will not be serialized on the wire._ from 306 https://developers.google.com/protocol-buffers/docs/proto3#default 307 * <sup>6</sup> _For enums, the default value is the first defined enum value, 308 which must be 0._ from 309 https://developers.google.com/protocol-buffers/docs/proto3#default 310 * <sup>7</sup> _For message fields, the field is not set. Its exact value is 311 language-dependent._ from 312 https://developers.google.com/protocol-buffers/docs/proto3#default 313 * Encoding rules and parts of the reasoning taken from 314 [canonical-proto3 Aaron Craelius](https://github.com/regen-network/canonical-proto3)