github.com/m3db/m3@v1.5.1-0.20231129193456-75a402aa583b/src/dbnode/encoding/proto/docs/encoding.md (about) 1 # Protobuf Encoding 2 3 ## Overview 4 5 This package contains the encoder/decoder for compressing streams of Protobuf messages matching a provided schema. 6 All compression is performed in a streaming manner such that the encoded stream is updated with each write; there is no internal buffering or batching during which multiple writes are gathered before performing encoding. 7 8 ## Features 9 10 1. Lossless compression. 11 1. Compression of Protobuf message timestamps using [Gorilla-style delta-of-delta encoding](https://www.vldb.org/pvldb/vol8/p1816-teller.pdf). 12 2. Compression of Protobuf message streams that match a provided schema by using different forms of compression for each field based on its type. 13 4. Changing Protobuf message schemas mid-stream. 14 15 ## Supported Syntax 16 17 While this package strives to support the entire [proto3 language spec](https://developers.google.com/protocol-buffers/docs/proto3), only the following features have been tested: 18 19 1. [Scalar values](https://developers.google.com/protocol-buffers/docs/proto3#scalar) 20 2. Nested messages 21 3. Repeated fields 22 4. Map fields 23 5. Reserved fields 24 25 The following have not been tested, and thus are not currently officially supported: 26 27 1. `Any` fields 28 2. [`Oneof` fields](https://developers.google.com/protocol-buffers/docs/proto#oneof) 29 3. Options of any type 30 4. Custom field types 31 32 ## Compression Techniques 33 34 This package compresses the timestamps for the Protobuf messages using [Gorilla style delta-of-delta encoding](https://www.vldb.org/pvldb/vol8/p1816-teller.pdf). 35 36 Additionally, each field is compressed using a different form of compression that is optimal for its type: 37 38 1. Floating point values are compressed using [Gorilla style XOR compression](https://www.vldb.org/pvldb/vol8/p1816-teller.pdf). 39 2. Integer values (including fixed-width types) are compressed using M3TSZ Significant Digit Integer Compression (documentation forthcoming). 40 3. `bytes` and `string` values are compressed using LRU Dictionary Compression, which is described in further detail below. 41 42 ### LRU Dictionary Compression 43 44 LRU Dictionary Compression is a compression scheme that provides high levels of compression for `bytes` and `string` fields that meet any of the following criteria: 45 46 1. The value of the field changes infrequently. 47 2. The value of the field changes frequently, but tends to rotate among a small number of frequently used values which may evolve over time. 48 49 For example, the stream: `["value1", "value2", "value3", "value4", "value5", "value6", ...]` will compress poorly, but the stream: `["value1", "value1", "value2", "value1", "value3", "value2", ...]` will compress well. 50 51 Similar to `LZ77` and its variants, this compression strategy has an implicit assumption that patterns in the input data occur close together. Data streams that don't satisfy this assumption will compress poorly. 52 53 In the future, we may replace this simple algorithm with a more sophisticated dictionary compression scheme such as `LZ77`, `LZ78`, `LZW` or variants like `LZMW` or `LZAP`. 54 55 #### Algorithm 56 57 The encoder maintains a list of recently encoded strings in a per-field LRU cache. 58 Everytime the encoder encounters a string it, it checks the cache first. 59 60 If the string *is not* in the cache, then it encodes the string in its entirety and adds it to the cache (evicting the least recently encoded string if necessary). 61 62 If the string *is* in the cache, then it encodes the **index** of the string in the cache which requires much less space. 63 For example, it only takes 2 bits to encode the position of the string in a cache with a maximum capacity of 4 strings. 64 65 For example, given a sequence of strings: `["foo", "bar", "baz", "bar"]` and an LRU cache of size 2; the algorithm performs the following operations: 66 67 1. Check the cache to see if it contains "foo", which it does not, and then write the full string "foo" into the stream and add "foo" to the cache -> `["foo"]` 68 2. Check the cache to see if it contains "bar", which it does not, and then write the full string "bar" into the stream and add "bar" to the cache -> `["foo", "bar"]` 69 3. Check the cache to see if it contains "baz", which it does not, and then write the full string "baz" into the stream and and evict "foo" from the cache -> `["bar", "baz"]` 70 4. Check the cache to see if it contains "bar", which it does, and then encode index 0 (because "bar" was at index 0 in the cache as of the end of step 3) into the stream with a single bit (which represents the string "bar" relative to the state of the cache at the end of step 3) and then update the cache to indicate that "bar" was the most recently encoded string -> `["baz", "bar"]` 71 72 This compression scheme works because the decoder can maintain an LRU cache (of the same maximum capacity) and apply the same operations in the same order when its decompressing the stream. 73 As a result, when it encounters an encoded cache index it can look up the corresponding string in its own LRU cache at the specified index. 74 75 ##### Encoding 76 77 The LRU Dictionary Compression scheme uses 2 control bits to encode all the relevant information required to decode the stream. In order, they are: 78 79 1. **The "no change" control bit.** If this bit is set to `1`, the value is unchanged and no further encoding/decoding is required. 80 2. **The "size" control bit.** If this bit is set to `0`, the size of the LRU cache capacity (N) is used to determine the number of remaining bits that need to be read and interpreted as a cache index that holds the compressed value; otherwise, the remaining bits are treated as a variable-width `length` and corresponding `bytes` pairs. Importantly, if the beginning of the `bytes` sequences is not byte-aligned, it is padded with zeroes up to the next byte boundary. While this isn't a strict requirement of the encoding scheme (in fact, it slightly lowers the compression ratio), it greatly simplifies the implementation because the encoder needs to reference previously-encoded bytes in order to check if the bytes currently being encoded are cached. Alternatively, the encoder could keep track of all the bytes that correspond to each cache entry in memory, but that would be a wasteful use of memory. Instead, it's more efficient if the cache stores offsets into the encoded stream for the beginning and end of the bytes that have already been encoded. These offsets are much easier to track of and compare against if they can be assumed to always correspond to the beginning of a byte boundary. In the future the implementation may be changed to favor wasting fewer bits in exchange for more complex logic. 81 82 ### Compression Limitations 83 84 While this compression applies to all scalar types at the top level of a message, it does not apply to any data that is part of `repeated` fields, `map` fields, or nested messages. 85 The `nested message` restriction may be lifted in the future, but the `repeated` and `map` restrictions are unlikely to change due to the difficulty of compressing variably sized fields. 86 87 ## Binary Format 88 89 At a high level, compressing Protobuf messages consists of the following: 90 91 1. Scanning each schema to identify which fields we can be extracted and compressed as described in the "Compression Techniques" section. 92 2. Iterating over all fields (as well as the timestamp) for each messages as it arrives and encoding the next value for that field into its corresponding compressed stream. 93 94 In practice, a Protbuf message can have any number of different fields; for performance reasons, it's impractical to maintain an independent stream at the field level. 95 Instead, multiple "logical" streams are interleaved on a per-write basis within one physical stream. 96 The remainder of this section outlines the binary format used to accomplish this interleaving. 97 98 The binary format begins with a stream header, and the the remainder of the stream is a sequence of tuples in the form: `<per-write header, compressed timestamp, compressed custom encoded fields, Protobuf marshalled fields>` 99 100 ### Stream Header 101 102 Every compressed stream begins with a header which includes the following information: 103 104 1. encoding scheme version (`varint`) 105 2. dictionary compression LRU cache size (`varint`) 106 107 In the future the dictionary compression LRU cache size may be moved to the per-write control bits section so that it can be updated mid stream (as opposed to only being updateable at the beginning of a new stream). 108 109 ### Per-Write Header 110 111 #### Per-Write Control Bits 112 113 Every write is prefixed with a header that contains at least one control bit. 114 115 If the control bit is set to `1`, then the stream contains another write that needs to be decoded, implying that the timestamp can be decoded as well. 116 117 If the control bit is set to `0`, then either **(a)** the end of the stream has been reached *or* **(b)** a time unit and/or schema change has been encountered. 118 119 This ambiguity is resolved by reading the next control bit, which will be `0` (if this is the end of the stream) or `1` if a time unit and/or schema change has been encountered. 120 121 If the control bit is `1` (meaning this is not the end of the stream), then the next two bits should be interpreted as boolean control bits indicating if there has been a time unit or schema change respectively. 122 123 Time unit changes must be tracked manually (instead of deferring to the M3TSZ timestamp delta-of-delta encoder, which can handle this independently) because the M3TSZ encoder relies on a custom marker scheme to indicate time unit changes that is not suitable for the Protbuf encoding format. 124 The M3TSZ timestamp encoder avoids using a control bit for each write by using a marker which contains a prefix that could not be generated by any possible input, and the decoder frequently "looks ahead" for this marker to see if it needs to decode a time unit change. 125 The Protobuf encoding scheme has no equivalent "impossible bit combination", so it uses explicit control bits to indicate a time unit change instead. 126 127 The table below contains a summary of all the possible per-write control bit combinations. Note that `X` is used to denote control bits that will not be included; so even though the format may look inefficient because it requires a maximum of 4 control bits to encode only 6 different combinations, the most common scenario (where the stream contains at least one more write and neither the time unit or schema has changed) can be encoded with just a single bit. 128 129 | Combination | Control Bits | Meaning | 130 |-------------|--------------|---------------------------------------------------------------------------------------------| 131 | 1 | 1XXX | The stream contains at least one more write. | 132 | 2 | 00XX | End of stream. | 133 | 3 | 0101 | The stream contains at least one more write and the schema has changed. | 134 | 4 | 0110 | The stream contains at least one more write and the time unit has changed. | 135 | 5 | 0111 | The stream contains at least one more write and both the schema and time unit have changed. | 136 | 6 | 0100 | Impossible combination. | 137 138 The header ends immediately after combinations #1 and #2, but combinations #3, #4, and #5 will be followed by an encoded time unit change and/or schema change. 139 140 #### Time Unit Encoding 141 142 Time unit changes are encoded using a single byte such that every possible time unit has a unique value. 143 144 #### Schema Encoding 145 146 An encoded schema can be thought of as a sequence of `<fieldNum, fieldType>` and is encoded as follows: 147 148 1. highest field number (`N`) that will be described (`varint`) 149 2. `N` sets of 3 bits where each set corresponds to the "custom type", which is enough information to determine how the field should be compressed / decompressed. This is analogous to a Protobuf [`wire type`](https://developers.google.com/protocol-buffers/docs/encoding) in that it includes enough information to skip over the field if its not present in the schema that is being used to decode the message. 150 151 Notably, the list only *explicitly* encodes the custom field type. *Implicitly*, the Protobuf field number is encoded by the position of the entry in the list. 152 In other words, the list of custom encoded fields can be thought of as a bitset, except that instead of using a single bit to encode the value at a given position, we use 3. 153 154 For example, given the following Protobuf schema: 155 156 ```protobuf 157 message Foo { 158 reserved 2, 3; 159 string query = 1; 160 int32 page_number = 4; 161 } 162 ``` 163 164 Encoding the list of custom compressed fields begins by encoding `4` as a `varint`, since that is the highest non-reserved field number. 165 166 Next, the field numbers and their types are encoded, 3 bits at a time, where the field number is implied from their position in the list (starting at index 1 since Protobuf fields numbers start at 1), and the type is encoded in the 3 bit combination: 167 168 `string query = 1;` is encoded as the first value (indicating field number 1) with the bit combination `111` indicating that it should be treated as `bytes` for compression purposes. 169 170 Next, `000` is encoded twice to indicate that no custom compression will be performed for fields `2` or `3` since they are reserved. 171 172 Finally, `010` is encoded as the fourth item to indicate that field number `4` will be treated as a signed 32 bit integer. 173 174 Note that only fields that support custom encoding are included in the schema. This is because the Protobuf encoding format will take care of schema changes for any non-custom-encoded fields as long as they are valid updates [according to the Protobuf specification](https://developers.google.com/protocol-buffers/docs/proto3#updating). 175 176 ##### Custom Types 177 178 0. (`000`): Not custom encoded - This type indicates that no custom compression will be applied to this field; instead, the standard Protobuf encoding will be used. 179 1. (`001`): Signed 64 bit integer (`int64`, `sint64`) 180 2. (`010`): Signed 32 bit integer (`int32`, `sint32`, `enum`) 181 3. (`011`): Unsigned 64 bit integer (`uint64`. `fixed64`) 182 4. (`100`): Unsigned 32 bit integer (`uint32`, `fixed32`) 183 5. (`101`): 64 bit float (`double`) 184 6. (`110`): 32 bit float (`float`) 185 7. (`111`): bytes (`bytes`, `string`) 186 187 ### Compressed Timestamp 188 189 The Protobuf compression scheme reuses the delta-of-delta timestamp encoding logic that is implemented in the M3TSZ package and decribed in the [Facebook Gorilla paper](https://www.vldb.org/pvldb/vol8/p1816-teller.pdf). 190 191 After encoding a control bit with a value of `1` (indicating that there is another write in the stream), the delta-of-delta of the current and previous timestamp is encoded. 192 193 Similarly, when decoding the stream, the inverse operation is performed to reconstruct the current timestamp based on both the previous timestamp and delta-of-delta encoded into the stream. 194 195 ### Compressed Protobuf Fields 196 197 Compressing the Protobuf fields is broken into two stages: 198 199 1. Custom compressed fields 200 2. Protobuf marshalled fields 201 202 In the first phase, any eligible custom fields are compressed as described in the "Compression Techniques" section. 203 204 In the second phase, the Protobuf marshalling format is used to encode and decode the data, with the caveat that fields are compared at the top level and re-encoding is avoided if they have not changed. 205 206 #### Custom Compressed Protobuf Fields 207 208 The custom fields' compressed values are encoded similarly to how their types are encoded as described in the "Header" section. 209 In fact, they're even encoded in the same order with the caveat that unlike when we're encoding the types, we don't need to encode a null value for non-contiguous field numbers for which we're not performing any compression. 210 211 All of the compressed values are encoded sequentially and no separator or control bits are placed between them, which means that they must encode enough information such that a decoder can determine where each one ends and the next one begins. 212 213 The values themselves are encoded based on the field type and the compression technique that is being applied to it. For example, considering the sample Protobuf message from earlier: 214 215 ```protobuf 216 message Foo { 217 reserved 2, 3; 218 string query = 1; 219 int32 page_number = 4; 220 } 221 ``` 222 223 If the `string` field `query` had never been encoded before, the following control bits would be encoded: `1` (indicating that the value had changed since its previous empty value), followed by `1` again (indicating that the value was not found in the LRU cache and would be encoded in its entirety with a `varint` length prefix). 224 225 Next, 6 bits would be used to encode the number of significant digits in the delta between current `page_number` and the previous `page_number`, followed by a control bit indicating if the delta is positive or negative, and then finally the significant bits themselves. 226 227 Note that the values encoded for both fields are "self contained" in that they encode all the information required to determine when the end has been reached. 228 229 #### Protobuf Marshalled Fields (non custom encoded / compressed) 230 231 We recommend reading the [Protocol Buffers Encoding](https://developers.google.com/protocol-buffers/docs/encoding) section of the official documentation before reading this section. 232 Specifically, understanding how Protobuf messages are (basically) encoded as a stream of tuples in the form of `<field number, wire type, value>` will make understanding this section much easier. 233 234 The Protobuf marshalled fields section of the encoding scheme contains all the values that don't currently support performing custom compression. 235 For the most part, the output of this section is similar to the result of calling `Marshal()` on a message in which all the custom compressed fields have already been removed, and the only remaining fields are ones for which Protobuf will encode directly. 236 This is possible because, as described in the Protobuf encoding section linked above, the Protobuf wire format does not encode **any** data for fields which are not set or are set to a default value, so by "clearing" the fields that have already been encoded, they can be omitted when marshalling the remainder of the Protobuf message. 237 238 While Protobuf's wire format is leaned upon heavily, there is specific attention given to re-encoding fields that haven't changed since the previous value, where "haven't changed" is defined at the top most level of the message. 239 240 For example, consider encoding messages with the following schema: 241 242 ```protobuf 243 message Outer { 244 message Nested { 245 message NestedDeeper { 246 int64 ival = 1; 247 bool booly = 2; 248 } 249 int64 outer = 1; 250 NestedDeeper deeper = 2; 251 } 252 253 Nested nested = 1; 254 } 255 ``` 256 257 If none of the values inside `nested` have changed since the previous message, the `nested` field doesn't need to be encoded at all. 258 However, if any of the fields have changed, like `nested.deeper.booly` for example, then the entire `nested` field must be re-encoded (including the `outer` field, even though only the `deeper` field changed). 259 260 This top-level "only if it has changed" delta encoding can be used because, when the stream is decoded later, the original message can be reconstructed by merging the previously-decoded message with the current delta message, which contains only fields that have changed since the previous message. 261 262 Only marshalling the fields that have changed since the previous message works for the most part, but there is one important edge case: because the Protobuf wire format does not encode **any** data for fields that are set to a default value (zero for `integers` and `floats`, empty array for `bytes` and `strings`, etc), using the standard Protobuf marshalling format with delta encoding works in every scenario *except* for the case where a field is changed from a non-default value to a default value because (because it is not possible to express explicitly setting a field to its default value). 263 264 This issue is mitigated by encoding an additional optional (as in it is only encoded when necessary) bitset which indicates any field numbers that were set to the default value of the field's type. 265 266 The bitset encoding is straightforward: it begins with a `varint` that encodes the length (number of bits) of the bitset, and then the remaining `N` bits are interpreted as a 1-indexed bitset (because field numbers start at 1 not 0) where a value of `1` indicates the field was changed to its default value. 267 268 ##### Protobuf Marshalled Fields Encoding Format 269 270 The Protobuf Marshalled Fields section of the encoding begins with a single control bit that indicates whether there have been any changes to the Protobuf encoded portion of the message at all. 271 If the control bit is set to `1`, then there have been changes and decoding must continue; if it is set to `0`, then there were no changes and the decoder can skip to the next write (or stop, if at the end of the stream). 272 273 If the previous control bit was set to `1`, indicating that there have been changes, then there will be another control bit which indicates whether any fields have been set to a default value. 274 If so, then its value will be `1` and the subsequent bits should be interpreted as a `varint` encoding the length of the bitset followed by the actual bitset bits as discussed above. 275 If the value is `0`, then there is no bitset to decode. 276 277 At this point, if the stream is not byte-aligned, it is passed with zeros up to the next byte boundary. This reduces compression slightly (a maximum of 7 bits per message that contains non-custom encoded fields), but significantly improves the speed at which large marshalled protobuf fields can be encoded and decoded. 278 279 Finally, this portion of the encoding will end with a `varint` that encodes the length of the bytes that would be generated by calling `Marshal()` on the message (where any custom-encoded or unchanged fields were cleared) followed by the actual marshalled bytes themselves.