github.com/m3db/m3@v1.5.1-0.20231129193456-75a402aa583b/src/dbnode/encoding/proto/docs/encoding.md (about)

     1  # Protobuf Encoding
     2  
     3  ## Overview
     4  
     5  This package contains the encoder/decoder for compressing streams of Protobuf messages matching a provided schema.
     6  All compression is performed in a streaming manner such that the encoded stream is updated with each write; there is no internal buffering or batching during which multiple writes are gathered before performing encoding.
     7  
     8  ## Features
     9  
    10  1. Lossless compression.
    11  1. Compression of Protobuf message timestamps using [Gorilla-style delta-of-delta encoding](https://www.vldb.org/pvldb/vol8/p1816-teller.pdf).
    12  2. Compression of Protobuf message streams that match a provided schema by using different forms of compression for each field based on its type.
    13  4. Changing Protobuf message schemas mid-stream.
    14  
    15  ## Supported Syntax
    16  
    17  While this package strives to support the entire [proto3 language spec](https://developers.google.com/protocol-buffers/docs/proto3), only the following features have been tested:
    18  
    19  1. [Scalar values](https://developers.google.com/protocol-buffers/docs/proto3#scalar)
    20  2. Nested messages
    21  3. Repeated fields
    22  4. Map fields
    23  5. Reserved fields
    24  
    25  The following have not been tested, and thus are not currently officially supported:
    26  
    27  1. `Any` fields
    28  2. [`Oneof` fields](https://developers.google.com/protocol-buffers/docs/proto#oneof)
    29  3. Options of any type
    30  4. Custom field types
    31  
    32  ## Compression Techniques
    33  
    34  This package compresses the timestamps for the Protobuf messages using [Gorilla style delta-of-delta encoding](https://www.vldb.org/pvldb/vol8/p1816-teller.pdf).
    35  
    36  Additionally, each field is compressed using a different form of compression that is optimal for its type:
    37  
    38  1. Floating point values are compressed using [Gorilla style XOR compression](https://www.vldb.org/pvldb/vol8/p1816-teller.pdf).
    39  2. Integer values (including fixed-width types) are compressed using M3TSZ Significant Digit Integer Compression (documentation forthcoming).
    40  3. `bytes` and `string` values are compressed using LRU Dictionary Compression, which is described in further detail below.
    41  
    42  ### LRU Dictionary Compression
    43  
    44  LRU Dictionary Compression is a compression scheme that provides high levels of compression for `bytes` and `string` fields that meet any of the following criteria:
    45  
    46  1. The value of the field changes infrequently.
    47  2. The value of the field changes frequently, but tends to rotate among a small number of frequently used values which may evolve over time.
    48  
    49  For example, the stream: `["value1", "value2", "value3", "value4", "value5", "value6", ...]` will compress poorly, but the stream: `["value1", "value1", "value2", "value1", "value3", "value2", ...]` will compress well.
    50  
    51  Similar to `LZ77` and its variants, this compression strategy has an implicit assumption that patterns in the input data occur close together. Data streams that don't satisfy this assumption will compress poorly.
    52  
    53  In the future, we may replace this simple algorithm with a more sophisticated dictionary compression scheme such as `LZ77`, `LZ78`, `LZW` or variants like `LZMW` or `LZAP`.
    54  
    55  #### Algorithm
    56  
    57  The encoder maintains a list of recently encoded strings in a per-field LRU cache.
    58  Everytime the encoder encounters a string it, it checks the cache first.
    59  
    60  If the string *is not* in the cache, then it encodes the string in its entirety and adds it to the cache (evicting the least recently encoded string if necessary).
    61  
    62  If the string *is* in the cache, then it encodes the **index** of the string in the cache which requires much less space.
    63  For example, it only takes 2 bits to encode the position of the string in a cache with a maximum capacity of 4 strings.
    64  
    65  For example, given a sequence of strings: `["foo", "bar", "baz", "bar"]` and an LRU cache of size 2; the algorithm performs the following operations:
    66  
    67  1. Check the cache to see if it contains "foo", which it does not, and then write the full string "foo" into the stream and add "foo" to the cache -> `["foo"]`
    68  2. Check the cache to see if it contains "bar", which it does not, and then write the full string "bar" into the stream and add "bar" to the cache -> `["foo", "bar"]`
    69  3. Check the cache to see if it contains "baz", which it does not, and then write the full string "baz" into the stream and and evict "foo" from the cache -> `["bar", "baz"]`
    70  4. Check the cache to see if it contains "bar", which it does, and then encode index 0 (because "bar" was at index 0 in the cache as of the end of step 3) into the stream with a single bit (which represents the string "bar" relative to the state of the cache at the end of step 3) and then update the cache to indicate that "bar" was the most recently encoded string -> `["baz", "bar"]`
    71  
    72  This compression scheme works because the decoder can maintain an LRU cache (of the same maximum capacity) and apply the same operations in the same order when its decompressing the stream.
    73  As a result, when it encounters an encoded cache index it can look up the corresponding string in its own LRU cache at the specified index.
    74  
    75  ##### Encoding
    76  
    77  The LRU Dictionary Compression scheme uses 2 control bits to encode all the relevant information required to decode the stream. In order, they are:
    78  
    79  1. **The "no change" control bit.** If this bit is set to `1`, the value is unchanged and no further encoding/decoding is required.
    80  2. **The "size" control bit.** If this bit is set to `0`, the size of the LRU cache capacity (N) is used to determine the number of remaining bits that need to be read and interpreted as a cache index that holds the compressed value; otherwise, the remaining bits are treated as a variable-width `length` and corresponding `bytes` pairs. Importantly, if the beginning of the `bytes` sequences is not byte-aligned, it is padded with zeroes up to the next byte boundary. While this isn't a strict requirement of the encoding scheme (in fact, it slightly lowers the compression ratio), it greatly simplifies the implementation because the encoder needs to reference previously-encoded bytes in order to check if the bytes currently being encoded are cached. Alternatively, the encoder could keep track of all the bytes that correspond to each cache entry in memory, but that would be a wasteful use of memory. Instead, it's more efficient if the cache stores offsets into the encoded stream for the beginning and end of the bytes that have already been encoded. These offsets are much easier to track of and compare against if they can be assumed to always correspond to the beginning of a byte boundary. In the future the implementation may be changed to favor wasting fewer bits in exchange for more complex logic.
    81  
    82  ### Compression Limitations
    83  
    84  While this compression applies to all scalar types at the top level of a message, it does not apply to any data that is part of `repeated` fields, `map` fields, or nested messages.
    85  The `nested message` restriction may be lifted in the future, but the `repeated` and `map` restrictions are unlikely to change due to the difficulty of compressing variably sized fields.
    86  
    87  ## Binary Format
    88  
    89  At a high level, compressing Protobuf messages consists of the following:
    90  
    91  1. Scanning each schema to identify which fields we can be extracted and compressed as described in the "Compression Techniques" section.
    92  2. Iterating over all fields (as well as the timestamp) for each messages as it arrives and encoding the next value for that field into its corresponding compressed stream.
    93  
    94  In practice, a Protbuf message can have any number of different fields; for performance reasons, it's impractical to maintain an independent stream at the field level.
    95  Instead, multiple "logical" streams are interleaved on a per-write basis within one physical stream.
    96  The remainder of this section outlines the binary format used to accomplish this interleaving.
    97  
    98  The binary format begins with a stream header, and the the remainder of the stream is a sequence of tuples in the form: `<per-write header, compressed timestamp, compressed custom encoded fields, Protobuf marshalled fields>`
    99  
   100  ### Stream Header
   101  
   102  Every compressed stream begins with a header which includes the following information:
   103  
   104  1. encoding scheme version (`varint`)
   105  2. dictionary compression LRU cache size (`varint`)
   106  
   107  In the future the dictionary compression LRU cache size may be moved to the per-write control bits section so that it can be updated mid stream (as opposed to only being updateable at the beginning of a new stream).
   108  
   109  ### Per-Write Header
   110  
   111  #### Per-Write Control Bits
   112  
   113  Every write is prefixed with a header that contains at least one control bit.
   114  
   115  If the control bit is set to `1`, then the stream contains another write that needs to be decoded, implying that the timestamp can be decoded as well.
   116  
   117  If the control bit is set to `0`, then either **(a)** the end of the stream has been reached *or* **(b)** a time unit and/or schema change has been encountered.
   118  
   119  This ambiguity is resolved by reading the next control bit, which will be `0` (if this is the end of the stream) or `1` if a time unit and/or schema change has been encountered.
   120  
   121  If the control bit is `1` (meaning this is not the end of the stream), then the next two bits should be interpreted as boolean control bits indicating if there has been a time unit or schema change respectively.
   122  
   123  Time unit changes must be tracked manually (instead of deferring to the M3TSZ timestamp delta-of-delta encoder, which can handle this independently) because the M3TSZ encoder relies on a custom marker scheme to indicate time unit changes that is not suitable for the Protbuf encoding format.
   124  The M3TSZ timestamp encoder avoids using a control bit for each write by using a marker which contains a prefix that could not be generated by any possible input, and the decoder frequently "looks ahead" for this marker to see if it needs to decode a time unit change.
   125  The Protobuf encoding scheme has no equivalent "impossible bit combination", so it uses explicit control bits to indicate a time unit change instead.
   126  
   127  The table below contains a summary of all the possible per-write control bit combinations. Note that `X` is used to denote control bits that will not be included; so even though the format may look inefficient because it requires a maximum of 4 control bits to encode only 6 different combinations, the most common scenario (where the stream contains at least one more write and neither the time unit or schema has changed) can be encoded with just a single bit.
   128  
   129  | Combination | Control Bits | Meaning                                                                                     |
   130  |-------------|--------------|---------------------------------------------------------------------------------------------|
   131  | 1           | 1XXX         | The stream contains at least one more write.                                                |
   132  | 2           | 00XX         | End of stream.                                                                              |
   133  | 3           | 0101         | The stream contains at least one more write and the schema has changed.                     |
   134  | 4           | 0110         | The stream contains at least one more write and the time unit has changed.                  |
   135  | 5           | 0111         | The stream contains at least one more write and both the schema and time unit have changed. |
   136  | 6           | 0100         | Impossible combination.                                                                     |
   137  
   138  The header ends immediately after combinations #1 and #2, but combinations #3, #4, and #5 will be followed by an encoded time unit change and/or schema change.
   139  
   140  #### Time Unit Encoding
   141  
   142  Time unit changes are encoded using a single byte such that every possible time unit has a unique value.
   143  
   144  #### Schema Encoding
   145  
   146  An encoded schema can be thought of as a sequence of `<fieldNum, fieldType>` and is encoded as follows:
   147  
   148  1. highest field number (`N`) that will be described (`varint`)
   149  2. `N` sets of 3 bits where each set corresponds to the "custom type", which is enough information to determine how the field should be compressed / decompressed. This is analogous to a Protobuf [`wire type`](https://developers.google.com/protocol-buffers/docs/encoding) in that it includes enough information to skip over the field if its not present in the schema that is being used to decode the message.
   150  
   151  Notably, the list only *explicitly* encodes the custom field type. *Implicitly*, the Protobuf field number is encoded by the position of the entry in the list.
   152  In other words, the list of custom encoded fields can be thought of as a bitset, except that instead of using a single bit to encode the value at a given position, we use 3.
   153  
   154  For example, given the following Protobuf schema:
   155  
   156  ```protobuf
   157  message Foo {
   158  	reserved 2, 3;
   159  	string query = 1;
   160  	int32 page_number = 4;
   161  }
   162  ```
   163  
   164  Encoding the list of custom compressed fields begins by encoding `4` as a `varint`, since that is the highest non-reserved field number.
   165  
   166  Next, the field numbers and their types are encoded, 3 bits at a time, where the field number is implied from their position in the list (starting at index 1 since Protobuf fields numbers start at 1), and the type is encoded in the 3 bit combination:
   167  
   168  `string query = 1;` is encoded as the first value (indicating field number 1) with the bit combination `111` indicating that it should be treated as `bytes` for compression purposes.
   169  
   170  Next, `000` is encoded twice to indicate that no custom compression will be performed for fields `2` or `3` since they are reserved.
   171  
   172  Finally, `010` is encoded as the fourth item to indicate that field number `4` will be treated as a signed 32 bit integer.
   173  
   174  Note that only fields that support custom encoding are included in the schema. This is because the Protobuf encoding format will take care of schema changes for any non-custom-encoded fields as long as they are valid updates [according to the Protobuf specification](https://developers.google.com/protocol-buffers/docs/proto3#updating).
   175  
   176  ##### Custom Types
   177  
   178  0. (`000`): Not custom encoded - This type indicates that no custom compression will be applied to this field; instead, the standard Protobuf encoding will be used.
   179  1. (`001`): Signed 64 bit integer (`int64`, `sint64`)
   180  2. (`010`): Signed 32 bit integer (`int32`, `sint32`, `enum`)
   181  3. (`011`): Unsigned 64 bit integer (`uint64`. `fixed64`)
   182  4. (`100`): Unsigned 32 bit integer (`uint32`, `fixed32`)
   183  5. (`101`): 64 bit float (`double`)
   184  6. (`110`): 32 bit float (`float`)
   185  7. (`111`): bytes (`bytes`, `string`)
   186  
   187  ### Compressed Timestamp
   188  
   189  The Protobuf compression scheme reuses the delta-of-delta timestamp encoding logic that is implemented in the M3TSZ package and decribed in the [Facebook Gorilla paper](https://www.vldb.org/pvldb/vol8/p1816-teller.pdf).
   190  
   191  After encoding a control bit with a value of `1` (indicating that there is another write in the stream), the delta-of-delta of the current and previous timestamp is encoded.
   192  
   193  Similarly, when decoding the stream, the inverse operation is performed to reconstruct the current timestamp based on both the previous timestamp and delta-of-delta encoded into the stream.
   194  
   195  ### Compressed Protobuf Fields
   196  
   197  Compressing the Protobuf fields is broken into two stages:
   198  
   199  1. Custom compressed fields
   200  2. Protobuf marshalled fields
   201  
   202  In the first phase, any eligible custom fields are compressed as described in the "Compression Techniques" section.
   203  
   204  In the second phase, the Protobuf marshalling format is used to encode and decode the data, with the caveat that fields are compared at the top level and re-encoding is avoided if they have not changed.
   205  
   206  #### Custom Compressed Protobuf Fields
   207  
   208  The custom fields' compressed values are encoded similarly to how their types are encoded as described in the "Header" section.
   209  In fact, they're even encoded in the same order with the caveat that unlike when we're encoding the types, we don't need to encode a null value for non-contiguous field numbers for which we're not performing any compression.
   210  
   211  All of the compressed values are encoded sequentially and no separator or control bits are placed between them, which means that they must encode enough information such that a decoder can determine where each one ends and the next one begins.
   212  
   213  The values themselves are encoded based on the field type and the compression technique that is being applied to it. For example, considering the sample Protobuf message from earlier:
   214  
   215  ```protobuf
   216  message Foo {
   217  	reserved 2, 3;
   218  	string query = 1;
   219  	int32 page_number = 4;
   220  }
   221  ```
   222  
   223  If the `string` field `query` had never been encoded before, the following control bits would be encoded: `1` (indicating that the value had changed since its previous empty value), followed by `1` again (indicating that the value was not found in the LRU cache and would be encoded in its entirety with a `varint` length prefix).
   224  
   225  Next, 6 bits would be used to encode the number of significant digits in the delta between current `page_number` and the previous `page_number`, followed by a control bit indicating if the delta is positive or negative, and then finally the significant bits themselves.
   226  
   227  Note that the values encoded for both fields are "self contained" in that they encode all the information required to determine when the end has been reached.
   228  
   229  #### Protobuf Marshalled Fields (non custom encoded / compressed)
   230  
   231  We recommend reading the [Protocol Buffers Encoding](https://developers.google.com/protocol-buffers/docs/encoding) section of the official documentation before reading this section.
   232  Specifically, understanding how Protobuf messages are (basically) encoded as a stream of tuples in the form of `<field number, wire type, value>` will make understanding this section much easier.
   233  
   234  The Protobuf marshalled fields section of the encoding scheme contains all the values that don't currently support performing custom compression.
   235  For the most part, the output of this section is similar to the result of calling `Marshal()` on a message in which all the custom compressed fields have already been removed, and the only remaining fields are ones for which Protobuf will encode directly.
   236  This is possible because, as described in the Protobuf encoding section linked above, the Protobuf wire format does not encode **any** data for fields which are not set or are set to a default value, so by "clearing" the fields that have already been encoded, they can be omitted when marshalling the remainder of the Protobuf message.
   237  
   238  While Protobuf's wire format is leaned upon heavily, there is specific attention given to re-encoding fields that haven't changed since the previous value, where "haven't changed" is defined at the top most level of the message.
   239  
   240  For example, consider encoding messages with the following schema:
   241  
   242  ```protobuf
   243  message Outer {
   244    message Nested {
   245      message NestedDeeper {
   246        int64 ival = 1;
   247        bool  booly = 2;
   248      }
   249      int64 outer = 1;
   250      NestedDeeper deeper = 2;
   251    }
   252  
   253    Nested nested = 1;
   254  }
   255  ```
   256  
   257  If none of the values inside `nested` have changed since the previous message, the `nested` field doesn't need to be encoded at all.
   258  However, if any of the fields have changed, like `nested.deeper.booly` for example, then the entire `nested` field must be re-encoded (including the `outer` field, even though only the `deeper` field changed).
   259  
   260  This top-level "only if it has changed" delta encoding can be used because, when the stream is decoded later, the original message can be reconstructed by merging the previously-decoded message with the current delta message, which contains only fields that have changed since the previous message.
   261  
   262  Only marshalling the fields that have changed since the previous message works for the most part, but there is one important edge case: because the Protobuf wire format does not encode **any** data for fields that are set to a default value (zero for `integers` and `floats`, empty array for `bytes` and `strings`, etc), using the standard Protobuf marshalling format with delta encoding works in every scenario *except* for the case where a field is changed from a non-default value to a default value because (because it is not possible to express explicitly setting a field to its default value).
   263  
   264  This issue is mitigated by encoding an additional optional (as in it is only encoded when necessary) bitset which indicates any field numbers that were set to the default value of the field's type.
   265  
   266  The bitset encoding is straightforward: it begins with a `varint` that encodes the length (number of bits) of the bitset, and then the remaining `N` bits are interpreted as a 1-indexed bitset (because field numbers start at 1 not 0) where a value of `1` indicates the field was changed to its default value.
   267  
   268  ##### Protobuf Marshalled Fields Encoding Format
   269  
   270  The Protobuf Marshalled Fields section of the encoding begins with a single control bit that indicates whether there have been any changes to the Protobuf encoded portion of the message at all.
   271  If the control bit is set to `1`, then there have been changes and decoding must continue; if it is set to `0`, then there were no changes and the decoder can skip to the next write (or stop, if at the end of the stream).
   272  
   273  If the previous control bit was set to `1`, indicating that there have been changes, then there will be another control bit which indicates whether any fields have been set to a default value.
   274  If so, then its value will be `1` and the subsequent bits should be interpreted as a `varint` encoding the length of the bitset followed by the actual bitset bits as discussed above.
   275  If the value is `0`, then there is no bitset to decode.
   276  
   277  At this point, if the stream is not byte-aligned, it is passed with zeros up to the next byte boundary. This reduces compression slightly (a maximum of 7 bits per message that contains non-custom encoded fields), but significantly improves the speed at which large marshalled protobuf fields can be encoded and decoded.
   278  
   279  Finally, this portion of the encoding will end with a `varint` that encodes the length of the bytes that would be generated by calling `Marshal()` on the message (where any custom-encoded or unchanged fields were cleared) followed by the actual marshalled bytes themselves.