go.chromium.org/luci@v0.0.0-20240309015107-7cdc2e660f33/common/data/cmpbin/doc.go (about)

     1  // Copyright 2015 The LUCI Authors.
     2  //
     3  // Licensed under the Apache License, Version 2.0 (the "License");
     4  // you may not use this file except in compliance with the License.
     5  // You may obtain a copy of the License at
     6  //
     7  //      http://www.apache.org/licenses/LICENSE-2.0
     8  //
     9  // Unless required by applicable law or agreed to in writing, software
    10  // distributed under the License is distributed on an "AS IS" BASIS,
    11  // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    12  // See the License for the specific language governing permissions and
    13  // limitations under the License.
    14  
    15  // Package cmpbin provides binary serialization routines which ensure that the
    16  // serialized objects maintain the same sort order of the original inputs when
    17  // sorted bytewise (i.e. with memcmp).  Additionally, serialized objects are
    18  // concatenatable, and the concatenated items will behave as if they're compared
    19  // field-to-field. So, for example, comparing each string in a []string would
    20  // compare the same way as comparing the concatenation of those strings encoded
    21  // with cmpbin. Simply concatenating the strings without encoding them will
    22  // NOT retain this property, as you could not distinguish []string{"a", "aa"}
    23  // from []string{"aa", "a"}. With cmpbin, these two would unambiguously sort as
    24  // ("a", "aa") < ("aa", "a").
    25  //
    26  // Notes on particular serialization schemes:
    27  //
    28  // - Numbers:
    29  // The number encoding is less efficient on average than Varint
    30  // ("encoding/binary") for small numbers (it has a minimum encoded size of
    31  // 2 bytes), but is more efficient for large numbers (it has a maximum encoded
    32  // size of 9 bytes for a 64 bit int, unlike the largest Varint which has a 10b
    33  // representation).
    34  //
    35  // Both signed and unsigned numbers are encoded with the same scheme, and will
    36  // sort together as signed numbers. Decoding with the incorrect routine will
    37  // result in an ErrOverflow/ErrUnderflow error if the actual value is out of
    38  // range.
    39  //
    40  // The scheme works like:
    41  //   - given an 2's compliment value V
    42  //   - extract the sign (S) and magnitude (M) of V
    43  //   - Find the position of the highest bit (P), minus 1.
    44  //   - write (bits):
    45  //   - SPPPPPPP MMMMMMMM MM000000
    46  //   - S is 1
    47  //   - P's are the log2(M)-1
    48  //   - M's are the magnitude of V
    49  //   - 0's are padding
    50  //   - Additionally, if the number is negative, invert the bits of all the bytes
    51  //     (e.g. XOR 0xFF). This makes the sign bit S 0 for negative numbers, and
    52  //     makes the ordering of the numbers correct when compared bytewise.
    53  //
    54  // - Strings/[]byte
    55  // Each byte in the encoded stream reserves the least significant bit as a stop
    56  // bit (1 means that the string continues, 0 means that the string ends). The
    57  // actual user data is shifted into the top 7 bits of every encoded byte. This
    58  // results in a data inflation rate of 12.5%, but this overhead is constant
    59  // (doesn't vary by the encoded content). Note that if space efficiency is very
    60  // important and you are storing large strings on average, you could reduce the
    61  // overhead by only placing the stop bit on every other byte or every 4th byte,
    62  // etc. This would reduce the overhead to 6.25% or 3.125% accordingly (but would
    63  // cause every string to round out to 2 or 4 byte chunks), and it would make
    64  // the algorithm implementation more complex. The current implementation was
    65  // chosen as good enough in light of the fact that pre-compressing regular data
    66  // could save more than 12.5% overall, and that for incompressable data a
    67  // commonly used encoding scheme (base64) has a full 25% overhead (and a
    68  // generally more complex implementation).
    69  //
    70  // - Floats
    71  // Floats are tricky (really tricky) because they have lots of weird
    72  // non-sortable special cases (like NaN). That said, for the majority of
    73  // non-weird cases, the implementation here will sort real numbers the way that
    74  // you would expect.
    75  //
    76  // The implementation is derived from http://stereopsis.com/radix.html, and full
    77  // credit for the original algorithm goes to Michael Herf. The algorithm is
    78  // essentially:
    79  //
    80  //   - if the number is positive, flip the top bit
    81  //   - if the number is negative, flip all the bits
    82  //
    83  // Floats are not varint encoded, you could varint encode the mantissa
    84  // (significand). This is only a 52 bit section, meaning that it is normally
    85  // encoded with 6.5 bytes (a nybble is stolen from the second exponent byte).
    86  // Assuming you used the numerical encoding above, shifted left by 4 bits,
    87  // discarding the sign bit (since its already the MSb on the float, and then
    88  // using 6 bits (instead of 7) to represent the number of significant bits in
    89  // the mantissa (since there are only a maximum of 52), you could expect to see
    90  // small-mantissa floats (of any characteristic) encoded in 3 bytes (this has
    91  // 6 bits of mantissa), and the largest floats would have an encoded size of
    92  // 9 bytes (with 2 wasted bits). However the implementation complexity would be
    93  // higher.
    94  //
    95  // The actual encoded values for special cases are (sorted high to low):
    96  //   - QNaN                    - 0xFFF8000000000000
    97  //     // note that golang doesn't seem to actually have SNaN?
    98  //   - SNaN                    - 0xFFF0000000000001
    99  //   - +inf                    - 0xFFF0000000000000
   100  //   - MaxFloat64              - 0xFFEFFFFFFFFFFFFF
   101  //   - SmallestNonzeroFloat64  - 0x8000000000000001
   102  //   - 0                       - 0x8000000000000000
   103  //   - -0                      - 0x7FFFFFFFFFFFFFFF
   104  //   - -SmallestNonzeroFloat64 - 0x7FFFFFFFFFFFFFFE
   105  //   - -MaxFloat64             - 0x0010000000000000
   106  //   - -inf                    - 0x000FFFFFFFFFFFFF
   107  package cmpbin