go.chromium.org/luci@v0.0.0-20240309015107-7cdc2e660f33/common/data/cmpbin/doc.go (about) 1 // Copyright 2015 The LUCI Authors. 2 // 3 // Licensed under the Apache License, Version 2.0 (the "License"); 4 // you may not use this file except in compliance with the License. 5 // You may obtain a copy of the License at 6 // 7 // http://www.apache.org/licenses/LICENSE-2.0 8 // 9 // Unless required by applicable law or agreed to in writing, software 10 // distributed under the License is distributed on an "AS IS" BASIS, 11 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 // See the License for the specific language governing permissions and 13 // limitations under the License. 14 15 // Package cmpbin provides binary serialization routines which ensure that the 16 // serialized objects maintain the same sort order of the original inputs when 17 // sorted bytewise (i.e. with memcmp). Additionally, serialized objects are 18 // concatenatable, and the concatenated items will behave as if they're compared 19 // field-to-field. So, for example, comparing each string in a []string would 20 // compare the same way as comparing the concatenation of those strings encoded 21 // with cmpbin. Simply concatenating the strings without encoding them will 22 // NOT retain this property, as you could not distinguish []string{"a", "aa"} 23 // from []string{"aa", "a"}. With cmpbin, these two would unambiguously sort as 24 // ("a", "aa") < ("aa", "a"). 25 // 26 // Notes on particular serialization schemes: 27 // 28 // - Numbers: 29 // The number encoding is less efficient on average than Varint 30 // ("encoding/binary") for small numbers (it has a minimum encoded size of 31 // 2 bytes), but is more efficient for large numbers (it has a maximum encoded 32 // size of 9 bytes for a 64 bit int, unlike the largest Varint which has a 10b 33 // representation). 34 // 35 // Both signed and unsigned numbers are encoded with the same scheme, and will 36 // sort together as signed numbers. Decoding with the incorrect routine will 37 // result in an ErrOverflow/ErrUnderflow error if the actual value is out of 38 // range. 39 // 40 // The scheme works like: 41 // - given an 2's compliment value V 42 // - extract the sign (S) and magnitude (M) of V 43 // - Find the position of the highest bit (P), minus 1. 44 // - write (bits): 45 // - SPPPPPPP MMMMMMMM MM000000 46 // - S is 1 47 // - P's are the log2(M)-1 48 // - M's are the magnitude of V 49 // - 0's are padding 50 // - Additionally, if the number is negative, invert the bits of all the bytes 51 // (e.g. XOR 0xFF). This makes the sign bit S 0 for negative numbers, and 52 // makes the ordering of the numbers correct when compared bytewise. 53 // 54 // - Strings/[]byte 55 // Each byte in the encoded stream reserves the least significant bit as a stop 56 // bit (1 means that the string continues, 0 means that the string ends). The 57 // actual user data is shifted into the top 7 bits of every encoded byte. This 58 // results in a data inflation rate of 12.5%, but this overhead is constant 59 // (doesn't vary by the encoded content). Note that if space efficiency is very 60 // important and you are storing large strings on average, you could reduce the 61 // overhead by only placing the stop bit on every other byte or every 4th byte, 62 // etc. This would reduce the overhead to 6.25% or 3.125% accordingly (but would 63 // cause every string to round out to 2 or 4 byte chunks), and it would make 64 // the algorithm implementation more complex. The current implementation was 65 // chosen as good enough in light of the fact that pre-compressing regular data 66 // could save more than 12.5% overall, and that for incompressable data a 67 // commonly used encoding scheme (base64) has a full 25% overhead (and a 68 // generally more complex implementation). 69 // 70 // - Floats 71 // Floats are tricky (really tricky) because they have lots of weird 72 // non-sortable special cases (like NaN). That said, for the majority of 73 // non-weird cases, the implementation here will sort real numbers the way that 74 // you would expect. 75 // 76 // The implementation is derived from http://stereopsis.com/radix.html, and full 77 // credit for the original algorithm goes to Michael Herf. The algorithm is 78 // essentially: 79 // 80 // - if the number is positive, flip the top bit 81 // - if the number is negative, flip all the bits 82 // 83 // Floats are not varint encoded, you could varint encode the mantissa 84 // (significand). This is only a 52 bit section, meaning that it is normally 85 // encoded with 6.5 bytes (a nybble is stolen from the second exponent byte). 86 // Assuming you used the numerical encoding above, shifted left by 4 bits, 87 // discarding the sign bit (since its already the MSb on the float, and then 88 // using 6 bits (instead of 7) to represent the number of significant bits in 89 // the mantissa (since there are only a maximum of 52), you could expect to see 90 // small-mantissa floats (of any characteristic) encoded in 3 bytes (this has 91 // 6 bits of mantissa), and the largest floats would have an encoded size of 92 // 9 bytes (with 2 wasted bits). However the implementation complexity would be 93 // higher. 94 // 95 // The actual encoded values for special cases are (sorted high to low): 96 // - QNaN - 0xFFF8000000000000 97 // // note that golang doesn't seem to actually have SNaN? 98 // - SNaN - 0xFFF0000000000001 99 // - +inf - 0xFFF0000000000000 100 // - MaxFloat64 - 0xFFEFFFFFFFFFFFFF 101 // - SmallestNonzeroFloat64 - 0x8000000000000001 102 // - 0 - 0x8000000000000000 103 // - -0 - 0x7FFFFFFFFFFFFFFF 104 // - -SmallestNonzeroFloat64 - 0x7FFFFFFFFFFFFFFE 105 // - -MaxFloat64 - 0x0010000000000000 106 // - -inf - 0x000FFFFFFFFFFFFF 107 package cmpbin