github.com/glycerine/zebrapack@v4.1.1-0.20181107023619-e955d028f9bf+incompatible/slides/zebrapack.slide (about) 1 ZebraPack: Fast, friendly serialization 2 GolangDFW Meetup, 2017 February 16 3 4 Jason E. Aten, Ph.D. 5 Computer Scientist/Gopher 6 j.e.aten@gmail.com 7 8 * ZebraPack 9 10 - a data description language and serialization format. Like Gobs version 2.0. 11 12 - remove gray areas from the language bindings. Provides for declared schemas, sane data evolution, and more compact encoding. 13 14 - maintain easy compatibility with all the dynamic languages that already have msgpack2 support. 15 16 - a day's work to adapt an existing language binding to read zebrapack: the schema are in msgpack2, and then one simply keeps a hashmap to translate between small integer <-> field names/type. 17 18 - MIT licensed. https://github.com/glycerine/zebrapack 19 20 * zebrapack: the main idea 21 22 .code structdef 23 24 * zebrapack: the main idea 2 25 26 .code transform 27 28 29 * motivation Why start with [msgpack2](http://msgpack.org)? 30 31 - msgpack2 is simple, fast, and extremely portable. 32 33 - It has an implementation in every language you've heard of, and some you haven't (some 50 libraries are available). 34 35 - It has a simple and short spec. 36 37 - msgpack2 is dynamic-language friendly because it is largely self-describing. 38 39 - most significantly: the existing library github.com/tinylib/msgp is extremely well tuned, and generates Go bindings by reading your Go source. 40 41 * Problems with msgpack2 42 43 - poorly defined language binding (signed/unsigned/bitwidth of integer?) 44 45 - a.k.a. insufficiently strong typing. 46 47 - weak support for data evolution. i.e. no conflict detection, no omitempty support from the prior libraries => they crash on unexpected fields. 48 49 50 51 * Problem example 52 53 - the widely emulated C-encoder for msgpack chooses to encode signed positive integers as unsigned integers. 54 55 - This causes crashes in readers who were expected a signed integer 56 57 - which they may have originated themselves in the original struct. 58 59 - the existing practice for msgpack2 language bindings allows the data types to change as they are read and re-serialized. 60 61 - Simple copying of a serialized struct can change the types of data from signed to unsigned. 62 63 - This is horrible. 64 65 66 67 * Addressing the problems 68 69 - for language binding: strongly define the types of fields. 70 71 - simply parse from the Go source. No separate IDL, your Go code is your one source of truth. 72 73 - For efficiency and data evolution: adopt a new convention about how to encode the field names of structs. Use small integer fields. 74 75 * Addressing the problems II 76 77 - Structs are encoded in msgpack2 using maps, as usual. 78 79 - maps that represent structs are now keyed by integers. 80 81 - Rather than strings as keys 82 83 - these integers are associated with a field name and type in a (separable) schema. 84 85 - The schema is also defined and encoded in msgpack2. 86 87 88 * Result 89 90 - resulting binary encoding is very similar in style to protobufs/Thrift/Capn'Proto. 91 92 - However it is much more friendly to dynamic languages; e.g. R, python, zygo 93 94 - Also it is screaming fast. 95 96 * Benchmarking Reads 97 98 .code readperf 99 100 * Benchmarking Writes 101 102 .code writeperf 103 104 * Advantages and advances: pulling the best ideas from other formats 105 106 - Once we have a schema, we can be very strongly typed, and be very efficient. 107 108 - We borrow the idea of field deprecation from FlatBuffers 109 110 - For conflicting update detection, we use CapnProto's field numbering discipline 111 (contiguous integers from 0..N-1). 112 113 - support for the `omitempty` tag 114 115 - in ZebraPack, all fields are `omitempty` 116 117 - If they are empty they won't be serialized on the wire. Like FlatBuffers and Protobufs, this enables one to define a very large schema of possibilities, and then only transmit a very small (efficient) portion that is currently relevant over the wire. 118 119 * Credit to Philip Hofer 120 121 Full credit: the ZebraPack code descends from the fantastic msgpack2 code generator https://github.com/tinylib/msgp by Philip Hofer. 122 123 124 * deprecating fields 125 126 .code depra1 127 128 * deprecating fields II 129 130 .code depra2 131 132 * Safety rules during data evolution 133 134 - Rules for safe data changes: To preserve forwards/backwards compatible changes, you must *never remove a field* from a struct, once that field has been defined and used. 135 136 - In the example above, the `zid:"4"` tag must stay in place, to prevent someone else from ever using 4 again. 137 138 - This allows sane data forward evolution, without tears, fears, or crashing of servers. 139 140 - The fact that `struct{}` fields take up no space also means that there is no need to worry about loss of performance when deprecating. 141 142 - We retain all fields ever used for their zebra ids, and the compiled Go code wastes no extra space for the deprecated fields. 143 144 * schema details 145 146 - Precisely defined format 147 148 - see the repo for examples and details. 149 150 - https://github.com/glycerine/zebrapack 151 152 * `zebrapack -msgp` as a msgpack2 code-generator 153 154 * `msg:",omitempty"` tags on struct fields 155 156 If you're using `zebrapack -msgp` to generate msgpack2 serialization code, then you can use the `omitempty` tag on your struct fields. 157 158 In the following example, 159 160 type Hedgehog struct { 161 Furriness string `msg:",omitempty"` 162 } 163 164 If Furriness is the empty string, the field will not be serialized, thus saving the space of the field name on the wire. 165 166 167 168 * It is safe to re-use structs even with `omitempty` 169 170 171 * `addzid` utility 172 173 The `addzid` utility (in the cmd/addzid subdir) can help you 174 get started. Running `addzid mysource.go` on a .go source file 175 will add the `zid:"0"`... fields automatically. This makes adding ZebraPack 176 serialization to existing Go projects easy. 177 178 See https://github.com/glycerine/zebrapack/blob/master/cmd/addzid/README.md 179 for more detail. 180 181 * What's next. New ideas. 182 183 - microschema 184 185 - handle cycles in an object graph, by detecting 186 (large) repeated references and encoding pointers as object IDs. 187 188 - your idea here. 189 190 - (One idea from meetup: optional bitmap to designate set/unset field, as in flatbuffers). 191 192