github.com/varialus/godfly@v0.0.0-20130904042352-1934f9f095ab/doc/articles/gobs_of_data.html

github.com/varialus/godfly@v0.0.0-20130904042352-1934f9f095ab/doc/articles/gobs_of_data.html (about)

     1  <!--{
     2  "Title": "Gobs of data",
     3  "Template": true
     4  }-->
     5  
     6  <p>
     7  To transmit a data structure across a network or to store it in a file, it must
     8  be encoded and then decoded again. There are many encodings available, of
     9  course: <a href="http://www.json.org/">JSON</a>,
    10  <a href="http://www.w3.org/XML/">XML</a>, Google's
    11  <a href="http://code.google.com/p/protobuf">protocol buffers</a>, and more.
    12  And now there's another, provided by Go's <a href="/pkg/encoding/gob/">gob</a>
    13  package.
    14  </p>
    15  
    16  <p>
    17  Why define a new encoding? It's a lot of work and redundant at that. Why not
    18  just use one of the existing formats? Well, for one thing, we do! Go has
    19  <a href="/pkg/">packages</a> supporting all the encodings just mentioned (the
    20  <a href="http://code.google.com/p/goprotobuf">protocol buffer package</a> is in
    21  a separate repository but it's one of the most frequently downloaded). And for
    22  many purposes, including communicating with tools and systems written in other
    23  languages, they're the right choice.
    24  </p>
    25  
    26  <p>
    27  But for a Go-specific environment, such as communicating between two servers
    28  written in Go, there's an opportunity to build something much easier to use and
    29  possibly more efficient.
    30  </p>
    31  
    32  <p>
    33  Gobs work with the language in a way that an externally-defined,
    34  language-independent encoding cannot. At the same time, there are lessons to be
    35  learned from the existing systems.
    36  </p>
    37  
    38  <p>
    39  <b>Goals</b>
    40  </p>
    41  
    42  <p>
    43  The gob package was designed with a number of goals in mind.
    44  </p>
    45  
    46  <p>
    47  First, and most obvious, it had to be very easy to use. First, because Go has
    48  reflection, there is no need for a separate interface definition language or
    49  "protocol compiler". The data structure itself is all the package should need
    50  to figure out how to encode and decode it. On the other hand, this approach
    51  means that gobs will never work as well with other languages, but that's OK:
    52  gobs are unashamedly Go-centric.
    53  </p>
    54  
    55  <p>
    56  Efficiency is also important. Textual representations, exemplified by XML and
    57  JSON, are too slow to put at the center of an efficient communications network.
    58  A binary encoding is necessary.
    59  </p>
    60  
    61  <p>
    62  Gob streams must be self-describing. Each gob stream, read from the beginning,
    63  contains sufficient information that the entire stream can be parsed by an
    64  agent that knows nothing a priori about its contents. This property means that
    65  you will always be able to decode a gob stream stored in a file, even long
    66  after you've forgotten what data it represents.
    67  </p>
    68  
    69  <p>
    70  There were also some things to learn from our experiences with Google protocol
    71  buffers.
    72  </p>
    73  
    74  <p>
    75  <b>Protocol buffer misfeatures</b>
    76  </p>
    77  
    78  <p>
    79  Protocol buffers had a major effect on the design of gobs, but have three
    80  features that were deliberately avoided. (Leaving aside the property that
    81  protocol buffers aren't self-describing: if you don't know the data definition
    82  used to encode a protocol buffer, you might not be able to parse it.)
    83  </p>
    84  
    85  <p>
    86  First, protocol buffers only work on the data type we call a struct in Go. You
    87  can't encode an integer or array at the top level, only a struct with fields
    88  inside it. That seems a pointless restriction, at least in Go. If all you want
    89  to send is an array of integers, why should you have to put it into a
    90  struct first?
    91  </p>
    92  
    93  <p>
    94  Next, a protocol buffer definition may specify that fields <code>T.x</code> and
    95  <code>T.y</code> are required to be present whenever a value of type
    96  <code>T</code> is encoded or decoded.  Although such required fields may seem
    97  like a good idea, they are costly to implement because the codec must maintain a
    98  separate data structure while encoding and decoding, to be able to report when
    99  required fields are missing.  They're also a maintenance problem. Over time, one
   100  may want to modify the data definition to remove a required field, but that may
   101  cause existing clients of the data to crash. It's better not to have them in the
   102  encoding at all.  (Protocol buffers also have optional fields. But if we don't
   103  have required fields, all fields are optional and that's that. There will be
   104  more to say about optional fields a little later.)
   105  </p>
   106  
   107  <p>
   108  The third protocol buffer misfeature is default values. If a protocol buffer
   109  omits the value for a "defaulted" field, then the decoded structure behaves as
   110  if the field were set to that value. This idea works nicely when you have
   111  getter and setter methods to control access to the field, but is harder to
   112  handle cleanly when the container is just a plain idiomatic struct. Required
   113  fields are also tricky to implement: where does one define the default values,
   114  what types do they have (is text UTF-8? uninterpreted bytes? how many bits in a
   115  float?) and despite the apparent simplicity, there were a number of
   116  complications in their design and implementation for protocol buffers. We
   117  decided to leave them out of gobs and fall back to Go's trivial but effective
   118  defaulting rule: unless you set something otherwise, it has the "zero value"
   119  for that type - and it doesn't need to be transmitted.
   120  </p>
   121  
   122  <p>
   123  So gobs end up looking like a sort of generalized, simplified protocol buffer.
   124  How do they work?
   125  </p>
   126  
   127  <p>
   128  <b>Values</b>
   129  </p>
   130  
   131  <p>
   132  The encoded gob data isn't about <code>int8</code>s and <code>uint16</code>s.
   133  Instead, somewhat analogous to constants in Go, its integer values are abstract,
   134  sizeless numbers, either signed or unsigned. When you encode an
   135  <code>int8</code>, its value is transmitted as an unsized, variable-length
   136  integer. When you encode an <code>int64</code>, its value is also transmitted as
   137  an unsized, variable-length integer. (Signed and unsigned are treated
   138  distinctly, but the same unsized-ness applies to unsigned values too.) If both
   139  have the value 7, the bits sent on the wire will be identical. When the receiver
   140  decodes that value, it puts it into the receiver's variable, which may be of
   141  arbitrary integer type. Thus an encoder may send a 7 that came from an
   142  <code>int8</code>, but the receiver may store it in an <code>int64</code>. This
   143  is fine: the value is an integer and as a long as it fits, everything works. (If
   144  it doesn't fit, an error results.) This decoupling from the size of the variable
   145  gives some flexibility to the encoding: we can expand the type of the integer
   146  variable as the software evolves, but still be able to decode old data.
   147  </p>
   148  
   149  <p>
   150  This flexibility also applies to pointers. Before transmission, all pointers are
   151  flattened. Values of type <code>int8</code>, <code>*int8</code>,
   152  <code>**int8</code>, <code>****int8</code>, etc. are all transmitted as an
   153  integer value, which may then be stored in <code>int</code> of any size, or
   154  <code>*int</code>, or <code>******int</code>, etc. Again, this allows for
   155  flexibility.
   156  </p>
   157  
   158  <p>
   159  Flexibility also happens because, when decoding a struct, only those fields
   160  that are sent by the encoder are stored in the destination. Given the value
   161  </p>
   162  
   163  {{code "/doc/progs/gobs1.go" `/type T/` `/STOP/`}}
   164  
   165  <p>
   166  the encoding of <code>t</code> sends only the 7 and 8. Because it's zero, the
   167  value of <code>Y</code> isn't even sent; there's no need to send a zero value.
   168  </p>
   169  
   170  <p>
   171  The receiver could instead decode the value into this structure:
   172  </p>
   173  
   174  {{code "/doc/progs/gobs1.go" `/type U/` `/STOP/`}}
   175  
   176  <p>
   177  and acquire a value of <code>u</code> with only <code>X</code> set (to the
   178  address of an <code>int8</code> variable set to 7); the <code>Z</code> field is
   179  ignored - where would you put it? When decoding structs, fields are matched by
   180  name and compatible type, and only fields that exist in both are affected. This
   181  simple approach finesses the "optional field" problem: as the type
   182  <code>T</code> evolves by adding fields, out of date receivers will still
   183  function with the part of the type they recognize. Thus gobs provide the
   184  important result of optional fields - extensibility - without any additional
   185  mechanism or notation.
   186  </p>
   187  
   188  <p>
   189  From integers we can build all the other types: bytes, strings, arrays, slices,
   190  maps, even floats. Floating-point values are represented by their IEEE 754
   191  floating-point bit pattern, stored as an integer, which works fine as long as
   192  you know their type, which we always do. By the way, that integer is sent in
   193  byte-reversed order because common values of floating-point numbers, such as
   194  small integers, have a lot of zeros at the low end that we can avoid
   195  transmitting.
   196  </p>
   197  
   198  <p>
   199  One nice feature of gobs that Go makes possible is that they allow you to define
   200  your own encoding by having your type satisfy the
   201  <a href="/pkg/encoding/gob/#GobEncoder">GobEncoder</a> and
   202  <a href="/pkg/encoding/gob/#GobDecoder">GobDecoder</a> interfaces, in a manner
   203  analogous to the <a href="/pkg/encoding/json/">JSON</a> package's
   204  <a href="/pkg/encoding/json/#Marshaler">Marshaler</a> and
   205  <a href="/pkg/encoding/json/#Unmarshaler">Unmarshaler</a> and also to the
   206  <a href="/pkg/fmt/#Stringer">Stringer</a> interface from
   207  <a href="/pkg/fmt/">package fmt</a>. This facility makes it possible to
   208  represent special features, enforce constraints, or hide secrets when you
   209  transmit data. See the <a href="/pkg/encoding/gob/">documentation</a> for
   210  details.
   211  </p>
   212  
   213  <p>
   214  <b>Types on the wire</b>
   215  </p>
   216  
   217  <p>
   218  The first time you send a given type, the gob package includes in the data
   219  stream a description of that type. In fact, what happens is that the encoder is
   220  used to encode, in the standard gob encoding format, an internal struct that
   221  describes the type and gives it a unique number. (Basic types, plus the layout
   222  of the type description structure, are predefined by the software for
   223  bootstrapping.) After the type is described, it can be referenced by its type
   224  number.
   225  </p>
   226  
   227  <p>
   228  Thus when we send our first type <code>T</code>, the gob encoder sends a
   229  description of <code>T</code> and tags it with a type number, say 127. All
   230  values, including the first, are then prefixed by that number, so a stream of
   231  <code>T</code> values looks like:
   232  </p>
   233  
   234  <pre>
   235  ("define type id" 127, definition of type T)(127, T value)(127, T value), ...
   236  </pre>
   237  
   238  <p>
   239  These type numbers make it possible to describe recursive types and send values
   240  of those types. Thus gobs can encode types such as trees:
   241  </p>
   242  
   243  {{code "/doc/progs/gobs1.go" `/type Node/` `/STOP/`}}
   244  
   245  <p>
   246  (It's an exercise for the reader to discover how the zero-defaulting rule makes
   247  this work, even though gobs don't represent pointers.)
   248  </p>
   249  
   250  <p>
   251  With the type information, a gob stream is fully self-describing except for the
   252  set of bootstrap types, which is a well-defined starting point.
   253  </p>
   254  
   255  <p>
   256  <b>Compiling a machine</b>
   257  </p>
   258  
   259  <p>
   260  The first time you encode a value of a given type, the gob package builds a
   261  little interpreted machine specific to that data type. It uses reflection on
   262  the type to construct that machine, but once the machine is built it does not
   263  depend on reflection. The machine uses package unsafe and some trickery to
   264  convert the data into the encoded bytes at high speed. It could use reflection
   265  and avoid unsafe, but would be significantly slower. (A similar high-speed
   266  approach is taken by the protocol buffer support for Go, whose design was
   267  influenced by the implementation of gobs.) Subsequent values of the same type
   268  use the already-compiled machine, so they can be encoded right away.
   269  </p>
   270  
   271  <p>
   272  Decoding is similar but harder. When you decode a value, the gob package holds
   273  a byte slice representing a value of a given encoder-defined type to decode,
   274  plus a Go value into which to decode it. The gob package builds a machine for
   275  that pair: the gob type sent on the wire crossed with the Go type provided for
   276  decoding. Once that decoding machine is built, though, it's again a
   277  reflectionless engine that uses unsafe methods to get maximum speed.
   278  </p>
   279  
   280  <p>
   281  <b>Use</b>
   282  </p>
   283  
   284  <p>
   285  There's a lot going on under the hood, but the result is an efficient,
   286  easy-to-use encoding system for transmitting data. Here's a complete example
   287  showing differing encoded and decoded types. Note how easy it is to send and
   288  receive values; all you need to do is present values and variables to the
   289  <a href="/pkg/encoding/gob/">gob package</a> and it does all the work.
   290  </p>
   291  
   292  {{code "/doc/progs/gobs2.go" `/package main/` `$`}}
   293  
   294  <p>
   295  You can compile and run this example code in the
   296  <a href="http://play.golang.org/p/_-OJV-rwMq">Go Playground</a>.
   297  </p>
   298  
   299  <p>
   300  The <a href="/pkg/net/rpc/">rpc package</a> builds on gobs to turn this
   301  encode/decode automation into transport for method calls across the network.
   302  That's a subject for another article.
   303  </p>
   304  
   305  <p>
   306  <b>Details</b>
   307  </p>
   308  
   309  <p>
   310  The <a href="/pkg/encoding/gob/">gob package documentation</a>, especially the
   311  file <a href="/src/pkg/encoding/gob/doc.go">doc.go</a>, expands on many of the
   312  details described here and includes a full worked example showing how the
   313  encoding represents data. If you are interested in the innards of the gob
   314  implementation, that's a good place to start.
   315  </p>