github.com/grailbio/base@v0.0.11/recordio/README.md (about)

     1  # RecordIO
     2  
     3  A recordio file stores a sequence of _items_, with optional compression and/or
     4  encryption.  Recordio also allows an application to generate indices.
     5  
     6  An API documentation is available at
     7  https://godoc.org/github.com/grailbio/base/recordio
     8  
     9  ## RecordIO file structure
    10  
    11  The following picture shows the structure of a recordio file.
    12  
    13  ![recordio format](recordio.png)
    14  
    15  A recordio file logically stores a list of *items*. Items are grouped into
    16  *blocks*. Each block may be compressed or encrypted, then split into sequence of
    17  *chunks* and stored in the file.
    18  
    19  There are three types of blocks: *header*, *body*, and *trailer*.
    20  These block types have a common structure:
    21  
    22      block :=
    23        number of items (varint)
    24        item 0 size (varint)
    25        …
    26        item K-1 size (varint)
    27        item 0 body (bytes)
    28        …
    29        item K-1 body (bytes)
    30  
    31  
    32  ### Header block
    33  
    34  Header block is the first block in the file. Header block contains one item. The
    35  sole item stores a flat key-value mappings of the following form:
    36  
    37      header item := List of (metakey, metavalue)
    38      metakey := value
    39      metavalue := value
    40      value := valuetype valuebytes
    41      valuetype := one byte, where
    42          1 if the valuebytes is a utf-8 string
    43          2 if the valuebytes is a signed varint
    44          3 if the valuebytes is a unsigned varint
    45          4 if the valuebytes is a IEEE float64 LE
    46      valuebytes :=
    47          For utf-8, length as uvarint, followed by contents.
    48          For other data types, just encode the data raw.
    49  
    50  
    51  Note: we could have defined the header as a protomessage, but we also wanted to
    52  avoid depending on the proto library. It would complicate cross-language
    53  integration.
    54  
    55  The user can add arbitrary (metakey, metavalue) pairs in the header, but a few
    56  metakey values are reserved.
    57  
    58  Key          | Value
    59  ------------ | -------------
    60  trailer      | Bool. Whether the file contains a trailer block
    61  transformer  | "flate", "zstd", etc.
    62  
    63  TODO: Reserve keys for encryption.
    64  
    65  ### Body block
    66  
    67  Body block contains actual user data.
    68  
    69  ### Trailer block
    70  
    71  Trailer block is optional. It contains a single arbitrary item.  Typically, it
    72  stores an index in an application-specific format so that the application can
    73  seek into arbitrary item if needed.
    74  
    75  Recordio library provides a way to read the trailer block in a constant time.
    76  
    77  ## Structure of a block
    78  
    79  At rest, a block is optionally compressed and encrypted. The resulting data is
    80  then split into multiple _chunks_. Size of a chunk is fixed at 32KiB.  The chunk
    81  structure allows an application to detect a corrupt chunk and skip to the next
    82  chunk or block.
    83  
    84  Each chunk contains a 28 byte header.
    85  
    86      chunk :=
    87          magic (8 bytes)
    88          CRC32 (4 bytes LE)
    89          flag (4 bytes LE)
    90          chunk payload size (4 bytes LE)
    91          totalChunks (4 bytes LE)
    92          chunk index (4 bytes LE)
    93          payload (bytes)
    94  
    95  - The 8-byte magic header tells whether the chunk is part of header, body, or a trailer.
    96  
    97    The current recordio format defines three magic numbers: MagicHeader,
    98    MagicPacked, and MagicTrailer.
    99  
   100  
   101  - The chunk payload size is (32768 - 28), unless it is for the final chunk of a
   102    block. For the final chunk, the "chunk payload size" stores the size of the
   103    block contents, and the chunk is filled with garbage to make it 32KiB at rest.
   104  
   105  - totalChunks is the number of chunks in the block. All the chunks in the same
   106    block stores the same totalChunks value.
   107  
   108  - Chunk index is 0 for the first chunk of the block, 1 for the second chunk of the block, and so on. The index resets to zero at the start of the next block.
   109  
   110  - Flag is a 32-bit bitmap. It is not used currently.
   111  
   112  - CRC is the IEEE CRC32 checksum of the rest of the chunk (payload size, index, flag, plus the payload).
   113  
   114  # Compression and encryption
   115  
   116  A block can be optionally compressed and/or encrypted using _transformers_.  The
   117  following example demonstrates the use of flate compression.
   118  
   119  https://github.com/grailbio/base/tree/master/recordio/example_basic_test.go
   120  
   121  Recordio library provides a few
   122  standard transformers:
   123  
   124  - flate (https://github.com/grailbio/base/tree/master/recordio/recordioflate)
   125  - zstd (https://github.com/grailbio/base/tree/master/recordio/recordiozstd)
   126  
   127  To register zstd, for example, call
   128  
   129      recordiozstd.Init()
   130  
   131  somewhere before writing or reading the recordio file. Then when writing, set
   132  transformer "zstd" in `WriterOpts.Transformers`. The transformer name is
   133  recorded in the recordio header block. The recordio reader reads the header,
   134  discovers the transformer name, and automatically creates a matching reverse
   135  transformer function.
   136  
   137  You can also register your own transformers.  To do that, add transformer
   138  factories when the application starts, using `RegisterTransformer`. See
   139  recordioflate and recordiozstd source code for examples.
   140  
   141  # Indexing
   142  
   143  An application can arrange a callback function to be run when items are written
   144  to storage. Such a callback can be used to build an index in a format of
   145  application's choice.  The following example demonstrates indexing.
   146  
   147    https://github.com/grailbio/base/tree/master/recordio/example_indexing_test.go
   148  
   149  The index is typically written in the trailer block of the recordio file. The
   150  recordio scanner provides a feature to read the trailer block.
   151  
   152  
   153  # Legacy file format
   154  
   155  The recordio package supports a _legacy_ file format that was in use before
   156  2018-03. recordio.Scanner supports both the current and the legacy file formats
   157  transparently. The legacy file can still be produced using the
   158  `deprecated/LegacyWriter` class, but we discourage its use; its support may be
   159  completely removed in a future.
   160  
   161  The legacy file format has the following structure:
   162  
   163      <header 0><record 0>
   164      <header 1><record 1>
   165      ...
   166  
   167  Each header is:
   168  
   169      8 bytes: magic number
   170      8 bytes: 64 bit length of payload, little endian
   171      4 bytes: IEEE CRC32 of the length, little endian
   172      <record>: length bytes
   173  
   174  The magic number is included to allow for the possibility of scanning to
   175  the next record in the case of a corrupted file.
   176  
   177  For the packed format each record (i.e. payload above) is as follows:
   178  
   179      uint32 little endian: IEEE CRC32 of all the varints that follow.
   180      uint32 varint: number of items in the record (n)
   181      uint32 varint: size of <item 0>
   182      uint32 varint: size of <item 1>
   183      ...
   184      uint32 varint: size of <item n>
   185  
   186      <item 0>
   187      <item 1>
   188      ..
   189      <item n>
   190  
   191  For the simple recordio format (not packed), indexing is supported via
   192  the Index callback which is called whenever a new record is written:
   193  
   194      Index func(offset, length uint64, v interface{}, p []byte) error
   195  
   196      offset: the absolute offset in the stream that the record is
   197              written at, including its header
   198      length: the size of the record being written, including the header.
   199      v:      the object marshaled if Marshal was used to write an object,
   200              nil otherwise
   201      p:      the byte slice being written
   202  
   203  The intended use is to instantiate a new Scanner at the specified offset
   204  in underlying file/stream.
   205  
   206  For the packed format indexing is a more involved due to the need to
   207  identify the start of each item as well as the record. To this end,
   208  the Index callback is called in two ways, and a second Flush callback
   209  is also provided.
   210  
   211  At the start of a record:
   212  
   213      offset: the absolute offset, including the recordio header
   214      length: is the size of the entire record being written (the sum of the
   215              of the sizes of the items and associated metadata), including
   216              the recordio header.
   217      v:      nil
   218      p:      nil
   219  
   220  For each item written to a single record:
   221  
   222      offset: the offset from the start of the data portion of the record
   223              that contains this item
   224      length: the size of the item
   225      v:      the object marshaled if Marshal was used to write an object,
   226              nil otherwise
   227      p:      the byte slice being written