github.com/cockroachdb/pebble@v1.1.2/docs/RFCS/20220112_pebble_sstable_format_versions.md (about)

     1  - Feature Name: Pebble SSTable Format Versions
     2  - Status: completed
     3  - Start Date: 2022-01-12
     4  - Authors: Nick Travers
     5  - RFC PR: https://github.com/cockroachdb/pebble/pull/1450
     6  - Pebble Issues:
     7    https://github.com/cockroachdb/pebble/issues/1409
     8    https://github.com/cockroachdb/pebble/issues/1339
     9  - Cockroach Issues:
    10  
    11  # Summary
    12  
    13  To safely support changes to the SSTable structure, a new versioning scheme
    14  under a Pebble magic number is proposed.
    15  
    16  This RFC also outlines the relationship between the SSTable format version and
    17  the existing Pebble format major version, in addition to how the two are to
    18  be used in Cockroach for safely enabling new table format versions.
    19  
    20  # Motivation
    21  
    22  Pebble currently uses a "format major version" scheme for the store (or DB)
    23  that indicates which Pebble features should be enabled when the store is first
    24  opened, before any SSTables are opened. The versions indicate points of
    25  backwards incompatibility for a store. For example, the introduction of the
    26  `SetWithDelete` key kind is gated behind a version, as is block property
    27  collection. This format major version scheme was introduced in
    28  [#1227](https://github.com/cockroachdb/pebble/issues/1227).
    29  
    30  While Pebble can use the format major version to infer how to load and
    31  interpret data in the LSM, the SSTables that make up the store itself have
    32  their own notion of a "version". This "SSTable version" (also referred to as a
    33  "table format") is written to the footer (or trailing section) of each SSTable
    34  file and determines how the file is to be interpreted by Pebble. As of the time
    35  of writing, Pebble supports two table formats - LevelDB's format, and RocksDB's
    36  v2 format. Pebble inherited the latter as the default table format as it was
    37  the version that RocksDB used at the time Pebble was being developed, and
    38  remained the default to allow for a simpler migration path from Cockroach
    39  clusters that were originally using RocksDB as the storage engine. The
    40  RocksDBv2 table format adds various features on top of the LevelDB format,
    41  including a two-level index, configurable checksum algorithms, and an explicit
    42  versioning scheme to allow for the introduction of changes, amongst other
    43  features.
    44  
    45  While the RocksDBv2 SSTable format has been sufficient for Pebble's needs since
    46  inception, new Pebble features and potential backports from RocksDB itself
    47  require that the SSTable format evolve over time and therefore that the table
    48  format be updated. As the majority of new features added over time will be
    49  specific to Pebble, it does not make sense to repurpose the RocksDB format
    50  versions that exist upstream for use with Pebble features (at the time of
    51  writing, RocksDB had added versions 3 and 4 on top of the version 2 in use by
    52  Pebble). A new Pebble-specific table format scheme is proposed.
    53  
    54  In the context of a distributed system such as Cockroach, certain SSTable
    55  features are backwards incompatible (e.g. the block property collection and
    56  filtering feature extends the RocksDBv2 SSTable block index format to encoded
    57  various block properties, which is a breaking change). Participants must
    58  _first_ ensure that their stores have the code-level features available to read
    59  and write these newer SSTables (indicated by Pebble's format major version).
    60  Once all stores agree that they are running the minimum Pebble format major
    61  version and will not roll back (e.g. Cockroach cluster version finalization),
    62  SSTables can be written and read using more recent table formats. The Pebble
    63  "format major version" and "table format version" are therefore no longer
    64  independent - the former implies an upper bound on the latter.
    65  
    66  Additionally, certain SSTable generation operations are independent of a
    67  specific Pebble instance. For example, SSTable construction for the purposes of
    68  backup and restore generates SSTables that are stored external to a specific
    69  Pebble store (e.g. in cloud storage) can be used at a later point in time to
    70  restore a store. SSTables constructed for such purposes must be carefully
    71  versioned to ensure compatibility with existing clusters that may run with a
    72  mixture of Pebble versions.
    73  
    74  As a real-world example of the need for the above, consider two Cockroach nodes
    75  each with a Pebble store, one at version A, the other at version B (version A
    76  (newer) > B (older)). Store A constructs an SSTable for an external backup
    77  containing a newer block index format (for block property collection). This
    78  SSTable is then imported in to store B. Store B fails to read the SSTable as it
    79  is not running with a format major version recent enough make sense of the
    80  newer index format. The two stores require a method for agreeing on a minimum
    81  supported table format.
    82  
    83  The remainder of this document outlines a new table format for Pebble. This new
    84  table format will be used for new table-level features such as block properties
    85  and range keys (see
    86  [#1339](https://github.com/cockroachdb/pebble/issues/1339)), but also for
    87  backporting table-level features from RocksDB that would be useful to Pebble
    88  (e.g. version 3 avoids encoding sequence numbers in the index, and version 4
    89  uses delta encoding for the block offsets in the index, both of which are
    90  useful for Pebble).
    91  
    92  # Technical design
    93  
    94  ## Pebble magic number
    95  
    96  The last 8 bytes of an SSTable is referred to as the "magic number".
    97  
    98  LevelDB uses the first 8 bytes of the SHA1 hash of the string
    99  `http://code.google.com/p/leveldb/` for the magic number.
   100  
   101  RocksDB uses its own magic number, which indicates the use of a slightly
   102  different table layout - the footer (the name for the end of an SSTable) is
   103  slightly larger to accommodate a 32-bit version number and 8 bits for a
   104  checksum type to be used for all blocks in the SSTable.
   105  
   106  A new 8-byte magic number will be introduced for Pebble:
   107  
   108  ```
   109  \xf0\x9f\xaa\xb3\xf0\x9f\xaa\xb3 // 🪳🪳
   110  ```
   111  
   112  ## Pebble version scheme
   113  
   114  Tables with a Pebble magic number will use a dedicated versioning scheme,
   115  starting with version `1`. No new versions other than version `2` will be
   116  supported for tables containing the RocksDB magic number.
   117  
   118  The choice of switching to a Pebble versioning scheme starting `1` simplifies
   119  the implementation. Essentially all existing Pebble stores are managed via
   120  Cockroach, and were either previously using RocksDB and migrated to Pebble, or
   121  were created with Pebble stores. In both situations the table format used is
   122  RocksDB v2.
   123  
   124  Given that Pebble has not needed (and likely will not need) to support other
   125  RocksDB table formats, it is reasonable to introduce a new magic number for
   126  Pebble and reset the version counter to v1.
   127  
   128  The following initial versions will correspond to the following new Pebble
   129  features, that have yet to be introduced to Cockroach clusters as of the time
   130  of writing:
   131  
   132  - Version 1: block property collectors (block properties are encoded into the
   133    block index)
   134  - Version 2: range keys (a new block is present in the table for range keys).
   135  
   136  Subsequent alterations to the SSTable format should only increment the _Pebble
   137  version number_. It should be noted that backported RocksDB table format
   138  features (e.g. RocksDB versions 3 and 4) would use a different version number,
   139  within the Pebble version sequence. While possibly confusing, the RocksDB
   140  features are being "adopted" by Pebble, rather than directly ported, so a
   141  Pebble specific version number is appropriate.
   142  
   143  An alternative would be to allow RocksDB table format features to be backported
   144  into Pebble under their existing RocksDB magic number, _alongside_
   145  Pebble-specific features. The complexity required to determine the set of
   146  characteristics to read and write to each SSTable would increase with such a
   147  scheme, compared to the simpler "linear history" approach described above,
   148  where new features simply ratchet the Pebble table format version number.
   149  
   150  ## Footer format
   151  
   152  The footer format for SSTables with Pebble magic numbers _will remain the same_
   153  as the RocksDB footer format - specifically, the trailing 53-bytes of the
   154  SSTable consisting of the following fields with the given indices,
   155  little-endian encoded:
   156  
   157  - `0`: Checksum type
   158  - `1-20`: Meta-index block handle
   159  - `21-40`: Index block handle
   160  - `41-44`: Version number
   161  - `45-52`: Magic number
   162  
   163  ## Changes / additions to `sstable.TableFormat`
   164  
   165  The `sstable.TableFormat` enum is a `uint32` representation of the tuple
   166  `(magic number, format version). The current values are:
   167  
   168  ```go
   169  type TableFormat uint32
   170  
   171  const (
   172    TableFormatRocksDBv2 TableFormat = iota
   173    TableFormatLevelDB
   174  )
   175  ```
   176  
   177  It should be noted that this enum is _not_ persisted in the SSTable. It is
   178  purely an internal type that represents the tuple that simplifies a number of
   179  version checks when reading / writing an SSTable. The values are free to
   180  change, provided care is taken with default values and existing usage.
   181  
   182  The existing `sstable.TableFormat` will be altered to reflect the "linear"
   183  nature of the version history. New versions will be added with the next value
   184  in the sequence.
   185  
   186  ```go
   187  const (
   188  	TableFormatUnspecified TableFormat = iota
   189    TableFormatLevelDB    // The original LevelDB table format.
   190    TableFormatRocksDBv2  // The current default table format.
   191  	TableFormatPebblev1   // Block properties.
   192  	TableFormatPebblev2   // Range keys.
   193    ...
   194    TableFormatPebbleDBvN
   195  )
   196  ```
   197  
   198  The introduction of `TableFormatUnspecified` can be used to ensure that where a
   199  `sstable.TableFormat` is _not_ specified, Pebble can select a suitable default
   200  for writing the table (most likely based on the format major version in use by
   201  the store; more in the next section).
   202  
   203  ## Interaction with the format major version
   204  
   205  The `FormatMajorVersion` type is used to determine the set of features the
   206  store supports.
   207  
   208  A Pebble store may be read-from / written-to by a Pebble binary that supports
   209  newer features, with more recent Pebble format major versions. These newer
   210  features could include the ability to read and write more recent SSTables.
   211  While the store _could_ read and write SSTables at the most recent version the
   212  binary supports, it is not safe to do so, for reasons outlined earlier.
   213  
   214  The format major version will have a "maximum table format version" associated
   215  with it that indicates the maximum `sstable.TableFormat` that can be safely
   216  handled by the store.
   217  
   218  When introducing a new _table format_ version, it should be gated behind an
   219  associated `FormatMajorVersion` that has the new table format as its "maximum
   220  table format version".
   221  
   222  For example:
   223  
   224  ```go
   225  // Existing verisons.
   226  FormatDefault.MaxTableFormat()                       // sstable.TableFormatRocksDBv2
   227  ...
   228  FormatSetWithDelete.MaxTableFormat()                 // sstable.TableFormatRocksDBv2
   229  // Proposed versions with Pebble version scheme.
   230  FormatBlockPropertyCollector.MaxTableFormat()        // sstable.TableFormatPebbleDBv1
   231  FormatRangeKeys.MaxTableFormat()                     // sstable.TableFormatPebbleDBv2
   232  ```
   233  
   234  ## Usage in Cockroach
   235  
   236  The introduction of new SSTable format versions needs to be carefully
   237  coordinated between stores to ensure there are no incompatibilities (i.e. newer
   238  store writes an SSTable that cannot be understood by other stores).
   239  
   240  It is only safe to use a new table format when all nodes in a cluster have been
   241  finalized. A newer Cockroach node, with newer Pebble code, should continue to
   242  write SSTables with a table format version equal to or less than the smallest
   243  table format version across all nodes in the cluster. Once the cluster version
   244  has been finalized, and `(*DB).RatchetFormatMajorVersion(FormatMajorVersion)`
   245  has been called, nodes are free to write SSTables at newer table format
   246  versions.
   247  
   248  At runtime, Pebble exposes a `(*DB).FormatMajorVersion()` method, which may be
   249  used to determine the current format major version of the store, and hence, the
   250  associated table format version.
   251  
   252  In addition to the above, there are situations where SSTables are created for
   253  consumption at a later point in time, independent of any Pebble store -
   254  specifically backup and restore. Currently, Cockroach uses two functions in
   255  `pkg/sstable` to construct SSTables for both ingestion and backup
   256  ([here](https://github.com/cockroachdb/cockroach/blob/20eaf0b415f1df361246804e5d1d80c7a20a8eb6/pkg/storage/sst_writer.go#L57)
   257  and
   258  [here](https://github.com/cockroachdb/cockroach/blob/20eaf0b415f1df361246804e5d1d80c7a20a8eb6/pkg/storage/sst_writer.go#L78)).
   259  Both will need to be updated to take into account the cluster version to ensure
   260  that SSTables with newer versions are only written once the cluster version has
   261  been finalized.
   262  
   263  ### Cluster version migration sequencing
   264  
   265  Cockroach uses cluster versions as a guarantee that all nodes in a cluster are
   266  running at a particular binary version, with a particular set of features
   267  enabled. The Pebble store is ratcheted as the cluster version passes certain
   268  versions that correspond to new Pebble functionality. Care must be taken to
   269  prevent subtle race conditions while the cluster version is being updated
   270  across all nodes in a cluster.
   271  
   272  Consider a cluster at cluster version `n-1` with corresponding Pebble format
   273  major version `A`. A new cluster version `n` introduces a new Pebble format
   274  major version `B` with new table level features. One by one, nodes will bump
   275  their format major versions from `A` to `B` as they are upgraded to cluster
   276  version `n`. There exists a period of time where nodes in a cluster are split
   277  between cluster versions `n-1` and `n`, and Pebble format major versions `A`
   278  and `B`. If version `B` introduces SSTable level features that nodes with
   279  stores at format major version `A` do not yet understand, there exists the risk
   280  for runtime incompatibilities.
   281  
   282  To guard against the window of incompatibility, _two_ cluster versions are
   283  employed when bumping Pebble format major versions that correspond to new
   284  SSTable level features. The first cluster verison is uesd to synchronize all
   285  stores at the same Pebble format major version (and therefore table format
   286  version). The second cluster version is used as a feature gate that enables
   287  Cockroach nodes to make use of the newer table format, relying on the guarantee
   288  that if a node is at version `n + 1`, then all other nodes in the cluster must
   289  all be at least at version `n`, and therefore have Pebble stores at format
   290  major version `B`.