github.com/cockroachdb/pebble@v1.1.2/docs/RFCS/20220112_pebble_sstable_format_versions.md (about) 1 - Feature Name: Pebble SSTable Format Versions 2 - Status: completed 3 - Start Date: 2022-01-12 4 - Authors: Nick Travers 5 - RFC PR: https://github.com/cockroachdb/pebble/pull/1450 6 - Pebble Issues: 7 https://github.com/cockroachdb/pebble/issues/1409 8 https://github.com/cockroachdb/pebble/issues/1339 9 - Cockroach Issues: 10 11 # Summary 12 13 To safely support changes to the SSTable structure, a new versioning scheme 14 under a Pebble magic number is proposed. 15 16 This RFC also outlines the relationship between the SSTable format version and 17 the existing Pebble format major version, in addition to how the two are to 18 be used in Cockroach for safely enabling new table format versions. 19 20 # Motivation 21 22 Pebble currently uses a "format major version" scheme for the store (or DB) 23 that indicates which Pebble features should be enabled when the store is first 24 opened, before any SSTables are opened. The versions indicate points of 25 backwards incompatibility for a store. For example, the introduction of the 26 `SetWithDelete` key kind is gated behind a version, as is block property 27 collection. This format major version scheme was introduced in 28 [#1227](https://github.com/cockroachdb/pebble/issues/1227). 29 30 While Pebble can use the format major version to infer how to load and 31 interpret data in the LSM, the SSTables that make up the store itself have 32 their own notion of a "version". This "SSTable version" (also referred to as a 33 "table format") is written to the footer (or trailing section) of each SSTable 34 file and determines how the file is to be interpreted by Pebble. As of the time 35 of writing, Pebble supports two table formats - LevelDB's format, and RocksDB's 36 v2 format. Pebble inherited the latter as the default table format as it was 37 the version that RocksDB used at the time Pebble was being developed, and 38 remained the default to allow for a simpler migration path from Cockroach 39 clusters that were originally using RocksDB as the storage engine. The 40 RocksDBv2 table format adds various features on top of the LevelDB format, 41 including a two-level index, configurable checksum algorithms, and an explicit 42 versioning scheme to allow for the introduction of changes, amongst other 43 features. 44 45 While the RocksDBv2 SSTable format has been sufficient for Pebble's needs since 46 inception, new Pebble features and potential backports from RocksDB itself 47 require that the SSTable format evolve over time and therefore that the table 48 format be updated. As the majority of new features added over time will be 49 specific to Pebble, it does not make sense to repurpose the RocksDB format 50 versions that exist upstream for use with Pebble features (at the time of 51 writing, RocksDB had added versions 3 and 4 on top of the version 2 in use by 52 Pebble). A new Pebble-specific table format scheme is proposed. 53 54 In the context of a distributed system such as Cockroach, certain SSTable 55 features are backwards incompatible (e.g. the block property collection and 56 filtering feature extends the RocksDBv2 SSTable block index format to encoded 57 various block properties, which is a breaking change). Participants must 58 _first_ ensure that their stores have the code-level features available to read 59 and write these newer SSTables (indicated by Pebble's format major version). 60 Once all stores agree that they are running the minimum Pebble format major 61 version and will not roll back (e.g. Cockroach cluster version finalization), 62 SSTables can be written and read using more recent table formats. The Pebble 63 "format major version" and "table format version" are therefore no longer 64 independent - the former implies an upper bound on the latter. 65 66 Additionally, certain SSTable generation operations are independent of a 67 specific Pebble instance. For example, SSTable construction for the purposes of 68 backup and restore generates SSTables that are stored external to a specific 69 Pebble store (e.g. in cloud storage) can be used at a later point in time to 70 restore a store. SSTables constructed for such purposes must be carefully 71 versioned to ensure compatibility with existing clusters that may run with a 72 mixture of Pebble versions. 73 74 As a real-world example of the need for the above, consider two Cockroach nodes 75 each with a Pebble store, one at version A, the other at version B (version A 76 (newer) > B (older)). Store A constructs an SSTable for an external backup 77 containing a newer block index format (for block property collection). This 78 SSTable is then imported in to store B. Store B fails to read the SSTable as it 79 is not running with a format major version recent enough make sense of the 80 newer index format. The two stores require a method for agreeing on a minimum 81 supported table format. 82 83 The remainder of this document outlines a new table format for Pebble. This new 84 table format will be used for new table-level features such as block properties 85 and range keys (see 86 [#1339](https://github.com/cockroachdb/pebble/issues/1339)), but also for 87 backporting table-level features from RocksDB that would be useful to Pebble 88 (e.g. version 3 avoids encoding sequence numbers in the index, and version 4 89 uses delta encoding for the block offsets in the index, both of which are 90 useful for Pebble). 91 92 # Technical design 93 94 ## Pebble magic number 95 96 The last 8 bytes of an SSTable is referred to as the "magic number". 97 98 LevelDB uses the first 8 bytes of the SHA1 hash of the string 99 `http://code.google.com/p/leveldb/` for the magic number. 100 101 RocksDB uses its own magic number, which indicates the use of a slightly 102 different table layout - the footer (the name for the end of an SSTable) is 103 slightly larger to accommodate a 32-bit version number and 8 bits for a 104 checksum type to be used for all blocks in the SSTable. 105 106 A new 8-byte magic number will be introduced for Pebble: 107 108 ``` 109 \xf0\x9f\xaa\xb3\xf0\x9f\xaa\xb3 // 🪳🪳 110 ``` 111 112 ## Pebble version scheme 113 114 Tables with a Pebble magic number will use a dedicated versioning scheme, 115 starting with version `1`. No new versions other than version `2` will be 116 supported for tables containing the RocksDB magic number. 117 118 The choice of switching to a Pebble versioning scheme starting `1` simplifies 119 the implementation. Essentially all existing Pebble stores are managed via 120 Cockroach, and were either previously using RocksDB and migrated to Pebble, or 121 were created with Pebble stores. In both situations the table format used is 122 RocksDB v2. 123 124 Given that Pebble has not needed (and likely will not need) to support other 125 RocksDB table formats, it is reasonable to introduce a new magic number for 126 Pebble and reset the version counter to v1. 127 128 The following initial versions will correspond to the following new Pebble 129 features, that have yet to be introduced to Cockroach clusters as of the time 130 of writing: 131 132 - Version 1: block property collectors (block properties are encoded into the 133 block index) 134 - Version 2: range keys (a new block is present in the table for range keys). 135 136 Subsequent alterations to the SSTable format should only increment the _Pebble 137 version number_. It should be noted that backported RocksDB table format 138 features (e.g. RocksDB versions 3 and 4) would use a different version number, 139 within the Pebble version sequence. While possibly confusing, the RocksDB 140 features are being "adopted" by Pebble, rather than directly ported, so a 141 Pebble specific version number is appropriate. 142 143 An alternative would be to allow RocksDB table format features to be backported 144 into Pebble under their existing RocksDB magic number, _alongside_ 145 Pebble-specific features. The complexity required to determine the set of 146 characteristics to read and write to each SSTable would increase with such a 147 scheme, compared to the simpler "linear history" approach described above, 148 where new features simply ratchet the Pebble table format version number. 149 150 ## Footer format 151 152 The footer format for SSTables with Pebble magic numbers _will remain the same_ 153 as the RocksDB footer format - specifically, the trailing 53-bytes of the 154 SSTable consisting of the following fields with the given indices, 155 little-endian encoded: 156 157 - `0`: Checksum type 158 - `1-20`: Meta-index block handle 159 - `21-40`: Index block handle 160 - `41-44`: Version number 161 - `45-52`: Magic number 162 163 ## Changes / additions to `sstable.TableFormat` 164 165 The `sstable.TableFormat` enum is a `uint32` representation of the tuple 166 `(magic number, format version). The current values are: 167 168 ```go 169 type TableFormat uint32 170 171 const ( 172 TableFormatRocksDBv2 TableFormat = iota 173 TableFormatLevelDB 174 ) 175 ``` 176 177 It should be noted that this enum is _not_ persisted in the SSTable. It is 178 purely an internal type that represents the tuple that simplifies a number of 179 version checks when reading / writing an SSTable. The values are free to 180 change, provided care is taken with default values and existing usage. 181 182 The existing `sstable.TableFormat` will be altered to reflect the "linear" 183 nature of the version history. New versions will be added with the next value 184 in the sequence. 185 186 ```go 187 const ( 188 TableFormatUnspecified TableFormat = iota 189 TableFormatLevelDB // The original LevelDB table format. 190 TableFormatRocksDBv2 // The current default table format. 191 TableFormatPebblev1 // Block properties. 192 TableFormatPebblev2 // Range keys. 193 ... 194 TableFormatPebbleDBvN 195 ) 196 ``` 197 198 The introduction of `TableFormatUnspecified` can be used to ensure that where a 199 `sstable.TableFormat` is _not_ specified, Pebble can select a suitable default 200 for writing the table (most likely based on the format major version in use by 201 the store; more in the next section). 202 203 ## Interaction with the format major version 204 205 The `FormatMajorVersion` type is used to determine the set of features the 206 store supports. 207 208 A Pebble store may be read-from / written-to by a Pebble binary that supports 209 newer features, with more recent Pebble format major versions. These newer 210 features could include the ability to read and write more recent SSTables. 211 While the store _could_ read and write SSTables at the most recent version the 212 binary supports, it is not safe to do so, for reasons outlined earlier. 213 214 The format major version will have a "maximum table format version" associated 215 with it that indicates the maximum `sstable.TableFormat` that can be safely 216 handled by the store. 217 218 When introducing a new _table format_ version, it should be gated behind an 219 associated `FormatMajorVersion` that has the new table format as its "maximum 220 table format version". 221 222 For example: 223 224 ```go 225 // Existing verisons. 226 FormatDefault.MaxTableFormat() // sstable.TableFormatRocksDBv2 227 ... 228 FormatSetWithDelete.MaxTableFormat() // sstable.TableFormatRocksDBv2 229 // Proposed versions with Pebble version scheme. 230 FormatBlockPropertyCollector.MaxTableFormat() // sstable.TableFormatPebbleDBv1 231 FormatRangeKeys.MaxTableFormat() // sstable.TableFormatPebbleDBv2 232 ``` 233 234 ## Usage in Cockroach 235 236 The introduction of new SSTable format versions needs to be carefully 237 coordinated between stores to ensure there are no incompatibilities (i.e. newer 238 store writes an SSTable that cannot be understood by other stores). 239 240 It is only safe to use a new table format when all nodes in a cluster have been 241 finalized. A newer Cockroach node, with newer Pebble code, should continue to 242 write SSTables with a table format version equal to or less than the smallest 243 table format version across all nodes in the cluster. Once the cluster version 244 has been finalized, and `(*DB).RatchetFormatMajorVersion(FormatMajorVersion)` 245 has been called, nodes are free to write SSTables at newer table format 246 versions. 247 248 At runtime, Pebble exposes a `(*DB).FormatMajorVersion()` method, which may be 249 used to determine the current format major version of the store, and hence, the 250 associated table format version. 251 252 In addition to the above, there are situations where SSTables are created for 253 consumption at a later point in time, independent of any Pebble store - 254 specifically backup and restore. Currently, Cockroach uses two functions in 255 `pkg/sstable` to construct SSTables for both ingestion and backup 256 ([here](https://github.com/cockroachdb/cockroach/blob/20eaf0b415f1df361246804e5d1d80c7a20a8eb6/pkg/storage/sst_writer.go#L57) 257 and 258 [here](https://github.com/cockroachdb/cockroach/blob/20eaf0b415f1df361246804e5d1d80c7a20a8eb6/pkg/storage/sst_writer.go#L78)). 259 Both will need to be updated to take into account the cluster version to ensure 260 that SSTables with newer versions are only written once the cluster version has 261 been finalized. 262 263 ### Cluster version migration sequencing 264 265 Cockroach uses cluster versions as a guarantee that all nodes in a cluster are 266 running at a particular binary version, with a particular set of features 267 enabled. The Pebble store is ratcheted as the cluster version passes certain 268 versions that correspond to new Pebble functionality. Care must be taken to 269 prevent subtle race conditions while the cluster version is being updated 270 across all nodes in a cluster. 271 272 Consider a cluster at cluster version `n-1` with corresponding Pebble format 273 major version `A`. A new cluster version `n` introduces a new Pebble format 274 major version `B` with new table level features. One by one, nodes will bump 275 their format major versions from `A` to `B` as they are upgraded to cluster 276 version `n`. There exists a period of time where nodes in a cluster are split 277 between cluster versions `n-1` and `n`, and Pebble format major versions `A` 278 and `B`. If version `B` introduces SSTable level features that nodes with 279 stores at format major version `A` do not yet understand, there exists the risk 280 for runtime incompatibilities. 281 282 To guard against the window of incompatibility, _two_ cluster versions are 283 employed when bumping Pebble format major versions that correspond to new 284 SSTable level features. The first cluster verison is uesd to synchronize all 285 stores at the same Pebble format major version (and therefore table format 286 version). The second cluster version is used as a feature gate that enables 287 Cockroach nodes to make use of the newer table format, relying on the guarantee 288 that if a node is at version `n + 1`, then all other nodes in the cluster must 289 all be at least at version `n`, and therefore have Pebble stores at format 290 major version `B`.