github.com/cockroachdb/pebble@v1.1.2/docs/rocksdb.md (about) 1 # Pebble vs RocksDB: Implementation Differences 2 3 RocksDB is a key-value store implemented using a Log-Structured 4 Merge-Tree (LSM). This document is not a primer on LSMs. There exist 5 some decent 6 [introductions](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/) 7 on the web, or try chapter 3 of [Designing Data-Intensive 8 Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321). 9 10 Pebble inherits the RocksDB file formats, has a similar API, and 11 shares many implementation details, but it also has many differences 12 that improve performance, reduce implementation complexity, or extend 13 functionality. This document highlights some of the more important 14 differences. 15 16 * [Internal Keys](#internal-keys) 17 * [Indexed Batches](#indexed-batches) 18 * [Large Batches](#large-batches) 19 * [Commit Pipeline](#commit-pipeline) 20 * [Range Deletions](#range-deletions) 21 * [Flush and Compaction Pacing](#flush-and-compaction-pacing) 22 * [Write Throttling](#write-throttling) 23 * [Other Differences](#other-differences) 24 25 ## Internal Keys 26 27 The external RocksDB API accepts keys and values. Due to the LSM 28 structure, keys are never updated in place, but overwritten with new 29 versions. Inside RocksDB, these versioned keys are known as Internal 30 Keys. An Internal Key is composed of the user specified key, a 31 sequence number and a kind. On disk, sstables always store Internal 32 Keys. 33 34 ``` 35 +-------------+------------+----------+ 36 | UserKey (N) | SeqNum (7) | Kind (1) | 37 +-------------+------------+----------+ 38 ``` 39 40 The `Kind` field indicates the type of key: set, merge, delete, etc. 41 42 While Pebble inherits the Internal Key encoding for format 43 compatibility, it diverges from RocksDB in how it manages Internal 44 Keys in its implementation. In RocksDB, Internal Keys are represented 45 either in encoded form (as a string) or as a `ParsedInternalKey`. The 46 latter is a struct with the components of the Internal Key as three 47 separate fields. 48 49 ```c++ 50 struct ParsedInternalKey { 51 Slice user_key; 52 uint64 seqnum; 53 uint8 kind; 54 } 55 ``` 56 57 The component format is convenient: changing the `SeqNum` or `Kind` is 58 field assignment. Extracting the `UserKey` is a field 59 reference. However, RocksDB tends to only use `ParsedInternalKey` 60 locally. The major internal APIs, such as `InternalIterator`, operate 61 using encoded internal keys (i.e. strings) for parameters and return 62 values. 63 64 To give a concrete example of the overhead this causes, consider 65 `Iterator::Seek(user_key)`. The external `Iterator` is implemented on 66 top of an `InternalIterator`. `Iterator::Seek` ends up calling 67 `InternalIterator::Seek`. Both Seek methods take a key, but 68 `InternalIterator::Seek` expects an encoded Internal Key. This is both 69 error prone and expensive. The key passed to `Iterator::Seek` needs to 70 be copied into a temporary string in order to append the `SeqNum` and 71 `Kind`. In Pebble, Internal Keys are represented in memory using an 72 `InternalKey` struct that is the analog of `ParsedInternalKey`. All 73 internal APIs use `InternalKeys`, with the exception of the lowest 74 level routines for decoding data from sstables. In Pebble, since the 75 interfaces all take and return the `InternalKey` struct, we don’t need 76 to allocate to construct the Internal Key from the User Key, but 77 RocksDB sometimes needs to allocate, and encode (i.e. make a 78 copy). The use of the encoded form also causes RocksDB to pass encoded 79 keys to the comparator routines, sometimes decoding the keys multiple 80 times during the course of processing. 81 82 ## Indexed Batches 83 84 In RocksDB, a batch is the unit for all write operations. Even writing 85 a single key is transformed internally to a batch. The batch internal 86 representation is a contiguous byte buffer with a fixed 12-byte 87 header, followed by a series of records. 88 89 ``` 90 +------------+-----------+--- ... ---+ 91 | SeqNum (8) | Count (4) | Entries | 92 +------------+-----------+--- ... ---+ 93 ``` 94 95 Each record has a 1-byte kind tag prefix, followed by 1 or 2 length 96 prefixed strings (varstring): 97 98 ``` 99 +----------+-----------------+-------------------+ 100 | Kind (1) | Key (varstring) | Value (varstring) | 101 +----------+-----------------+-------------------+ 102 ``` 103 104 (The `Kind` indicates if there are 1 or 2 varstrings. `Set`, `Merge`, 105 and `DeleteRange` have 2 varstrings, while `Delete` has 1.) 106 107 Adding a mutation to a batch involves appending a new record to the 108 buffer. This format is extremely fast for writes, but the lack of 109 indexing makes it untenable to use directly for reads. In order to 110 support iteration, a separate indexing structure is created. Both 111 RocksDB and Pebble use a skiplist for the indexing structure, but with 112 a clever twist. Rather than the skiplist storing a copy of the key, it 113 simply stores the offset of the record within the mutation buffer. The 114 result is that the skiplist acts a multi-map (i.e. a map that can have 115 duplicate entries for a given key). The iteration order for this map 116 is constructed so that records sort on key, and for equal keys they 117 sort on descending offset. Newer records for the same key appear 118 before older records. 119 120 While the indexing structure for batches is nearly identical between 121 RocksDB and Pebble, how the index structure is used is completely 122 different. In RocksDB, a batch is indexed using the 123 `WriteBatchWithIndex` class. The `WriteBatchWithIndex` class provides 124 a `NewIteratorWithBase` method that allows iteration over the merged 125 view of the batch contents and an underlying "base" iterator created 126 from the database. `BaseDeltaIterator` contains logic to iterate over 127 the batch entries and the base iterator in parallel which allows us to 128 perform reads on a snapshot of the database as though the batch had 129 been applied to it. On the surface this sounds reasonable, yet the 130 implementation is incomplete. Merge and DeleteRange operations are not 131 supported. The reason they are not supported is because handling them 132 is complex and requires duplicating logic that already exists inside 133 RocksDB for normal iterator processing. 134 135 Pebble takes a different approach to iterating over a merged view of a 136 batch's contents and the underlying database: it treats the batch as 137 another level in the LSM. Recall that an LSM is composed of zero or 138 more memtable layers and zero or more sstable layers. Internally, both 139 RocksDB and Pebble contain a `MergingIterator` that knows how to merge 140 the operations from different levels, including processing overwritten 141 keys, merge operations, and delete range operations. The challenge 142 with treating the batch as another level to be used by a 143 `MergingIterator` is that the records in a batch do not have a 144 sequence number. The sequence number in the batch header is not 145 assigned until the batch is committed. The solution is to give the 146 batch records temporary sequence numbers. We need these temporary 147 sequence numbers to be larger than any other sequence number in the 148 database so that the records in the batch are considered newer than 149 any committed record. This is accomplished by reserving the high-bit 150 in the 56-bit sequence number for use as a marker for batch sequence 151 numbers. The sequence number for a record in an uncommitted batch is: 152 153 ``` 154 RecordOffset | (1<<55) 155 ``` 156 157 Newer records in a given batch will have a larger sequence number than 158 older records in the batch. And all of the records in a batch will 159 have larger sequence numbers than any committed record in the 160 database. 161 162 The end result is that Pebble's batch iterators support all of the 163 functionality of regular database iterators with minimal additional 164 code. 165 166 ## Large Batches 167 168 The size of a batch is limited only by available memory, yet the 169 required memory is not just the batch representation. When a batch is 170 committed, the commit operation iterates over the records in the batch 171 from oldest to newest and inserts them into the current memtable. The 172 memtable is an in-memory structure that buffers mutations that have 173 been committed (written to the Write Ahead Log), but not yet written 174 to an sstable. Internally, a memtable uses a skiplist to index 175 records. Each skiplist entry has overhead for the index links and 176 other metadata that is a dozen bytes at minimum. A large batch 177 composed of many small records can require twice as much memory when 178 inserted into a memtable than it required in the batch. And note that 179 this causes a temporary increase in memory requirements because the 180 batch memory is not freed until it is completely committed. 181 182 A non-obvious implementation restriction present in both RocksDB and 183 Pebble is that there is a one-to-one correspondence between WAL files 184 and memtables. That is, a given WAL file has a single memtable 185 associated with it and vice-versa. While this restriction could be 186 removed, doing so is onerous and intricate. It should also be noted 187 that committing a batch involves writing it to a single WAL file. The 188 combination of restrictions results in a batch needing to be written 189 entirely to a single memtable. 190 191 What happens if a batch is too large to fit in a memtable? Memtables 192 are generally considered to have a fixed size, yet this is not 193 actually true in RocksDB. In RocksDB, the memtable skiplist is 194 implemented on top of an arena structure. An arena is composed of a 195 list of fixed size chunks, with no upper limit set for the number of 196 chunks that can be associated with an arena. So RocksDB handles large 197 batches by allowing a memtable to grow beyond its configured 198 size. Concretely, while RocksDB may be configured with a 64MB memtable 199 size, a 1GB batch will cause the memtable to grow to accomodate 200 it. Functionally, this is good, though there is a practical problem: a 201 large batch is first written to the WAL, and then added to the 202 memtable. Adding the large batch to the memtable may consume so much 203 memory that the system runs out of memory and is killed by the 204 kernel. This can result in a death loop because upon restarting as the 205 batch is read from the WAL and applied to the memtable again. 206 207 In Pebble, the memtable is also implemented using a skiplist on top of 208 an arena. Significantly, the Pebble arena is a fixed size. While the 209 RocksDB skiplist uses pointers, the Pebble skiplist uses offsets from 210 the start of the arena. The fixed size arena means that the Pebble 211 memtable cannot expand arbitrarily. A batch that is too large to fit 212 in the memtable causes the current mutable memtable to be marked as 213 immutable and the batch is wrapped in a `flushableBatch` structure and 214 added to the list of immutable memtables. Because the `flushableBatch` 215 is readable as another layer in the LSM, the batch commit can return 216 as soon as the `flushableBatch` has been added to the immutable 217 memtable list. 218 219 Internally, a `flushableBatch` provides iterator support by sorting 220 the batch contents (the batch is sorted once, when it is added to the 221 memtable list). Sorting the batch contents and insertion of the 222 contents into a memtable have the same big-O time, but the constant 223 factor dominates here. Sorting is significantly faster and uses 224 significantly less memory due to not having to copy the batch records. 225 226 Note that an effect of this large batch support is that Pebble can be 227 configured as an efficient on-disk sorter: specify a small memtable 228 size, disable the WAL, and set a large L0 compaction threshold. In 229 order to sort a large amount of data, create batches that are larger 230 than the memtable size and commit them. When committed these batches 231 will not be inserted into a memtable, but instead sorted and then 232 written out to L0. The fully sorted data can later be read and the 233 normal merging process will take care of the final ordering. 234 235 ## Commit Pipeline 236 237 The commit pipeline is the component which manages the steps in 238 committing write batches, such as writing the batch to the WAL and 239 applying its contents to the memtable. While simple conceptually, the 240 commit pipeline is crucial for high performance. In the absence of 241 concurrency, commit performance is limited by how fast a batch can be 242 written (and synced) to the WAL and then added to the memtable, both 243 of which are outside of the purview of the commit pipeline. 244 245 To understand the challenge here, it is useful to have a conception of 246 the WAL (write-ahead log). The WAL contains a record of all of the 247 batches that have been committed to the database. As a record is 248 written to the WAL it is added to the memtable. Each record is 249 assigned a sequence number which is used to distinguish newer updates 250 from older ones. Conceptually the WAL looks like: 251 252 ``` 253 +--------------------------------------+ 254 | Batch(SeqNum=1,Count=9,Records=...) | 255 +--------------------------------------+ 256 | Batch(SeqNum=10,Count=5,Records=...) | 257 +--------------------------------------+ 258 | Batch(SeqNum=15,Count=7,Records...) | 259 +--------------------------------------+ 260 | ... | 261 +--------------------------------------+ 262 ``` 263 264 Note that each WAL entry is precisely the batch representation 265 described earlier in the [Indexed Batches](#indexed-batches) 266 section. The monotonically increasing sequence numbers are a critical 267 component in allowing RocksDB and Pebble to provide fast snapshot 268 views of the database for reads. 269 270 If concurrent performance was not a concern, the commit pipeline could 271 simply be a mutex which serialized writes to the WAL and application 272 of the batch records to the memtable. Concurrent performance is a 273 concern, though. 274 275 The primary challenge in concurrent performance in the commit pipeline 276 is maintaining two invariants: 277 278 1. Batches need to be written to the WAL in sequence number order. 279 2. Batches need to be made visible for reads in sequence number 280 order. This invariant arises from the use of a single sequence 281 number which indicates which mutations are visible. 282 283 The second invariant deserves explanation. RocksDB and Pebble both 284 keep track of a visible sequence number. This is the sequence number 285 for which records in the database are visible during reads. The 286 visible sequence number exists because committing a batch is an atomic 287 operation, yet adding records to the memtable is done without an 288 exclusive lock (the skiplists used by both Pebble and RocksDB are 289 lock-free). When the records from a batch are being added to the 290 memtable, a concurrent read operation may see those records, but will 291 skip over them because they are newer than the visible sequence 292 number. Once all of the records in the batch have been added to the 293 memtable, the visible sequence number is atomically incremented. 294 295 So we have four steps in committing a write batch: 296 297 1. Write the batch to the WAL 298 2. Apply the mutations in the batch to the memtable 299 3. Bump the visible sequence number 300 4. (Optionally) sync the WAL 301 302 Writing the batch to the WAL is actually very fast as it is just a 303 memory copy. Applying the mutations in the batch to the memtable is by 304 far the most CPU intensive part of the commit pipeline. Syncing the 305 WAL is the most expensive from a wall clock perspective. 306 307 With that background out of the way, let's examine how RocksDB commits 308 batches. This description is of the traditional commit pipeline in 309 RocksDB (i.e. the one used by CockroachDB). 310 311 RocksDB achieves concurrency in the commit pipeline by grouping 312 concurrently committed batches into a batch group. Each group is 313 assigned a "leader" which is the first batch to be added to the 314 group. The batch group is written atomically to the WAL by the leader 315 thread, and then the individual batches making up the group are 316 concurrently applied to the memtable. Lastly, the visible sequence 317 number is bumped such that all of the batches in the group become 318 visible in a single atomic step. While a batch group is being applied, 319 other concurrent commits are added to a waiting list. When the group 320 commit finishes, the waiting commits form the next group. 321 322 There are two criticisms of the batch grouping approach. The first is 323 that forming a batch group involves copying batch contents. RocksDB 324 partially alleviates this for large batches by placing a limit on the 325 total size of a group. A large batch will end up in its own group and 326 not be copied, but the criticism still applies for small batches. Note 327 that there are actually two copies here. The batch contents are 328 concatenated together to form the group, and then the group contents 329 are written into an in memory buffer for the WAL before being written 330 to disk. 331 332 The second criticism is about the thread synchronization points. Let's 333 consider what happens to a commit which becomes the leader: 334 335 1. Lock commit mutex 336 2. Wait to become leader 337 3. Form (concatenate) batch group and write to the WAL 338 4. Notify followers to apply their batch to the memtable 339 5. Apply own batch to memtable 340 6. Wait for followers to finish 341 7. Bump visible sequence number 342 8. Unlock commit mutex 343 9. Notify followers that the commit is complete 344 345 The follower's set of operations looks like: 346 347 1. Lock commit mutex 348 2. Wait to become follower 349 3. Wait to be notified that it is time to apply batch 350 4. Unlock commit mutex 351 5. Apply batch to memtable 352 6. Wait to be notified that commit is complete 353 354 The thread synchronization points (all of the waits and notifies) are 355 overhead. Reducing that overhead can improve performance. 356 357 The Pebble commit pipeline addresses both criticisms. The main 358 innovation is a commit queue that mirrors the commit order. The Pebble 359 commit pipeline looks like: 360 361 1. Lock commit mutex 362 * Add batch to commit queue 363 * Assign batch sequence number 364 * Write batch to the WAL 365 2. Unlock commit mutex 366 3. Apply batch to memtable (concurrently) 367 4. Publish batch sequence number 368 369 Pebble does not use the concept of a batch group. Each batch is 370 individually written to the WAL, but note that the WAL write is just a 371 memory copy into an internal buffer in the WAL. 372 373 Step 4 deserves further scrutiny as it is where the invariant on the 374 visible batch sequence number is maintained. Publishing the batch 375 sequence number cannot simply bump the visible sequence number because 376 batches with earlier sequence numbers may still be applying to the 377 memtable. If we were to ratchet the visible sequence number without 378 waiting for those applies to finish, a concurrent reader could see 379 partial batch contents. Note that RocksDB has experimented with 380 allowing these semantics with its unordered writes option. 381 382 We want to retain the atomic visibility of batch commits. The publish 383 batch sequence number step needs to ensure that we don't ratchet the 384 visible sequence number until all batches with earlier sequence 385 numbers have applied. Enter the commit queue: a lock-free 386 single-producer, multi-consumer queue. Batches are added to the commit 387 queue with the commit mutex held, ensuring the same order as the 388 sequence number assignment. After a batch finishes applying to the 389 memtable, it atomically marks the batch as applied. It then removes 390 the prefix of applied batches from the commit queue, bumping the 391 visible sequence number, and marking the batch as committed (via a 392 `sync.WaitGroup`). If the first batch in the commit queue has not be 393 applied we wait for our batch to be committed, relying on another 394 concurrent committer to perform the visible sequence ratcheting for 395 our batch. We know a concurrent commit is taking place because if 396 there was only one batch committing it would be at the head of the 397 commit queue. 398 399 There are two possibilities when publishing a sequence number. The 400 first is that there is an unapplied batch at the head of the 401 queue. Consider the following scenario where we're trying to publish 402 the sequence number for batch `B`. 403 404 ``` 405 +---------------+-------------+---------------+-----+ 406 | A (unapplied) | B (applied) | C (unapplied) | ... | 407 +---------------+-------------+---------------+-----+ 408 ``` 409 410 The publish routine will see that `A` is unapplied and then simply 411 wait for `B's` done `sync.WaitGroup` to be signalled. This is safe 412 because `A` must still be committing. And if `A` has concurrently been 413 marked as applied, the goroutine publishing `A` will then publish 414 `B`. What happens when `A` publishes its sequence number? The commit 415 queue state becomes: 416 417 ``` 418 +-------------+-------------+---------------+-----+ 419 | A (applied) | B (applied) | C (unapplied) | ... | 420 +-------------+-------------+---------------+-----+ 421 ``` 422 423 The publish routine pops `A` from the queue, ratchets the sequence 424 number, then pops `B` and ratchets the sequence number again, and then 425 finds `C` and stops. A detail that it is important to notice is that 426 the committer for batch `B` didn't have to do any more work. An 427 alternative approach would be to have `B` wakeup and ratchet its own 428 sequence number, but that would serialize the remainder of the commit 429 queue behind that goroutine waking up. 430 431 The commit queue reduces the number of thread synchronization 432 operations required to commit a batch. There is no leader to notify, 433 or followers to wait for. A commit either publishes its own sequence 434 number, or performs one synchronization operation to wait for a 435 concurrent committer to publish its sequence number. 436 437 ## Range Deletions 438 439 Deletion of an individual key in RocksDB and Pebble is accomplished by 440 writing a deletion tombstone. A deletion tombstone shadows an existing 441 value for a key, causing reads to treat the key as not present. The 442 deletion tombstone mechanism works well for deleting small sets of 443 keys, but what happens if you want to all of the keys within a range 444 of keys that might number in the thousands or millions? A range 445 deletion is an operation which deletes an entire range of keys with a 446 single record. In contrast to a point deletion tombstone which 447 specifies a single key, a range deletion tombstone (a.k.a. range 448 tombstone) specifies a start key (inclusive) and an end key 449 (exclusive). This single record is much faster to write than thousands 450 or millions of point deletion tombstones, and can be done blindly -- 451 without iterating over the keys that need to be deleted. The downside 452 to range tombstones is that they require additional processing during 453 reads. How the processing of range tombstones is done significantly 454 affects both the complexity of the implementation, and the efficiency 455 of read operations in the presence of range tombstones. 456 457 A range tombstone is composed of a start key, end key, and sequence 458 number. Any key that falls within the range is considered deleted if 459 the key's sequence number is less than the range tombstone's sequence 460 number. RocksDB stores range tombstones segregated from point 461 operations in a special range deletion block within each sstable. 462 Conceptually, the range tombstones stored within an sstable are 463 truncated to the boundaries of the sstable, though there are 464 complexities that cause this to not actually be physically true. 465 466 In RocksDB, the main structure implementing range tombstone processing 467 is the `RangeDelAggregator`. Each read operation and iterator has its 468 own `RangeDelAggregator` configured for the sequence number the read 469 is taking place at. The initial implementation of `RangeDelAggregator` 470 built up a "skyline" for the range tombstones visible at the read 471 sequence number. 472 473 ``` 474 10 +---+ 475 9 | | 476 8 | | 477 7 | +----+ 478 6 | | 479 5 +-+ | +----+ 480 4 | | | | 481 3 | | | +---+ 482 2 | | | | 483 1 | | | | 484 0 | | | | 485 abcdefghijklmnopqrstuvwxyz 486 ``` 487 488 The above diagram shows the skyline created for the range tombstones 489 `[b,j)#5`, `[d,h)#10`, `[f,m)#7`, `[p,u)#5`, and `[t,y)#3`. The 490 skyline is queried for each key read to see if the key should be 491 considered deleted or not. The skyline structure is stored in a binary 492 tree, making the queries an O(logn) operation in the number of 493 tombstones, though there is an optimization to make this O(1) for 494 `next`/`prev` iteration. Note that the skyline representation loses 495 information about the range tombstones. This requires the structure to 496 be rebuilt on every read which has a significant performance impact. 497 498 The initial skyline range tombstone implementation has since been 499 replaced with a more efficient lookup structure. See the 500 [DeleteRange](https://rocksdb.org/blog/2018/11/21/delete-range.html) 501 blog post for a good description of both the original implementation 502 and the new (v2) implementation. The key change in the new 503 implementation is to "fragment" the range tombstones that are stored 504 in an sstable. The fragmented range tombstones provide the same 505 benefit as the skyline representation: the ability to binary search 506 the fragments in order to find the tombstone covering a key. But 507 unlike the skyline approach, the fragmented tombstones can be cached 508 on a per-sstable basis. In the v2 approach, `RangeDelAggregator` keeps 509 track of the fragmented range tombstones for each sstable encountered 510 during a read or iterator, and logically merges them together. 511 512 Fragmenting range tombstones involves splitting range tombstones at 513 overlap points. Let's consider the tombstones in the skyline example 514 above: 515 516 ``` 517 10: d---h 518 7: f------m 519 5: b-------j p----u 520 3: t----y 521 ``` 522 523 Fragmenting the range tombstones at the overlap points creates a 524 larger number of range tombstones: 525 526 ``` 527 10: d-f-h 528 7: f-h-j--m 529 5: b-d-f-h-j p---tu 530 3: tu---y 531 ``` 532 533 While the number of tombstones is larger there is a significant 534 advantage: we can order the tombstones by their start key and then 535 binary search to find the set of tombstones overlapping a particular 536 point. This is possible because due to the fragmenting, all the 537 tombstones that overlap a range of keys will have the same start and 538 end key. The v2 `RangeDelAggregator` and associated classes perform 539 fragmentation of range tombstones stored in each sstable and those 540 fragmented tombstones are then cached. 541 542 In summary, in RocksDB `RangeDelAggregator` acts as an oracle for 543 answering whether a key is deleted at a particular sequence 544 number. Due to caching of fragmented tombstones, the v2 implementation 545 of `RangeDelAggregator` implementation is significantly faster to 546 populate than v1, yet the overall approach to processing range 547 tombstones remains similar. 548 549 Pebble takes a different approach: it integrates range tombstones 550 processing directly into the `mergingIter` structure. `mergingIter` is 551 the internal structure which provides a merged view of the levels in 552 an LSM. RocksDB has a similar class named 553 `MergingIterator`. Internally, `mergingIter` maintains a heap over the 554 levels in the LSM (note that each memtable and L0 table is a separate 555 "level" in `mergingIter`). In RocksDB, `MergingIterator` knows nothing 556 about range tombstones, and it is thus up to higher-level code to 557 process range tombstones using `RangeDelAggregator`. 558 559 While the separation of `MergingIterator` and range tombstones seems 560 reasonable at first glance, there is an optimization that RocksDB does 561 not perform which is awkward with the `RangeDelAggregator` approach: 562 skipping swaths of deleted keys. A range tombstone often shadows more 563 than one key. Rather than iterating over the deleted keys, it is much 564 quicker to seek to the end point of the range tombstone. The challenge 565 in implementing this optimization is that a key might be newer than 566 the range tombstone and thus shouldn't be skipped. An insight to be 567 utilized is that the level structure itself provides sufficient 568 information. A range tombstone at `Ln` is guaranteed to be newer than 569 any key it overlaps in `Ln+1`. 570 571 Pebble utilizes the insight above to integrate range deletion 572 processing with `mergingIter`. A `mergingIter` maintains a point 573 iterator and a range deletion iterator per level in the LSM. In this 574 context, every L0 table is a separate level, as is every 575 memtable. Within a level, when a range deletion contains a point 576 operation the sequence numbers must be checked to determine if the 577 point operation is newer or older than the range deletion 578 tombstone. The `mergingIter` maintains the invariant that the range 579 deletion iterators for all levels newer that the current iteration key 580 are positioned at the next (or previous during reverse iteration) 581 range deletion tombstone. We know those levels don't contain a range 582 deletion tombstone that covers the current key because if they did the 583 current key would be deleted. The range deletion iterator for the 584 current key's level is positioned at a range tombstone covering or 585 past the current key. The position of all of other range deletion 586 iterators is unspecified. Whenever a key from those levels becomes the 587 current key, their range deletion iterators need to be 588 positioned. This lazy positioning avoids seeking the range deletion 589 iterators for keys that are never considered. 590 591 For a full example, consider the following setup: 592 593 ``` 594 p0: o 595 r0: m---q 596 597 p1: n p 598 r1: g---k 599 600 p2: b d i 601 r2: a---e q----v 602 603 p3: e 604 r3: 605 ``` 606 607 The diagram above shows is showing 4 levels, with `pX` indicating the 608 point operations in a level and `rX` indicating the range tombstones. 609 610 If we start iterating from the beginning, the first key we encounter 611 is `b` in `p2`. When the mergingIter is pointing at a valid entry, the 612 range deletion iterators for all of the levels less that the current 613 key's level are positioned at the next range tombstone past the 614 current key. So `r0` will point at `[m,q)` and `r1` at `[g,k)`. When 615 the key `b` is encountered, we check to see if the current tombstone 616 for `r0` or `r1` contains it, and whether the tombstone for `r2`, 617 `[a,e)`, contains and is newer than `b`. 618 619 Advancing the iterator finds the next key at `d`. This is in the same 620 level as the previous key `b` so we don't have to reposition any of 621 the range deletion iterators, but merely check whether `d` is now 622 contained by any of the range tombstones at higher levels or has 623 stepped past the range tombstone in its own level. In this case, there 624 is nothing to be done. 625 626 Advancing the iterator again finds `e`. Since `e` comes from `p3`, we 627 have to position the `r3` range deletion iterator, which is empty. `e` 628 is past the `r2` tombstone of `[a,e)` so we need to advance the `r2` 629 range deletion iterator to `[q,v)`. 630 631 The next key is `i`. Because this key is in `p2`, a level above `e`, 632 we don't have to reposition any range deletion iterators and instead 633 see that `i` is covered by the range tombstone `[g,k)`. The iterator 634 is immediately advanced to `n` which is covered by the range tombstone 635 `[m,q)` causing the iterator to advance to `o` which is visible. 636 637 ## Flush and Compaction Pacing 638 639 Flushes and compactions in LSM trees are problematic because they 640 contend with foreground traffic, resulting in write and read latency 641 spikes. Without throttling the rate of flushes and compactions, they 642 occur "as fast as possible" (which is not entirely true, since we 643 have a `bytes_per_sync` option). This instantaneous usage of CPU and 644 disk IO results in potentially huge latency spikes for writes and 645 reads which occur in parallel to the flushes and compactions. 646 647 RocksDB attempts to solve this issue by offering an option to limit 648 the speed of flushes and compactions. A maximum `bytes/sec` can be 649 specified through the options, and background IO usage will be limited 650 to the specified amount. Flushes are given priority over compactions, 651 but they still use the same rate limiter. Though simple to implement 652 and understand, this option is fragile for various reasons. 653 654 1) If the rate limit is configured too low, the DB will stall and 655 write throughput will be affected. 656 2) If the rate limit is configured too high, the write and read 657 latency spikes will persist. 658 3) A different configuration is needed per system depending on the 659 speed of the storage device. 660 4) Write rates typically do not stay the same throughout the lifetime 661 of the DB (higher throughput during certain times of the day, etc) but 662 the rate limit cannot be configured during runtime. 663 664 RocksDB also offers an 665 ["auto-tuned" rate limiter](https://rocksdb.org/blog/2017/12/18/17-auto-tuned-rate-limiter.html) 666 which uses a simple multiplicative-increase, multiplicative-decrease 667 algorithm to dynamically adjust the background IO rate limit depending 668 on how much of the rate limiter has been exhausted in an interval. 669 This solves the problem of having a static rate limit, but Pebble 670 attempts to improve on this with a different pacing mechanism. 671 672 Pebble's pacing mechanism uses separate rate limiters for flushes and 673 compactions. Both the flush and compaction pacing mechanisms work by 674 attempting to flush and compact only as fast as needed and no faster. 675 This is achieved differently for flushes versus compactions. 676 677 For flush pacing, Pebble keeps the rate at which the memtable is 678 flushed at the same rate as user writes. This ensures that disk IO 679 used by flushes remains steady. When a mutable memtable becomes full 680 and is marked immutable, it is typically flushed as fast as possible. 681 Instead of flushing as fast as possible, what we do is look at the 682 total number of bytes in all the memtables (mutable + queue of 683 immutables) and subtract the number of bytes that have been flushed in 684 the current flush. This number gives us the total number of bytes 685 which remain to be flushed. If we keep this number steady at a constant 686 level, we have the invariant that the flush rate is equal to the write 687 rate. 688 689 When the number of bytes remaining to be flushed falls below our 690 target level, we slow down the speed of flushing. We keep a minimum 691 rate at which the memtable is flushed so that flushes proceed even if 692 writes have stopped. When the number of bytes remaining to be flushed 693 goes above our target level, we allow the flush to proceed as fast as 694 possible, without applying any rate limiting. However, note that the 695 second case would indicate that writes are occurring faster than the 696 memtable can flush, which would be an unsustainable rate. The LSM 697 would soon hit the memtable count stall condition and writes would be 698 completely stopped. 699 700 For compaction pacing, Pebble uses an estimation of compaction debt, 701 which is the number of bytes which need to be compacted before no 702 further compactions are needed. This estimation is calculated by 703 looking at the number of bytes that have been flushed by the current 704 flush routine, adding those bytes to the size of the level 0 sstables, 705 then seeing how many bytes exceed the target number of bytes for the 706 level 0 sstables. We multiply the number of bytes exceeded by the 707 level ratio and add that number to the compaction debt estimate. 708 We repeat this process until the final level, which gives us a final 709 compaction debt estimate for the entire LSM tree. 710 711 Like with flush pacing, we want to keep the compaction debt at a 712 constant level. This ensures that compactions occur only as fast as 713 needed and no faster. If the compaction debt estimate falls below our 714 target level, we slow down compactions. We maintain a minimum 715 compaction rate so that compactions proceed even if flushes have 716 stopped. If the compaction debt goes above our target level, we let 717 compactions proceed as fast as possible without any rate limiting. 718 Just like with flush pacing, this would indicate that writes are 719 occurring faster than the background compactions can keep up with, 720 which is an unsustainable rate. The LSM's read amplification would 721 increase and the L0 file count stall condition would be hit. 722 723 With the combined flush and compaction pacing mechanisms, flushes and 724 compactions only occur as fast as needed and no faster, which reduces 725 latency spikes for user read and write operations. 726 727 ## Write throttling 728 729 RocksDB adds artificial delays to user writes when certain thresholds 730 are met, such as `l0_slowdown_writes_threshold`. These artificial 731 delays occur when the system is close to stalling to lessen the write 732 pressure so that flushing and compactions can catch up. On the surface 733 this seems good, since write stalls would seemingly be eliminated and 734 replaced with gradual slowdowns. Closed loop write latency benchmarks 735 would show the elimination of abrupt write stalls, which seems 736 desirable. 737 738 However, this doesn't do anything to improve latencies in an open loop 739 model, which is the model more likely to resemble real world use 740 cases. Artificial delays increase write latencies without a clear 741 benefit. Writes stalls in an open loop system would indicate that 742 writes are generated faster than the system could possibly handle, 743 which adding artificial delays won't solve. 744 745 For this reason, Pebble doesn't add artificial delays to user writes 746 and writes are served as quickly as possible. 747 748 ### Other Differences 749 750 * `internalIterator` API which minimizes indirect (virtual) function 751 calls 752 * Previous pointers in the memtable and indexed batch skiplists 753 * Elision of per-key lower/upper bound checks in long range scans 754 * Improved `Iterator` API 755 + `SeekPrefixGE` for prefix iteration 756 + `SetBounds` for adjusting the bounds on an existing `Iterator` 757 * Simpler `Get` implementation