github.com/petermattis/pebble@v0.0.0-20190905164901-ab51a2166067/docs/rocksdb.md (about) 1 # Pebble vs RocksDB: Implementation Differences 2 3 RocksDB is a key-value store implemented using a Log-Structured 4 Merge-Tree (LSM). This document is not a primer on LSMs. There exist 5 some decent 6 [introductions](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/) 7 on the web, or try chapter 3 of [Designing Data-Intensive 8 Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321). 9 10 Pebble inherits the RocksDB file formats, has a similar API, and 11 shares many implementation details, but it also has many differences 12 that improve performance, reduce implementation complexity, or extend 13 functionality. This document highlights some of the more important 14 differences. 15 16 * [Internal Keys](#internal-keys) 17 * [Indexed Batches](#indexed-batches) 18 * [Large Batches](#large-batches) 19 * [Commit Pipeline](#commit-pipeline) 20 * [Range Deletions](#range-deletions) 21 * [Flush and Compaction Pacing](#flush-and-compaction-pacing) 22 * [Write Throttling](#write-throttling) 23 * [Other Differences](#other-differences) 24 25 ## Internal Keys 26 27 The external RocksDB API accepts keys and values. Due to the LSM 28 structure, keys are never updated in place, but overwritten with new 29 versions. Inside RocksDB, these versioned keys are known as Internal 30 Keys. An Internal Key is composed of the user specified key, a 31 sequence number and a kind. On disk, sstables always store Internal 32 Keys. 33 34 ``` 35 +-------------+------------+----------+ 36 | UserKey (N) | SeqNum (7) | Kind (1) | 37 +-------------+------------+----------+ 38 ``` 39 40 The `Kind` field indicates the type of key: set, merge, delete, etc. 41 42 While Pebble inherits the Internal Key encoding for format 43 compatibility, it diverges from RocksDB in how it manages Internal 44 Keys in its implementation. In RocksDB, Internal Keys are represented 45 either in encoded form (as a string) or as a `ParsedInternalKey`. The 46 latter is a struct with the components of the Internal Key as three 47 separate fields. 48 49 ```c++ 50 struct ParsedInternalKey { 51 Slice user_key; 52 uint64 seqnum; 53 uint8 kind; 54 } 55 ``` 56 57 The component format is convenient: changing the `SeqNum` or `Kind` is 58 field assignment. Extracting the `UserKey` is a field 59 reference. However, RocksDB tends to only use `ParsedInternalKey` 60 locally. The major internal APIs, such as `InternalIterator`, operate 61 using encoded internal keys (i.e. strings) for parameters and return 62 values. 63 64 To give a concrete example of the overhead this causes, consider 65 `Iterator::Seek(user_key)`. The external `Iterator` is implemented on 66 top of an `InternalIterator`. `Iterator::Seek` ends up calling 67 `InternalIterator::Seek`. Both Seek methods take a key, but 68 `InternalIterator::Seek` expects an encoded Internal Key. This is both 69 error prone and expensive. The key passed to `Iterator::Seek` needs to 70 be copied into a temporary string in order to append the `SeqNum` and 71 `Kind`. In Pebble, Internal Keys are represented in memory using an 72 `InternalKey` struct that is the analog of `ParsedInternalKey`. All 73 internal APIs use `InternalKeys`, with the exception of the lowest 74 level routines for decoding data from sstables. In Pebble, since the 75 interfaces all take and return the `InternalKey` struct, we don’t need 76 to allocate to construct the Internal Key from the User Key, but 77 RocksDB sometimes needs to allocate, and encode (i.e. make a 78 copy). The use of the encoded form also causes RocksDB to pass encoded 79 keys to the comparator routines, sometimes decoding the keys multiple 80 times during the course of processing. 81 82 ## Indexed Batches 83 84 In RocksDB, a batch is the unit for all write operations. Even writing 85 a single key is transformed internally to a batch. The batch internal 86 representation is a contiguous byte buffer with a fixed 12-byte 87 header, followed by a series of records. 88 89 ``` 90 +------------+-----------+--- ... ---+ 91 | SeqNum (8) | Count (4) | Entries | 92 +------------+-----------+--- ... ---+ 93 ``` 94 95 Each record has a 1-byte kind tag prefix, followed by 1 or 2 length 96 prefixed strings (varstring): 97 98 ``` 99 +----------+-----------------+-------------------+ 100 | Kind (1) | Key (varstring) | Value (varstring) | 101 +----------+-----------------+-------------------+ 102 ``` 103 104 (The `Kind` indicates if there are 1 or 2 varstrings. `Set`, `Merge`, 105 and `DeleteRange` have 2 varstrings, while `Delete` has 1.) 106 107 Adding a mutation to a batch involves appending a new record to the 108 buffer. This format is extremely fast for writes, but the lack of 109 indexing makes it untenable to use directly for reads. In order to 110 support iteration, a separate indexing structure is created. Both 111 RocksDB and Pebble use a skiplist for the indexing structure, but with 112 a clever twist. Rather than the skiplist storing a copy of the key, it 113 simply stores the offset of the record within the mutation buffer. The 114 result is that the skiplist acts a multi-map (i.e. a map that can have 115 duplicate entries for a given key). The iteration order for this map 116 is constructed so that records sort on key, and for equal keys they 117 sort on descending offset. Newer records for the same key appear 118 before older records. 119 120 While the indexing structure for batches is nearly identical between 121 RocksDB and Pebble, how the index structure is used is completely 122 different. In RocksDB, a batch is indexed using the 123 `WriteBatchWithIndex` class. The `WriteBatchWithIndex` class provides 124 a `NewIteratorWithBase` method that allows iteration over the merged 125 view of the batch contents and an underlying "base" iterator created 126 from the database. `BaseDeltaIterator` contains logic to iterate over 127 the batch entries and the base iterator in parallel which allows us to 128 perform reads on a snapshot of the database as though the batch had 129 been applied to it. On the surface this sounds reasonable, yet the 130 implementation is incomplete. Merge and DeleteRange operations are not 131 supported. The reason they are not supported is because handling them 132 is complex and requires duplicating logic that already exists inside 133 RocksDB for normal iterator processing. 134 135 Pebble takes a different approach to iterating over a merged view of a 136 batch's contents and the underlying database: it treats the batch as 137 another level in the LSM. Recall that an LSM is composed of zero or 138 more memtable layers and zero or more sstable layers. Internally, both 139 RocksDB and Pebble contain a `MergingIterator` that knows how to merge 140 the operations from different levels, including processing overwritten 141 keys, merge operations, and delete range operations. The challenge 142 with treating the batch as another level to be used by a 143 `MergingIterator` is that the records in a batch do not have a 144 sequence number. The sequence number in the batch header is not 145 assigned until the batch is committed. The solution is to give the 146 batch records temporary sequence numbers. We need these temporary 147 sequence numbers to be larger than any other sequence number in the 148 database so that the records in the batch are considered newer than 149 any committed record. This is accomplished by reserving the high-bit 150 in the 56-bit sequence number for use as a marker for batch sequence 151 numbers. The sequence number for a record in an uncommitted batch is: 152 153 ``` 154 RecordOffset | (1<<55) 155 ``` 156 157 Newer records in a given batch will have a larger sequence number than 158 older records in the batch. And all of the records in a batch will 159 have larger sequence numbers than any committed record in the 160 database. 161 162 The end result is that Pebble's batch iterators support all of the 163 functionality of regular database iterators with minimal additional 164 code. 165 166 ## Large Batches 167 168 The size of a batch is limited only by available memory, yet the 169 required memory is not just the batch representation. When a batch is 170 committed, the commit operation iterates over the records in the batch 171 from oldest to newest and inserts them into the current memtable. The 172 memtable is an in-memory structure that buffers mutations that have 173 been committed (written to the Write Ahead Log), but not yet written 174 to an sstable. Internally, a memtable uses a skiplist to index 175 records. Each skiplist entry has overhead for the index links and 176 other metadata that is a dozen bytes at minimum. A large batch 177 composed of many small records can require twice as much memory when 178 inserted into a memtable than it required in the batch. And note that 179 this causes a temporary increase in memory requirements because the 180 batch memory is not freed until it is completely committed. 181 182 A non-obvious implementation restriction present in both RocksDB and 183 Pebble is that there is a one-to-one correspondence between WAL files 184 and memtables. That is, a given WAL file has a single memtable 185 associated with it and vice-versa. While this restriction could be 186 removed, doing so is onerous and intricate. It should also be noted 187 that committing a batch involves writing it to a single WAL file. The 188 combination of restrictions results in a batch needing to be written 189 entirely to a single memtable. 190 191 What happens if a batch is too large to fit in a memtable? Memtables 192 are generally considered to have a fixed size, yet this is not 193 actually true in RocksDB. In RocksDB, the memtable skiplist is 194 implemented on top of an arena structure. An arena is composed of a 195 list of fixed size chunks, with no upper limit set for the number of 196 chunks that can be associated with an arena. So RocksDB handles large 197 batches by allowing a memtable to grow beyond its configured 198 size. Concretely, while RocksDB may be configured with a 64MB memtable 199 size, a 1GB batch will cause the memtable to grow to accomodate 200 it. Functionally, this is good, though there is a practical problem: a 201 large batch is first written to the WAL, and then added to the 202 memtable. Adding the large batch to the memtable may consume so much 203 memory that the system runs out of memory and is killed by the 204 kernel. This can result in a death loop because upon restarting as the 205 batch is read from the WAL and applied to the memtable again. 206 207 In Pebble, the memtable is also implemented using a skiplist on top of 208 an arena. Significantly, the Pebble arena is a fixed size. While the 209 RocksDB skiplist uses pointers, the Pebble skiplist uses offsets from 210 the start of the arena. The fixed size arena means that the Pebble 211 memtable cannot expand arbitrarily. A batch that is too large to fit 212 in the memtable causes the current mutable memtable to be marked as 213 immutable and the batch is wrapped in a `flushableBatch` structure and 214 added to the list of immutable memtables. Because the `flushableBatch` 215 is readable as another layer in the LSM, the batch commit can return 216 as soon as the `flushableBatch` has been added to the immutable 217 memtable list. 218 219 Internally, a `flushableBatch` provides iterator support by sorting 220 the batch contents (the batch is sorted once, when it is added to the 221 memtable list). Sorting the batch contents and insertion of the 222 contents into a memtable have the same big-O time, but the constant 223 factor dominates here. Sorting is significantly faster and uses 224 significantly less memory due to not having to copy the batch records. 225 226 Note that an effect of this large batch support is that Pebble can be 227 configured as an efficient on-disk sorter: specify a small memtable 228 size, disable the WAL, and set a large L0 compaction threshold. In 229 order to sort a large amount of data, create batches that are larger 230 than the memtable size and commit them. When committed these batches 231 will not be inserted into a memtable, but instead sorted and then 232 written out to L0. The fully sorted data can later be read and the 233 normal merging process will take care of the final ordering. 234 235 ## Commit Pipeline 236 237 The commit pipeline is the component which manages the steps in 238 committing write batches, such as writing the batch to the WAL and 239 applying its contents to the memtable. While simple conceptually, the 240 commit pipeline is crucial for high performance. In the absence of 241 concurrency, commit performance is limited by how fast a batch can be 242 written (and synced) to the WAL and then added to the memtable, both 243 of which are outside of the purview of the commit pipeline. 244 245 To understand the challenge here, it is useful to have a conception of 246 the WAL (write-ahead log). The WAL contains a record of all of the 247 batches that have been committed to the database. As a record is 248 written to the WAL it is added to the memtable. Each record is 249 assigned a sequence number which is used to distinguish newer updates 250 from older ones. Conceptually the WAL looks like: 251 252 ``` 253 +--------------------------------------+ 254 | Batch(SeqNum=1,Count=9,Records=...) | 255 +--------------------------------------+ 256 | Batch(SeqNum=10,Count=5,Records=...) | 257 +--------------------------------------+ 258 | Batch(SeqNum=15,Count=7,Records...) | 259 +--------------------------------------+ 260 | ... | 261 +--------------------------------------+ 262 ``` 263 264 Note that each WAL entry is precisely the batch representation 265 described earlier in the [Indexed Batches](#indexed-batches) 266 section. The monotonically increasing sequence numbers are a critical 267 component in allowing RocksDB and Pebble to provide fast snapshot 268 views of the database for reads. 269 270 If concurrent performance was not a concern, the commit pipeline could 271 simply be a mutex which serialized writes to the WAL and application 272 of the batch records to the memtable. Concurrent performance is a 273 concern, though. 274 275 The primary challenge in concurrent performance in the commit pipeline 276 is maintaining two invariants: 277 278 1. Batches need to be written to the WAL in sequence number order. 279 2. Batches need to be made visible for reads in sequence number 280 order. This invariant arises from the use of a single sequence 281 number which indicates which mutations are visible. 282 283 The second invariant deserves explanation. RocksDB and Pebble both 284 keep track of a visible sequence number. This is the sequence number 285 for which records in the database are visible during reads. The 286 visible sequence number exists because committing a batch is an atomic 287 operation, yet adding records to the memtable is done without an 288 exclusive lock (the skiplists used by both Pebble and RocksDB are 289 lock-free). When the records from a batch are being added to the 290 memtable, a concurrent read operation may see those records, but will 291 skip over them because they are newer than the visible sequence 292 number. Once all of the records in the batch have been added to the 293 memtable, the visible sequence number is atomically incremented. 294 295 So we have four steps in committing a write batch: 296 297 1. Write the batch to the WAL 298 2. Apply the mutations in the batch to the memtable 299 3. Bump the visible sequence number 300 4. (Optionally) sync the WAL 301 302 Writing the batch to the WAL is actually very fast as it is just a 303 memory copy. Applying the mutations in the batch to the memtable is by 304 far the most CPU intensive part of the commit pipeline. Syncing the 305 WAL is the most expensive from a wall clock perspective. 306 307 With that background out of the way, let's examine how RocksDB commits 308 batches. This description is of the traditional commit pipeline in 309 RocksDB (i.e. the one used by CockroachDB). 310 311 RocksDB achieves concurrency in the commit pipeline by grouping 312 concurrently committed batches into a batch group. Each group is 313 assigned a "leader" which is the first batch to be added to the 314 group. The batch group is written atomically to the WAL by the leader 315 thread, and then the individual batches making up the group are 316 concurrently applied to the memtable. Lastly, the visible sequence 317 number is bumped such that all of the batches in the group become 318 visible in a single atomic step. While a batch group is being applied, 319 other concurrent commits are added to a waiting list. When the group 320 commit finishes, the waiting commits form the next group. 321 322 There are two criticisms of the batch grouping approach. The first is 323 that forming a batch group involves copying batch contents. RocksDB 324 partially alleviates this for large batches by placing a limit on the 325 total size of a group. A large batch will end up in its own group and 326 not be copied, but the criticism still applies for small batches. Note 327 that there are actually two copies here. The batch contents are 328 concatenated together to form the group, and then the group contents 329 are written into an in memory buffer for the WAL before being written 330 to disk. 331 332 The second criticism is about the thread synchronization points. Let's 333 consider what happens to a commit which becomes the leader: 334 335 1. Lock commit mutex 336 2. Wait to become leader 337 3. Form (concatenate) batch group and write to the WAL 338 4. Notify followers to apply their batch to the memtable 339 5. Apply own batch to memtable 340 6. Wait for followers to finish 341 7. Bump visible sequence number 342 8. Unlock commit mutex 343 9. Notify followers that the commit is complete 344 345 The follower's set of operations looks like: 346 347 1. Lock commit mutex 348 2. Wait to become follower 349 3. Wait to be notified that it is time to apply batch 350 4. Unlock commit mutex 351 5. Apply batch to memtable 352 6. Wait to be notified that commit is complete 353 354 The thread synchronization points (all of the waits and notifies) are 355 overhead. Reducing that overhead can improve performance. 356 357 The Pebble commit pipeline addresses both criticisms. The main 358 innovation is a commit queue that mirrors the commit order. The Pebble 359 commit pipeline looks like: 360 361 1. Lock commit mutex 362 * Add batch to commit queue 363 * Assign batch sequence number 364 * Write batch to the WAL 365 2. Unlock commit mutex 366 3. Apply batch to memtable (concurrently) 367 4. Publish batch sequence number 368 369 Pebble does not use the concept of a batch group. Each batch is 370 individually written to the WAL, but note that the WAL write is just a 371 memory copy into an internal buffer in the WAL. 372 373 Step 4 deserves further scrutiny as it is where the invariant on the 374 visible batch sequence number is maintained. Publishing the batch 375 sequence number cannot simply bump the visible sequence number because 376 batches with earlier sequence numbers may still be applying to the 377 memtable. If we were to ratchet the visible sequence number without 378 waiting for those applies to finish, a concurrent reader could see 379 partial batch contents. Note that RocksDB has experimented with 380 allowing these semantics with its unordered writes option. 381 382 We want to retain the atomic visibility of batch commits. The publish 383 batch sequence number step needs to ensure that we don't ratchet the 384 visible sequence number until all batches with earlier sequence 385 numbers have applied. Enter the commit queue: a lock-free 386 single-producer, multi-consumer queue. Batches are added to the commit 387 queue with the commit mutex held, ensuring the same order as the 388 sequence number assignment. After a batch finishes applying to the 389 memtable, it atomically marks the batch as applied. It then removes 390 the prefix of applied batches from the commit queue, bumping the 391 visible sequence number, and marking the batch as committed (via a 392 `sync.WaitGroup`). If the first batch in the commit queue has not be 393 applied we wait for our batch to be committed, relying on another 394 concurrent committer to perform the visible sequence ratcheting for 395 our batch. We know a concurrent commit is taking place because if 396 there was only one batch committing it would be at the head of the 397 commit queue. 398 399 There are two possibilities when publishing a sequence number. The 400 first is that there is an unapplied batch at the head of the 401 queue. Consider the following scenario where we're trying to publish 402 the sequence number for batch `B`. 403 404 ``` 405 +---------------+-------------+---------------+-----+ 406 | A (unapplied) | B (applied) | C (unapplied) | ... | 407 +---------------+-------------+---------------+-----+ 408 ``` 409 410 The publish routine will see that `A` is unapplied and then simply 411 wait for `B's` done `sync.WaitGroup` to be signalled. This is safe 412 because `A` must still be committing. And if `A` has concurrently been 413 marked as applied, the goroutine publishing `A` will then publish 414 `B`. What happens when `A` publishes its sequence number? The commit 415 queue state becomes: 416 417 ``` 418 +-------------+-------------+---------------+-----+ 419 | A (applied) | B (applied) | C (unapplied) | ... | 420 +-------------+-------------+---------------+-----+ 421 ``` 422 423 The publish routine pops `A` from the queue, ratchets the sequence 424 number, then pops `B` and ratchets the sequence number again, and then 425 finds `C` and stops. A detail that it is important to notice is that 426 the committer for batch `B` didn't have to do any more work. An 427 alternative approach would be to have `B` wakeup and ratchet its own 428 sequence number, but that would serialize the remainder of the commit 429 queue behind that goroutine waking up. 430 431 The commit queue reduces the number of thread synchronization 432 operations required to commit a batch. There is no leader to notify, 433 or followers to wait for. A commit either publishes its own sequence 434 number, or performs one synchronization operation to wait for a 435 concurrent committer to publish its sequence number. 436 437 ## Range Deletions 438 439 Deletion of an individual key in RocksDB and Pebble is accomplished by 440 writing a deletion tombstone. A deletion tombstone shadows an existing 441 value for a key, causing reads to treat the key as not present. The 442 deletion tombstone mechanism works well for deleting small sets of 443 keys, but what happens if you want to all of the keys within a range 444 of keys that might number in the thousands or millions? A range 445 deletion is an operation which deletes an entire range of keys with a 446 single record. In contrast to a point deletion tombstone which 447 specifies a single key, a range deletion tombstone (a.k.a. range 448 tombstone) specifies a start key (inclusive) and an end key 449 (exclusive). This single record is much faster to write than thousands 450 or millions of point deletion tombstones, and can be done blindly -- 451 without iterating over the keys that need to be deleted. The downside 452 to range tombstones is that they require additional processing during 453 reads. How the processing of range tombstones is done significantly 454 affects both the complexity of the implementation, and the efficiency 455 of read operations in the presence of range tombstones. 456 457 A range tombstone is composed of a start key, end key, and sequence 458 number. Any key that falls within the range is considered deleted if 459 the key's sequence number is less than or equal to the range 460 tombstone's sequence number. RocksDB stores range tombstones 461 segregated from point operations in a special range deletion block 462 within each sstable. Conceptually, the range tombstones stored within 463 an sstable are truncated to the boundaries of the sstable, though 464 there are complexities that cause this to not actually be physically 465 true. 466 467 In RocksDB, the main structure implementing range tombstone processing 468 is the `RangeDelAggregator`. Each read operation and iterator has its 469 own `RangeDelAggregator` configured for the sequence number the read 470 is taking place at. The initial implementation of `RangeDelAggregator` 471 built up a "skyline" for the range tombstones visible at the read 472 sequence number. 473 474 ``` 475 10 +---+ 476 9 | | 477 8 | | 478 7 | +----+ 479 6 | | 480 5 +-+ | +----+ 481 4 | | | | 482 3 | | | +---+ 483 2 | | | | 484 1 | | | | 485 0 | | | | 486 abcdefghijklmnopqrstuvwxyz 487 ``` 488 489 The above diagram shows the skyline created for the range tombstones 490 `[b,j)#5`, `[d,h)#10`, `[f,m)#7`, `[p,u)#5`, and `[t,y)#3`. The 491 skyline is queried for each key read to see if the key should be 492 considered deleted or not. The skyline structure is stored in a binary 493 tree, making the queries an O(logn) operation in the number of 494 tombstones, though there is an optimization to make this O(1) for 495 `next`/`prev` iteration. Note that the skyline representation loses 496 information about the range tombstones. This requires the structure to 497 be rebuilt on every read which has a significant performance impact. 498 499 The initial skyline range tombstone implementation has since been 500 replaced with a more efficient lookup structure. See the 501 [DeleteRange](https://rocksdb.org/blog/2018/11/21/delete-range.html) 502 blog post for a good description of both the original implementation 503 and the new (v2) implementation. The key change in the new 504 implementation is to "fragment" the range tombstones that are stored 505 in an sstable. The fragmented range tombstones provide the same 506 benefit as the skyline representation: the ability to binary search 507 the fragments in order to find the tombstone covering a key. But 508 unlike the skyline approach, the fragmented tombstones can be cached 509 on a per-sstable basis. In the v2 approach, `RangeDelAggregator` keeps 510 track of the fragmented range tombstones for each sstable encountered 511 during a read or iterator, and logically merges them together. 512 513 Fragmenting range tombstones involves splitting range tombstones at 514 overlap points. Let's consider the tombstones in the skyline example 515 above: 516 517 ``` 518 10: d---h 519 7: f------m 520 5: b-------j p----u 521 3: t----y 522 ``` 523 524 Fragmenting the range tombstones at the overlap points creates a 525 larger number of range tombstones: 526 527 ``` 528 10: d-f-h 529 7: f-h-j--m 530 5: b-d-f-h-j p---tu 531 3: tu---y 532 ``` 533 534 While the number of tombstones is larger there is a significant 535 advantage: we can order the tombstones by their start key and then 536 binary search to find the set of tombstones overlapping a particular 537 point. This is possible because due to the fragmenting, all the 538 tombstones that overlap a range of keys will have the same start and 539 end key. The v2 `RangeDelAggregator` and associated classes perform 540 fragmentation of range tombstones stored in each sstable and those 541 fragmented tombstones are then cached. 542 543 In summary, in RocksDB `RangeDelAggregator` acts as an oracle for 544 answering whether a key is deleted at a particular sequence 545 number. Due to caching of fragmented tombstones, the v2 implementation 546 of `RangeDelAggregator` implementation is significantly faster to 547 populate than v1, yet the overall approach to processing range 548 tombstones remains similar. 549 550 Pebble takes a different approach: it integrates range tombstones 551 processing directly into the `mergingIter` structure. `mergingIter` is 552 the internal structure which provides a merged view of the levels in 553 an LSM. RocksDB has a similar class named 554 `MergingIterator`. Internally, `mergingIter` maintains a heap over the 555 levels in the LSM (note that each memtable and L0 table is a separate 556 "level" in `mergingIter`). In RocksDB, `MergingIterator` knows nothing 557 about range tombstones, and it is thus up to higher-level code to 558 process range tombstones using `RangeDelAggregator`. 559 560 While the separation of `MergingIterator` and range tombstones seems 561 reasonable at first glance, there is an optimization that RocksDB does 562 not perform which is awkward with the `RangeDelAggregator` approach: 563 skipping swaths of deleted keys. A range tombstone often shadows more 564 than one key. Rather than iterating over the deleted keys, it is much 565 quicker to seek to the end point of the range tombstone. The challenge 566 in implementing this optimization is that a key might be newer than 567 the range tombstone and thus shouldn't be skipped. An insight to be 568 utilized is that the level structure itself provides sufficient 569 information. A range tombstone at `Ln` is guaranteed to be newer than 570 any key it overlaps in `Ln+1`. 571 572 Pebble utilizes the insight above to integrate range deletion 573 processing with `mergingIter`. A `mergingIter` maintains a point 574 iterator and a range deletion iterator per level in the LSM. In this 575 context, every L0 table is a separate level, as is every 576 memtable. Within a level, when a range deletion contains a point 577 operation the sequence numbers must be checked to determine if the 578 point operation is newer or older than the range deletion 579 tombstone. The `mergingIter` maintains the invariant that the range 580 deletion iterators for all levels newer that the current iteration key 581 are positioned at the next (or previous during reverse iteration) 582 range deletion tombstone. We know those levels don't contain a range 583 deletion tombstone that covers the current key because if they did the 584 current key would be deleted. The range deletion iterator for the 585 current key's level is positioned at a range tombstone covering or 586 past the current key. The position of all of other range deletion 587 iterators is unspecified. Whenever a key from those levels becomes the 588 current key, their range deletion iterators need to be 589 positioned. This lazy positioning avoids seeking the range deletion 590 iterators for keys that are never considered. 591 592 For a full example, consider the following setup: 593 594 ``` 595 p0: o 596 r0: m---q 597 598 p1: n p 599 r1: g---k 600 601 p2: b d i 602 r2: a---e q----v 603 604 p3: e 605 r3: 606 ``` 607 608 The diagram above shows is showing 4 levels, with `pX` indicating the 609 point operations in a level and `rX` indicating the range tombstones. 610 611 If we start iterating from the beginning, the first key we encounter 612 is `b` in `p2`. When the mergingIter is pointing at a valid entry, the 613 range deletion iterators for all of the levels less that the current 614 key's level are positioned at the next range tombstone past the 615 current key. So `r0` will point at `[m,q)` and `r1` at `[g,k)`. When 616 the key `b` is encountered, we check to see if the current tombstone 617 for `r0` or `r1` contains it, and whether the tombstone for `r2`, 618 `[a,e)`, contains and is newer than `b`. 619 620 Advancing the iterator finds the next key at `d`. This is in the same 621 level as the previous key `b` so we don't have to reposition any of 622 the range deletion iterators, but merely check whether `d` is now 623 contained by any of the range tombstones at higher levels or has 624 stepped past the range tombstone in its own level. In this case, there 625 is nothing to be done. 626 627 Advancing the iterator again finds `e`. Since `e` comes from `p3`, we 628 have to position the `r3` range deletion iterator, which is empty. `e` 629 is past the `r2` tombstone of `[a,e)` so we need to advance the `r2` 630 range deletion iterator to `[q,v)`. 631 632 The next key is `i`. Because this key is in `p2`, a level above `e`, 633 we don't have to reposition any range deletion iterators and instead 634 see that `i` is covered by the range tombstone `[g,k)`. The iterator 635 is immediately advanced to `n` which is covered by the range tombstone 636 `[m,q)` causing the iterator to advance to `o` which is visible. 637 638 ## Flush and Compaction Pacing 639 640 Flushes and compactions in LSM trees are problematic because they 641 contend with foreground traffic, resulting in write and read latency 642 spikes. Without throttling the rate of flushes and compactions, they 643 occur "as fast as possible" (which is not entirely true, since we 644 have a `bytes_per_sync` option). This instantaneous usage of CPU and 645 disk IO results in potentially huge latency spikes for writes and 646 reads which occur in parallel to the flushes and compactions. 647 648 RocksDB attempts to solve this issue by offering an option to limit 649 the speed of flushes and compactions. A maximum `bytes/sec` can be 650 specified through the options, and background IO usage will be limited 651 to the specified amount. Flushes are given priority over compactions, 652 but they still use the same rate limiter. Though simple to implement 653 and understand, this option is fragile for various reasons. 654 655 1) If the rate limit is configured too low, the DB will stall and 656 write throughput will be affected. 657 2) If the rate limit is configured too high, the write and read 658 latency spikes will persist. 659 3) A different configuration is needed per system depending on the 660 speed of the storage device. 661 4) Write rates typically do not stay the same throughout the lifetime 662 of the DB (higher throughput during certain times of the day, etc) but 663 the rate limit cannot be configured during runtime. 664 665 RocksDB also offers an 666 ["auto-tuned" rate limiter](https://rocksdb.org/blog/2017/12/18/17-auto-tuned-rate-limiter.html) 667 which uses a simple multiplicative-increase, multiplicative-decrease 668 algorithm to dynamically adjust the background IO rate limit depending 669 on how much of the rate limiter has been exhausted in an interval. 670 This solves the problem of having a static rate limit, but Pebble 671 attempts to improve on this with a different pacing mechanism. 672 673 Pebble's pacing mechanism uses separate rate limiters for flushes and 674 compactions. Both the flush and compaction pacing mechanisms work by 675 attempting to flush and compact only as fast as needed and no faster. 676 This is achieved differently for flushes versus compactions. 677 678 For flush pacing, Pebble keeps the rate at which the memtable is 679 flushed at the same rate as user writes. This ensures that disk IO 680 used by flushes remains steady. When a mutable memtable becomes full 681 and is marked immutable, it is typically flushed as fast as possible. 682 Instead of flushing as fast as possible, what we do is look at the 683 total number of bytes in all the memtables (mutable + queue of 684 immutables) and subtract the number of bytes that have been flushed in 685 the current flush. This number gives us the total number of bytes 686 which remain to be flushed. If we keep this number steady at a constant 687 level, we have the invariant that the flush rate is equal to the write 688 rate. 689 690 When the number of bytes remaining to be flushed falls below our 691 target level, we slow down the speed of flushing. We keep a minimum 692 rate at which the memtable is flushed so that flushes proceed even if 693 writes have stopped. When the number of bytes remaining to be flushed 694 goes above our target level, we allow the flush to proceed as fast as 695 possible, without applying any rate limiting. However, note that the 696 second case would indicate that writes are occurring faster than the 697 memtable can flush, which would be an unsustainable rate. The LSM 698 would soon hit the memtable count stall condition and writes would be 699 completely stopped. 700 701 For compaction pacing, Pebble uses an estimation of compaction debt, 702 which is the number of bytes which need to be compacted before no 703 further compactions are needed. This estimation is calculated by 704 looking at the number of bytes that have been flushed by the current 705 flush routine, adding those bytes to the size of the level 0 sstables, 706 then seeing how many bytes exceed the target number of bytes for the 707 level 0 sstables. We multiply the number of bytes exceeded by the 708 the level ratio and add that number to the compaction debt estimate. 709 We repeat this process until the final level, which gives us a final 710 compaction debt estimate for the entire LSM tree. 711 712 Like with flush pacing, we want to keep the compaction debt at a 713 constant level. This ensures that compactions occur only as fast as 714 needed and no faster. If the compaction debt estimate falls below our 715 target level, we slow down compactions. We maintain a minimum 716 compaction rate so that compactions proceed even if flushes have 717 stopped. If the compaction debt goes above our target level, we let 718 compactions proceed as fast as possible without any rate limiting. 719 Just like with flush pacing, this would indicate that writes are 720 occurring faster than the background compactions can keep up with, 721 which is an unsustainable rate. The LSM's read amplification would 722 increase and the L0 file count stall condition would be hit. 723 724 With the combined flush and compaction pacing mechanisms, flushes and 725 compactions only occur as fast as needed and no faster, which reduces 726 latency spikes for user read and write operations. 727 728 ## Write throttling 729 730 RocksDB adds artificial delays to user writes when certain thresholds 731 are met, such as `l0_slowdown_writes_threshold`. These artificial 732 delays occur when the system is close to stalling to lessen the write 733 pressure so that flushing and compactions can catch up. On the surface 734 this seems good, since write stalls would seemingly be eliminated and 735 replaced with gradual slowdowns. Closed loop write latency benchmarks 736 would show the elimination of abrupt write stalls, which seems 737 desirable. 738 739 However, this doesn't do anything to improve latencies in an open loop 740 model, which is the model more likely to resemble real world use 741 cases. Artificial delays increase write latencies without a clear 742 benefit. Writes stalls in an open loop system would indicate that 743 writes are generated faster than the system could possibly handle, 744 which adding artificial delays won't solve. 745 746 For this reason, Pebble doesn't add artificial delays to user writes 747 and writes are served as quickly as possible. 748 749 ### Other Differences 750 751 * `internalIterator` API which minimizes indirect (virtual) function 752 calls 753 * Previous pointers in the memtable and indexed batch skiplists 754 * Elision of per-key lower/upper bound checks in long range scans 755 * Weak cache references remove the need to pin index and filter blocks 756 in memory 757 * Improved `Iterator` API 758 + `SeekPrefixGE` for prefix iteration 759 + `SetBounds` for adjusting the bounds on an existing `Iterator` 760 * Simpler `Get` implementation