github.com/petermattis/pebble@v0.0.0-20190905164901-ab51a2166067/docs/rocksdb.md

github.com/petermattis/pebble@v0.0.0-20190905164901-ab51a2166067/docs/rocksdb.md (about)

1 # Pebble vs RocksDB: Implementation Differences
2
3 RocksDB is a key-value store implemented using a Log-Structured
4 Merge-Tree (LSM). This document is not a primer on LSMs. There exist
5 some decent
6 [introductions](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/)
7 on the web, or try chapter 3 of [Designing Data-Intensive
8 Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321).
9
10 Pebble inherits the RocksDB file formats, has a similar API, and
11 shares many implementation details, but it also has many differences
12 that improve performance, reduce implementation complexity, or extend
13 functionality. This document highlights some of the more important
14 differences.
15
16 * [Internal Keys](#internal-keys)
17 * [Indexed Batches](#indexed-batches)
18 * [Large Batches](#large-batches)
19 * [Commit Pipeline](#commit-pipeline)
20 * [Range Deletions](#range-deletions)
21 * [Flush and Compaction Pacing](#flush-and-compaction-pacing)
22 * [Write Throttling](#write-throttling)
23 * [Other Differences](#other-differences)
24
25 ## Internal Keys
26
27 The external RocksDB API accepts keys and values. Due to the LSM
28 structure, keys are never updated in place, but overwritten with new
29 versions. Inside RocksDB, these versioned keys are known as Internal
30 Keys. An Internal Key is composed of the user specified key, a
31 sequence number and a kind. On disk, sstables always store Internal
32 Keys.
33
34 ```
35 +-------------+------------+----------+
36 | UserKey (N) | SeqNum (7) | Kind (1) |
37 +-------------+------------+----------+
38 ```
39
40 The `Kind` field indicates the type of key: set, merge, delete, etc.
41
42 While Pebble inherits the Internal Key encoding for format
43 compatibility, it diverges from RocksDB in how it manages Internal
44 Keys in its implementation. In RocksDB, Internal Keys are represented
45 either in encoded form (as a string) or as a `ParsedInternalKey`. The
46 latter is a struct with the components of the Internal Key as three
47 separate fields.
48
49 ```c++
50 struct ParsedInternalKey {
51 Slice user_key;
52 uint64 seqnum;
53 uint8 kind;
54 }
55 ```
56
57 The component format is convenient: changing the `SeqNum` or `Kind` is
58 field assignment. Extracting the `UserKey` is a field
59 reference. However, RocksDB tends to only use `ParsedInternalKey`
60 locally. The major internal APIs, such as `InternalIterator`, operate
61 using encoded internal keys (i.e. strings) for parameters and return
62 values.
63
64 To give a concrete example of the overhead this causes, consider
65 `Iterator::Seek(user_key)`. The external `Iterator` is implemented on
66 top of an `InternalIterator`. `Iterator::Seek` ends up calling
67 `InternalIterator::Seek`. Both Seek methods take a key, but
68 `InternalIterator::Seek` expects an encoded Internal Key. This is both
69 error prone and expensive. The key passed to `Iterator::Seek` needs to
70 be copied into a temporary string in order to append the `SeqNum` and
71 `Kind`. In Pebble, Internal Keys are represented in memory using an
72 `InternalKey` struct that is the analog of `ParsedInternalKey`. All
73 internal APIs use `InternalKeys`, with the exception of the lowest
74 level routines for decoding data from sstables. In Pebble, since the
75 interfaces all take and return the `InternalKey` struct, we don’t need
76 to allocate to construct the Internal Key from the User Key, but
77 RocksDB sometimes needs to allocate, and encode (i.e. make a
78 copy). The use of the encoded form also causes RocksDB to pass encoded
79 keys to the comparator routines, sometimes decoding the keys multiple
80 times during the course of processing.
81
82 ## Indexed Batches
83
84 In RocksDB, a batch is the unit for all write operations. Even writing
85 a single key is transformed internally to a batch. The batch internal
86 representation is a contiguous byte buffer with a fixed 12-byte
87 header, followed by a series of records.
88
89 ```
90 +------------+-----------+--- ... ---+
91 | SeqNum (8) | Count (4) | Entries |
92 +------------+-----------+--- ... ---+
93 ```
94
95 Each record has a 1-byte kind tag prefix, followed by 1 or 2 length
96 prefixed strings (varstring):
97
98 ```
99 +----------+-----------------+-------------------+
100 | Kind (1) | Key (varstring) | Value (varstring) |
101 +----------+-----------------+-------------------+
102 ```
103
104 (The `Kind` indicates if there are 1 or 2 varstrings. `Set`, `Merge`,
105 and `DeleteRange` have 2 varstrings, while `Delete` has 1.)
106
107 Adding a mutation to a batch involves appending a new record to the
108 buffer. This format is extremely fast for writes, but the lack of
109 indexing makes it untenable to use directly for reads. In order to
110 support iteration, a separate indexing structure is created. Both
111 RocksDB and Pebble use a skiplist for the indexing structure, but with
112 a clever twist. Rather than the skiplist storing a copy of the key, it
113 simply stores the offset of the record within the mutation buffer. The
114 result is that the skiplist acts a multi-map (i.e. a map that can have
115 duplicate entries for a given key). The iteration order for this map
116 is constructed so that records sort on key, and for equal keys they
117 sort on descending offset. Newer records for the same key appear
118 before older records.
119
120 While the indexing structure for batches is nearly identical between
121 RocksDB and Pebble, how the index structure is used is completely
122 different. In RocksDB, a batch is indexed using the
123 `WriteBatchWithIndex` class. The `WriteBatchWithIndex` class provides
124 a `NewIteratorWithBase` method that allows iteration over the merged
125 view of the batch contents and an underlying "base" iterator created
126 from the database. `BaseDeltaIterator` contains logic to iterate over
127 the batch entries and the base iterator in parallel which allows us to
128 perform reads on a snapshot of the database as though the batch had
129 been applied to it. On the surface this sounds reasonable, yet the
130 implementation is incomplete. Merge and DeleteRange operations are not
131 supported. The reason they are not supported is because handling them
132 is complex and requires duplicating logic that already exists inside
133 RocksDB for normal iterator processing.
134
135 Pebble takes a different approach to iterating over a merged view of a
136 batch's contents and the underlying database: it treats the batch as
137 another level in the LSM. Recall that an LSM is composed of zero or
138 more memtable layers and zero or more sstable layers. Internally, both
139 RocksDB and Pebble contain a `MergingIterator` that knows how to merge
140 the operations from different levels, including processing overwritten
141 keys, merge operations, and delete range operations. The challenge
142 with treating the batch as another level to be used by a
143 `MergingIterator` is that the records in a batch do not have a
144 sequence number. The sequence number in the batch header is not
145 assigned until the batch is committed. The solution is to give the
146 batch records temporary sequence numbers. We need these temporary
147 sequence numbers to be larger than any other sequence number in the
148 database so that the records in the batch are considered newer than
149 any committed record. This is accomplished by reserving the high-bit
150 in the 56-bit sequence number for use as a marker for batch sequence
151 numbers. The sequence number for a record in an uncommitted batch is:
152
153 ```
154 RecordOffset | (1<<55)
155 ```
156
157 Newer records in a given batch will have a larger sequence number than
158 older records in the batch. And all of the records in a batch will
159 have larger sequence numbers than any committed record in the
160 database.
161
162 The end result is that Pebble's batch iterators support all of the
163 functionality of regular database iterators with minimal additional
164 code.
165
166 ## Large Batches
167
168 The size of a batch is limited only by available memory, yet the
169 required memory is not just the batch representation. When a batch is
170 committed, the commit operation iterates over the records in the batch
171 from oldest to newest and inserts them into the current memtable. The
172 memtable is an in-memory structure that buffers mutations that have
173 been committed (written to the Write Ahead Log), but not yet written
174 to an sstable. Internally, a memtable uses a skiplist to index
175 records. Each skiplist entry has overhead for the index links and
176 other metadata that is a dozen bytes at minimum. A large batch
177 composed of many small records can require twice as much memory when
178 inserted into a memtable than it required in the batch. And note that
179 this causes a temporary increase in memory requirements because the
180 batch memory is not freed until it is completely committed.
181
182 A non-obvious implementation restriction present in both RocksDB and
183 Pebble is that there is a one-to-one correspondence between WAL files
184 and memtables. That is, a given WAL file has a single memtable
185 associated with it and vice-versa. While this restriction could be
186 removed, doing so is onerous and intricate. It should also be noted
187 that committing a batch involves writing it to a single WAL file. The
188 combination of restrictions results in a batch needing to be written
189 entirely to a single memtable.
190
191 What happens if a batch is too large to fit in a memtable? Memtables
192 are generally considered to have a fixed size, yet this is not
193 actually true in RocksDB. In RocksDB, the memtable skiplist is
194 implemented on top of an arena structure. An arena is composed of a
195 list of fixed size chunks, with no upper limit set for the number of
196 chunks that can be associated with an arena. So RocksDB handles large
197 batches by allowing a memtable to grow beyond its configured
198 size. Concretely, while RocksDB may be configured with a 64MB memtable
199 size, a 1GB batch will cause the memtable to grow to accomodate
200 it. Functionally, this is good, though there is a practical problem: a
201 large batch is first written to the WAL, and then added to the
202 memtable. Adding the large batch to the memtable may consume so much
203 memory that the system runs out of memory and is killed by the
204 kernel. This can result in a death loop because upon restarting as the
205 batch is read from the WAL and applied to the memtable again.
206
207 In Pebble, the memtable is also implemented using a skiplist on top of
208 an arena. Significantly, the Pebble arena is a fixed size. While the
209 RocksDB skiplist uses pointers, the Pebble skiplist uses offsets from
210 the start of the arena. The fixed size arena means that the Pebble
211 memtable cannot expand arbitrarily. A batch that is too large to fit
212 in the memtable causes the current mutable memtable to be marked as
213 immutable and the batch is wrapped in a `flushableBatch` structure and
214 added to the list of immutable memtables. Because the `flushableBatch`
215 is readable as another layer in the LSM, the batch commit can return
216 as soon as the `flushableBatch` has been added to the immutable
217 memtable list.
218
219 Internally, a `flushableBatch` provides iterator support by sorting
220 the batch contents (the batch is sorted once, when it is added to the
221 memtable list). Sorting the batch contents and insertion of the
222 contents into a memtable have the same big-O time, but the constant
223 factor dominates here. Sorting is significantly faster and uses
224 significantly less memory due to not having to copy the batch records.
225
226 Note that an effect of this large batch support is that Pebble can be
227 configured as an efficient on-disk sorter: specify a small memtable
228 size, disable the WAL, and set a large L0 compaction threshold. In
229 order to sort a large amount of data, create batches that are larger
230 than the memtable size and commit them. When committed these batches
231 will not be inserted into a memtable, but instead sorted and then
232 written out to L0. The fully sorted data can later be read and the
233 normal merging process will take care of the final ordering.
234
235 ## Commit Pipeline
236
237 The commit pipeline is the component which manages the steps in
238 committing write batches, such as writing the batch to the WAL and
239 applying its contents to the memtable. While simple conceptually, the
240 commit pipeline is crucial for high performance. In the absence of
241 concurrency, commit performance is limited by how fast a batch can be
242 written (and synced) to the WAL and then added to the memtable, both
243 of which are outside of the purview of the commit pipeline.
244
245 To understand the challenge here, it is useful to have a conception of
246 the WAL (write-ahead log). The WAL contains a record of all of the
247 batches that have been committed to the database. As a record is
248 written to the WAL it is added to the memtable. Each record is
249 assigned a sequence number which is used to distinguish newer updates
250 from older ones. Conceptually the WAL looks like:
251
252 ```
253 +--------------------------------------+
254 | Batch(SeqNum=1,Count=9,Records=...) |
255 +--------------------------------------+
256 | Batch(SeqNum=10,Count=5,Records=...) |
257 +--------------------------------------+
258 | Batch(SeqNum=15,Count=7,Records...) |
259 +--------------------------------------+
260 | ... |
261 +--------------------------------------+
262 ```
263
264 Note that each WAL entry is precisely the batch representation
265 described earlier in the [Indexed Batches](#indexed-batches)
266 section. The monotonically increasing sequence numbers are a critical
267 component in allowing RocksDB and Pebble to provide fast snapshot
268 views of the database for reads.
269
270 If concurrent performance was not a concern, the commit pipeline could
271 simply be a mutex which serialized writes to the WAL and application
272 of the batch records to the memtable. Concurrent performance is a
273 concern, though.
274
275 The primary challenge in concurrent performance in the commit pipeline
276 is maintaining two invariants:
277
278 1. Batches need to be written to the WAL in sequence number order.
279 2. Batches need to be made visible for reads in sequence number
280 order. This invariant arises from the use of a single sequence
281 number which indicates which mutations are visible.
282
283 The second invariant deserves explanation. RocksDB and Pebble both
284 keep track of a visible sequence number. This is the sequence number
285 for which records in the database are visible during reads. The
286 visible sequence number exists because committing a batch is an atomic
287 operation, yet adding records to the memtable is done without an
288 exclusive lock (the skiplists used by both Pebble and RocksDB are
289 lock-free). When the records from a batch are being added to the
290 memtable, a concurrent read operation may see those records, but will
291 skip over them because they are newer than the visible sequence
292 number. Once all of the records in the batch have been added to the
293 memtable, the visible sequence number is atomically incremented.
294
295 So we have four steps in committing a write batch:
296
297 1. Write the batch to the WAL
298 2. Apply the mutations in the batch to the memtable
299 3. Bump the visible sequence number
300 4. (Optionally) sync the WAL
301
302 Writing the batch to the WAL is actually very fast as it is just a
303 memory copy. Applying the mutations in the batch to the memtable is by
304 far the most CPU intensive part of the commit pipeline. Syncing the
305 WAL is the most expensive from a wall clock perspective.
306
307 With that background out of the way, let's examine how RocksDB commits
308 batches. This description is of the traditional commit pipeline in
309 RocksDB (i.e. the one used by CockroachDB).
310
311 RocksDB achieves concurrency in the commit pipeline by grouping
312 concurrently committed batches into a batch group. Each group is
313 assigned a "leader" which is the first batch to be added to the
314 group. The batch group is written atomically to the WAL by the leader
315 thread, and then the individual batches making up the group are
316 concurrently applied to the memtable. Lastly, the visible sequence
317 number is bumped such that all of the batches in the group become
318 visible in a single atomic step. While a batch group is being applied,
319 other concurrent commits are added to a waiting list. When the group
320 commit finishes, the waiting commits form the next group.
321
322 There are two criticisms of the batch grouping approach. The first is
323 that forming a batch group involves copying batch contents. RocksDB
324 partially alleviates this for large batches by placing a limit on the
325 total size of a group. A large batch will end up in its own group and
326 not be copied, but the criticism still applies for small batches. Note
327 that there are actually two copies here. The batch contents are
328 concatenated together to form the group, and then the group contents
329 are written into an in memory buffer for the WAL before being written
330 to disk.
331
332 The second criticism is about the thread synchronization points. Let's
333 consider what happens to a commit which becomes the leader:
334
335 1. Lock commit mutex
336 2. Wait to become leader
337 3. Form (concatenate) batch group and write to the WAL
338 4. Notify followers to apply their batch to the memtable
339 5. Apply own batch to memtable
340 6. Wait for followers to finish
341 7. Bump visible sequence number
342 8. Unlock commit mutex
343 9. Notify followers that the commit is complete
344
345 The follower's set of operations looks like:
346
347 1. Lock commit mutex
348 2. Wait to become follower
349 3. Wait to be notified that it is time to apply batch
350 4. Unlock commit mutex
351 5. Apply batch to memtable
352 6. Wait to be notified that commit is complete
353
354 The thread synchronization points (all of the waits and notifies) are
355 overhead. Reducing that overhead can improve performance.
356
357 The Pebble commit pipeline addresses both criticisms. The main
358 innovation is a commit queue that mirrors the commit order. The Pebble
359 commit pipeline looks like:
360
361 1. Lock commit mutex
362 * Add batch to commit queue
363 * Assign batch sequence number
364 * Write batch to the WAL
365 2. Unlock commit mutex
366 3. Apply batch to memtable (concurrently)
367 4. Publish batch sequence number
368
369 Pebble does not use the concept of a batch group. Each batch is
370 individually written to the WAL, but note that the WAL write is just a
371 memory copy into an internal buffer in the WAL.
372
373 Step 4 deserves further scrutiny as it is where the invariant on the
374 visible batch sequence number is maintained. Publishing the batch
375 sequence number cannot simply bump the visible sequence number because
376 batches with earlier sequence numbers may still be applying to the
377 memtable. If we were to ratchet the visible sequence number without
378 waiting for those applies to finish, a concurrent reader could see
379 partial batch contents. Note that RocksDB has experimented with
380 allowing these semantics with its unordered writes option.
381
382 We want to retain the atomic visibility of batch commits. The publish
383 batch sequence number step needs to ensure that we don't ratchet the
384 visible sequence number until all batches with earlier sequence
385 numbers have applied. Enter the commit queue: a lock-free
386 single-producer, multi-consumer queue. Batches are added to the commit
387 queue with the commit mutex held, ensuring the same order as the
388 sequence number assignment. After a batch finishes applying to the
389 memtable, it atomically marks the batch as applied. It then removes
390 the prefix of applied batches from the commit queue, bumping the
391 visible sequence number, and marking the batch as committed (via a
392 `sync.WaitGroup`). If the first batch in the commit queue has not be
393 applied we wait for our batch to be committed, relying on another
394 concurrent committer to perform the visible sequence ratcheting for
395 our batch. We know a concurrent commit is taking place because if
396 there was only one batch committing it would be at the head of the
397 commit queue.
398
399 There are two possibilities when publishing a sequence number. The
400 first is that there is an unapplied batch at the head of the
401 queue. Consider the following scenario where we're trying to publish
402 the sequence number for batch `B`.
403
404 ```
405 +---------------+-------------+---------------+-----+
406 | A (unapplied) | B (applied) | C (unapplied) | ... |
407 +---------------+-------------+---------------+-----+
408 ```
409
410 The publish routine will see that `A` is unapplied and then simply
411 wait for `B's` done `sync.WaitGroup` to be signalled. This is safe
412 because `A` must still be committing. And if `A` has concurrently been
413 marked as applied, the goroutine publishing `A` will then publish
414 `B`. What happens when `A` publishes its sequence number? The commit
415 queue state becomes:
416
417 ```
418 +-------------+-------------+---------------+-----+
419 | A (applied) | B (applied) | C (unapplied) | ... |
420 +-------------+-------------+---------------+-----+
421 ```
422
423 The publish routine pops `A` from the queue, ratchets the sequence
424 number, then pops `B` and ratchets the sequence number again, and then
425 finds `C` and stops. A detail that it is important to notice is that
426 the committer for batch `B` didn't have to do any more work. An
427 alternative approach would be to have `B` wakeup and ratchet its own
428 sequence number, but that would serialize the remainder of the commit
429 queue behind that goroutine waking up.
430
431 The commit queue reduces the number of thread synchronization
432 operations required to commit a batch. There is no leader to notify,
433 or followers to wait for. A commit either publishes its own sequence
434 number, or performs one synchronization operation to wait for a
435 concurrent committer to publish its sequence number.
436
437 ## Range Deletions
438
439 Deletion of an individual key in RocksDB and Pebble is accomplished by
440 writing a deletion tombstone. A deletion tombstone shadows an existing
441 value for a key, causing reads to treat the key as not present. The
442 deletion tombstone mechanism works well for deleting small sets of
443 keys, but what happens if you want to all of the keys within a range
444 of keys that might number in the thousands or millions? A range
445 deletion is an operation which deletes an entire range of keys with a
446 single record. In contrast to a point deletion tombstone which
447 specifies a single key, a range deletion tombstone (a.k.a. range
448 tombstone) specifies a start key (inclusive) and an end key
449 (exclusive). This single record is much faster to write than thousands
450 or millions of point deletion tombstones, and can be done blindly --
451 without iterating over the keys that need to be deleted. The downside
452 to range tombstones is that they require additional processing during
453 reads. How the processing of range tombstones is done significantly
454 affects both the complexity of the implementation, and the efficiency
455 of read operations in the presence of range tombstones.
456
457 A range tombstone is composed of a start key, end key, and sequence
458 number. Any key that falls within the range is considered deleted if
459 the key's sequence number is less than or equal to the range
460 tombstone's sequence number. RocksDB stores range tombstones
461 segregated from point operations in a special range deletion block
462 within each sstable. Conceptually, the range tombstones stored within
463 an sstable are truncated to the boundaries of the sstable, though
464 there are complexities that cause this to not actually be physically
465 true.
466
467 In RocksDB, the main structure implementing range tombstone processing
468 is the `RangeDelAggregator`. Each read operation and iterator has its
469 own `RangeDelAggregator` configured for the sequence number the read
470 is taking place at. The initial implementation of `RangeDelAggregator`
471 built up a "skyline" for the range tombstones visible at the read
472 sequence number.
473
474 ```
475 10 +---+
476 9 | |
477 8 | |
478 7 | +----+
479 6 | |
480 5 +-+ | +----+
481 4 | | | |
482 3 | | | +---+
483 2 | | | |
484 1 | | | |
485 0 | | | |
486 abcdefghijklmnopqrstuvwxyz
487 ```
488
489 The above diagram shows the skyline created for the range tombstones
490 `[b,j)#5`, `[d,h)#10`, `[f,m)#7`, `[p,u)#5`, and `[t,y)#3`. The
491 skyline is queried for each key read to see if the key should be
492 considered deleted or not. The skyline structure is stored in a binary
493 tree, making the queries an O(logn) operation in the number of
494 tombstones, though there is an optimization to make this O(1) for
495 `next`/`prev` iteration. Note that the skyline representation loses
496 information about the range tombstones. This requires the structure to
497 be rebuilt on every read which has a significant performance impact.
498
499 The initial skyline range tombstone implementation has since been
500 replaced with a more efficient lookup structure. See the
501 [DeleteRange](https://rocksdb.org/blog/2018/11/21/delete-range.html)
502 blog post for a good description of both the original implementation
503 and the new (v2) implementation. The key change in the new
504 implementation is to "fragment" the range tombstones that are stored
505 in an sstable. The fragmented range tombstones provide the same
506 benefit as the skyline representation: the ability to binary search
507 the fragments in order to find the tombstone covering a key. But
508 unlike the skyline approach, the fragmented tombstones can be cached
509 on a per-sstable basis. In the v2 approach, `RangeDelAggregator` keeps
510 track of the fragmented range tombstones for each sstable encountered
511 during a read or iterator, and logically merges them together.
512
513 Fragmenting range tombstones involves splitting range tombstones at
514 overlap points. Let's consider the tombstones in the skyline example
515 above:
516
517 ```
518 10: d---h
519 7: f------m
520 5: b-------j p----u
521 3: t----y
522 ```
523
524 Fragmenting the range tombstones at the overlap points creates a
525 larger number of range tombstones:
526
527 ```
528 10: d-f-h
529 7: f-h-j--m
530 5: b-d-f-h-j p---tu
531 3: tu---y
532 ```
533
534 While the number of tombstones is larger there is a significant
535 advantage: we can order the tombstones by their start key and then
536 binary search to find the set of tombstones overlapping a particular
537 point. This is possible because due to the fragmenting, all the
538 tombstones that overlap a range of keys will have the same start and
539 end key. The v2 `RangeDelAggregator` and associated classes perform
540 fragmentation of range tombstones stored in each sstable and those
541 fragmented tombstones are then cached.
542
543 In summary, in RocksDB `RangeDelAggregator` acts as an oracle for
544 answering whether a key is deleted at a particular sequence
545 number. Due to caching of fragmented tombstones, the v2 implementation
546 of `RangeDelAggregator` implementation is significantly faster to
547 populate than v1, yet the overall approach to processing range
548 tombstones remains similar.
549
550 Pebble takes a different approach: it integrates range tombstones
551 processing directly into the `mergingIter` structure. `mergingIter` is
552 the internal structure which provides a merged view of the levels in
553 an LSM. RocksDB has a similar class named
554 `MergingIterator`. Internally, `mergingIter` maintains a heap over the
555 levels in the LSM (note that each memtable and L0 table is a separate
556 "level" in `mergingIter`). In RocksDB, `MergingIterator` knows nothing
557 about range tombstones, and it is thus up to higher-level code to
558 process range tombstones using `RangeDelAggregator`.
559
560 While the separation of `MergingIterator` and range tombstones seems
561 reasonable at first glance, there is an optimization that RocksDB does
562 not perform which is awkward with the `RangeDelAggregator` approach:
563 skipping swaths of deleted keys. A range tombstone often shadows more
564 than one key. Rather than iterating over the deleted keys, it is much
565 quicker to seek to the end point of the range tombstone. The challenge
566 in implementing this optimization is that a key might be newer than
567 the range tombstone and thus shouldn't be skipped. An insight to be
568 utilized is that the level structure itself provides sufficient
569 information. A range tombstone at `Ln` is guaranteed to be newer than
570 any key it overlaps in `Ln+1`.
571
572 Pebble utilizes the insight above to integrate range deletion
573 processing with `mergingIter`. A `mergingIter` maintains a point
574 iterator and a range deletion iterator per level in the LSM. In this
575 context, every L0 table is a separate level, as is every
576 memtable. Within a level, when a range deletion contains a point
577 operation the sequence numbers must be checked to determine if the
578 point operation is newer or older than the range deletion
579 tombstone. The `mergingIter` maintains the invariant that the range
580 deletion iterators for all levels newer that the current iteration key
581 are positioned at the next (or previous during reverse iteration)
582 range deletion tombstone. We know those levels don't contain a range
583 deletion tombstone that covers the current key because if they did the
584 current key would be deleted. The range deletion iterator for the
585 current key's level is positioned at a range tombstone covering or
586 past the current key. The position of all of other range deletion
587 iterators is unspecified. Whenever a key from those levels becomes the
588 current key, their range deletion iterators need to be
589 positioned. This lazy positioning avoids seeking the range deletion
590 iterators for keys that are never considered.
591
592 For a full example, consider the following setup:
593
594 ```
595 p0: o
596 r0: m---q
597
598 p1: n p
599 r1: g---k
600
601 p2: b d i
602 r2: a---e q----v
603
604 p3: e
605 r3:
606 ```
607
608 The diagram above shows is showing 4 levels, with `pX` indicating the
609 point operations in a level and `rX` indicating the range tombstones.
610
611 If we start iterating from the beginning, the first key we encounter
612 is `b` in `p2`. When the mergingIter is pointing at a valid entry, the
613 range deletion iterators for all of the levels less that the current
614 key's level are positioned at the next range tombstone past the
615 current key. So `r0` will point at `[m,q)` and `r1` at `[g,k)`. When
616 the key `b` is encountered, we check to see if the current tombstone
617 for `r0` or `r1` contains it, and whether the tombstone for `r2`,
618 `[a,e)`, contains and is newer than `b`.
619
620 Advancing the iterator finds the next key at `d`. This is in the same
621 level as the previous key `b` so we don't have to reposition any of
622 the range deletion iterators, but merely check whether `d` is now
623 contained by any of the range tombstones at higher levels or has
624 stepped past the range tombstone in its own level. In this case, there
625 is nothing to be done.
626
627 Advancing the iterator again finds `e`. Since `e` comes from `p3`, we
628 have to position the `r3` range deletion iterator, which is empty. `e`
629 is past the `r2` tombstone of `[a,e)` so we need to advance the `r2`
630 range deletion iterator to `[q,v)`.
631
632 The next key is `i`. Because this key is in `p2`, a level above `e`,
633 we don't have to reposition any range deletion iterators and instead
634 see that `i` is covered by the range tombstone `[g,k)`. The iterator
635 is immediately advanced to `n` which is covered by the range tombstone
636 `[m,q)` causing the iterator to advance to `o` which is visible.
637
638 ## Flush and Compaction Pacing
639
640 Flushes and compactions in LSM trees are problematic because they
641 contend with foreground traffic, resulting in write and read latency
642 spikes. Without throttling the rate of flushes and compactions, they
643 occur "as fast as possible" (which is not entirely true, since we
644 have a `bytes_per_sync` option). This instantaneous usage of CPU and
645 disk IO results in potentially huge latency spikes for writes and
646 reads which occur in parallel to the flushes and compactions.
647
648 RocksDB attempts to solve this issue by offering an option to limit
649 the speed of flushes and compactions. A maximum `bytes/sec` can be
650 specified through the options, and background IO usage will be limited
651 to the specified amount. Flushes are given priority over compactions,
652 but they still use the same rate limiter. Though simple to implement
653 and understand, this option is fragile for various reasons.
654
655 1) If the rate limit is configured too low, the DB will stall and
656 write throughput will be affected.
657 2) If the rate limit is configured too high, the write and read
658 latency spikes will persist.
659 3) A different configuration is needed per system depending on the
660 speed of the storage device.
661 4) Write rates typically do not stay the same throughout the lifetime
662 of the DB (higher throughput during certain times of the day, etc) but
663 the rate limit cannot be configured during runtime.
664
665 RocksDB also offers an
666 ["auto-tuned" rate limiter](https://rocksdb.org/blog/2017/12/18/17-auto-tuned-rate-limiter.html)
667 which uses a simple multiplicative-increase, multiplicative-decrease
668 algorithm to dynamically adjust the background IO rate limit depending
669 on how much of the rate limiter has been exhausted in an interval.
670 This solves the problem of having a static rate limit, but Pebble
671 attempts to improve on this with a different pacing mechanism.
672
673 Pebble's pacing mechanism uses separate rate limiters for flushes and
674 compactions. Both the flush and compaction pacing mechanisms work by
675 attempting to flush and compact only as fast as needed and no faster.
676 This is achieved differently for flushes versus compactions.
677
678 For flush pacing, Pebble keeps the rate at which the memtable is
679 flushed at the same rate as user writes. This ensures that disk IO
680 used by flushes remains steady. When a mutable memtable becomes full
681 and is marked immutable, it is typically flushed as fast as possible.
682 Instead of flushing as fast as possible, what we do is look at the
683 total number of bytes in all the memtables (mutable + queue of
684 immutables) and subtract the number of bytes that have been flushed in
685 the current flush. This number gives us the total number of bytes
686 which remain to be flushed. If we keep this number steady at a constant
687 level, we have the invariant that the flush rate is equal to the write
688 rate.
689
690 When the number of bytes remaining to be flushed falls below our
691 target level, we slow down the speed of flushing. We keep a minimum
692 rate at which the memtable is flushed so that flushes proceed even if
693 writes have stopped. When the number of bytes remaining to be flushed
694 goes above our target level, we allow the flush to proceed as fast as
695 possible, without applying any rate limiting. However, note that the
696 second case would indicate that writes are occurring faster than the
697 memtable can flush, which would be an unsustainable rate. The LSM
698 would soon hit the memtable count stall condition and writes would be
699 completely stopped.
700
701 For compaction pacing, Pebble uses an estimation of compaction debt,
702 which is the number of bytes which need to be compacted before no
703 further compactions are needed. This estimation is calculated by
704 looking at the number of bytes that have been flushed by the current
705 flush routine, adding those bytes to the size of the level 0 sstables,
706 then seeing how many bytes exceed the target number of bytes for the
707 level 0 sstables. We multiply the number of bytes exceeded by the
708 the level ratio and add that number to the compaction debt estimate.
709 We repeat this process until the final level, which gives us a final
710 compaction debt estimate for the entire LSM tree.
711
712 Like with flush pacing, we want to keep the compaction debt at a
713 constant level. This ensures that compactions occur only as fast as
714 needed and no faster. If the compaction debt estimate falls below our
715 target level, we slow down compactions. We maintain a minimum
716 compaction rate so that compactions proceed even if flushes have
717 stopped. If the compaction debt goes above our target level, we let
718 compactions proceed as fast as possible without any rate limiting.
719 Just like with flush pacing, this would indicate that writes are
720 occurring faster than the background compactions can keep up with,
721 which is an unsustainable rate. The LSM's read amplification would
722 increase and the L0 file count stall condition would be hit.
723
724 With the combined flush and compaction pacing mechanisms, flushes and
725 compactions only occur as fast as needed and no faster, which reduces
726 latency spikes for user read and write operations.
727
728 ## Write throttling
729
730 RocksDB adds artificial delays to user writes when certain thresholds
731 are met, such as `l0_slowdown_writes_threshold`. These artificial
732 delays occur when the system is close to stalling to lessen the write
733 pressure so that flushing and compactions can catch up. On the surface
734 this seems good, since write stalls would seemingly be eliminated and
735 replaced with gradual slowdowns. Closed loop write latency benchmarks
736 would show the elimination of abrupt write stalls, which seems
737 desirable.
738
739 However, this doesn't do anything to improve latencies in an open loop
740 model, which is the model more likely to resemble real world use
741 cases. Artificial delays increase write latencies without a clear
742 benefit. Writes stalls in an open loop system would indicate that
743 writes are generated faster than the system could possibly handle,
744 which adding artificial delays won't solve.
745
746 For this reason, Pebble doesn't add artificial delays to user writes
747 and writes are served as quickly as possible.
748
749 ### Other Differences
750
751 * `internalIterator` API which minimizes indirect (virtual) function
752 calls
753 * Previous pointers in the memtable and indexed batch skiplists
754 * Elision of per-key lower/upper bound checks in long range scans
755 * Weak cache references remove the need to pin index and filter blocks
756 in memory
757 * Improved `Iterator` API
758 + `SeekPrefixGE` for prefix iteration
759 + `SetBounds` for adjusting the bounds on an existing `Iterator`
760 * Simpler `Get` implementation