github.com/cockroachdb/pebble@v1.1.2/docs/rocksdb.md

github.com/cockroachdb/pebble@v1.1.2/docs/rocksdb.md (about)

1 # Pebble vs RocksDB: Implementation Differences
2
3 RocksDB is a key-value store implemented using a Log-Structured
4 Merge-Tree (LSM). This document is not a primer on LSMs. There exist
5 some decent
6 [introductions](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/)
7 on the web, or try chapter 3 of [Designing Data-Intensive
8 Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321).
9
10 Pebble inherits the RocksDB file formats, has a similar API, and
11 shares many implementation details, but it also has many differences
12 that improve performance, reduce implementation complexity, or extend
13 functionality. This document highlights some of the more important
14 differences.
15
16 * [Internal Keys](#internal-keys)
17 * [Indexed Batches](#indexed-batches)
18 * [Large Batches](#large-batches)
19 * [Commit Pipeline](#commit-pipeline)
20 * [Range Deletions](#range-deletions)
21 * [Flush and Compaction Pacing](#flush-and-compaction-pacing)
22 * [Write Throttling](#write-throttling)
23 * [Other Differences](#other-differences)
24
25 ## Internal Keys
26
27 The external RocksDB API accepts keys and values. Due to the LSM
28 structure, keys are never updated in place, but overwritten with new
29 versions. Inside RocksDB, these versioned keys are known as Internal
30 Keys. An Internal Key is composed of the user specified key, a
31 sequence number and a kind. On disk, sstables always store Internal
32 Keys.
33
34 ```
35 +-------------+------------+----------+
36 | UserKey (N) | SeqNum (7) | Kind (1) |
37 +-------------+------------+----------+
38 ```
39
40 The `Kind` field indicates the type of key: set, merge, delete, etc.
41
42 While Pebble inherits the Internal Key encoding for format
43 compatibility, it diverges from RocksDB in how it manages Internal
44 Keys in its implementation. In RocksDB, Internal Keys are represented
45 either in encoded form (as a string) or as a `ParsedInternalKey`. The
46 latter is a struct with the components of the Internal Key as three
47 separate fields.
48
49 ```c++
50 struct ParsedInternalKey {
51 Slice user_key;
52 uint64 seqnum;
53 uint8 kind;
54 }
55 ```
56
57 The component format is convenient: changing the `SeqNum` or `Kind` is
58 field assignment. Extracting the `UserKey` is a field
59 reference. However, RocksDB tends to only use `ParsedInternalKey`
60 locally. The major internal APIs, such as `InternalIterator`, operate
61 using encoded internal keys (i.e. strings) for parameters and return
62 values.
63
64 To give a concrete example of the overhead this causes, consider
65 `Iterator::Seek(user_key)`. The external `Iterator` is implemented on
66 top of an `InternalIterator`. `Iterator::Seek` ends up calling
67 `InternalIterator::Seek`. Both Seek methods take a key, but
68 `InternalIterator::Seek` expects an encoded Internal Key. This is both
69 error prone and expensive. The key passed to `Iterator::Seek` needs to
70 be copied into a temporary string in order to append the `SeqNum` and
71 `Kind`. In Pebble, Internal Keys are represented in memory using an
72 `InternalKey` struct that is the analog of `ParsedInternalKey`. All
73 internal APIs use `InternalKeys`, with the exception of the lowest
74 level routines for decoding data from sstables. In Pebble, since the
75 interfaces all take and return the `InternalKey` struct, we don’t need
76 to allocate to construct the Internal Key from the User Key, but
77 RocksDB sometimes needs to allocate, and encode (i.e. make a
78 copy). The use of the encoded form also causes RocksDB to pass encoded
79 keys to the comparator routines, sometimes decoding the keys multiple
80 times during the course of processing.
81
82 ## Indexed Batches
83
84 In RocksDB, a batch is the unit for all write operations. Even writing
85 a single key is transformed internally to a batch. The batch internal
86 representation is a contiguous byte buffer with a fixed 12-byte
87 header, followed by a series of records.
88
89 ```
90 +------------+-----------+--- ... ---+
91 | SeqNum (8) | Count (4) | Entries |
92 +------------+-----------+--- ... ---+
93 ```
94
95 Each record has a 1-byte kind tag prefix, followed by 1 or 2 length
96 prefixed strings (varstring):
97
98 ```
99 +----------+-----------------+-------------------+
100 | Kind (1) | Key (varstring) | Value (varstring) |
101 +----------+-----------------+-------------------+
102 ```
103
104 (The `Kind` indicates if there are 1 or 2 varstrings. `Set`, `Merge`,
105 and `DeleteRange` have 2 varstrings, while `Delete` has 1.)
106
107 Adding a mutation to a batch involves appending a new record to the
108 buffer. This format is extremely fast for writes, but the lack of
109 indexing makes it untenable to use directly for reads. In order to
110 support iteration, a separate indexing structure is created. Both
111 RocksDB and Pebble use a skiplist for the indexing structure, but with
112 a clever twist. Rather than the skiplist storing a copy of the key, it
113 simply stores the offset of the record within the mutation buffer. The
114 result is that the skiplist acts a multi-map (i.e. a map that can have
115 duplicate entries for a given key). The iteration order for this map
116 is constructed so that records sort on key, and for equal keys they
117 sort on descending offset. Newer records for the same key appear
118 before older records.
119
120 While the indexing structure for batches is nearly identical between
121 RocksDB and Pebble, how the index structure is used is completely
122 different. In RocksDB, a batch is indexed using the
123 `WriteBatchWithIndex` class. The `WriteBatchWithIndex` class provides
124 a `NewIteratorWithBase` method that allows iteration over the merged
125 view of the batch contents and an underlying "base" iterator created
126 from the database. `BaseDeltaIterator` contains logic to iterate over
127 the batch entries and the base iterator in parallel which allows us to
128 perform reads on a snapshot of the database as though the batch had
129 been applied to it. On the surface this sounds reasonable, yet the
130 implementation is incomplete. Merge and DeleteRange operations are not
131 supported. The reason they are not supported is because handling them
132 is complex and requires duplicating logic that already exists inside
133 RocksDB for normal iterator processing.
134
135 Pebble takes a different approach to iterating over a merged view of a
136 batch's contents and the underlying database: it treats the batch as
137 another level in the LSM. Recall that an LSM is composed of zero or
138 more memtable layers and zero or more sstable layers. Internally, both
139 RocksDB and Pebble contain a `MergingIterator` that knows how to merge
140 the operations from different levels, including processing overwritten
141 keys, merge operations, and delete range operations. The challenge
142 with treating the batch as another level to be used by a
143 `MergingIterator` is that the records in a batch do not have a
144 sequence number. The sequence number in the batch header is not
145 assigned until the batch is committed. The solution is to give the
146 batch records temporary sequence numbers. We need these temporary
147 sequence numbers to be larger than any other sequence number in the
148 database so that the records in the batch are considered newer than
149 any committed record. This is accomplished by reserving the high-bit
150 in the 56-bit sequence number for use as a marker for batch sequence
151 numbers. The sequence number for a record in an uncommitted batch is:
152
153 ```
154 RecordOffset | (1<<55)
155 ```
156
157 Newer records in a given batch will have a larger sequence number than
158 older records in the batch. And all of the records in a batch will
159 have larger sequence numbers than any committed record in the
160 database.
161
162 The end result is that Pebble's batch iterators support all of the
163 functionality of regular database iterators with minimal additional
164 code.
165
166 ## Large Batches
167
168 The size of a batch is limited only by available memory, yet the
169 required memory is not just the batch representation. When a batch is
170 committed, the commit operation iterates over the records in the batch
171 from oldest to newest and inserts them into the current memtable. The
172 memtable is an in-memory structure that buffers mutations that have
173 been committed (written to the Write Ahead Log), but not yet written
174 to an sstable. Internally, a memtable uses a skiplist to index
175 records. Each skiplist entry has overhead for the index links and
176 other metadata that is a dozen bytes at minimum. A large batch
177 composed of many small records can require twice as much memory when
178 inserted into a memtable than it required in the batch. And note that
179 this causes a temporary increase in memory requirements because the
180 batch memory is not freed until it is completely committed.
181
182 A non-obvious implementation restriction present in both RocksDB and
183 Pebble is that there is a one-to-one correspondence between WAL files
184 and memtables. That is, a given WAL file has a single memtable
185 associated with it and vice-versa. While this restriction could be
186 removed, doing so is onerous and intricate. It should also be noted
187 that committing a batch involves writing it to a single WAL file. The
188 combination of restrictions results in a batch needing to be written
189 entirely to a single memtable.
190
191 What happens if a batch is too large to fit in a memtable? Memtables
192 are generally considered to have a fixed size, yet this is not
193 actually true in RocksDB. In RocksDB, the memtable skiplist is
194 implemented on top of an arena structure. An arena is composed of a
195 list of fixed size chunks, with no upper limit set for the number of
196 chunks that can be associated with an arena. So RocksDB handles large
197 batches by allowing a memtable to grow beyond its configured
198 size. Concretely, while RocksDB may be configured with a 64MB memtable
199 size, a 1GB batch will cause the memtable to grow to accomodate
200 it. Functionally, this is good, though there is a practical problem: a
201 large batch is first written to the WAL, and then added to the
202 memtable. Adding the large batch to the memtable may consume so much
203 memory that the system runs out of memory and is killed by the
204 kernel. This can result in a death loop because upon restarting as the
205 batch is read from the WAL and applied to the memtable again.
206
207 In Pebble, the memtable is also implemented using a skiplist on top of
208 an arena. Significantly, the Pebble arena is a fixed size. While the
209 RocksDB skiplist uses pointers, the Pebble skiplist uses offsets from
210 the start of the arena. The fixed size arena means that the Pebble
211 memtable cannot expand arbitrarily. A batch that is too large to fit
212 in the memtable causes the current mutable memtable to be marked as
213 immutable and the batch is wrapped in a `flushableBatch` structure and
214 added to the list of immutable memtables. Because the `flushableBatch`
215 is readable as another layer in the LSM, the batch commit can return
216 as soon as the `flushableBatch` has been added to the immutable
217 memtable list.
218
219 Internally, a `flushableBatch` provides iterator support by sorting
220 the batch contents (the batch is sorted once, when it is added to the
221 memtable list). Sorting the batch contents and insertion of the
222 contents into a memtable have the same big-O time, but the constant
223 factor dominates here. Sorting is significantly faster and uses
224 significantly less memory due to not having to copy the batch records.
225
226 Note that an effect of this large batch support is that Pebble can be
227 configured as an efficient on-disk sorter: specify a small memtable
228 size, disable the WAL, and set a large L0 compaction threshold. In
229 order to sort a large amount of data, create batches that are larger
230 than the memtable size and commit them. When committed these batches
231 will not be inserted into a memtable, but instead sorted and then
232 written out to L0. The fully sorted data can later be read and the
233 normal merging process will take care of the final ordering.
234
235 ## Commit Pipeline
236
237 The commit pipeline is the component which manages the steps in
238 committing write batches, such as writing the batch to the WAL and
239 applying its contents to the memtable. While simple conceptually, the
240 commit pipeline is crucial for high performance. In the absence of
241 concurrency, commit performance is limited by how fast a batch can be
242 written (and synced) to the WAL and then added to the memtable, both
243 of which are outside of the purview of the commit pipeline.
244
245 To understand the challenge here, it is useful to have a conception of
246 the WAL (write-ahead log). The WAL contains a record of all of the
247 batches that have been committed to the database. As a record is
248 written to the WAL it is added to the memtable. Each record is
249 assigned a sequence number which is used to distinguish newer updates
250 from older ones. Conceptually the WAL looks like:
251
252 ```
253 +--------------------------------------+
254 | Batch(SeqNum=1,Count=9,Records=...) |
255 +--------------------------------------+
256 | Batch(SeqNum=10,Count=5,Records=...) |
257 +--------------------------------------+
258 | Batch(SeqNum=15,Count=7,Records...) |
259 +--------------------------------------+
260 | ... |
261 +--------------------------------------+
262 ```
263
264 Note that each WAL entry is precisely the batch representation
265 described earlier in the [Indexed Batches](#indexed-batches)
266 section. The monotonically increasing sequence numbers are a critical
267 component in allowing RocksDB and Pebble to provide fast snapshot
268 views of the database for reads.
269
270 If concurrent performance was not a concern, the commit pipeline could
271 simply be a mutex which serialized writes to the WAL and application
272 of the batch records to the memtable. Concurrent performance is a
273 concern, though.
274
275 The primary challenge in concurrent performance in the commit pipeline
276 is maintaining two invariants:
277
278 1. Batches need to be written to the WAL in sequence number order.
279 2. Batches need to be made visible for reads in sequence number
280 order. This invariant arises from the use of a single sequence
281 number which indicates which mutations are visible.
282
283 The second invariant deserves explanation. RocksDB and Pebble both
284 keep track of a visible sequence number. This is the sequence number
285 for which records in the database are visible during reads. The
286 visible sequence number exists because committing a batch is an atomic
287 operation, yet adding records to the memtable is done without an
288 exclusive lock (the skiplists used by both Pebble and RocksDB are
289 lock-free). When the records from a batch are being added to the
290 memtable, a concurrent read operation may see those records, but will
291 skip over them because they are newer than the visible sequence
292 number. Once all of the records in the batch have been added to the
293 memtable, the visible sequence number is atomically incremented.
294
295 So we have four steps in committing a write batch:
296
297 1. Write the batch to the WAL
298 2. Apply the mutations in the batch to the memtable
299 3. Bump the visible sequence number
300 4. (Optionally) sync the WAL
301
302 Writing the batch to the WAL is actually very fast as it is just a
303 memory copy. Applying the mutations in the batch to the memtable is by
304 far the most CPU intensive part of the commit pipeline. Syncing the
305 WAL is the most expensive from a wall clock perspective.
306
307 With that background out of the way, let's examine how RocksDB commits
308 batches. This description is of the traditional commit pipeline in
309 RocksDB (i.e. the one used by CockroachDB).
310
311 RocksDB achieves concurrency in the commit pipeline by grouping
312 concurrently committed batches into a batch group. Each group is
313 assigned a "leader" which is the first batch to be added to the
314 group. The batch group is written atomically to the WAL by the leader
315 thread, and then the individual batches making up the group are
316 concurrently applied to the memtable. Lastly, the visible sequence
317 number is bumped such that all of the batches in the group become
318 visible in a single atomic step. While a batch group is being applied,
319 other concurrent commits are added to a waiting list. When the group
320 commit finishes, the waiting commits form the next group.
321
322 There are two criticisms of the batch grouping approach. The first is
323 that forming a batch group involves copying batch contents. RocksDB
324 partially alleviates this for large batches by placing a limit on the
325 total size of a group. A large batch will end up in its own group and
326 not be copied, but the criticism still applies for small batches. Note
327 that there are actually two copies here. The batch contents are
328 concatenated together to form the group, and then the group contents
329 are written into an in memory buffer for the WAL before being written
330 to disk.
331
332 The second criticism is about the thread synchronization points. Let's
333 consider what happens to a commit which becomes the leader:
334
335 1. Lock commit mutex
336 2. Wait to become leader
337 3. Form (concatenate) batch group and write to the WAL
338 4. Notify followers to apply their batch to the memtable
339 5. Apply own batch to memtable
340 6. Wait for followers to finish
341 7. Bump visible sequence number
342 8. Unlock commit mutex
343 9. Notify followers that the commit is complete
344
345 The follower's set of operations looks like:
346
347 1. Lock commit mutex
348 2. Wait to become follower
349 3. Wait to be notified that it is time to apply batch
350 4. Unlock commit mutex
351 5. Apply batch to memtable
352 6. Wait to be notified that commit is complete
353
354 The thread synchronization points (all of the waits and notifies) are
355 overhead. Reducing that overhead can improve performance.
356
357 The Pebble commit pipeline addresses both criticisms. The main
358 innovation is a commit queue that mirrors the commit order. The Pebble
359 commit pipeline looks like:
360
361 1. Lock commit mutex
362 * Add batch to commit queue
363 * Assign batch sequence number
364 * Write batch to the WAL
365 2. Unlock commit mutex
366 3. Apply batch to memtable (concurrently)
367 4. Publish batch sequence number
368
369 Pebble does not use the concept of a batch group. Each batch is
370 individually written to the WAL, but note that the WAL write is just a
371 memory copy into an internal buffer in the WAL.
372
373 Step 4 deserves further scrutiny as it is where the invariant on the
374 visible batch sequence number is maintained. Publishing the batch
375 sequence number cannot simply bump the visible sequence number because
376 batches with earlier sequence numbers may still be applying to the
377 memtable. If we were to ratchet the visible sequence number without
378 waiting for those applies to finish, a concurrent reader could see
379 partial batch contents. Note that RocksDB has experimented with
380 allowing these semantics with its unordered writes option.
381
382 We want to retain the atomic visibility of batch commits. The publish
383 batch sequence number step needs to ensure that we don't ratchet the
384 visible sequence number until all batches with earlier sequence
385 numbers have applied. Enter the commit queue: a lock-free
386 single-producer, multi-consumer queue. Batches are added to the commit
387 queue with the commit mutex held, ensuring the same order as the
388 sequence number assignment. After a batch finishes applying to the
389 memtable, it atomically marks the batch as applied. It then removes
390 the prefix of applied batches from the commit queue, bumping the
391 visible sequence number, and marking the batch as committed (via a
392 `sync.WaitGroup`). If the first batch in the commit queue has not be
393 applied we wait for our batch to be committed, relying on another
394 concurrent committer to perform the visible sequence ratcheting for
395 our batch. We know a concurrent commit is taking place because if
396 there was only one batch committing it would be at the head of the
397 commit queue.
398
399 There are two possibilities when publishing a sequence number. The
400 first is that there is an unapplied batch at the head of the
401 queue. Consider the following scenario where we're trying to publish
402 the sequence number for batch `B`.
403
404 ```
405 +---------------+-------------+---------------+-----+
406 | A (unapplied) | B (applied) | C (unapplied) | ... |
407 +---------------+-------------+---------------+-----+
408 ```
409
410 The publish routine will see that `A` is unapplied and then simply
411 wait for `B's` done `sync.WaitGroup` to be signalled. This is safe
412 because `A` must still be committing. And if `A` has concurrently been
413 marked as applied, the goroutine publishing `A` will then publish
414 `B`. What happens when `A` publishes its sequence number? The commit
415 queue state becomes:
416
417 ```
418 +-------------+-------------+---------------+-----+
419 | A (applied) | B (applied) | C (unapplied) | ... |
420 +-------------+-------------+---------------+-----+
421 ```
422
423 The publish routine pops `A` from the queue, ratchets the sequence
424 number, then pops `B` and ratchets the sequence number again, and then
425 finds `C` and stops. A detail that it is important to notice is that
426 the committer for batch `B` didn't have to do any more work. An
427 alternative approach would be to have `B` wakeup and ratchet its own
428 sequence number, but that would serialize the remainder of the commit
429 queue behind that goroutine waking up.
430
431 The commit queue reduces the number of thread synchronization
432 operations required to commit a batch. There is no leader to notify,
433 or followers to wait for. A commit either publishes its own sequence
434 number, or performs one synchronization operation to wait for a
435 concurrent committer to publish its sequence number.
436
437 ## Range Deletions
438
439 Deletion of an individual key in RocksDB and Pebble is accomplished by
440 writing a deletion tombstone. A deletion tombstone shadows an existing
441 value for a key, causing reads to treat the key as not present. The
442 deletion tombstone mechanism works well for deleting small sets of
443 keys, but what happens if you want to all of the keys within a range
444 of keys that might number in the thousands or millions? A range
445 deletion is an operation which deletes an entire range of keys with a
446 single record. In contrast to a point deletion tombstone which
447 specifies a single key, a range deletion tombstone (a.k.a. range
448 tombstone) specifies a start key (inclusive) and an end key
449 (exclusive). This single record is much faster to write than thousands
450 or millions of point deletion tombstones, and can be done blindly --
451 without iterating over the keys that need to be deleted. The downside
452 to range tombstones is that they require additional processing during
453 reads. How the processing of range tombstones is done significantly
454 affects both the complexity of the implementation, and the efficiency
455 of read operations in the presence of range tombstones.
456
457 A range tombstone is composed of a start key, end key, and sequence
458 number. Any key that falls within the range is considered deleted if
459 the key's sequence number is less than the range tombstone's sequence
460 number. RocksDB stores range tombstones segregated from point
461 operations in a special range deletion block within each sstable.
462 Conceptually, the range tombstones stored within an sstable are
463 truncated to the boundaries of the sstable, though there are
464 complexities that cause this to not actually be physically true.
465
466 In RocksDB, the main structure implementing range tombstone processing
467 is the `RangeDelAggregator`. Each read operation and iterator has its
468 own `RangeDelAggregator` configured for the sequence number the read
469 is taking place at. The initial implementation of `RangeDelAggregator`
470 built up a "skyline" for the range tombstones visible at the read
471 sequence number.
472
473 ```
474 10 +---+
475 9 | |
476 8 | |
477 7 | +----+
478 6 | |
479 5 +-+ | +----+
480 4 | | | |
481 3 | | | +---+
482 2 | | | |
483 1 | | | |
484 0 | | | |
485 abcdefghijklmnopqrstuvwxyz
486 ```
487
488 The above diagram shows the skyline created for the range tombstones
489 `[b,j)#5`, `[d,h)#10`, `[f,m)#7`, `[p,u)#5`, and `[t,y)#3`. The
490 skyline is queried for each key read to see if the key should be
491 considered deleted or not. The skyline structure is stored in a binary
492 tree, making the queries an O(logn) operation in the number of
493 tombstones, though there is an optimization to make this O(1) for
494 `next`/`prev` iteration. Note that the skyline representation loses
495 information about the range tombstones. This requires the structure to
496 be rebuilt on every read which has a significant performance impact.
497
498 The initial skyline range tombstone implementation has since been
499 replaced with a more efficient lookup structure. See the
500 [DeleteRange](https://rocksdb.org/blog/2018/11/21/delete-range.html)
501 blog post for a good description of both the original implementation
502 and the new (v2) implementation. The key change in the new
503 implementation is to "fragment" the range tombstones that are stored
504 in an sstable. The fragmented range tombstones provide the same
505 benefit as the skyline representation: the ability to binary search
506 the fragments in order to find the tombstone covering a key. But
507 unlike the skyline approach, the fragmented tombstones can be cached
508 on a per-sstable basis. In the v2 approach, `RangeDelAggregator` keeps
509 track of the fragmented range tombstones for each sstable encountered
510 during a read or iterator, and logically merges them together.
511
512 Fragmenting range tombstones involves splitting range tombstones at
513 overlap points. Let's consider the tombstones in the skyline example
514 above:
515
516 ```
517 10: d---h
518 7: f------m
519 5: b-------j p----u
520 3: t----y
521 ```
522
523 Fragmenting the range tombstones at the overlap points creates a
524 larger number of range tombstones:
525
526 ```
527 10: d-f-h
528 7: f-h-j--m
529 5: b-d-f-h-j p---tu
530 3: tu---y
531 ```
532
533 While the number of tombstones is larger there is a significant
534 advantage: we can order the tombstones by their start key and then
535 binary search to find the set of tombstones overlapping a particular
536 point. This is possible because due to the fragmenting, all the
537 tombstones that overlap a range of keys will have the same start and
538 end key. The v2 `RangeDelAggregator` and associated classes perform
539 fragmentation of range tombstones stored in each sstable and those
540 fragmented tombstones are then cached.
541
542 In summary, in RocksDB `RangeDelAggregator` acts as an oracle for
543 answering whether a key is deleted at a particular sequence
544 number. Due to caching of fragmented tombstones, the v2 implementation
545 of `RangeDelAggregator` implementation is significantly faster to
546 populate than v1, yet the overall approach to processing range
547 tombstones remains similar.
548
549 Pebble takes a different approach: it integrates range tombstones
550 processing directly into the `mergingIter` structure. `mergingIter` is
551 the internal structure which provides a merged view of the levels in
552 an LSM. RocksDB has a similar class named
553 `MergingIterator`. Internally, `mergingIter` maintains a heap over the
554 levels in the LSM (note that each memtable and L0 table is a separate
555 "level" in `mergingIter`). In RocksDB, `MergingIterator` knows nothing
556 about range tombstones, and it is thus up to higher-level code to
557 process range tombstones using `RangeDelAggregator`.
558
559 While the separation of `MergingIterator` and range tombstones seems
560 reasonable at first glance, there is an optimization that RocksDB does
561 not perform which is awkward with the `RangeDelAggregator` approach:
562 skipping swaths of deleted keys. A range tombstone often shadows more
563 than one key. Rather than iterating over the deleted keys, it is much
564 quicker to seek to the end point of the range tombstone. The challenge
565 in implementing this optimization is that a key might be newer than
566 the range tombstone and thus shouldn't be skipped. An insight to be
567 utilized is that the level structure itself provides sufficient
568 information. A range tombstone at `Ln` is guaranteed to be newer than
569 any key it overlaps in `Ln+1`.
570
571 Pebble utilizes the insight above to integrate range deletion
572 processing with `mergingIter`. A `mergingIter` maintains a point
573 iterator and a range deletion iterator per level in the LSM. In this
574 context, every L0 table is a separate level, as is every
575 memtable. Within a level, when a range deletion contains a point
576 operation the sequence numbers must be checked to determine if the
577 point operation is newer or older than the range deletion
578 tombstone. The `mergingIter` maintains the invariant that the range
579 deletion iterators for all levels newer that the current iteration key
580 are positioned at the next (or previous during reverse iteration)
581 range deletion tombstone. We know those levels don't contain a range
582 deletion tombstone that covers the current key because if they did the
583 current key would be deleted. The range deletion iterator for the
584 current key's level is positioned at a range tombstone covering or
585 past the current key. The position of all of other range deletion
586 iterators is unspecified. Whenever a key from those levels becomes the
587 current key, their range deletion iterators need to be
588 positioned. This lazy positioning avoids seeking the range deletion
589 iterators for keys that are never considered.
590
591 For a full example, consider the following setup:
592
593 ```
594 p0: o
595 r0: m---q
596
597 p1: n p
598 r1: g---k
599
600 p2: b d i
601 r2: a---e q----v
602
603 p3: e
604 r3:
605 ```
606
607 The diagram above shows is showing 4 levels, with `pX` indicating the
608 point operations in a level and `rX` indicating the range tombstones.
609
610 If we start iterating from the beginning, the first key we encounter
611 is `b` in `p2`. When the mergingIter is pointing at a valid entry, the
612 range deletion iterators for all of the levels less that the current
613 key's level are positioned at the next range tombstone past the
614 current key. So `r0` will point at `[m,q)` and `r1` at `[g,k)`. When
615 the key `b` is encountered, we check to see if the current tombstone
616 for `r0` or `r1` contains it, and whether the tombstone for `r2`,
617 `[a,e)`, contains and is newer than `b`.
618
619 Advancing the iterator finds the next key at `d`. This is in the same
620 level as the previous key `b` so we don't have to reposition any of
621 the range deletion iterators, but merely check whether `d` is now
622 contained by any of the range tombstones at higher levels or has
623 stepped past the range tombstone in its own level. In this case, there
624 is nothing to be done.
625
626 Advancing the iterator again finds `e`. Since `e` comes from `p3`, we
627 have to position the `r3` range deletion iterator, which is empty. `e`
628 is past the `r2` tombstone of `[a,e)` so we need to advance the `r2`
629 range deletion iterator to `[q,v)`.
630
631 The next key is `i`. Because this key is in `p2`, a level above `e`,
632 we don't have to reposition any range deletion iterators and instead
633 see that `i` is covered by the range tombstone `[g,k)`. The iterator
634 is immediately advanced to `n` which is covered by the range tombstone
635 `[m,q)` causing the iterator to advance to `o` which is visible.
636
637 ## Flush and Compaction Pacing
638
639 Flushes and compactions in LSM trees are problematic because they
640 contend with foreground traffic, resulting in write and read latency
641 spikes. Without throttling the rate of flushes and compactions, they
642 occur "as fast as possible" (which is not entirely true, since we
643 have a `bytes_per_sync` option). This instantaneous usage of CPU and
644 disk IO results in potentially huge latency spikes for writes and
645 reads which occur in parallel to the flushes and compactions.
646
647 RocksDB attempts to solve this issue by offering an option to limit
648 the speed of flushes and compactions. A maximum `bytes/sec` can be
649 specified through the options, and background IO usage will be limited
650 to the specified amount. Flushes are given priority over compactions,
651 but they still use the same rate limiter. Though simple to implement
652 and understand, this option is fragile for various reasons.
653
654 1) If the rate limit is configured too low, the DB will stall and
655 write throughput will be affected.
656 2) If the rate limit is configured too high, the write and read
657 latency spikes will persist.
658 3) A different configuration is needed per system depending on the
659 speed of the storage device.
660 4) Write rates typically do not stay the same throughout the lifetime
661 of the DB (higher throughput during certain times of the day, etc) but
662 the rate limit cannot be configured during runtime.
663
664 RocksDB also offers an
665 ["auto-tuned" rate limiter](https://rocksdb.org/blog/2017/12/18/17-auto-tuned-rate-limiter.html)
666 which uses a simple multiplicative-increase, multiplicative-decrease
667 algorithm to dynamically adjust the background IO rate limit depending
668 on how much of the rate limiter has been exhausted in an interval.
669 This solves the problem of having a static rate limit, but Pebble
670 attempts to improve on this with a different pacing mechanism.
671
672 Pebble's pacing mechanism uses separate rate limiters for flushes and
673 compactions. Both the flush and compaction pacing mechanisms work by
674 attempting to flush and compact only as fast as needed and no faster.
675 This is achieved differently for flushes versus compactions.
676
677 For flush pacing, Pebble keeps the rate at which the memtable is
678 flushed at the same rate as user writes. This ensures that disk IO
679 used by flushes remains steady. When a mutable memtable becomes full
680 and is marked immutable, it is typically flushed as fast as possible.
681 Instead of flushing as fast as possible, what we do is look at the
682 total number of bytes in all the memtables (mutable + queue of
683 immutables) and subtract the number of bytes that have been flushed in
684 the current flush. This number gives us the total number of bytes
685 which remain to be flushed. If we keep this number steady at a constant
686 level, we have the invariant that the flush rate is equal to the write
687 rate.
688
689 When the number of bytes remaining to be flushed falls below our
690 target level, we slow down the speed of flushing. We keep a minimum
691 rate at which the memtable is flushed so that flushes proceed even if
692 writes have stopped. When the number of bytes remaining to be flushed
693 goes above our target level, we allow the flush to proceed as fast as
694 possible, without applying any rate limiting. However, note that the
695 second case would indicate that writes are occurring faster than the
696 memtable can flush, which would be an unsustainable rate. The LSM
697 would soon hit the memtable count stall condition and writes would be
698 completely stopped.
699
700 For compaction pacing, Pebble uses an estimation of compaction debt,
701 which is the number of bytes which need to be compacted before no
702 further compactions are needed. This estimation is calculated by
703 looking at the number of bytes that have been flushed by the current
704 flush routine, adding those bytes to the size of the level 0 sstables,
705 then seeing how many bytes exceed the target number of bytes for the
706 level 0 sstables. We multiply the number of bytes exceeded by the
707 level ratio and add that number to the compaction debt estimate.
708 We repeat this process until the final level, which gives us a final
709 compaction debt estimate for the entire LSM tree.
710
711 Like with flush pacing, we want to keep the compaction debt at a
712 constant level. This ensures that compactions occur only as fast as
713 needed and no faster. If the compaction debt estimate falls below our
714 target level, we slow down compactions. We maintain a minimum
715 compaction rate so that compactions proceed even if flushes have
716 stopped. If the compaction debt goes above our target level, we let
717 compactions proceed as fast as possible without any rate limiting.
718 Just like with flush pacing, this would indicate that writes are
719 occurring faster than the background compactions can keep up with,
720 which is an unsustainable rate. The LSM's read amplification would
721 increase and the L0 file count stall condition would be hit.
722
723 With the combined flush and compaction pacing mechanisms, flushes and
724 compactions only occur as fast as needed and no faster, which reduces
725 latency spikes for user read and write operations.
726
727 ## Write throttling
728
729 RocksDB adds artificial delays to user writes when certain thresholds
730 are met, such as `l0_slowdown_writes_threshold`. These artificial
731 delays occur when the system is close to stalling to lessen the write
732 pressure so that flushing and compactions can catch up. On the surface
733 this seems good, since write stalls would seemingly be eliminated and
734 replaced with gradual slowdowns. Closed loop write latency benchmarks
735 would show the elimination of abrupt write stalls, which seems
736 desirable.
737
738 However, this doesn't do anything to improve latencies in an open loop
739 model, which is the model more likely to resemble real world use
740 cases. Artificial delays increase write latencies without a clear
741 benefit. Writes stalls in an open loop system would indicate that
742 writes are generated faster than the system could possibly handle,
743 which adding artificial delays won't solve.
744
745 For this reason, Pebble doesn't add artificial delays to user writes
746 and writes are served as quickly as possible.
747
748 ### Other Differences
749
750 * `internalIterator` API which minimizes indirect (virtual) function
751 calls
752 * Previous pointers in the memtable and indexed batch skiplists
753 * Elision of per-key lower/upper bound checks in long range scans
754 * Improved `Iterator` API
755 + `SeekPrefixGE` for prefix iteration
756 + `SetBounds` for adjusting the bounds on an existing `Iterator`
757 * Simpler `Get` implementation