github.com/cockroachdb/pebble@v0.0.0-20231214172447-ab4952c5f87b/docs/rocksdb.md (about)

     1  # Pebble vs RocksDB: Implementation Differences
     2  
     3  RocksDB is a key-value store implemented using a Log-Structured
     4  Merge-Tree (LSM). This document is not a primer on LSMs. There exist
     5  some decent
     6  [introductions](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/)
     7  on the web, or try chapter 3 of [Designing Data-Intensive
     8  Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321).
     9  
    10  Pebble inherits the RocksDB file formats, has a similar API, and
    11  shares many implementation details, but it also has many differences
    12  that improve performance, reduce implementation complexity, or extend
    13  functionality. This document highlights some of the more important
    14  differences.
    15  
    16  * [Internal Keys](#internal-keys)
    17  * [Indexed Batches](#indexed-batches)
    18  * [Large Batches](#large-batches)
    19  * [Commit Pipeline](#commit-pipeline)
    20  * [Range Deletions](#range-deletions)
    21  * [Flush and Compaction Pacing](#flush-and-compaction-pacing)
    22  * [Write Throttling](#write-throttling)
    23  * [Other Differences](#other-differences)
    24  
    25  ## Internal Keys
    26  
    27  The external RocksDB API accepts keys and values. Due to the LSM
    28  structure, keys are never updated in place, but overwritten with new
    29  versions. Inside RocksDB, these versioned keys are known as Internal
    30  Keys. An Internal Key is composed of the user specified key, a
    31  sequence number and a kind. On disk, sstables always store Internal
    32  Keys.
    33  
    34  ```
    35    +-------------+------------+----------+
    36    | UserKey (N) | SeqNum (7) | Kind (1) |
    37    +-------------+------------+----------+
    38  ```
    39  
    40  The `Kind` field indicates the type of key: set, merge, delete, etc.
    41  
    42  While Pebble inherits the Internal Key encoding for format
    43  compatibility, it diverges from RocksDB in how it manages Internal
    44  Keys in its implementation. In RocksDB, Internal Keys are represented
    45  either in encoded form (as a string) or as a `ParsedInternalKey`. The
    46  latter is a struct with the components of the Internal Key as three
    47  separate fields.
    48  
    49  ```c++
    50  struct ParsedInternalKey {
    51    Slice  user_key;
    52    uint64 seqnum;
    53    uint8  kind;
    54  }
    55  ```
    56  
    57  The component format is convenient: changing the `SeqNum` or `Kind` is
    58  field assignment. Extracting the `UserKey` is a field
    59  reference. However, RocksDB tends to only use `ParsedInternalKey`
    60  locally. The major internal APIs, such as `InternalIterator`, operate
    61  using encoded internal keys (i.e. strings) for parameters and return
    62  values.
    63  
    64  To give a concrete example of the overhead this causes, consider
    65  `Iterator::Seek(user_key)`. The external `Iterator` is implemented on
    66  top of an `InternalIterator`. `Iterator::Seek` ends up calling
    67  `InternalIterator::Seek`. Both Seek methods take a key, but
    68  `InternalIterator::Seek` expects an encoded Internal Key. This is both
    69  error prone and expensive. The key passed to `Iterator::Seek` needs to
    70  be copied into a temporary string in order to append the `SeqNum` and
    71  `Kind`. In Pebble, Internal Keys are represented in memory using an
    72  `InternalKey` struct that is the analog of `ParsedInternalKey`. All
    73  internal APIs use `InternalKeys`, with the exception of the lowest
    74  level routines for decoding data from sstables. In Pebble, since the
    75  interfaces all take and return the `InternalKey` struct, we don’t need
    76  to allocate to construct the Internal Key from the User Key, but
    77  RocksDB sometimes needs to allocate, and encode (i.e. make a
    78  copy). The use of the encoded form also causes RocksDB to pass encoded
    79  keys to the comparator routines, sometimes decoding the keys multiple
    80  times during the course of processing.
    81  
    82  ## Indexed Batches
    83  
    84  In RocksDB, a batch is the unit for all write operations. Even writing
    85  a single key is transformed internally to a batch. The batch internal
    86  representation is a contiguous byte buffer with a fixed 12-byte
    87  header, followed by a series of records.
    88  
    89  ```
    90    +------------+-----------+--- ... ---+
    91    | SeqNum (8) | Count (4) |  Entries  |
    92    +------------+-----------+--- ... ---+
    93  ```
    94  
    95  Each record has a 1-byte kind tag prefix, followed by 1 or 2 length
    96  prefixed strings (varstring):
    97  
    98  ```
    99    +----------+-----------------+-------------------+
   100    | Kind (1) | Key (varstring) | Value (varstring) |
   101    +----------+-----------------+-------------------+
   102  ```
   103  
   104  (The `Kind` indicates if there are 1 or 2 varstrings. `Set`, `Merge`,
   105  and `DeleteRange` have 2 varstrings, while `Delete` has 1.)
   106  
   107  Adding a mutation to a batch involves appending a new record to the
   108  buffer. This format is extremely fast for writes, but the lack of
   109  indexing makes it untenable to use directly for reads. In order to
   110  support iteration, a separate indexing structure is created. Both
   111  RocksDB and Pebble use a skiplist for the indexing structure, but with
   112  a clever twist. Rather than the skiplist storing a copy of the key, it
   113  simply stores the offset of the record within the mutation buffer. The
   114  result is that the skiplist acts a multi-map (i.e. a map that can have
   115  duplicate entries for a given key). The iteration order for this map
   116  is constructed so that records sort on key, and for equal keys they
   117  sort on descending offset. Newer records for the same key appear
   118  before older records.
   119  
   120  While the indexing structure for batches is nearly identical between
   121  RocksDB and Pebble, how the index structure is used is completely
   122  different. In RocksDB, a batch is indexed using the
   123  `WriteBatchWithIndex` class. The `WriteBatchWithIndex` class provides
   124  a `NewIteratorWithBase` method that allows iteration over the merged
   125  view of the batch contents and an underlying "base" iterator created
   126  from the database. `BaseDeltaIterator` contains logic to iterate over
   127  the batch entries and the base iterator in parallel which allows us to
   128  perform reads on a snapshot of the database as though the batch had
   129  been applied to it. On the surface this sounds reasonable, yet the
   130  implementation is incomplete. Merge and DeleteRange operations are not
   131  supported. The reason they are not supported is because handling them
   132  is complex and requires duplicating logic that already exists inside
   133  RocksDB for normal iterator processing.
   134  
   135  Pebble takes a different approach to iterating over a merged view of a
   136  batch's contents and the underlying database: it treats the batch as
   137  another level in the LSM. Recall that an LSM is composed of zero or
   138  more memtable layers and zero or more sstable layers. Internally, both
   139  RocksDB and Pebble contain a `MergingIterator` that knows how to merge
   140  the operations from different levels, including processing overwritten
   141  keys, merge operations, and delete range operations. The challenge
   142  with treating the batch as another level to be used by a
   143  `MergingIterator` is that the records in a batch do not have a
   144  sequence number. The sequence number in the batch header is not
   145  assigned until the batch is committed. The solution is to give the
   146  batch records temporary sequence numbers. We need these temporary
   147  sequence numbers to be larger than any other sequence number in the
   148  database so that the records in the batch are considered newer than
   149  any committed record. This is accomplished by reserving the high-bit
   150  in the 56-bit sequence number for use as a marker for batch sequence
   151  numbers. The sequence number for a record in an uncommitted batch is:
   152  
   153  ```
   154    RecordOffset | (1<<55)
   155  ```
   156  
   157  Newer records in a given batch will have a larger sequence number than
   158  older records in the batch. And all of the records in a batch will
   159  have larger sequence numbers than any committed record in the
   160  database.
   161  
   162  The end result is that Pebble's batch iterators support all of the
   163  functionality of regular database iterators with minimal additional
   164  code.
   165  
   166  ## Large Batches
   167  
   168  The size of a batch is limited only by available memory, yet the
   169  required memory is not just the batch representation. When a batch is
   170  committed, the commit operation iterates over the records in the batch
   171  from oldest to newest and inserts them into the current memtable. The
   172  memtable is an in-memory structure that buffers mutations that have
   173  been committed (written to the Write Ahead Log), but not yet written
   174  to an sstable. Internally, a memtable uses a skiplist to index
   175  records. Each skiplist entry has overhead for the index links and
   176  other metadata that is a dozen bytes at minimum. A large batch
   177  composed of many small records can require twice as much memory when
   178  inserted into a memtable than it required in the batch. And note that
   179  this causes a temporary increase in memory requirements because the
   180  batch memory is not freed until it is completely committed.
   181  
   182  A non-obvious implementation restriction present in both RocksDB and
   183  Pebble is that there is a one-to-one correspondence between WAL files
   184  and memtables. That is, a given WAL file has a single memtable
   185  associated with it and vice-versa. While this restriction could be
   186  removed, doing so is onerous and intricate. It should also be noted
   187  that committing a batch involves writing it to a single WAL file. The
   188  combination of restrictions results in a batch needing to be written
   189  entirely to a single memtable.
   190  
   191  What happens if a batch is too large to fit in a memtable?  Memtables
   192  are generally considered to have a fixed size, yet this is not
   193  actually true in RocksDB. In RocksDB, the memtable skiplist is
   194  implemented on top of an arena structure. An arena is composed of a
   195  list of fixed size chunks, with no upper limit set for the number of
   196  chunks that can be associated with an arena. So RocksDB handles large
   197  batches by allowing a memtable to grow beyond its configured
   198  size. Concretely, while RocksDB may be configured with a 64MB memtable
   199  size, a 1GB batch will cause the memtable to grow to accomodate
   200  it. Functionally, this is good, though there is a practical problem: a
   201  large batch is first written to the WAL, and then added to the
   202  memtable. Adding the large batch to the memtable may consume so much
   203  memory that the system runs out of memory and is killed by the
   204  kernel. This can result in a death loop because upon restarting as the
   205  batch is read from the WAL and applied to the memtable again.
   206  
   207  In Pebble, the memtable is also implemented using a skiplist on top of
   208  an arena. Significantly, the Pebble arena is a fixed size. While the
   209  RocksDB skiplist uses pointers, the Pebble skiplist uses offsets from
   210  the start of the arena. The fixed size arena means that the Pebble
   211  memtable cannot expand arbitrarily. A batch that is too large to fit
   212  in the memtable causes the current mutable memtable to be marked as
   213  immutable and the batch is wrapped in a `flushableBatch` structure and
   214  added to the list of immutable memtables. Because the `flushableBatch`
   215  is readable as another layer in the LSM, the batch commit can return
   216  as soon as the `flushableBatch` has been added to the immutable
   217  memtable list.
   218  
   219  Internally, a `flushableBatch` provides iterator support by sorting
   220  the batch contents (the batch is sorted once, when it is added to the
   221  memtable list). Sorting the batch contents and insertion of the
   222  contents into a memtable have the same big-O time, but the constant
   223  factor dominates here. Sorting is significantly faster and uses
   224  significantly less memory due to not having to copy the batch records.
   225  
   226  Note that an effect of this large batch support is that Pebble can be
   227  configured as an efficient on-disk sorter: specify a small memtable
   228  size, disable the WAL, and set a large L0 compaction threshold. In
   229  order to sort a large amount of data, create batches that are larger
   230  than the memtable size and commit them. When committed these batches
   231  will not be inserted into a memtable, but instead sorted and then
   232  written out to L0. The fully sorted data can later be read and the
   233  normal merging process will take care of the final ordering.
   234  
   235  ## Commit Pipeline
   236  
   237  The commit pipeline is the component which manages the steps in
   238  committing write batches, such as writing the batch to the WAL and
   239  applying its contents to the memtable. While simple conceptually, the
   240  commit pipeline is crucial for high performance. In the absence of
   241  concurrency, commit performance is limited by how fast a batch can be
   242  written (and synced) to the WAL and then added to the memtable, both
   243  of which are outside of the purview of the commit pipeline.
   244  
   245  To understand the challenge here, it is useful to have a conception of
   246  the WAL (write-ahead log). The WAL contains a record of all of the
   247  batches that have been committed to the database. As a record is
   248  written to the WAL it is added to the memtable. Each record is
   249  assigned a sequence number which is used to distinguish newer updates
   250  from older ones. Conceptually the WAL looks like:
   251  
   252  ```
   253  +--------------------------------------+
   254  | Batch(SeqNum=1,Count=9,Records=...)  |
   255  +--------------------------------------+
   256  | Batch(SeqNum=10,Count=5,Records=...) |
   257  +--------------------------------------+
   258  | Batch(SeqNum=15,Count=7,Records...)  |
   259  +--------------------------------------+
   260  | ...                                  |
   261  +--------------------------------------+
   262  ```
   263  
   264  Note that each WAL entry is precisely the batch representation
   265  described earlier in the [Indexed Batches](#indexed-batches)
   266  section. The monotonically increasing sequence numbers are a critical
   267  component in allowing RocksDB and Pebble to provide fast snapshot
   268  views of the database for reads.
   269  
   270  If concurrent performance was not a concern, the commit pipeline could
   271  simply be a mutex which serialized writes to the WAL and application
   272  of the batch records to the memtable. Concurrent performance is a
   273  concern, though.
   274  
   275  The primary challenge in concurrent performance in the commit pipeline
   276  is maintaining two invariants:
   277  
   278  1. Batches need to be written to the WAL in sequence number order.
   279  2. Batches need to be made visible for reads in sequence number
   280     order. This invariant arises from the use of a single sequence
   281     number which indicates which mutations are visible.
   282  
   283  The second invariant deserves explanation. RocksDB and Pebble both
   284  keep track of a visible sequence number. This is the sequence number
   285  for which records in the database are visible during reads. The
   286  visible sequence number exists because committing a batch is an atomic
   287  operation, yet adding records to the memtable is done without an
   288  exclusive lock (the skiplists used by both Pebble and RocksDB are
   289  lock-free). When the records from a batch are being added to the
   290  memtable, a concurrent read operation may see those records, but will
   291  skip over them because they are newer than the visible sequence
   292  number. Once all of the records in the batch have been added to the
   293  memtable, the visible sequence number is atomically incremented.
   294  
   295  So we have four steps in committing a write batch:
   296  
   297  1. Write the batch to the WAL
   298  2. Apply the mutations in the batch to the memtable
   299  3. Bump the visible sequence number
   300  4. (Optionally) sync the WAL
   301  
   302  Writing the batch to the WAL is actually very fast as it is just a
   303  memory copy. Applying the mutations in the batch to the memtable is by
   304  far the most CPU intensive part of the commit pipeline. Syncing the
   305  WAL is the most expensive from a wall clock perspective.
   306  
   307  With that background out of the way, let's examine how RocksDB commits
   308  batches. This description is of the traditional commit pipeline in
   309  RocksDB (i.e. the one used by CockroachDB).
   310  
   311  RocksDB achieves concurrency in the commit pipeline by grouping
   312  concurrently committed batches into a batch group. Each group is
   313  assigned a "leader" which is the first batch to be added to the
   314  group. The batch group is written atomically to the WAL by the leader
   315  thread, and then the individual batches making up the group are
   316  concurrently applied to the memtable. Lastly, the visible sequence
   317  number is bumped such that all of the batches in the group become
   318  visible in a single atomic step. While a batch group is being applied,
   319  other concurrent commits are added to a waiting list. When the group
   320  commit finishes, the waiting commits form the next group.
   321  
   322  There are two criticisms of the batch grouping approach. The first is
   323  that forming a batch group involves copying batch contents. RocksDB
   324  partially alleviates this for large batches by placing a limit on the
   325  total size of a group. A large batch will end up in its own group and
   326  not be copied, but the criticism still applies for small batches. Note
   327  that there are actually two copies here. The batch contents are
   328  concatenated together to form the group, and then the group contents
   329  are written into an in memory buffer for the WAL before being written
   330  to disk.
   331  
   332  The second criticism is about the thread synchronization points. Let's
   333  consider what happens to a commit which becomes the leader:
   334  
   335  1. Lock commit mutex
   336  2. Wait to become leader
   337  3. Form (concatenate) batch group and write to the WAL
   338  4. Notify followers to apply their batch to the memtable
   339  5. Apply own batch to memtable
   340  6. Wait for followers to finish
   341  7. Bump visible sequence number
   342  8. Unlock commit mutex
   343  9. Notify followers that the commit is complete
   344  
   345  The follower's set of operations looks like:
   346  
   347  1. Lock commit mutex
   348  2. Wait to become follower
   349  3. Wait to be notified that it is time to apply batch
   350  4. Unlock commit mutex
   351  5. Apply batch to memtable
   352  6. Wait to be notified that commit is complete
   353  
   354  The thread synchronization points (all of the waits and notifies) are
   355  overhead. Reducing that overhead can improve performance.
   356  
   357  The Pebble commit pipeline addresses both criticisms. The main
   358  innovation is a commit queue that mirrors the commit order. The Pebble
   359  commit pipeline looks like:
   360  
   361  1. Lock commit mutex
   362    * Add batch to commit queue
   363    * Assign batch sequence number
   364    * Write batch to the WAL
   365  2. Unlock commit mutex
   366  3. Apply batch to memtable (concurrently)
   367  4. Publish batch sequence number
   368  
   369  Pebble does not use the concept of a batch group. Each batch is
   370  individually written to the WAL, but note that the WAL write is just a
   371  memory copy into an internal buffer in the WAL.
   372  
   373  Step 4 deserves further scrutiny as it is where the invariant on the
   374  visible batch sequence number is maintained. Publishing the batch
   375  sequence number cannot simply bump the visible sequence number because
   376  batches with earlier sequence numbers may still be applying to the
   377  memtable. If we were to ratchet the visible sequence number without
   378  waiting for those applies to finish, a concurrent reader could see
   379  partial batch contents. Note that RocksDB has experimented with
   380  allowing these semantics with its unordered writes option.
   381  
   382  We want to retain the atomic visibility of batch commits. The publish
   383  batch sequence number step needs to ensure that we don't ratchet the
   384  visible sequence number until all batches with earlier sequence
   385  numbers have applied. Enter the commit queue: a lock-free
   386  single-producer, multi-consumer queue. Batches are added to the commit
   387  queue with the commit mutex held, ensuring the same order as the
   388  sequence number assignment. After a batch finishes applying to the
   389  memtable, it atomically marks the batch as applied. It then removes
   390  the prefix of applied batches from the commit queue, bumping the
   391  visible sequence number, and marking the batch as committed (via a
   392  `sync.WaitGroup`). If the first batch in the commit queue has not be
   393  applied we wait for our batch to be committed, relying on another
   394  concurrent committer to perform the visible sequence ratcheting for
   395  our batch. We know a concurrent commit is taking place because if
   396  there was only one batch committing it would be at the head of the
   397  commit queue.
   398  
   399  There are two possibilities when publishing a sequence number. The
   400  first is that there is an unapplied batch at the head of the
   401  queue. Consider the following scenario where we're trying to publish
   402  the sequence number for batch `B`.
   403  
   404  ```
   405    +---------------+-------------+---------------+-----+
   406    | A (unapplied) | B (applied) | C (unapplied) | ... |
   407    +---------------+-------------+---------------+-----+
   408  ```
   409  
   410  The publish routine will see that `A` is unapplied and then simply
   411  wait for `B's` done `sync.WaitGroup` to be signalled. This is safe
   412  because `A` must still be committing. And if `A` has concurrently been
   413  marked as applied, the goroutine publishing `A` will then publish
   414  `B`. What happens when `A` publishes its sequence number? The commit
   415  queue state becomes:
   416  
   417  ```
   418    +-------------+-------------+---------------+-----+
   419    | A (applied) | B (applied) | C (unapplied) | ... |
   420    +-------------+-------------+---------------+-----+
   421  ```
   422  
   423  The publish routine pops `A` from the queue, ratchets the sequence
   424  number, then pops `B` and ratchets the sequence number again, and then
   425  finds `C` and stops. A detail that it is important to notice is that
   426  the committer for batch `B` didn't have to do any more work. An
   427  alternative approach would be to have `B` wakeup and ratchet its own
   428  sequence number, but that would serialize the remainder of the commit
   429  queue behind that goroutine waking up.
   430  
   431  The commit queue reduces the number of thread synchronization
   432  operations required to commit a batch. There is no leader to notify,
   433  or followers to wait for. A commit either publishes its own sequence
   434  number, or performs one synchronization operation to wait for a
   435  concurrent committer to publish its sequence number.
   436  
   437  ## Range Deletions
   438  
   439  Deletion of an individual key in RocksDB and Pebble is accomplished by
   440  writing a deletion tombstone. A deletion tombstone shadows an existing
   441  value for a key, causing reads to treat the key as not present. The
   442  deletion tombstone mechanism works well for deleting small sets of
   443  keys, but what happens if you want to all of the keys within a range
   444  of keys that might number in the thousands or millions? A range
   445  deletion is an operation which deletes an entire range of keys with a
   446  single record. In contrast to a point deletion tombstone which
   447  specifies a single key, a range deletion tombstone (a.k.a. range
   448  tombstone) specifies a start key (inclusive) and an end key
   449  (exclusive). This single record is much faster to write than thousands
   450  or millions of point deletion tombstones, and can be done blindly --
   451  without iterating over the keys that need to be deleted. The downside
   452  to range tombstones is that they require additional processing during
   453  reads. How the processing of range tombstones is done significantly
   454  affects both the complexity of the implementation, and the efficiency
   455  of read operations in the presence of range tombstones.
   456  
   457  A range tombstone is composed of a start key, end key, and sequence
   458  number. Any key that falls within the range is considered deleted if
   459  the key's sequence number is less than the range tombstone's sequence
   460  number. RocksDB stores range tombstones segregated from point
   461  operations in a special range deletion block within each sstable.
   462  Conceptually, the range tombstones stored within an sstable are
   463  truncated to the boundaries of the sstable, though there are
   464  complexities that cause this to not actually be physically true.
   465  
   466  In RocksDB, the main structure implementing range tombstone processing
   467  is the `RangeDelAggregator`. Each read operation and iterator has its
   468  own `RangeDelAggregator` configured for the sequence number the read
   469  is taking place at. The initial implementation of `RangeDelAggregator`
   470  built up a "skyline" for the range tombstones visible at the read
   471  sequence number.
   472  
   473  ```
   474  10   +---+
   475   9   |   |
   476   8   |   |
   477   7   |   +----+
   478   6   |        |
   479   5 +-+        |  +----+
   480   4 |          |  |    |
   481   3 |          |  |    +---+
   482   2 |          |  |        |
   483   1 |          |  |        |
   484   0 |          |  |        |
   485    abcdefghijklmnopqrstuvwxyz
   486  ```
   487  
   488  The above diagram shows the skyline created for the range tombstones
   489  `[b,j)#5`, `[d,h)#10`, `[f,m)#7`, `[p,u)#5`, and `[t,y)#3`. The
   490  skyline is queried for each key read to see if the key should be
   491  considered deleted or not. The skyline structure is stored in a binary
   492  tree, making the queries an O(logn) operation in the number of
   493  tombstones, though there is an optimization to make this O(1) for
   494  `next`/`prev` iteration. Note that the skyline representation loses
   495  information about the range tombstones. This requires the structure to
   496  be rebuilt on every read which has a significant performance impact.
   497  
   498  The initial skyline range tombstone implementation has since been
   499  replaced with a more efficient lookup structure. See the
   500  [DeleteRange](https://rocksdb.org/blog/2018/11/21/delete-range.html)
   501  blog post for a good description of both the original implementation
   502  and the new (v2) implementation. The key change in the new
   503  implementation is to "fragment" the range tombstones that are stored
   504  in an sstable. The fragmented range tombstones provide the same
   505  benefit as the skyline representation: the ability to binary search
   506  the fragments in order to find the tombstone covering a key. But
   507  unlike the skyline approach, the fragmented tombstones can be cached
   508  on a per-sstable basis. In the v2 approach, `RangeDelAggregator` keeps
   509  track of the fragmented range tombstones for each sstable encountered
   510  during a read or iterator, and logically merges them together.
   511  
   512  Fragmenting range tombstones involves splitting range tombstones at
   513  overlap points. Let's consider the tombstones in the skyline example
   514  above:
   515  
   516  ```
   517  10:   d---h
   518   7:     f------m
   519   5: b-------j     p----u
   520   3:                   t----y
   521  ```
   522  
   523  Fragmenting the range tombstones at the overlap points creates a
   524  larger number of range tombstones:
   525  
   526  ```
   527  10:   d-f-h
   528   7:     f-h-j--m
   529   5: b-d-f-h-j     p---tu
   530   3:                   tu---y
   531  ```
   532  
   533  While the number of tombstones is larger there is a significant
   534  advantage: we can order the tombstones by their start key and then
   535  binary search to find the set of tombstones overlapping a particular
   536  point. This is possible because due to the fragmenting, all the
   537  tombstones that overlap a range of keys will have the same start and
   538  end key. The v2 `RangeDelAggregator` and associated classes perform
   539  fragmentation of range tombstones stored in each sstable and those
   540  fragmented tombstones are then cached.
   541  
   542  In summary, in RocksDB `RangeDelAggregator` acts as an oracle for
   543  answering whether a key is deleted at a particular sequence
   544  number. Due to caching of fragmented tombstones, the v2 implementation
   545  of `RangeDelAggregator` implementation is significantly faster to
   546  populate than v1, yet the overall approach to processing range
   547  tombstones remains similar.
   548  
   549  Pebble takes a different approach: it integrates range tombstones
   550  processing directly into the `mergingIter` structure. `mergingIter` is
   551  the internal structure which provides a merged view of the levels in
   552  an LSM. RocksDB has a similar class named
   553  `MergingIterator`. Internally, `mergingIter` maintains a heap over the
   554  levels in the LSM (note that each memtable and L0 table is a separate
   555  "level" in `mergingIter`). In RocksDB, `MergingIterator` knows nothing
   556  about range tombstones, and it is thus up to higher-level code to
   557  process range tombstones using `RangeDelAggregator`.
   558  
   559  While the separation of `MergingIterator` and range tombstones seems
   560  reasonable at first glance, there is an optimization that RocksDB does
   561  not perform which is awkward with the `RangeDelAggregator` approach:
   562  skipping swaths of deleted keys. A range tombstone often shadows more
   563  than one key. Rather than iterating over the deleted keys, it is much
   564  quicker to seek to the end point of the range tombstone. The challenge
   565  in implementing this optimization is that a key might be newer than
   566  the range tombstone and thus shouldn't be skipped. An insight to be
   567  utilized is that the level structure itself provides sufficient
   568  information. A range tombstone at `Ln` is guaranteed to be newer than
   569  any key it overlaps in `Ln+1`.
   570  
   571  Pebble utilizes the insight above to integrate range deletion
   572  processing with `mergingIter`. A `mergingIter` maintains a point
   573  iterator and a range deletion iterator per level in the LSM. In this
   574  context, every L0 table is a separate level, as is every
   575  memtable. Within a level, when a range deletion contains a point
   576  operation the sequence numbers must be checked to determine if the
   577  point operation is newer or older than the range deletion
   578  tombstone. The `mergingIter` maintains the invariant that the range
   579  deletion iterators for all levels newer that the current iteration key
   580  are positioned at the next (or previous during reverse iteration)
   581  range deletion tombstone. We know those levels don't contain a range
   582  deletion tombstone that covers the current key because if they did the
   583  current key would be deleted. The range deletion iterator for the
   584  current key's level is positioned at a range tombstone covering or
   585  past the current key. The position of all of other range deletion
   586  iterators is unspecified. Whenever a key from those levels becomes the
   587  current key, their range deletion iterators need to be
   588  positioned. This lazy positioning avoids seeking the range deletion
   589  iterators for keys that are never considered.
   590  
   591  For a full example, consider the following setup:
   592  
   593  ```
   594    p0:               o
   595    r0:             m---q
   596  
   597    p1:              n p
   598    r1:       g---k
   599  
   600    p2:  b d    i
   601    r2: a---e           q----v
   602  
   603    p3:     e
   604    r3:
   605  ```
   606  
   607  The diagram above shows is showing 4 levels, with `pX` indicating the
   608  point operations in a level and `rX` indicating the range tombstones.
   609  
   610  If we start iterating from the beginning, the first key we encounter
   611  is `b` in `p2`. When the mergingIter is pointing at a valid entry, the
   612  range deletion iterators for all of the levels less that the current
   613  key's level are positioned at the next range tombstone past the
   614  current key. So `r0` will point at `[m,q)` and `r1` at `[g,k)`. When
   615  the key `b` is encountered, we check to see if the current tombstone
   616  for `r0` or `r1` contains it, and whether the tombstone for `r2`,
   617  `[a,e)`, contains and is newer than `b`.
   618  
   619  Advancing the iterator finds the next key at `d`. This is in the same
   620  level as the previous key `b` so we don't have to reposition any of
   621  the range deletion iterators, but merely check whether `d` is now
   622  contained by any of the range tombstones at higher levels or has
   623  stepped past the range tombstone in its own level. In this case, there
   624  is nothing to be done.
   625  
   626  Advancing the iterator again finds `e`. Since `e` comes from `p3`, we
   627  have to position the `r3` range deletion iterator, which is empty. `e`
   628  is past the `r2` tombstone of `[a,e)` so we need to advance the `r2`
   629  range deletion iterator to `[q,v)`.
   630  
   631  The next key is `i`. Because this key is in `p2`, a level above `e`,
   632  we don't have to reposition any range deletion iterators and instead
   633  see that `i` is covered by the range tombstone `[g,k)`. The iterator
   634  is immediately advanced to `n` which is covered by the range tombstone
   635  `[m,q)` causing the iterator to advance to `o` which is visible.
   636  
   637  ## Flush and Compaction Pacing
   638  
   639  Flushes and compactions in LSM trees are problematic because they
   640  contend with foreground traffic, resulting in write and read latency
   641  spikes. Without throttling the rate of flushes and compactions, they
   642  occur "as fast as possible" (which is not entirely true, since we
   643  have a `bytes_per_sync` option). This instantaneous usage of CPU and
   644  disk IO results in potentially huge latency spikes for writes and
   645  reads which occur in parallel to the flushes and compactions.
   646  
   647  RocksDB attempts to solve this issue by offering an option to limit
   648  the speed of flushes and compactions. A maximum `bytes/sec` can be
   649  specified through the options, and background IO usage will be limited
   650  to the specified amount. Flushes are given priority over compactions,
   651  but they still use the same rate limiter. Though simple to implement
   652  and understand, this option is fragile for various reasons.
   653  
   654  1) If the rate limit is configured too low, the DB will stall and
   655  write throughput will be affected.
   656  2) If the rate limit is configured too high, the write and read
   657  latency spikes will persist.
   658  3) A different configuration is needed per system depending on the
   659  speed of the storage device.
   660  4) Write rates typically do not stay the same throughout the lifetime
   661  of the DB (higher throughput during certain times of the day, etc) but
   662  the rate limit cannot be configured during runtime.
   663  
   664  RocksDB also offers an
   665  ["auto-tuned" rate limiter](https://rocksdb.org/blog/2017/12/18/17-auto-tuned-rate-limiter.html)
   666  which uses a simple multiplicative-increase, multiplicative-decrease
   667  algorithm to dynamically adjust the background IO rate limit depending
   668  on how much of the rate limiter has been exhausted in an interval.
   669  This solves the problem of having a static rate limit, but Pebble
   670  attempts to improve on this with a different pacing mechanism.
   671  
   672  Pebble's pacing mechanism uses separate rate limiters for flushes and
   673  compactions. Both the flush and compaction pacing mechanisms work by
   674  attempting to flush and compact only as fast as needed and no faster.
   675  This is achieved differently for flushes versus compactions.
   676  
   677  For flush pacing, Pebble keeps the rate at which the memtable is
   678  flushed at the same rate as user writes. This ensures that disk IO
   679  used by flushes remains steady. When a mutable memtable becomes full
   680  and is marked immutable, it is typically flushed as fast as possible.
   681  Instead of flushing as fast as possible, what we do is look at the
   682  total number of bytes in all the memtables (mutable + queue of
   683  immutables) and subtract the number of bytes that have been flushed in
   684  the current flush. This number gives us the total number of bytes
   685  which remain to be flushed. If we keep this number steady at a constant
   686  level, we have the invariant that the flush rate is equal to the write
   687  rate.
   688  
   689  When the number of bytes remaining to be flushed falls below our
   690  target level, we slow down the speed of flushing. We keep a minimum
   691  rate at which the memtable is flushed so that flushes proceed even if
   692  writes have stopped. When the number of bytes remaining to be flushed
   693  goes above our target level, we allow the flush to proceed as fast as
   694  possible, without applying any rate limiting. However, note that the
   695  second case would indicate that writes are occurring faster than the
   696  memtable can flush, which would be an unsustainable rate. The LSM
   697  would soon hit the memtable count stall condition and writes would be
   698  completely stopped.
   699  
   700  For compaction pacing, Pebble uses an estimation of compaction debt,
   701  which is the number of bytes which need to be compacted before no
   702  further compactions are needed. This estimation is calculated by
   703  looking at the number of bytes that have been flushed by the current
   704  flush routine, adding those bytes to the size of the level 0 sstables,
   705  then seeing how many bytes exceed the target number of bytes for the
   706  level 0 sstables. We multiply the number of bytes exceeded by the
   707  level ratio and add that number to the compaction debt estimate.
   708  We repeat this process until the final level, which gives us a final
   709  compaction debt estimate for the entire LSM tree.
   710  
   711  Like with flush pacing, we want to keep the compaction debt at a
   712  constant level. This ensures that compactions occur only as fast as
   713  needed and no faster. If the compaction debt estimate falls below our
   714  target level, we slow down compactions. We maintain a minimum
   715  compaction rate so that compactions proceed even if flushes have
   716  stopped. If the compaction debt goes above our target level, we let
   717  compactions proceed as fast as possible without any rate limiting.
   718  Just like with flush pacing, this would indicate that writes are
   719  occurring faster than the background compactions can keep up with,
   720  which is an unsustainable rate. The LSM's read amplification would
   721  increase and the L0 file count stall condition would be hit.
   722  
   723  With the combined flush and compaction pacing mechanisms, flushes and
   724  compactions only occur as fast as needed and no faster, which reduces
   725  latency spikes for user read and write operations.
   726  
   727  ## Write throttling
   728  
   729  RocksDB adds artificial delays to user writes when certain thresholds
   730  are met, such as `l0_slowdown_writes_threshold`. These artificial
   731  delays occur when the system is close to stalling to lessen the write
   732  pressure so that flushing and compactions can catch up. On the surface
   733  this seems good, since write stalls would seemingly be eliminated and
   734  replaced with gradual slowdowns. Closed loop write latency benchmarks
   735  would show the elimination of abrupt write stalls, which seems
   736  desirable.
   737  
   738  However, this doesn't do anything to improve latencies in an open loop
   739  model, which is the model more likely to resemble real world use
   740  cases. Artificial delays increase write latencies without a clear
   741  benefit. Writes stalls in an open loop system would indicate that
   742  writes are generated faster than the system could possibly handle,
   743  which adding artificial delays won't solve.
   744  
   745  For this reason, Pebble doesn't add artificial delays to user writes
   746  and writes are served as quickly as possible.
   747  
   748  ### Other Differences
   749  
   750  * `internalIterator` API which minimizes indirect (virtual) function
   751    calls
   752  * Previous pointers in the memtable and indexed batch skiplists
   753  * Elision of per-key lower/upper bound checks in long range scans
   754  * Improved `Iterator` API
   755    + `SeekPrefixGE` for prefix iteration
   756    + `SetBounds` for adjusting the bounds on an existing `Iterator`
   757  * Simpler `Get` implementation