github.com/petermattis/pebble@v0.0.0-20190905164901-ab51a2166067/docs/rocksdb.md (about)

     1  # Pebble vs RocksDB: Implementation Differences
     2  
     3  RocksDB is a key-value store implemented using a Log-Structured
     4  Merge-Tree (LSM). This document is not a primer on LSMs. There exist
     5  some decent
     6  [introductions](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/)
     7  on the web, or try chapter 3 of [Designing Data-Intensive
     8  Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321).
     9  
    10  Pebble inherits the RocksDB file formats, has a similar API, and
    11  shares many implementation details, but it also has many differences
    12  that improve performance, reduce implementation complexity, or extend
    13  functionality. This document highlights some of the more important
    14  differences.
    15  
    16  * [Internal Keys](#internal-keys)
    17  * [Indexed Batches](#indexed-batches)
    18  * [Large Batches](#large-batches)
    19  * [Commit Pipeline](#commit-pipeline)
    20  * [Range Deletions](#range-deletions)
    21  * [Flush and Compaction Pacing](#flush-and-compaction-pacing)
    22  * [Write Throttling](#write-throttling)
    23  * [Other Differences](#other-differences)
    24  
    25  ## Internal Keys
    26  
    27  The external RocksDB API accepts keys and values. Due to the LSM
    28  structure, keys are never updated in place, but overwritten with new
    29  versions. Inside RocksDB, these versioned keys are known as Internal
    30  Keys. An Internal Key is composed of the user specified key, a
    31  sequence number and a kind. On disk, sstables always store Internal
    32  Keys.
    33  
    34  ```
    35    +-------------+------------+----------+
    36    | UserKey (N) | SeqNum (7) | Kind (1) |
    37    +-------------+------------+----------+
    38  ```
    39  
    40  The `Kind` field indicates the type of key: set, merge, delete, etc.
    41  
    42  While Pebble inherits the Internal Key encoding for format
    43  compatibility, it diverges from RocksDB in how it manages Internal
    44  Keys in its implementation. In RocksDB, Internal Keys are represented
    45  either in encoded form (as a string) or as a `ParsedInternalKey`. The
    46  latter is a struct with the components of the Internal Key as three
    47  separate fields.
    48  
    49  ```c++
    50  struct ParsedInternalKey {
    51    Slice  user_key;
    52    uint64 seqnum;
    53    uint8  kind;
    54  }
    55  ```
    56  
    57  The component format is convenient: changing the `SeqNum` or `Kind` is
    58  field assignment. Extracting the `UserKey` is a field
    59  reference. However, RocksDB tends to only use `ParsedInternalKey`
    60  locally. The major internal APIs, such as `InternalIterator`, operate
    61  using encoded internal keys (i.e. strings) for parameters and return
    62  values.
    63  
    64  To give a concrete example of the overhead this causes, consider
    65  `Iterator::Seek(user_key)`. The external `Iterator` is implemented on
    66  top of an `InternalIterator`. `Iterator::Seek` ends up calling
    67  `InternalIterator::Seek`. Both Seek methods take a key, but
    68  `InternalIterator::Seek` expects an encoded Internal Key. This is both
    69  error prone and expensive. The key passed to `Iterator::Seek` needs to
    70  be copied into a temporary string in order to append the `SeqNum` and
    71  `Kind`. In Pebble, Internal Keys are represented in memory using an
    72  `InternalKey` struct that is the analog of `ParsedInternalKey`. All
    73  internal APIs use `InternalKeys`, with the exception of the lowest
    74  level routines for decoding data from sstables. In Pebble, since the
    75  interfaces all take and return the `InternalKey` struct, we don’t need
    76  to allocate to construct the Internal Key from the User Key, but
    77  RocksDB sometimes needs to allocate, and encode (i.e. make a
    78  copy). The use of the encoded form also causes RocksDB to pass encoded
    79  keys to the comparator routines, sometimes decoding the keys multiple
    80  times during the course of processing.
    81  
    82  ## Indexed Batches
    83  
    84  In RocksDB, a batch is the unit for all write operations. Even writing
    85  a single key is transformed internally to a batch. The batch internal
    86  representation is a contiguous byte buffer with a fixed 12-byte
    87  header, followed by a series of records.
    88  
    89  ```
    90    +------------+-----------+--- ... ---+
    91    | SeqNum (8) | Count (4) |  Entries  |
    92    +------------+-----------+--- ... ---+
    93  ```
    94  
    95  Each record has a 1-byte kind tag prefix, followed by 1 or 2 length
    96  prefixed strings (varstring):
    97  
    98  ```
    99    +----------+-----------------+-------------------+
   100    | Kind (1) | Key (varstring) | Value (varstring) |
   101    +----------+-----------------+-------------------+
   102  ```
   103  
   104  (The `Kind` indicates if there are 1 or 2 varstrings. `Set`, `Merge`,
   105  and `DeleteRange` have 2 varstrings, while `Delete` has 1.)
   106  
   107  Adding a mutation to a batch involves appending a new record to the
   108  buffer. This format is extremely fast for writes, but the lack of
   109  indexing makes it untenable to use directly for reads. In order to
   110  support iteration, a separate indexing structure is created. Both
   111  RocksDB and Pebble use a skiplist for the indexing structure, but with
   112  a clever twist. Rather than the skiplist storing a copy of the key, it
   113  simply stores the offset of the record within the mutation buffer. The
   114  result is that the skiplist acts a multi-map (i.e. a map that can have
   115  duplicate entries for a given key). The iteration order for this map
   116  is constructed so that records sort on key, and for equal keys they
   117  sort on descending offset. Newer records for the same key appear
   118  before older records.
   119  
   120  While the indexing structure for batches is nearly identical between
   121  RocksDB and Pebble, how the index structure is used is completely
   122  different. In RocksDB, a batch is indexed using the
   123  `WriteBatchWithIndex` class. The `WriteBatchWithIndex` class provides
   124  a `NewIteratorWithBase` method that allows iteration over the merged
   125  view of the batch contents and an underlying "base" iterator created
   126  from the database. `BaseDeltaIterator` contains logic to iterate over
   127  the batch entries and the base iterator in parallel which allows us to
   128  perform reads on a snapshot of the database as though the batch had
   129  been applied to it. On the surface this sounds reasonable, yet the
   130  implementation is incomplete. Merge and DeleteRange operations are not
   131  supported. The reason they are not supported is because handling them
   132  is complex and requires duplicating logic that already exists inside
   133  RocksDB for normal iterator processing.
   134  
   135  Pebble takes a different approach to iterating over a merged view of a
   136  batch's contents and the underlying database: it treats the batch as
   137  another level in the LSM. Recall that an LSM is composed of zero or
   138  more memtable layers and zero or more sstable layers. Internally, both
   139  RocksDB and Pebble contain a `MergingIterator` that knows how to merge
   140  the operations from different levels, including processing overwritten
   141  keys, merge operations, and delete range operations. The challenge
   142  with treating the batch as another level to be used by a
   143  `MergingIterator` is that the records in a batch do not have a
   144  sequence number. The sequence number in the batch header is not
   145  assigned until the batch is committed. The solution is to give the
   146  batch records temporary sequence numbers. We need these temporary
   147  sequence numbers to be larger than any other sequence number in the
   148  database so that the records in the batch are considered newer than
   149  any committed record. This is accomplished by reserving the high-bit
   150  in the 56-bit sequence number for use as a marker for batch sequence
   151  numbers. The sequence number for a record in an uncommitted batch is:
   152  
   153  ```
   154    RecordOffset | (1<<55)
   155  ```
   156  
   157  Newer records in a given batch will have a larger sequence number than
   158  older records in the batch. And all of the records in a batch will
   159  have larger sequence numbers than any committed record in the
   160  database.
   161  
   162  The end result is that Pebble's batch iterators support all of the
   163  functionality of regular database iterators with minimal additional
   164  code.
   165  
   166  ## Large Batches
   167  
   168  The size of a batch is limited only by available memory, yet the
   169  required memory is not just the batch representation. When a batch is
   170  committed, the commit operation iterates over the records in the batch
   171  from oldest to newest and inserts them into the current memtable. The
   172  memtable is an in-memory structure that buffers mutations that have
   173  been committed (written to the Write Ahead Log), but not yet written
   174  to an sstable. Internally, a memtable uses a skiplist to index
   175  records. Each skiplist entry has overhead for the index links and
   176  other metadata that is a dozen bytes at minimum. A large batch
   177  composed of many small records can require twice as much memory when
   178  inserted into a memtable than it required in the batch. And note that
   179  this causes a temporary increase in memory requirements because the
   180  batch memory is not freed until it is completely committed.
   181  
   182  A non-obvious implementation restriction present in both RocksDB and
   183  Pebble is that there is a one-to-one correspondence between WAL files
   184  and memtables. That is, a given WAL file has a single memtable
   185  associated with it and vice-versa. While this restriction could be
   186  removed, doing so is onerous and intricate. It should also be noted
   187  that committing a batch involves writing it to a single WAL file. The
   188  combination of restrictions results in a batch needing to be written
   189  entirely to a single memtable.
   190  
   191  What happens if a batch is too large to fit in a memtable?  Memtables
   192  are generally considered to have a fixed size, yet this is not
   193  actually true in RocksDB. In RocksDB, the memtable skiplist is
   194  implemented on top of an arena structure. An arena is composed of a
   195  list of fixed size chunks, with no upper limit set for the number of
   196  chunks that can be associated with an arena. So RocksDB handles large
   197  batches by allowing a memtable to grow beyond its configured
   198  size. Concretely, while RocksDB may be configured with a 64MB memtable
   199  size, a 1GB batch will cause the memtable to grow to accomodate
   200  it. Functionally, this is good, though there is a practical problem: a
   201  large batch is first written to the WAL, and then added to the
   202  memtable. Adding the large batch to the memtable may consume so much
   203  memory that the system runs out of memory and is killed by the
   204  kernel. This can result in a death loop because upon restarting as the
   205  batch is read from the WAL and applied to the memtable again.
   206  
   207  In Pebble, the memtable is also implemented using a skiplist on top of
   208  an arena. Significantly, the Pebble arena is a fixed size. While the
   209  RocksDB skiplist uses pointers, the Pebble skiplist uses offsets from
   210  the start of the arena. The fixed size arena means that the Pebble
   211  memtable cannot expand arbitrarily. A batch that is too large to fit
   212  in the memtable causes the current mutable memtable to be marked as
   213  immutable and the batch is wrapped in a `flushableBatch` structure and
   214  added to the list of immutable memtables. Because the `flushableBatch`
   215  is readable as another layer in the LSM, the batch commit can return
   216  as soon as the `flushableBatch` has been added to the immutable
   217  memtable list.
   218  
   219  Internally, a `flushableBatch` provides iterator support by sorting
   220  the batch contents (the batch is sorted once, when it is added to the
   221  memtable list). Sorting the batch contents and insertion of the
   222  contents into a memtable have the same big-O time, but the constant
   223  factor dominates here. Sorting is significantly faster and uses
   224  significantly less memory due to not having to copy the batch records.
   225  
   226  Note that an effect of this large batch support is that Pebble can be
   227  configured as an efficient on-disk sorter: specify a small memtable
   228  size, disable the WAL, and set a large L0 compaction threshold. In
   229  order to sort a large amount of data, create batches that are larger
   230  than the memtable size and commit them. When committed these batches
   231  will not be inserted into a memtable, but instead sorted and then
   232  written out to L0. The fully sorted data can later be read and the
   233  normal merging process will take care of the final ordering.
   234  
   235  ## Commit Pipeline
   236  
   237  The commit pipeline is the component which manages the steps in
   238  committing write batches, such as writing the batch to the WAL and
   239  applying its contents to the memtable. While simple conceptually, the
   240  commit pipeline is crucial for high performance. In the absence of
   241  concurrency, commit performance is limited by how fast a batch can be
   242  written (and synced) to the WAL and then added to the memtable, both
   243  of which are outside of the purview of the commit pipeline.
   244  
   245  To understand the challenge here, it is useful to have a conception of
   246  the WAL (write-ahead log). The WAL contains a record of all of the
   247  batches that have been committed to the database. As a record is
   248  written to the WAL it is added to the memtable. Each record is
   249  assigned a sequence number which is used to distinguish newer updates
   250  from older ones. Conceptually the WAL looks like:
   251  
   252  ```
   253  +--------------------------------------+
   254  | Batch(SeqNum=1,Count=9,Records=...)  |
   255  +--------------------------------------+
   256  | Batch(SeqNum=10,Count=5,Records=...) |
   257  +--------------------------------------+
   258  | Batch(SeqNum=15,Count=7,Records...)  |
   259  +--------------------------------------+
   260  | ...                                  |
   261  +--------------------------------------+
   262  ```
   263  
   264  Note that each WAL entry is precisely the batch representation
   265  described earlier in the [Indexed Batches](#indexed-batches)
   266  section. The monotonically increasing sequence numbers are a critical
   267  component in allowing RocksDB and Pebble to provide fast snapshot
   268  views of the database for reads.
   269  
   270  If concurrent performance was not a concern, the commit pipeline could
   271  simply be a mutex which serialized writes to the WAL and application
   272  of the batch records to the memtable. Concurrent performance is a
   273  concern, though.
   274  
   275  The primary challenge in concurrent performance in the commit pipeline
   276  is maintaining two invariants:
   277  
   278  1. Batches need to be written to the WAL in sequence number order.
   279  2. Batches need to be made visible for reads in sequence number
   280     order. This invariant arises from the use of a single sequence
   281     number which indicates which mutations are visible.
   282  
   283  The second invariant deserves explanation. RocksDB and Pebble both
   284  keep track of a visible sequence number. This is the sequence number
   285  for which records in the database are visible during reads. The
   286  visible sequence number exists because committing a batch is an atomic
   287  operation, yet adding records to the memtable is done without an
   288  exclusive lock (the skiplists used by both Pebble and RocksDB are
   289  lock-free). When the records from a batch are being added to the
   290  memtable, a concurrent read operation may see those records, but will
   291  skip over them because they are newer than the visible sequence
   292  number. Once all of the records in the batch have been added to the
   293  memtable, the visible sequence number is atomically incremented.
   294  
   295  So we have four steps in committing a write batch:
   296  
   297  1. Write the batch to the WAL
   298  2. Apply the mutations in the batch to the memtable
   299  3. Bump the visible sequence number
   300  4. (Optionally) sync the WAL
   301  
   302  Writing the batch to the WAL is actually very fast as it is just a
   303  memory copy. Applying the mutations in the batch to the memtable is by
   304  far the most CPU intensive part of the commit pipeline. Syncing the
   305  WAL is the most expensive from a wall clock perspective.
   306  
   307  With that background out of the way, let's examine how RocksDB commits
   308  batches. This description is of the traditional commit pipeline in
   309  RocksDB (i.e. the one used by CockroachDB).
   310  
   311  RocksDB achieves concurrency in the commit pipeline by grouping
   312  concurrently committed batches into a batch group. Each group is
   313  assigned a "leader" which is the first batch to be added to the
   314  group. The batch group is written atomically to the WAL by the leader
   315  thread, and then the individual batches making up the group are
   316  concurrently applied to the memtable. Lastly, the visible sequence
   317  number is bumped such that all of the batches in the group become
   318  visible in a single atomic step. While a batch group is being applied,
   319  other concurrent commits are added to a waiting list. When the group
   320  commit finishes, the waiting commits form the next group.
   321  
   322  There are two criticisms of the batch grouping approach. The first is
   323  that forming a batch group involves copying batch contents. RocksDB
   324  partially alleviates this for large batches by placing a limit on the
   325  total size of a group. A large batch will end up in its own group and
   326  not be copied, but the criticism still applies for small batches. Note
   327  that there are actually two copies here. The batch contents are
   328  concatenated together to form the group, and then the group contents
   329  are written into an in memory buffer for the WAL before being written
   330  to disk.
   331  
   332  The second criticism is about the thread synchronization points. Let's
   333  consider what happens to a commit which becomes the leader:
   334  
   335  1. Lock commit mutex
   336  2. Wait to become leader
   337  3. Form (concatenate) batch group and write to the WAL
   338  4. Notify followers to apply their batch to the memtable
   339  5. Apply own batch to memtable
   340  6. Wait for followers to finish
   341  7. Bump visible sequence number
   342  8. Unlock commit mutex
   343  9. Notify followers that the commit is complete
   344  
   345  The follower's set of operations looks like:
   346  
   347  1. Lock commit mutex
   348  2. Wait to become follower
   349  3. Wait to be notified that it is time to apply batch
   350  4. Unlock commit mutex
   351  5. Apply batch to memtable
   352  6. Wait to be notified that commit is complete
   353  
   354  The thread synchronization points (all of the waits and notifies) are
   355  overhead. Reducing that overhead can improve performance.
   356  
   357  The Pebble commit pipeline addresses both criticisms. The main
   358  innovation is a commit queue that mirrors the commit order. The Pebble
   359  commit pipeline looks like:
   360  
   361  1. Lock commit mutex
   362    * Add batch to commit queue
   363    * Assign batch sequence number
   364    * Write batch to the WAL
   365  2. Unlock commit mutex
   366  3. Apply batch to memtable (concurrently)
   367  4. Publish batch sequence number
   368  
   369  Pebble does not use the concept of a batch group. Each batch is
   370  individually written to the WAL, but note that the WAL write is just a
   371  memory copy into an internal buffer in the WAL.
   372  
   373  Step 4 deserves further scrutiny as it is where the invariant on the
   374  visible batch sequence number is maintained. Publishing the batch
   375  sequence number cannot simply bump the visible sequence number because
   376  batches with earlier sequence numbers may still be applying to the
   377  memtable. If we were to ratchet the visible sequence number without
   378  waiting for those applies to finish, a concurrent reader could see
   379  partial batch contents. Note that RocksDB has experimented with
   380  allowing these semantics with its unordered writes option.
   381  
   382  We want to retain the atomic visibility of batch commits. The publish
   383  batch sequence number step needs to ensure that we don't ratchet the
   384  visible sequence number until all batches with earlier sequence
   385  numbers have applied. Enter the commit queue: a lock-free
   386  single-producer, multi-consumer queue. Batches are added to the commit
   387  queue with the commit mutex held, ensuring the same order as the
   388  sequence number assignment. After a batch finishes applying to the
   389  memtable, it atomically marks the batch as applied. It then removes
   390  the prefix of applied batches from the commit queue, bumping the
   391  visible sequence number, and marking the batch as committed (via a
   392  `sync.WaitGroup`). If the first batch in the commit queue has not be
   393  applied we wait for our batch to be committed, relying on another
   394  concurrent committer to perform the visible sequence ratcheting for
   395  our batch. We know a concurrent commit is taking place because if
   396  there was only one batch committing it would be at the head of the
   397  commit queue.
   398  
   399  There are two possibilities when publishing a sequence number. The
   400  first is that there is an unapplied batch at the head of the
   401  queue. Consider the following scenario where we're trying to publish
   402  the sequence number for batch `B`.
   403  
   404  ```
   405    +---------------+-------------+---------------+-----+
   406    | A (unapplied) | B (applied) | C (unapplied) | ... |
   407    +---------------+-------------+---------------+-----+
   408  ```
   409  
   410  The publish routine will see that `A` is unapplied and then simply
   411  wait for `B's` done `sync.WaitGroup` to be signalled. This is safe
   412  because `A` must still be committing. And if `A` has concurrently been
   413  marked as applied, the goroutine publishing `A` will then publish
   414  `B`. What happens when `A` publishes its sequence number? The commit
   415  queue state becomes:
   416  
   417  ```
   418    +-------------+-------------+---------------+-----+
   419    | A (applied) | B (applied) | C (unapplied) | ... |
   420    +-------------+-------------+---------------+-----+
   421  ```
   422  
   423  The publish routine pops `A` from the queue, ratchets the sequence
   424  number, then pops `B` and ratchets the sequence number again, and then
   425  finds `C` and stops. A detail that it is important to notice is that
   426  the committer for batch `B` didn't have to do any more work. An
   427  alternative approach would be to have `B` wakeup and ratchet its own
   428  sequence number, but that would serialize the remainder of the commit
   429  queue behind that goroutine waking up.
   430  
   431  The commit queue reduces the number of thread synchronization
   432  operations required to commit a batch. There is no leader to notify,
   433  or followers to wait for. A commit either publishes its own sequence
   434  number, or performs one synchronization operation to wait for a
   435  concurrent committer to publish its sequence number.
   436  
   437  ## Range Deletions
   438  
   439  Deletion of an individual key in RocksDB and Pebble is accomplished by
   440  writing a deletion tombstone. A deletion tombstone shadows an existing
   441  value for a key, causing reads to treat the key as not present. The
   442  deletion tombstone mechanism works well for deleting small sets of
   443  keys, but what happens if you want to all of the keys within a range
   444  of keys that might number in the thousands or millions? A range
   445  deletion is an operation which deletes an entire range of keys with a
   446  single record. In contrast to a point deletion tombstone which
   447  specifies a single key, a range deletion tombstone (a.k.a. range
   448  tombstone) specifies a start key (inclusive) and an end key
   449  (exclusive). This single record is much faster to write than thousands
   450  or millions of point deletion tombstones, and can be done blindly --
   451  without iterating over the keys that need to be deleted. The downside
   452  to range tombstones is that they require additional processing during
   453  reads. How the processing of range tombstones is done significantly
   454  affects both the complexity of the implementation, and the efficiency
   455  of read operations in the presence of range tombstones.
   456  
   457  A range tombstone is composed of a start key, end key, and sequence
   458  number. Any key that falls within the range is considered deleted if
   459  the key's sequence number is less than or equal to the range
   460  tombstone's sequence number. RocksDB stores range tombstones
   461  segregated from point operations in a special range deletion block
   462  within each sstable. Conceptually, the range tombstones stored within
   463  an sstable are truncated to the boundaries of the sstable, though
   464  there are complexities that cause this to not actually be physically
   465  true.
   466  
   467  In RocksDB, the main structure implementing range tombstone processing
   468  is the `RangeDelAggregator`. Each read operation and iterator has its
   469  own `RangeDelAggregator` configured for the sequence number the read
   470  is taking place at. The initial implementation of `RangeDelAggregator`
   471  built up a "skyline" for the range tombstones visible at the read
   472  sequence number.
   473  
   474  ```
   475  10   +---+
   476   9   |   |
   477   8   |   |
   478   7   |   +----+
   479   6   |        |
   480   5 +-+        |  +----+
   481   4 |          |  |    |
   482   3 |          |  |    +---+
   483   2 |          |  |        |
   484   1 |          |  |        |
   485   0 |          |  |        |
   486    abcdefghijklmnopqrstuvwxyz
   487  ```
   488  
   489  The above diagram shows the skyline created for the range tombstones
   490  `[b,j)#5`, `[d,h)#10`, `[f,m)#7`, `[p,u)#5`, and `[t,y)#3`. The
   491  skyline is queried for each key read to see if the key should be
   492  considered deleted or not. The skyline structure is stored in a binary
   493  tree, making the queries an O(logn) operation in the number of
   494  tombstones, though there is an optimization to make this O(1) for
   495  `next`/`prev` iteration. Note that the skyline representation loses
   496  information about the range tombstones. This requires the structure to
   497  be rebuilt on every read which has a significant performance impact.
   498  
   499  The initial skyline range tombstone implementation has since been
   500  replaced with a more efficient lookup structure. See the
   501  [DeleteRange](https://rocksdb.org/blog/2018/11/21/delete-range.html)
   502  blog post for a good description of both the original implementation
   503  and the new (v2) implementation. The key change in the new
   504  implementation is to "fragment" the range tombstones that are stored
   505  in an sstable. The fragmented range tombstones provide the same
   506  benefit as the skyline representation: the ability to binary search
   507  the fragments in order to find the tombstone covering a key. But
   508  unlike the skyline approach, the fragmented tombstones can be cached
   509  on a per-sstable basis. In the v2 approach, `RangeDelAggregator` keeps
   510  track of the fragmented range tombstones for each sstable encountered
   511  during a read or iterator, and logically merges them together.
   512  
   513  Fragmenting range tombstones involves splitting range tombstones at
   514  overlap points. Let's consider the tombstones in the skyline example
   515  above:
   516  
   517  ```
   518  10:   d---h
   519   7:     f------m
   520   5: b-------j     p----u
   521   3:                   t----y
   522  ```
   523  
   524  Fragmenting the range tombstones at the overlap points creates a
   525  larger number of range tombstones:
   526  
   527  ```
   528  10:   d-f-h
   529   7:     f-h-j--m
   530   5: b-d-f-h-j     p---tu
   531   3:                   tu---y
   532  ```
   533  
   534  While the number of tombstones is larger there is a significant
   535  advantage: we can order the tombstones by their start key and then
   536  binary search to find the set of tombstones overlapping a particular
   537  point. This is possible because due to the fragmenting, all the
   538  tombstones that overlap a range of keys will have the same start and
   539  end key. The v2 `RangeDelAggregator` and associated classes perform
   540  fragmentation of range tombstones stored in each sstable and those
   541  fragmented tombstones are then cached.
   542  
   543  In summary, in RocksDB `RangeDelAggregator` acts as an oracle for
   544  answering whether a key is deleted at a particular sequence
   545  number. Due to caching of fragmented tombstones, the v2 implementation
   546  of `RangeDelAggregator` implementation is significantly faster to
   547  populate than v1, yet the overall approach to processing range
   548  tombstones remains similar.
   549  
   550  Pebble takes a different approach: it integrates range tombstones
   551  processing directly into the `mergingIter` structure. `mergingIter` is
   552  the internal structure which provides a merged view of the levels in
   553  an LSM. RocksDB has a similar class named
   554  `MergingIterator`. Internally, `mergingIter` maintains a heap over the
   555  levels in the LSM (note that each memtable and L0 table is a separate
   556  "level" in `mergingIter`). In RocksDB, `MergingIterator` knows nothing
   557  about range tombstones, and it is thus up to higher-level code to
   558  process range tombstones using `RangeDelAggregator`.
   559  
   560  While the separation of `MergingIterator` and range tombstones seems
   561  reasonable at first glance, there is an optimization that RocksDB does
   562  not perform which is awkward with the `RangeDelAggregator` approach:
   563  skipping swaths of deleted keys. A range tombstone often shadows more
   564  than one key. Rather than iterating over the deleted keys, it is much
   565  quicker to seek to the end point of the range tombstone. The challenge
   566  in implementing this optimization is that a key might be newer than
   567  the range tombstone and thus shouldn't be skipped. An insight to be
   568  utilized is that the level structure itself provides sufficient
   569  information. A range tombstone at `Ln` is guaranteed to be newer than
   570  any key it overlaps in `Ln+1`.
   571  
   572  Pebble utilizes the insight above to integrate range deletion
   573  processing with `mergingIter`. A `mergingIter` maintains a point
   574  iterator and a range deletion iterator per level in the LSM. In this
   575  context, every L0 table is a separate level, as is every
   576  memtable. Within a level, when a range deletion contains a point
   577  operation the sequence numbers must be checked to determine if the
   578  point operation is newer or older than the range deletion
   579  tombstone. The `mergingIter` maintains the invariant that the range
   580  deletion iterators for all levels newer that the current iteration key
   581  are positioned at the next (or previous during reverse iteration)
   582  range deletion tombstone. We know those levels don't contain a range
   583  deletion tombstone that covers the current key because if they did the
   584  current key would be deleted. The range deletion iterator for the
   585  current key's level is positioned at a range tombstone covering or
   586  past the current key. The position of all of other range deletion
   587  iterators is unspecified. Whenever a key from those levels becomes the
   588  current key, their range deletion iterators need to be
   589  positioned. This lazy positioning avoids seeking the range deletion
   590  iterators for keys that are never considered.
   591  
   592  For a full example, consider the following setup:
   593  
   594  ```
   595    p0:               o
   596    r0:             m---q
   597  
   598    p1:              n p
   599    r1:       g---k
   600  
   601    p2:  b d    i
   602    r2: a---e           q----v
   603  
   604    p3:     e
   605    r3:
   606  ```
   607  
   608  The diagram above shows is showing 4 levels, with `pX` indicating the
   609  point operations in a level and `rX` indicating the range tombstones.
   610  
   611  If we start iterating from the beginning, the first key we encounter
   612  is `b` in `p2`. When the mergingIter is pointing at a valid entry, the
   613  range deletion iterators for all of the levels less that the current
   614  key's level are positioned at the next range tombstone past the
   615  current key. So `r0` will point at `[m,q)` and `r1` at `[g,k)`. When
   616  the key `b` is encountered, we check to see if the current tombstone
   617  for `r0` or `r1` contains it, and whether the tombstone for `r2`,
   618  `[a,e)`, contains and is newer than `b`.
   619  
   620  Advancing the iterator finds the next key at `d`. This is in the same
   621  level as the previous key `b` so we don't have to reposition any of
   622  the range deletion iterators, but merely check whether `d` is now
   623  contained by any of the range tombstones at higher levels or has
   624  stepped past the range tombstone in its own level. In this case, there
   625  is nothing to be done.
   626  
   627  Advancing the iterator again finds `e`. Since `e` comes from `p3`, we
   628  have to position the `r3` range deletion iterator, which is empty. `e`
   629  is past the `r2` tombstone of `[a,e)` so we need to advance the `r2`
   630  range deletion iterator to `[q,v)`.
   631  
   632  The next key is `i`. Because this key is in `p2`, a level above `e`,
   633  we don't have to reposition any range deletion iterators and instead
   634  see that `i` is covered by the range tombstone `[g,k)`. The iterator
   635  is immediately advanced to `n` which is covered by the range tombstone
   636  `[m,q)` causing the iterator to advance to `o` which is visible.
   637  
   638  ## Flush and Compaction Pacing
   639  
   640  Flushes and compactions in LSM trees are problematic because they
   641  contend with foreground traffic, resulting in write and read latency
   642  spikes. Without throttling the rate of flushes and compactions, they
   643  occur "as fast as possible" (which is not entirely true, since we
   644  have a `bytes_per_sync` option). This instantaneous usage of CPU and
   645  disk IO results in potentially huge latency spikes for writes and
   646  reads which occur in parallel to the flushes and compactions.
   647  
   648  RocksDB attempts to solve this issue by offering an option to limit
   649  the speed of flushes and compactions. A maximum `bytes/sec` can be
   650  specified through the options, and background IO usage will be limited
   651  to the specified amount. Flushes are given priority over compactions,
   652  but they still use the same rate limiter. Though simple to implement
   653  and understand, this option is fragile for various reasons.
   654  
   655  1) If the rate limit is configured too low, the DB will stall and
   656  write throughput will be affected.
   657  2) If the rate limit is configured too high, the write and read
   658  latency spikes will persist.
   659  3) A different configuration is needed per system depending on the
   660  speed of the storage device.
   661  4) Write rates typically do not stay the same throughout the lifetime
   662  of the DB (higher throughput during certain times of the day, etc) but
   663  the rate limit cannot be configured during runtime.
   664  
   665  RocksDB also offers an
   666  ["auto-tuned" rate limiter](https://rocksdb.org/blog/2017/12/18/17-auto-tuned-rate-limiter.html)
   667  which uses a simple multiplicative-increase, multiplicative-decrease
   668  algorithm to dynamically adjust the background IO rate limit depending
   669  on how much of the rate limiter has been exhausted in an interval.
   670  This solves the problem of having a static rate limit, but Pebble
   671  attempts to improve on this with a different pacing mechanism.
   672  
   673  Pebble's pacing mechanism uses separate rate limiters for flushes and
   674  compactions. Both the flush and compaction pacing mechanisms work by
   675  attempting to flush and compact only as fast as needed and no faster.
   676  This is achieved differently for flushes versus compactions.
   677  
   678  For flush pacing, Pebble keeps the rate at which the memtable is
   679  flushed at the same rate as user writes. This ensures that disk IO
   680  used by flushes remains steady. When a mutable memtable becomes full
   681  and is marked immutable, it is typically flushed as fast as possible.
   682  Instead of flushing as fast as possible, what we do is look at the
   683  total number of bytes in all the memtables (mutable + queue of
   684  immutables) and subtract the number of bytes that have been flushed in
   685  the current flush. This number gives us the total number of bytes
   686  which remain to be flushed. If we keep this number steady at a constant
   687  level, we have the invariant that the flush rate is equal to the write
   688  rate.
   689  
   690  When the number of bytes remaining to be flushed falls below our
   691  target level, we slow down the speed of flushing. We keep a minimum
   692  rate at which the memtable is flushed so that flushes proceed even if
   693  writes have stopped. When the number of bytes remaining to be flushed
   694  goes above our target level, we allow the flush to proceed as fast as
   695  possible, without applying any rate limiting. However, note that the
   696  second case would indicate that writes are occurring faster than the
   697  memtable can flush, which would be an unsustainable rate. The LSM
   698  would soon hit the memtable count stall condition and writes would be
   699  completely stopped.
   700  
   701  For compaction pacing, Pebble uses an estimation of compaction debt,
   702  which is the number of bytes which need to be compacted before no
   703  further compactions are needed. This estimation is calculated by
   704  looking at the number of bytes that have been flushed by the current
   705  flush routine, adding those bytes to the size of the level 0 sstables,
   706  then seeing how many bytes exceed the target number of bytes for the
   707  level 0 sstables. We multiply the number of bytes exceeded by the
   708  the level ratio and add that number to the compaction debt estimate.
   709  We repeat this process until the final level, which gives us a final
   710  compaction debt estimate for the entire LSM tree.
   711  
   712  Like with flush pacing, we want to keep the compaction debt at a
   713  constant level. This ensures that compactions occur only as fast as
   714  needed and no faster. If the compaction debt estimate falls below our
   715  target level, we slow down compactions. We maintain a minimum
   716  compaction rate so that compactions proceed even if flushes have
   717  stopped. If the compaction debt goes above our target level, we let
   718  compactions proceed as fast as possible without any rate limiting.
   719  Just like with flush pacing, this would indicate that writes are
   720  occurring faster than the background compactions can keep up with,
   721  which is an unsustainable rate. The LSM's read amplification would
   722  increase and the L0 file count stall condition would be hit.
   723  
   724  With the combined flush and compaction pacing mechanisms, flushes and
   725  compactions only occur as fast as needed and no faster, which reduces
   726  latency spikes for user read and write operations.
   727  
   728  ## Write throttling
   729  
   730  RocksDB adds artificial delays to user writes when certain thresholds
   731  are met, such as `l0_slowdown_writes_threshold`. These artificial
   732  delays occur when the system is close to stalling to lessen the write
   733  pressure so that flushing and compactions can catch up. On the surface
   734  this seems good, since write stalls would seemingly be eliminated and
   735  replaced with gradual slowdowns. Closed loop write latency benchmarks
   736  would show the elimination of abrupt write stalls, which seems
   737  desirable.
   738  
   739  However, this doesn't do anything to improve latencies in an open loop
   740  model, which is the model more likely to resemble real world use
   741  cases. Artificial delays increase write latencies without a clear
   742  benefit. Writes stalls in an open loop system would indicate that
   743  writes are generated faster than the system could possibly handle,
   744  which adding artificial delays won't solve.
   745  
   746  For this reason, Pebble doesn't add artificial delays to user writes
   747  and writes are served as quickly as possible.
   748  
   749  ### Other Differences
   750  
   751  * `internalIterator` API which minimizes indirect (virtual) function
   752    calls
   753  * Previous pointers in the memtable and indexed batch skiplists
   754  * Elision of per-key lower/upper bound checks in long range scans
   755  * Weak cache references remove the need to pin index and filter blocks
   756    in memory
   757  * Improved `Iterator` API
   758    + `SeekPrefixGE` for prefix iteration
   759    + `SetBounds` for adjusting the bounds on an existing `Iterator`
   760  * Simpler `Get` implementation