github.com/cockroachdb/pebble@v1.1.2/docs/RFCS/20211018_range_keys.md (about)

     1  - Feature Name: Range Keys
     2  - Status: draft
     3  - Start Date: 2021-10-18
     4  - Authors: Sumeer Bhola, Jackson Owens
     5  - RFC PR: #1341
     6  - Pebble Issues:
     7    https://github.com/cockroachdb/pebble/issues/1339
     8  - Cockroach Issues:
     9    https://github.com/cockroachdb/cockroach/issues/70429
    10    https://github.com/cockroachdb/cockroach/issues/70412
    11  
    12  ** Design Draft**
    13  
    14  # Summary
    15  
    16  An ongoing effort within CockroachDB to preserve MVCC history across all SQL
    17  operations (see cockroachdb/cockroach#69380) requires a more efficient method of
    18  deleting ranges of MVCC history.
    19  
    20  This document describes an extension to Pebble introducing first-class support
    21  for range keys. Range keys map a range of keyspace to a value. Optionally, the
    22  key range may include an suffix encoding a version (eg, MVCC timestamp). Pebble
    23  iterators may be configured to surface range keys during iteration, or to mask
    24  point keys at lower MVCC timestamps covered by range keys.
    25  
    26  CockroachDB will make use of these range keys to enable history-preserving
    27  removal of contiguous ranges of MVCC keys with constant writes, and efficient
    28  iteration past deleted versions.
    29  
    30  # Background
    31  
    32  A previous CockroachDB RFC cockroach/cockroachdb#69380 describes the motivation
    33  for the larger project of migrating MVCC-noncompliant operations into MVCC
    34  compliance. Implemented with the existing MVCC primitives, some operations like
    35  removal of an index or table would require performing writes linearly
    36  proportional to the size of the table. Dropping a large table using existing
    37  MVCC point-delete primitives would be prohibitively expensive. The desire for a
    38  sublinear delete of an MVCC range motivates this work.
    39  
    40  The detailed design for MVCC compliant bulk operations ([high-level
    41  description](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210825_mvcc_bulk_ops.md);
    42  detailed design draft for DeleteRange in internal
    43  [doc](https://docs.google.com/document/d/1ItxpitNwuaEnwv95RJORLCGuOczuS2y_GoM2ckJCnFs/edit#heading=h.x6oktstoeb9t)),
    44  ran into complexity by placing range operations above the Pebble layer, such
    45  that Pebble sees these as points. The complexity causes are various: (a) which
    46  key (start or end) to anchor this range on, when represented as a point (there
    47  are performance consequences), (b) rewriting on CockroachDB range splits (and
    48  concerns about rewrite volume), (c) fragmentation on writes and complexity
    49  thereof (and performance concerns for reads when not fragmenting), (d) inability
    50  to efficiently skip older MVCC versions that are masked by a `[k1,k2)@ts` (where
    51  ts is the MVCC timestamp).
    52  
    53  Pebble currently has only one kind of key that is associated with a range:
    54  `RANGEDEL [k1, k2)#seq`, where [k1, k2) is supplied by the caller, and is used
    55  to efficiently remove a set of point keys.
    56  
    57  First-class support for range keys in Pebble eliminates all these issues.
    58  Additionally, it allows for future extensions like efficient transactional range
    59  operations. This issue describes how this feature would work from the
    60  perspective of a user of Pebble (like CockroachDB), and sketches some
    61  implementation details.
    62  
    63  # Design
    64  
    65  ## Interface
    66  
    67  ### New `Comparer` requirements
    68  
    69  The Pebble `Comparer` type allows users to optionally specify a `Split` function
    70  that splits a user key into a prefix and a suffix. This Split allows users
    71  implementing MVCC (Multi-Version Concurrency Control) to inform Pebble which
    72  part of the key encodes the user key and which part of the key encodes the
    73  version (eg, a timestamp). Pebble does not dictate the encoding of an MVCC
    74  version, only that the version form a suffix on keys.
    75  
    76  The range keys design described in this RFC introduces stricter requirements for
    77  user-provided `Split` implementations and the ordering of keys:
    78  
    79  1. The user key consisting of just a key prefix `k` must sort before all
    80     other user keys containing that prefix. Specifically
    81     `Compare(k[:Split(k)], k) < 0` where `Split(k) < len(k)`.
    82  2. A key consisting of a bare suffix must be a valid key and comparable. The
    83     ordering of the empty key prefix with any suffixes must be consistent with
    84     the ordering of those same suffixes applied to any other key prefix.
    85     Specifically `Compare(k[Split(k):], k2[Split(k2):]) == Compare(k, k2)` where
    86     `Compare(k[:Split(k)], k2[:Split(k2)]) == 0`.
    87  
    88  The details of why these new requirements are necessary are explained in the
    89  implementation section.
    90  
    91  ### Writes
    92  
    93  This design introduces three new write operations:
    94  
    95  - `RangeKeySet([k1, k2), [optional suffix], <value>)`: This represents the
    96    mapping `[k1, k2)@suffix => value`. Keys `k1` and `k2` must not contain a
    97    suffix (i.e., `Split(k1)==len(k1)` and `Split(k2)==len(k2))`.
    98  
    99  - `RangeKeyUnset([k1, k2), [optional suffix])`: This removes a mapping
   100    previously applied by `RangeKeySet`. The unset may use a smaller key range
   101    than the original `RangeKeySet`, in which case only part of the range is
   102    deleted. The unset only applies to range keys with a matching optional suffix.
   103    If the optional suffix is absent in both the RangeKeySet and RangeKeyUnset,
   104    they are considered matching.
   105  
   106  - `RangeKeyDelete([k1, k2))`: This removes all range keys within the provided
   107    key span. It behaves like an `Unset` unencumbered by suffix restrictions.
   108  
   109  For example, consider `RangeKeySet([a,d), foo)` (i.e., no suffix). If
   110  there is a later call `RangeKeyUnset([b,c))`, the resulting state seen by
   111  a reader is `[a,b) => foo`, `[c,d) => foo`. Note that the value is not
   112  modified when the key is fragmented.
   113  
   114  Partially overlapping `RangeKeySet`s with the same suffix overwrite one
   115  another.  For example, consider `RangeKeySet([a,d), foo)`, followed by
   116  `RangeKeySet([c,e), bar)`.  The resulting state is `[a,c) => foo`, `[c,e)
   117  => bar`.
   118  
   119  Point keys (eg, traditional keys defined at a singular byte string key) and
   120  range keys do not overwrite one another. They have a parallel existence. Point
   121  deletes only apply to points. Range unsets only apply to range keys. However,
   122  users may configure iterators to mask point keys covered by newer range keys.
   123  This masking behavior is explicitly requested by the user in the context of the
   124  iteration. Masking is described in more detail below.
   125  
   126  There exist separate range delete operations for point keys and range keys. A
   127  `RangeKeyDelete` may remove part of a range key, just like the new
   128  `RangeKeyUnset` operation introduced earlier. `RangeKeyDelete`s differ from
   129  `RangeKeyUnset`s, because the latter requires that the suffix matches and
   130  applies only to range keys. `RangeKeyDelete`s completely clear all existing
   131  range keys within their span at all suffix values.
   132  
   133  The optional suffix in `RangeKeySet` and `RangeKeyUnset` operations is related
   134  to the pebble `Comparer.Split` operation which is explicitly documented as being
   135  for [MVCC
   136  keys](https://github.com/cockroachdb/pebble/blob/e95e73745ce8a85d605ef311d29a6574db8ed3bf/internal/base/comparer.go#L69-L88),
   137  without mandating exactly how the versions are represented. `RangeKeySet` and
   138  `RangeKeyUnset` keys with different suffixes do not interact logically, although
   139  Pebble will observably fragment ranges at intersection points.
   140  
   141  ### Iteration
   142  
   143  A user iterating over a key interval [k1,k2) can request:
   144  
   145  - **[I1]** An iterator over only point keys.
   146  
   147  - **[I2]** A combined iterator over point and range keys. This is what
   148    we mainly discuss below in the implementation discussion.
   149  
   150  - **[I3]** An iterator over only range keys. In the CockroachDB use
   151      case, range keys will need to be subject to MVCC GC just like
   152      point keys — this iterator may be useful for that purpose.
   153  
   154  The `pebble.Iterator` type will be extended to provide accessors for
   155  range keys for use in the combined and exclusively range iteration
   156  modes.
   157  
   158  ```
   159  // HasPointAndRange indicates whether there exists a point key, a range key or
   160  // both at the current iterator position.
   161  HasPointAndRange() (hasPoint, hasRange bool)
   162  
   163  // RangeKeyChanged indicates whether the most recent iterator positioning
   164  // operation resulted in the iterator stepping into or out of a new range key.
   165  // If true previously returned range key bounds and data has been invalidated.
   166  // If false, previously obtained range key bounds, suffix and value slices are
   167  // still valid and may continue to be read.
   168  RangeKeyChanged() bool
   169  
   170  // Key returns the key of the current key/value pair, or nil if done. If
   171  // positioned at an iterator position that only holds a range key, Key()
   172  // always returns the start bound of the range key. Otherwise, it returns
   173  // the point key's key.
   174  Key() []byte
   175  
   176  // RangeBounds returns the start (inclusive) and end (exclusive) bounds of the
   177  // range key covering the current iterator position. RangeBounds returns nil
   178  // bounds if there is no range key covering the current iterator position, or
   179  // the iterator is not configured to surface range keys.
   180  //
   181  // If valid, the returned start bound is less than or equal to Key() and the
   182  // returned end bound is greater than Key().
   183  RangeBounds() (start, end []byte)
   184  
   185  // Value returns the value of the current key/value pair, or nil if done.
   186  // The caller should not modify the contents of the returned slice, and
   187  // its contents may change on the next call to Next.
   188  //
   189  // Only valid if HasPointAndRange() returns true for hasPoint.
   190  Value() []byte
   191  
   192  // RangeKeys returns the range key values and their suffixes covering the
   193  // current iterator position. The range bounds may be retrieved separately
   194  // through RangeBounds().
   195  RangeKeys() []RangeKey
   196  
   197  type RangeKey struct {
   198      Suffix []byte
   199      Value  []byte
   200  }
   201  ```
   202  
   203  When a combined iterator exposes range keys, it exposes all the range
   204  keys covering `Key`. During iteration with a combined iterator, an
   205  iteration position may surface just a point key, just a range key or
   206  both at the currently-positioned `Key`.
   207  
   208  Described another way, a Pebble combined iterator guarantees that it
   209  will stop at all positions within the keyspace where:
   210  1. There exists a point key at that position.
   211  2. There exists a range key that logically begins at that postition.
   212  
   213  In addition to the above positions, a Pebble iterator may also stop at keys
   214  in-between the above positions due to fragmentation. Range keys are defined over
   215  continuous spans of keyspace. Range keys with different suffix values may
   216  overlap each other arbitrarily. To surface these arbitrarily overlapping spans
   217  in an understandable and efficient way, the Pebble iterator surfaces range keys
   218  fragmented at intersection points. Consider the following sequence of writes:
   219  
   220  ```
   221      RangeKeySet([a,z), @1, 'apple')
   222      RangeKeySet([c,e), @3, 'banana')
   223      RangeKeySet([e,m), @5, 'orange')
   224      RangeKeySet([b,k), @7, 'kiwi')
   225  ```
   226  
   227  This yields a database containing overlapping range keys:
   228  ```
   229    @7 → kiwi     |-----------------)
   230    @5 → orange         |---------------)
   231    @3 → banana     |---)
   232    @1 → apple  |-------------------------------------------------)
   233                a b c d e f g h i j k l m n o p q r s t u v w x y z
   234  ```
   235  
   236  During iteration, these range keys are surfaced using the bounds of their
   237  intersection points. For example, a scan across the keyspace containing only
   238  these range keys would observe the following iterator positions:
   239  
   240  ```
   241    Key() = a   RangeKeyBounds() = [a,b)   RangeKeys() = {(@1,apple)}
   242    Key() = b   RangeKeyBounds() = [b,c)   RangeKeys() = {(@7,kiwi), (@1,apple)}
   243    Key() = c   RangeKeyBounds() = [c,e)   RangeKeys() = {(@7,kiwi), (@3,banana), (@1,apple)}
   244    Key() = e   RangeKeyBounds() = [e,k)   RangeKeys() = {(@7,kiwi), (@5,orange), (@1,apple)}
   245    Key() = k   RangeKeyBounds() = [k,m)   RangeKeys() = {(@5,orange), (@1,apple)}
   246    Key() = m   RangeKeyBounds() = [m,z)   RangeKeys() = {(@1,apple)}
   247  ```
   248  
   249  This fragmentation produces a more understandable interface, and avoids forcing
   250  iterators to read all range keys within the bounds of the broadest range key.
   251  Consider this example:
   252  
   253  ```
   254                     iterator pos          [ ] - sstable bounds
   255                           |
   256  L1:         [a----v1@t2--|-h]     [l-----unset@t1----u]
   257  L2:                 [e---|------v1@t1----------r]
   258               a b c d e f g h i j k l m n o p q r s t u v w x y z
   259  ```
   260  
   261  If the iterator is positioned at a point key `g`, there are two overlapping
   262  physical range keys: `[a,h)@t2→v1` and `[e,r)@t1→v1`.
   263  
   264  However, the `RangeKeyUnset([l,u), @t1)` removes part of the `[e,r)@t1→v1` range
   265  key, truncating it to the bounds `[e,l)`. The iterator must return the truncated
   266  bounds that correctly respect the `RangeKeyUnset`. However, when the range keys
   267  are stored within a log-structured merge tree like Pebble, the `RangeKeyUnset`
   268  may not be contained within the level's sstable that overlaps the current point
   269  key. Searching for the unset could require reading an unbounded number of
   270  sstables, losing the log-structured merge tree's property that bounds read
   271  amplification to the number of levels in the tree.
   272  
   273  Fragmenting range keys to intersection points avoids this problem. The iterator
   274  positioned at `g` only surfaces range key state with the bounds `[e,h)`, the
   275  widest bounds in which it can guarantee t2→v1 and t1→v1 without loading
   276  additional sstables.
   277  
   278  #### Iteration order
   279  
   280  Recall that the user-provided `Comparer.Split(k)` function divides all user keys
   281  into a prefix and a suffix, such that the prefix is `k[:Split(k)]`, and the
   282  suffix is `k[Split(k):]`. If a key does not contain a suffix, the key equals the
   283  prefix.
   284  
   285  An iterator that is configured to surface range keys alongside point keys will
   286  surface all range keys covering the current `Key()` position. Revisiting an
   287  earlier example with the addition of three new point key-value pairs:
   288  a→artichoke, b@2→beet and t@3→turnip. Consider '@<number>' to form the suffix
   289  where present, with `<number>` denoting a MVCC timestamp. Higher, more-recent
   290  timestamps sort before lower, older timestamps.
   291  
   292  ```
   293                .                                                   a   → artichoke
   294    @7 → kiwi     |-----------------)
   295    @5 → orange         |---------------)
   296                  . b@2                                             b@2 → beet
   297    @3 → banana     |---)                             . t@3         t@3 → turnip
   298    @1 → apple  |-------------------------------------------------)
   299                a b c d e f g h i j k l m n o p q r s t u v w x y z
   300  ```
   301  
   302  An iterator configured to surface both point and range keys will visit the
   303  following iterator positions during forward iteration:
   304  
   305  ```
   306    Key()   HasPointAndRange()   Value()      RangeKeyBounds()    RangeKeys()
   307    a       (true,  true)        artichoke    [a,b)               {(@1,apple)}
   308    b       (false, true)        -            [b,c)               {(@7,kiwi), (@1,apple)}
   309    b@2     (true,  true)        beet         [b,c)               {(@7,kiwi), (@1,apple)}
   310    c       (false, true)        -            [c,e)               {(@7,kiwi), (@3,banana), (@1,apple)}
   311    e       (false, true)        -            [e,k)               {(@7,kiwi), (@5,orange), (@1,apple)}
   312    k       (false, true)        -            [k,m)               {(@5,orange), (@1,apple)}
   313    m       (false, true)        -            [m,z)               {(@1,apple)}
   314    t@3     (true,  true)        turnip       [m,z)               {(@1,apple)}
   315  ```
   316  
   317  Note that:
   318  
   319  - While positioned over a point key (eg, Key() = 'a', 'b@2' or t@3'), the
   320    iterator exposes both the point key's value through Value() and the
   321    overlapping range keys values through `RangeKeys()`.
   322  
   323  - There can be multiple range keys covering a `Key()`, each with a different
   324    suffix.
   325  
   326  - There cannot be multiple range keys covering a `Key()` with the same suffix,
   327    since the most-recently committed one (eg, the one with the highest sequence
   328    number) will win, just like for point keys.
   329  
   330  - If the iterator has configured lower and/or upper bounds, they will truncate
   331    the range key to those bounds. For example, if the above iterator had an upper
   332    bound 'y', the `[m,z)` range key would be surfaced with the bounds `[m,y)`
   333    instead.
   334  
   335  #### Masking
   336  
   337  Range key masking provides additional, optional functionality designed
   338  specifically for the use case of implementing a MVCC-compatible delete range.
   339  
   340  When constructing an iterator that iterators over both point and range keys, a
   341  user may request that range keys mask point keys. Masking is configured with a
   342  suffix parameter that determines which range keys may mask point keys. Only
   343  range keys with suffixes that sort after the mask's suffix mask point keys. A
   344  range key that meets this condition only masks points with suffixes that sort
   345  after the range key's suffix.
   346  
   347  ```
   348  type IterOptions struct {
   349      // ...
   350      RangeKeyMasking RangeKeyMasking
   351  }
   352  
   353  // RangeKeyMasking configures automatic hiding of point keys by range keys.
   354  // A non-nil Suffix enables range-key masking. When enabled, range keys with
   355  // suffixes ≥ Suffix behave as masks. All point keys that are contained within
   356  // a masking range key's bounds and have suffixes greater than the range key's
   357  // suffix are automatically skipped.
   358  //
   359  // Specifically, when configured with a RangeKeyMasking.Suffix _s_, and there
   360  // exists a range key with suffix _r_ covering a point key with suffix _p_, and
   361  //
   362  //   _s_ ≤ _r_ < _p_
   363  //
   364  // then the point key is elided.
   365  //
   366  // Range-key masking may only be used when iterating over both point keys and
   367  // range keys.
   368  type RangeKeyMasking struct {
   369  	// Suffix configures which range keys may mask point keys. Only range keys
   370  	// that are defined at suffixes greater than or equal to Suffix will mask
   371  	// point keys.
   372  	Suffix []byte
   373  	// Filter is an optional field that may be used to improve performance of
   374  	// range-key masking through a block-property filter defined over key
   375  	// suffixes. If non-nil, Filter is called by Pebble to construct a
   376  	// block-property filter mask at iterator creation. The filter is used to
   377  	// skip whole point-key blocks containing point keys with suffixes greater
   378  	// than a covering range-key's suffix.
   379  	//
   380  	// To use this functionality, the caller must create and configure (through
   381  	// Options.BlockPropertyCollectors) a block-property collector that records
   382  	// the maxmimum suffix contained within a block. The caller then must write
   383  	// and provide a BlockPropertyFilterMask implementation on that same
   384  	// property. See the BlockPropertyFilterMask type for more information.
   385  	Filter func() BlockPropertyFilterMask
   386  }
   387  ```
   388  
   389  Example: A user may construct an iterator with `RangeKeyMasking.Suffix` set to
   390  `@50`. The range key `[a, c)@60` would mask nothing, because `@60` is a more
   391  recent timestamp than `@50`. However a range key `[a,c)@30` would mask `a@20`
   392  and `apple@10` but not `apple@40`. A range key can only mask keys with MVCC
   393  timestamps older than the range key's own timestamp. Only range keys with
   394  suffixes (eg, MVCC timestamps) may mask anything at all.
   395  
   396  The pebble Iterator surfaces all range keys when masking is enabled. Only point
   397  keys are ever skipped, and only when they are contained within the bounds of a
   398  range key with a more-recent suffix, and the range key's suffix is older than
   399  the timestamp encoded in `RangeKeyMasking.Sufffix`.
   400  
   401  ## Implementation
   402  
   403  ### Write operations
   404  
   405  This design introduces three new Pebble write operations: `RangeKeySet`,
   406  `RangeKeyUnset` and `RangeKeyDelete`. Internally, these operations are
   407  represented as internal keys with new corresponding key kinds encoded as a part
   408  of the key trailer. These keys are stored within special range key blocks
   409  separate from point keys, but within the same sstable. The range key blocks hold
   410  `RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` keys, but do not hold keys
   411  of any other kind.  Within the memtables, these range keys are stored in a
   412  separate skip list.
   413  
   414  - `RangeKeySet([k1,k2), @suffix, value)` is encoded as a `k1.RANGEKEYSET` key
   415    with a value encoding the tuple `(k2,@suffix,value)`.
   416  - `RangeKeyUnset([k1,k2), @suffix)` is encoded as a `k1.RANGEUNSET` key
   417    with a value encoding the tuple `(k2,@suffix)`.
   418  - `RangeKeyDelete([k1,k2)` is encoded as a `k1.RANGEKEYDELETE` key with a value
   419    encoding `k2`.
   420  
   421  Range keys are physically fragmented as an artifact of the log-structured merge
   422  tree structure and internal sstable boundaries. This fragmentation is essential
   423  for preserving the performance characteristics of a log-structured merge tree.
   424  Although the public interface operations for `RangeKeySet` and `RangeKeyUnset`
   425  require both boundary keys `[k1,k2)` to always be bare prefixes (eg, to not have
   426  a suffix), internally these keys may be fragmented to bounds containing
   427  suffixes.
   428  
   429  Example: If a user attempts to write `RangeKeySet([a@v1, c@v2), @v3, value)`,
   430  Pebble will return an error to the user. If a user writes `RangeKeySet([a, c),
   431  @v3, value)`, Pebble will allow the write and may later internally fragment the
   432  `RangeKeySet` into three internal keys:
   433   - `RangeKeySet([a, a@v1), @v3, value)`
   434   - `RangeKeySet([a@v1, c@v2), @v3, value)`
   435   - `RangeKeySet([c@v2, c), @v3, value)`
   436  
   437  This fragmentation preserve log-structured merge tree performance
   438  characteristics because it allows a range key to be split across many sstables,
   439  while preserving locality between range keys and point keys. Consider a
   440  `RangeKeySet([a,z), @1, foo)` on a database that contains millions of point keys
   441  in the range [a,z). If the [a,z) range key was not permitted to be fragmented
   442  internally, it would either need to be stored completely separately from the
   443  point keys in a separate sstable or in a single intractably large sstable
   444  containing all the overlapping point keys. Fragmentation allows locality,
   445  ensuring point keys and range keys in the same region of the keyspace can be
   446  stored in the same sstable.
   447  
   448  `RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` keys are assigned sequence
   449  numbers, like other internal keys. Log-structured merge tree level invariants
   450  are valid across range key, point keys and between the two. That is:
   451  
   452    1. The point key `k1#s2` cannot be at a lower level than `k2#s1` where
   453       `k1==k2` and `s1 < s2`. This is the invariant implemented by all LSMs.
   454    2. `RangeKeySet([k1,k2))#s2` cannot be at a lower level than
   455       `RangeKeySet([k3,k4))#s1` where `[k1,k2)` overlaps `[k3,k4)` and `s1 < s2`.
   456    3. `RangeKeySet([k1,k2))#s2` cannot be at a lower level than a point key
   457       `k3#s1` where `k3 \in [k1,k2)` and `s1 < s2`.
   458  
   459  Like other tombstones, the `RangeKeyUnset` and `RangeKeyDelete` keys are elided
   460  when they fall to the bottomost level of the LSM and there is no snapshot
   461  preventing its elision. There is no additional garbage collection problem
   462  introduced by these keys.
   463  
   464  There is no Merge operation that affects range keys.
   465  
   466  #### Physical representation
   467  
   468  `RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` keys are keyed by their
   469  start key. This poses an obstacle. We must be able to support multiple range
   470  keys at the same sequence number, because all keys within an ingested sstable
   471  adopt the same sequence number. Duplicate internal keys (keys with equal user
   472  keys, sequence numbers and kinds) are prohibited within Pebble. To resolve this
   473  issue, fragments with the same bounds are merged within snapshot stripes into a
   474  single physical key-value, representing multiple logical key-value pairs:
   475  
   476  ```
   477  k1.RangeKeySet#s2 → (k2,[(@t2,v2),(@t1,v1)])
   478  ```
   479  
   480  Within a physical key-value pair, suffix-value pairs are stored sorted by
   481  suffix, descending. This has a minor advantage of reducing iteration-time
   482  user-key comparisons when there exist multiple range keys in a table.
   483  
   484  Unlike other Pebble keys, the `RangeKeySet` and `RangeKeyUnset` keys have values
   485  that encode fields of data known to Pebble. The value that the user sets in a
   486  call to `RangeKeySet` is opaque to Pebble, but the physical representation of
   487  the `RangeKeySet`'s value is known. This encoding is a sequence of fields:
   488  
   489  * End key, `varstring`, encodes the end user key of the fragment.
   490  * A series of (suffix, value) tuples representing the logical range keys that
   491    were merged into this one physical `RangeKeySet` key:
   492    * Suffix, `varstring`
   493    * Value, `varstring`
   494  
   495  Similarly, `RangeKeyUnset` keys are merged within snapshot stripes and have a
   496  physical representation like:
   497  
   498  ```
   499  k1.RangeKeyUnset#s2 → (k2,[(@t2),(@t1)])
   500  ```
   501  
   502  A `RangeKeyUnset` key's value is encoded as:
   503  * End key, `varstring`, encodes the end user key of the fragment.
   504  * A series of suffix `varstring`s.
   505  
   506  When `RangeKeySet` and `RangeKeyUnset` fragments with identical bounds meet
   507  within the same snapshot stripe within a compaction, any of the
   508  `RangeKeyUnset`'s suffixes that exist within the `RangeKeySet` key are removed.
   509  
   510  A `RangeKeyDelete` key has no additional data beyond its end key, which is
   511  encoded directly in the value.
   512  
   513  NB: `RangeKeySet` and `RangeKeyUnset` keys are not merged within batches or the
   514  memtable. That's okay, because batches are append-only and indexed batches will
   515  refragment and merge the range keys on-demand. In the memtable, every key is
   516  guaranteed to have a unique sequence number.
   517  
   518  ### Sequence numbers
   519  
   520  Like all Pebble keys, `RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` are
   521  assigned sequence numbers when committed. As described above, overlapping
   522  `RangeKeySet`s and `RangeKeyUnset`s are fragmented to have matching start and
   523  end bounds. Then the resulting exactly-overlapping range key fragments are
   524  merged into a single internal key-value pair, within the same snapshot stripe
   525  and sstable. The original, unmerged internal keys each have their own sequence
   526  numbers, indicating the moment they were committed within the history of all
   527  write operations.
   528  
   529  Recall that sequence numbers are used within Pebble to determine which keys
   530  appear live to which iterators. When an iterator is constructed, it takes note
   531  of the current _visible sequence number_, and for the lifetime of the iterator,
   532  only surfaces keys less than that sequence number. Similarly, snapshots read the
   533  current _visible sequence number_, remember it, but also leave a note asking
   534  compactions to preserve history at that sequence number. The space between
   535  snapshotted sequence numbers is referred to as a _snapshot stripe_, and
   536  operations cannot drop or otherwise mutate keys unless they fall within the same
   537  _snapshot stripe_. For example a `k.MERGE#5` key may not be merged with a
   538  `k.MERGE#1` operation if there's an open snapshot at `#3`.
   539  
   540  The new `RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` keys behave
   541  similarly. Overlapping range keys won't be merged if there's an open snapshot
   542  separating them. Consider a range key `a-z` written at sequence number `#1` and
   543  a point key `d.SET#2`. A combined point-and-range iterator using a sequence
   544  number `#3` and positioned at `d` will surface both the range key `a-z` and the
   545  point key `d`.
   546  
   547  In the context of masking, the suffix-based masking of range keys can cause
   548  potentially unexpected behavior. A range key `[a,z)@10` may be committed as
   549  sequence number `#1`. Afterwards, a point key `d@5#2` may be committed. An
   550  iterator that is configured with range-key masking with suffix `@20` would mask
   551  the point key `d@5#2` because although `d@5#2`'s sequence number is higher,
   552  range-key masking uses suffixes to impose order, not sequence numbers.
   553  
   554  ### Boundaries for sstables
   555  
   556  Range keys follow the same relationship to sstable bounadries as the existing
   557  `RANGEDEL` tombstones. The bounds of an internal range key are user keys. Every
   558  range key is limited by its containing sstable's bounds.
   559  
   560  Consider these keys, annotated with sequence numbers:
   561  
   562  ```
   563  Point keys: a#50, b#70, b#49, b#48, c#47, d#46, e#45, f#44
   564  Range key: [a,e)#60
   565  ```
   566  
   567  We have created three versions of `b` in this example. In previous versions,
   568  Pebble could split output sstables during a compaction such that the different
   569  `b` versions span more than one sstable. This creates problems for `RANGEDEL`s
   570  which span these two sstables which are discussed in the section on [improperly
   571  truncated RANGEDELS](https://github.com/cockroachdb/pebble/blob/master/docs/range_deletions.md#improperly-truncated-range-deletes).
   572  We manage to tolerate this for `RANGEDEL`s since their semantics are defined by
   573  the system, which is not true for these range keys where the actual semantics
   574  are up to the user.
   575  
   576  Pebble now disallows such sstable split points. In this example, by postponing
   577  the sstable split point to the user key c, we can cleanly split the range key
   578  into `[a,c)#60` and `[c,e)#60`. The sstable end bound for the first sstable
   579  (sstable bounds are inclusive) will be c#inf (where inf is the largest possible
   580  seqnum, which is unused except for these cases), and sstable start bound for the
   581  second sstable will be c#60.
   582  
   583  The above example deals exclusively with point and range keys without suffixes.
   584  Consider this example with suffixed keys, and compaction outputs split in the
   585  middle of the `b` prefix:
   586  
   587  ```
   588  first sstable: points: a@100, a@30, b@100, b@40 ranges: [a,c)@50
   589  second sstable: points: b@30, c@40, d@40, e@30, ranges: [c,e)@50
   590  ```
   591  
   592  When the compaction code decides to defer `b@30` to the next sstable and finish
   593  the first sstable, the range key `[a,c)@50` is sitting in the fragmenter. The
   594  compaction must split the range key at the bounds determined by the user key.
   595  The compaction uses the first point key of the next sstable, in this case
   596  `b@30`, to truncate the range key. The compaction flushes the fragment
   597  `[a,b@30)@50` to the first sstable and updates the existing fragment to begin at
   598  `b@30`.
   599  
   600  If a range key extends into the next file, the range key's truncated end is used
   601  for the purposes of determining the sstable end boundary. The first sstable's
   602  end boundary becomes `b@30#inf`, signifying the range key does not cover `b@30`.
   603  The second sstable's start boundary is `b@30`.
   604  
   605  ### Block property collectors
   606  
   607  Separate block property collectors may be configured to collect separate
   608  properties about range keys. This is necessary for CockroachDB's MVCC block
   609  property collectors to ensure the sstable-level properties are correct.
   610  
   611  ### Iteration
   612  
   613  This design extends the `*pebble.Iterator` with the ability to iterate over
   614  exclusively range keys, range keys and point keys together or exclusively point
   615  keys (the previous behavior).
   616  
   617  - Pebble already requires that the prefix `k` follows the same key validity
   618    rules as `k@suffix`.
   619  
   620  - Previously, Pebble did not require that a user key consisting of just a prefix
   621    `k` sort before the same prefix with a non-empty suffix. CockroachDB has
   622    adopted this behavior since it results in the following clean behavior:
   623    `RANGEDEL` over [k1, k2) deletes all versioned keys which have prefixes in the
   624    interval [k1, k2). Pebble will now require this behavior for all users using
   625    MVCC keys. Specifically, it must hold that `Compare(k[:Split(k)], k) < 0` if
   626    `Split(k) < len(k)`.
   627  
   628  # TKTK: Discuss merging iterator
   629  
   630  #### Determinism
   631  
   632  Range keys will be split based on boundaries of sstables in an LSM. Users of an
   633  LSM typically expect that two different LSMs with different sstable settings
   634  that receive the same writes should output the same key-value pairs when
   635  iterating. To provide this behavior, the iterator implementation may be
   636  configured to defragment range keys during iteration time. The defragmentation
   637  behavior would be:
   638  
   639  - Two visible ranges `[k1,k2)@suffix1=>val1`, `[k2,k3)@suffix2=>val2` are
   640    defragmented if suffix1==suffix2 and val1==val2, and become [k1,k3).
   641  
   642  - Defragmentation during user iteration does not consider the sequence number.
   643    This is necessary since LSM state can be exported to another LSM via the use
   644    of sstable ingestion, which can collapse different seqnums to the same seqnum.
   645    We would like both LSMs to look identical to the user when iterating.
   646  
   647  The above defragmentation is conceptually simple, but hard to implement
   648  efficiently, since it requires stepping ahead from the current position to
   649  defragment range keys. This stepping ahead could switch sstables while there are
   650  still points to be consumed in a previous sstable. This determinism is useful
   651  for testing and verification purposes:
   652  
   653  - Randomized and metamorphic testing is used extensively to reliably test
   654    software including Pebble and CockroachDB. Defragmentation provides
   655    the determinism necessary for this form of testing.
   656  
   657  - CockroachDB's replica divergence detector requires a consistent view of the
   658    database on each replica.
   659  
   660  In order to provide determinism, Pebble constructs an internal range key
   661  iterator stack that's separate from the point iterator stack, even when
   662  performing combined iteration over both range and point keys. The separate range
   663  key iterator allows the internal range key iterator to move independently of the
   664  point key iterator. This allows the range key iterator to independently visit
   665  adjacent sstables in order to defragment their range keys if necessary, without
   666  repositioning the point iterator.
   667  
   668  Two spans [k1,k2) and [k3, k4) of range keys are defragmented if their bounds
   669  abut and their user observable-state is identical. That is, `k2==k3` and each
   670  spans' contains exactly the same set of range key (<suffix>, <tuple>) pairs.  In
   671  order to support `RangeKeyUnset` and `RangeKeyDelete`, defragmentation must be
   672  applied _after_ resolving unset and deletes.
   673  
   674  #### Merging iteration
   675  
   676  Recall that range keys are stored in the same sstables as point keys. In a
   677  log-structured merge tree, these sstables are distributed across levels. Within
   678  a level, sstables are non-overlapping but between levels sstables may overlap
   679  arbitrarily. During iteration, keys across levels must be merged together. For
   680  point keys, this is typically done with a heap.
   681  
   682  Range keys too must be merged across levels, and the earlier described
   683  fragmentation at intersection boundaries must be applied. To implement this, a
   684  range key merging iterator is defined.
   685  
   686  A merging iterator is initialized with an arbitrary number of child iterators
   687  over fragmented spans. Each child iterator exposes fragmented range keys, such
   688  that overlapping range keys are surfaced in a single span with a single set of
   689  bounds. Range keys from one child iterator may overlap key spans from another
   690  child iterator arbitrarily. The high-level algorithm is:
   691  
   692  1. Initialize a heap with bound keys from child iterators' range keys.
   693  2. Find the next [or previous, if in reverse] two unique user keys' from bounds.
   694  3. Consider the span formed between the two unique user keys a candidate span.
   695  4. Determine if any of the child iterators' spans overlap the candidate span.
   696    4a. If any of the child iterator's current bounds are end keys (during
   697        forward iteration) or start keys (during reverse iteration), then all the
   698        spans with that bound overlap the candidate span.
   699    4b. If no spans overlap, forget the smallest (forward iteration) or largest
   700        (reverse iteration) unique user key and advance the iterators to the next
   701        unique user key. Start again from 3.
   702  
   703  Consider the example:
   704  
   705  ```
   706         i0:     b---d e-----h
   707         i1:   a---c         h-----k
   708         i2:   a------------------------------p
   709  
   710  fragments:   a-b-c-d-e-----h-----k----------p
   711  ```
   712  
   713  None of the individual child iterators contain a span with the exact bounds
   714  [c,d), but the merging iterator must produce a span [c,d). To accomplish this,
   715  the merging iterator visits every span between unique boundary user keys. In the
   716  above example, this is:
   717  
   718  ```
   719  [a,b), [b,c), [c,d), [d,e), [e, h), [h, k), [k, p)
   720  ```
   721  
   722  The merging iterator first initializes the heap to prepare for iteration. The
   723  description below discusses the mechanics of forward iteration after a call to
   724  First, but the mechanics are similar for reverse iteration and other positioning
   725  methods.
   726  
   727  During a call to First, the heap is initialized by seeking every level to the
   728  first bound of the first fragment. In the above example, this seeks the child
   729  iterators to:
   730  
   731  ```
   732  i0: (b, boundKindStart, [ [b,d) ])
   733  i1: (a, boundKindStart, [ [a,c) ])
   734  i2: (a, boundKindStart, [ [a,p) ])
   735  ```
   736  
   737  After fixing up the heap, the root of the heap is the bound with the smallest
   738  user key ('a' in the example). During forward iteration, the root of the heap's
   739  user key is the start key of next merged span. The merging iterator records this
   740  key as the start key. The heap may contain other levels with range keys that
   741  also have the same user key as a bound of a range key, so the merging iterator
   742  pulls from the heap until it finds the first bound greater than the recorded
   743  start key.
   744  
   745  In the above example, this results in the bounds `[a,b)` and child iterators in
   746  the following positions:
   747  
   748  ```
   749  i0: (b, boundKindStart, [ [b,d) ])
   750  i1: (c, boundKindEnd,   [ [a,c) ])
   751  i2: (p, boundKindEnd,   [ [a,p) ])
   752  ```
   753  
   754  With the user key bounds of the next merged span established, the merging
   755  iterator must determine which, if any, of the range keys overlap the span.
   756  During forward iteration any child iterator that is now positioned at an end
   757  boundary has an overlapping span. (Justification: The child iterator's end
   758  boundary is ≥ the new end bound. The child iterator's range key's corresponding
   759  start boundary must be ≤ the new start bound since there were no other user keys
   760  between the new span's bounds. So the fragments associated with the iterator's
   761  current end boundary have start and end bounds such that start ≤ <new start
   762  bound> < <new end bound> ≤ end).
   763  
   764  The merging iterator iterates over the levels, collecting keys from any child
   765  iterators positioned at end boundaries. In the above example, i1 and i2 are
   766  positioned at end boundaries, so the merging iterator collects the keys of [a,c)
   767  and [a,p). These spans contain the merging iterator's [a,b) span, but they may
   768  also extend beyond the new span's start and end. The merging iterator returns
   769  the keys with the new start and end bounds, preserving the underlying keys'
   770  sequence numbers, key kinds and values.
   771  
   772  It may be the case that the merging iterator finds no levels positioned at span
   773  end boundaries in which case the span overlaps with nothing. In this case the
   774  merging iterator loops, repeating the above process again until it finds a span
   775  that does contain keys.
   776  
   777  #### Efficient masking
   778  
   779  Recollect that in the earlier example from the iteration interface, during
   780  forward iteration an iterator would output the following keys:
   781  
   782  ```
   783    Key()   HasPointAndRange()   Value()      RangeKeyBounds()    RangeKeys()
   784    a       (true,  true)        artichoke    [a,b)               {(@1,apple)}
   785    b       (false, true)        -            [b,c)               {(@7,kiwi), (@1,apple)}
   786    b@2     (true,  true)        beet         [b,c)               {(@7,kiwi), (@1,apple)}
   787    c       (false, true)        -            [c,e)               {(@7,kiwi), (@3,banana), (@1,apple)}
   788    e       (false, true)        -            [e,k)               {(@7,kiwi), (@5,orange), (@1,apple)}
   789    k       (false, true)        -            [k,m)               {(@5,orange), (@1,apple)}
   790    m       (false, true)        -            [m,z)               {(@1,apple)}
   791    t@3     (true,  true)        turnip       [m,z)               {(@1,apple)}
   792  ```
   793  
   794  When implementing an MVCC "soft delete range" operation using range keys, the
   795  range key `[b,k)@7→kiwi` may represent that all keys within the range [b,k) are
   796  deleted at MVCC timestamp @7. During iteration, it would be desirable if the
   797  caller could indicate that it does not want to observe any "soft deleted" point
   798  keys, and the iterator can safely skip them. Note that in a MVCC system, whether
   799  or not a key is soft deleted depends on the timestamp at which the database is
   800  read.
   801  
   802  This is implemented through "range key masking," where a range key may act as a
   803  mask, hiding point keys with MVCC timestamps beneath the range key. This
   804  iterator option requires that the client configure the iterator with a MVCC
   805  timestamp `suffix` representing the timestamp at which history should be read.
   806  All range keys with suffixes (MVCC timestamps) less than or equal to the
   807  configured suffix serve as masks. All point keys with suffixes (MVCC timestamps)
   808  less than a covering, masking range key's suffix are hidden.
   809  
   810  Specifically, when configured with a RangeKeyMasking.Suffix _s_, and there
   811  exists a range key with suffix _r_ covering a point key with suffix _p_, and _s_
   812  ≤ _r_ < _p_ then the point key is elided.
   813  
   814  In the above example, if `RangeKeyMasking.Suffix` is set to `@7`, every range
   815  key serves as a mask and the point key `b@2` is hidden during iteration because
   816  it's contained within the masking `[b,k)@7→kiwi` range key. Note that `t@3`
   817  would _not_ be masked, because its timestamp `@3` is more recent than the only
   818  range key that covers it (`[a,z)@1→apple`).
   819  
   820  If `RangeKeyMasking.Suffix` were set to `@6` (a historical, point-in-time read),
   821  the `[b,k)@7→kiwi` range key would no longer serve as a mask, and `b@2` would be
   822  visible.
   823  
   824  To efficiently implement masking, we cannot rely on the LSM invariant since
   825  `b@100` can be at a lower level than `[a,e)@50`. Instead, we build on
   826  block-property filters, supporting special use of a MVCC timestamp block
   827  property in order to skip blocks wholly containing point keys that are masked by
   828  a range key. The client may configure a block-property collector to record the
   829  highest MVCC timestamps of point keys within blocks.
   830  
   831  During read time, when positioned within a range key with a suffix ≤
   832  `RangeKeyMasking.Suffix`, the iterator configures sstable readers to use a
   833  block-property filter to skip any blocks for which the highest MVCC timestamp is
   834  less than the provided suffix. Additionally, these iterators must consult index
   835  block bounds to ensure the block-property filter is not applied beyond the
   836  bounds of the masking range key.
   837  
   838  ### CockroachDB use
   839  
   840  CockroachDB initially will only use range keys to represent MVCC range
   841  tombstones. See the MVCC range tombstones tech note for more details:
   842  
   843  https://github.com/cockroachdb/cockroach/blob/master/docs/tech-notes/mvcc-range-tombstones.md
   844  
   845  ### Alternatives
   846  
   847  #### A1. Automatic elision of range keys that don't cover keys
   848  
   849  We could decide that range keys:
   850  
   851  - Don't contribute to `MVCCStats` themselves.
   852  - May be elided by Pebble when they cover zero point keys.
   853  
   854  This means that CockroachDB garbage collection does not need to explicitly
   855  remove the range keys, only the point keys they deleted. This option is clean
   856  when paired with `RANGEDEL`s dropping both point and range keys. CockroachDB can
   857  issue `RANGEDEL`s whenever it wants to drop a contiguous swath of points, and
   858  not worry about the fact that it might also need to update the MVCC stats for
   859  overlapping range keys.
   860  
   861  However, this option makes deterministic iteration over defragmented range keys
   862  for replica divergence detection challenging, because internal fragmentation may
   863  elide regions of a range key at any point.  Producing a normalized form would
   864  require storing state in the value (ie, the original start key) and
   865  recalculating the smallest and largest extant covered point keys within the
   866  range key and replica bounds. This would require maintaining _O_(range-keys)
   867  state during the `storage.ComputeStatsForRange` pass over a replica's combined
   868  point and range iterator.
   869  
   870  This likely forces replica divergence detection to use other means (eg, altering
   871  the checksum of covered points) to incorporate MVCC range tombstone state.
   872  
   873  This option is also highly tailored to the MVCC Delete Range use case.  Other
   874  range key usages, like ranged intents, would not want this behavior, so we don't
   875  consider it further.
   876  
   877  #### A2. Separate LSM of range keys
   878  
   879  There are two viable options for where to store range keys. They may be encoded
   880  within the same sstables as points in separate blocks, or in separate sstables
   881  forming a parallel range-key LSM. We examine the tradeoffs between storing range
   882  keys in the same sstable in different blocks ("shared sstables") or separate
   883  sstables forming a parallel LSM ("separate sstables"):
   884  
   885  - Storing range keys in separate sstables is possible because the only
   886    iteractions between range keys and point keys happens at a global level.
   887    Masking is defined over suffixes. It may be extended to be defined over
   888    sequence numbers too (see 'Sequence numbers' section below), but that is
   889    optional. Unlike range deletion tombstones, range keys have no effect on point
   890    keys during compactions.
   891  
   892  - With separate sstables, reads may need to open additional sstable(s) and read
   893    additional blocks. The number of additional sstables is the number of nonempty
   894    levels in the range-key LSM, so it grows logarithmically with the number of
   895    range keys. For each sstable, a read must read the index block and a data
   896    block.
   897  
   898  - With our expectation of few range keys, the range-key LSM is expected to be
   899    small, with one or two levels. Heuristics around sstable boundaries may
   900    prevent unnecessary range-key reads when there is no covering range key. Range
   901    key sstables and blocks are expected to have much higher table and block cache
   902    hit rates, since they are orders of magnitude less dense. Reads in any
   903    overlapping point sstables all access the same range key sstables.
   904  
   905  - With shared sstables, `SeekPrefixGE` cannot use bloom filters to entirely
   906    eliminate sstables that contain range keys. Pebble does not always use bloom
   907    filters in L6, so once a range key is compacted into L6 its impact to
   908    `SeekPrefixGE` is lessened. With separate sstables, `SeekPrefixGE` can always
   909    use bloom filters for point-key sstables. If there are any overlapping
   910    range-key sstables, the read must read them.
   911  
   912  - With shared sstables, range keys create dense sstable boundaries. A range key
   913    spanning an sstable boundary leaves no gap between the sstables' bounds. This
   914    can force ingested sstables into higher levels of the LSM, even if the
   915    sstables' point key spans don't overlap. This problem was previously observed
   916    with wide `RANGEDEL` tombstones and was mitigated by prioritizing compaction
   917    of sstables that contain `RANGEDEL` keys. We could do the same with range
   918    keys, but the write amplification is expected to be much worse. The `RANGEDEL`
   919    tombstones drop keys and eventually are dropped themselves as long as there is
   920    not an open snapshot. Range keys do not drop data and are expected to persist
   921    in L6 for long durations, always requiring ingested sstables to be inserted
   922    into L5 or above.
   923  
   924  - With separate sstables, compaction logic is separate, which helps avoid
   925    complexity of tricky sstable boundary conditions. Because there are expected
   926    to be an order of magnitude fewer range keys, we could impose the constraint
   927    that a prefix cannot be split across multiple range key sstables. The
   928    simplified compaction logic comes at the cost of higher levels, iterators, etc
   929    all needing to deal with the concept of two parallel LSMs.
   930  
   931  - With shared sstables, the LSM invariant is maintained between range keys and
   932    point keys. For example, if the point key `b@20` is committed, and
   933    subsequently a range key `RangeKey([a,c), @25, ...)` is committed, the range
   934    key will never fall below the covered point `b@20` within the LSM.
   935  
   936  We decide to share sstables, because preserving the LSM invariant between range
   937  keys and point keys is expected to be useful in the long-term.
   938  
   939  #### A3. Sequence number masking
   940  
   941  In the CockroachDB MVCC range tombstone use case, a point key should never be
   942  written below an existing range key with a higher timestamp. The MVCC range
   943  tombstone use case would allow us to dictate that an overlapping range key with
   944  a higher sequence number always masks range keys with lower sequence numbers.
   945  Adding this additional masking scope would avoid the comparatively costly suffix
   946  comparison when a point key _is_ masked by a range key. We need to consider how
   947  sequence number masking might be affected by the merging of range keys within
   948  snapshot stripes.
   949  
   950  Consider the committing of range key `[a,z)@{t1}#10`, followed by point keys
   951  `d@t2#11` and `m@t2#11`, followed by range key `[j,z)@{t3}#12`.  This sequencing
   952  respects the expected timestamp, sequence number relationship in CockroachDB's
   953  use case. If all keys are flushed within the same sstable, fragmentation and
   954  merging overlapping fragments yields range keys `[a,j)@{t1}#10`,
   955  `[j,z)@{t3,t1}#12`. The key `d@t2#11` must not be masked because it's not
   956  covered by the new range key, and indeed that's the case because the covering
   957  range key's fragment is unchanged `[a,j)@{t1}#10`.
   958  
   959  For now we defer this optimization, with the expectation that we may not be able
   960  to preserve this relationship between sequence numbers and suffixes in all range
   961  key use cases.