github.com/cockroachdb/pebble@v1.1.1-0.20240513155919-3622ade60459/docs/range_deletions.md (about)

     1  # Range Deletions
     2  
     3  TODO: The following explanation of range deletions does not take into account
     4  the recent change to prohibit splitting of a user key between sstables. This
     5  change simplifies the logic, removing 'improperly truncated range tombstones.'
     6  
     7  TODO: The following explanation of range deletions ignores the
     8  kind/trailer that appears at the end of keys after the sequence
     9  number. This should be harmless but need to add a justification on why
    10  it is harmless.
    11  
    12  ## Background and Notation
    13  
    14  Range deletions are represented as `[start, end)#seqnum`. Points
    15  (set/merge/...) are represented as `key#seqnum`. A range delete `[s, e)#n1`
    16  deletes every point `k#n2` where `k \in [s, e)` and `n2 < n1`.
    17  The inequality `n2 < n1` is to handle the case where a range delete and
    18  a point have the same sequence number -- this happens during sstable
    19  ingestion where the whole sstable is assigned a single sequence number
    20  that applies to all the data in it.
    21  
    22  There is additionally an infinity sequence number, represented as
    23  `inf`, which is not used for any point, that we can use for reasoning
    24  about range deletes.
    25  
    26  It has been asked why range deletes use an exclusive end key instead
    27  of an inclusive end key. For string keys, one can convert a desired
    28  range delete on `[s, e]` into a range delete on `[s, ImmediateSuccessor(e))`.
    29  For strings, the immediate successor of a key
    30  is that key with a \0 appended to it. However one cannot go in the
    31  other direction: if one could represent only inclusive end keys in a
    32  range delete and one desires to delete a range with an exclusive end
    33  key `[s, e)#n`, one needs to compute `ImmediatePredecessor(e)` which
    34  is an infinite length string. For example,
    35  `ImmediatePredecessor("ab")` is `"aa\xff\xff...."`. Additionally,
    36  regardless of user needs, the exclusive end key helps with splitting a
    37  range delete as we will see later. 
    38  
    39  We will sometimes use ImmediatePredecessor and ImmediateSuccessor in
    40  the following for illustrating an idea, but we do not rely on them as
    41  something that is viable to produce for a particular kind of key. And
    42  even if viable, these functions are not currently provided to
    43  RockDB/Pebble.
    44  
    45  ### Visualization
    46  
    47  If we consider a 2 dimensional space with increasing keys on the X
    48  axis (with every possible user key represented) and increasing
    49  sequence numbers on the Y axis, range deletes apply to a rectangle
    50  whose bottom edge sits on the X axis.
    51  
    52  The actual space represented by the ordering in our sstables is a one
    53  dimensional space where `k1#n1` is less than `k2#n2` if either of the
    54  following holds:
    55  
    56  - k1 < k2
    57  
    58  - k1 = k2 and n1 > n2 (under the assumption that no two points with
    59  the same key have the same sequence number).
    60  
    61  ```
    62    ^
    63    |   .       > .        > .        > yy
    64    |   .      >  .       >  .       >  .
    65    |   .     >   .      >   .      >   .
    66  n |   V    >    xx    >    .     >    V
    67    |   .   >     x.   >     x.   >     . 
    68    |   .  >      x.  >      x.  >      .
    69    |   . >       x. >       x. >       .
    70    |   .>        x.>        x.>        .
    71    ------------------------------------------>
    72                  k        IS(k)    IS(IS(k))
    73  ```
    74  
    75  The above figure uses `.` to represent points and the X axis is dense in
    76  that it represents all possible keys. `xx` represents the start of a
    77  range delete and `x.` are the points which it deletes. The arrows `V` and
    78  `>` represent the ordering of the points in the one dimensional space.
    79  `IS` is shorthand for `ImmediateSuccessor` and the range delete represented
    80  there is `[k, IS(IS(k)))#n`. Ignore `yy` for now.
    81  
    82  The one dimensional space works fine in a world with only points. But
    83  issues arise when storing range deletes, that represent an action in 2
    84  dimensional space, into this one dimensional space.
    85  
    86  ## Range Delete Boundaries and the Simplest World
    87  
    88  RocksDB/Pebble store the inclusive bounds of each sstable in one dimensional
    89  space. The range deletes two dimensional behavior and exclusive end key needs
    90  to be adapted to this requirement. For a range delete `[s, e)#n`,
    91  the smallest key it acts on is `s#(n-1)` and the largest key it
    92  acts on is `ImmediatePredecessor(e)#0`. So if we position the range delete
    93  immediately before the smallest key it acts on and immediately after
    94  the largest key it acts on we can give it a tight inclusive bound of
    95  `[s#n, e#inf]`.  
    96  
    97  Note again that this range delete does not delete everything in its
    98  inclusive bound. For example, range delete `["c", "h")#10` has a tight
    99  inclusive bound of `["c"#10, "h"#inf]` but does not delete `"d"#11`
   100  which lies in that bound. Going back to our earlier diagram, the one
   101  dimensional inclusive bounds go from the `xx` to `yy` but there are
   102  `.`s in between, in the one dimensional order, that are not deleted.
   103  
   104  This is the reason why one cannot in general
   105  use a range delete to seek over all points within its bounds. The one
   106  exception to this seeking behaviour is that when we can order sstables
   107  from new to old, one can "blindly" use this range delete in a newer
   108  sstable to seek to `"h"` in all older sstables since we know those
   109  older sstables must only have point keys with sequence numbers `< 10`
   110  for the keys in interval `["c", "h")`. This partial order across
   111  sstables exists in RocksDB/Pebble between memtable, L0 sstables (where
   112  it is a total order) and across sstables in different levels.
   113  
   114  Coming back to the inclusive bounds of the range delete, `[s#n, e#inf]`:
   115  these bounds participate in deciding the bounds of the
   116  sstable. In this world, one can read all the entries in an sstable and
   117  compute its bounds. However being able to construct these bounds by
   118  reading an sstable is not essential -- RocksDB/Pebble store these
   119  bounds in the `MANIFEST`. This latter fact has been exploited to
   120  construct a real world (later section) where the bounds of an sstable
   121  are not computable by reading all its keys.
   122  
   123  If we had a system with one sstable per level, for each level lower
   124  than L0, we are effectively done. We have represented the tight bounds
   125  of each range delete and it is within the bounds of the sstable. This
   126  works even with L0 => L0 compactions assuming they output exactly one
   127  sstable.
   128  
   129  ## The Mostly Simple World
   130  
   131  Here we have multiple files for levels lower than L0 that are non
   132  overlapping in the file bounds. These multiple files occur because
   133  compactions produce multiple files. This introduces the need to split a
   134  range delete across the files being produced by a compaction.
   135  
   136  There is a clean way to split a range delete `[s, e)#n` into 2 parts
   137  (which can be recursively applied to split into arbitrarily many
   138  parts): split into `[s, m)#n` and `[m, e)#n`. These range deletes
   139  apply to non-overlapping points and their tight bounds are `[s#m,
   140  m#inf]`, `[m#n, e#inf]` which are also non-overlapping.
   141  
   142  Consider the following example of an input range delete `["c", "h")#10` and
   143  the following two output files from a compaction:
   144  
   145  ```
   146            sst1            sst2
   147  last point is "e"#7 | first point is "f"#20
   148  ```
   149  
   150  The range delete can be split into `["c", "f")#10` and `["f",
   151  "h")#10`, by using the first point key of sst2 as the split
   152  point. Then the bounds of sst1 and sst2 will be `[..., "f"#inf]` and
   153  `["f"#20, ...]` which are non-overlapping. It is still possible to compute
   154  the sstable bounds by looking at all the entries in the sstable.
   155  
   156  ## The Real World
   157  
   158  Continuing with the same range delete `["c", "h")#10`, we can have the
   159  following sstables produced during a compaction:
   160  
   161  ```
   162           sst1       sst2         sst3        sst4     sst5
   163  points: "e"#7 | "f"#12 "f"#7 | "f"#4 "f"#3 | "f"#1 | "g"#15
   164  ```
   165  
   166  The range deletes written to these ssts are
   167  
   168  ```
   169        sst1           sst2            sst3           sst4          sst5
   170  ["c", "h")#10 | ["f", "h")#10 | ["f", "h")#10 | ["f", "h")#10 | ["g", "h")#10
   171  ```
   172  
   173  The Pebble code that achieves this effect is in
   174  `rangedel.Fragmenter`. It is a code structuring artifact that sst1
   175  does not contain a range delete equal to `["c", "f")#10` and sst4 does
   176  not contain `["f", "g")#10`. However for the range deletes in sst2 and
   177  sst3 we cannot do any better because we don't know what the key
   178  following "f" will be (the compaction cannot look ahead) and because
   179  we don't have an `ImmediateSuccessor` function (otherwise we could
   180  have written `["f", ImmediateSuccessor("f"))#10` to sst2, sst3). But
   181  the code artifacts are not the ones introducing the real complexity.
   182  
   183  The range delete bounds are
   184  
   185  ```
   186        sst1        sst2, sst3, sst4          sst5
   187  ["c"#10, "h"#inf] ["f"#10, "h"#inf]   ["g"#10, "h"#inf]
   188  
   189  ```
   190  
   191  We note the following:
   192  
   193  - The bounds of range deletes are overlapping since we have been
   194    unable to split the range deletes. If these decide the sstable
   195    bounds, the sstables will have overlapping bounds. This is not
   196    permissible.
   197  
   198  - The range deletes included in each sstable result in that sstable
   199    being "self-sufficient" wrt having the range delete that deletes
   200    some of the points in the sstable (let us assume that the points in
   201    this example have not been dropped from that sstable because of a
   202    snapshot).
   203  
   204  - The transitions from sst1 to sst2 and sst4 to sst5 are **clean** in
   205    that we can pretend that the range deletes in those files are actually:
   206  
   207  ```
   208        sst1           sst2            sst3           sst4          sst5
   209  ["c", "f")#10 | ["f", "g")#10 | ["f", "g")#10 | ["f", "g")#10 | ["g", "h")#10
   210  ```
   211  
   212  We could achieve some of these **clean** transitions (but not all) with a
   213  code change. Also note that these better range deletes maintain the
   214  "self-sufficient" property.
   215  
   216  ### Making Non-overlapping SSTable bounds
   217  
   218  We force the sstable bounds to be non-overlapping by setting them to:
   219  
   220  ```
   221        sst1              sst2           sst3            sst4              sst5
   222  ["c"#10, "f"#inf] ["f"#12, "f"#7] ["f"#4, "f"#3] ["f"#1, "g"#inf] ["g"#15, "h"#inf]
   223  ```
   224  
   225  Note that for sst1...sst4 the sstable bounds are smaller than the
   226  bounds of the range deletes contained in them. The code that
   227  accomplishes this is Pebble is in `compaction.go` -- we will not discuss the
   228  details of that logic here but note that it is placing an `inf`
   229  sequence number for a clean transition and for an unclean transition
   230  it is using the point keys as the bounds.
   231  
   232  Associated with these narrower bounds, we add the following
   233  requirement: a range delete in an sstable must **act-within** the bounds of
   234  the sstable it is contained in. In the above example:
   235  
   236  - sst1: range delete `["c", "h")#10` must act-within the bound `["c"#10, "f"#inf]`
   237  
   238  - sst2: range delete `["f", "h")#10` must act-within the bound `["f"#12, "f"#7]`
   239  
   240  - sst3: range delete `["f", "h")#10` must act-within the bound `["f"#4, "f"#3]`
   241  
   242  - sst4: range delete `["f", "h")#10` must act-within the bound ["f"#1, "g"#inf]
   243  
   244  - And so on.
   245  
   246  The intuitive reason for the **act-within** requirement is that 
   247  sst5 can be compacted and moved down to a lower level independent of
   248  sst1-sst4, since it was at a **clean** boundary. We do not want the
   249  range delete `["f", "h")#10` sitting in sst1...sst4 at the higher
   250  level to act on `"g"#15` that has been moved to the lower level. Note
   251  that this incorrect action can happen due to 2 reasons:
   252    
   253  1. the invariant that lower levels have older data for keys
   254     that also exist in higher levels means we can (a) seek a lower level
   255     sstable to the end of a range delete from a higher level, (b) for a key
   256     lookup, stop searching in lower levels once a range delete is encountered
   257     for that key in a higher level.
   258    
   259  2. Sequence number zeroing logic can change the sequence number of
   260    `"g"#15` to `"g"#0` (for better compression) once it realizes that
   261     there are no older versions of `"g"`. It would be incorrect for this
   262    `"g"#0` to be deleted.  
   263  
   264  
   265  #### Loss of Power
   266  
   267  This act-within behavior introduces some "loss of power" for
   268  the original range delete `["c", "h")#10`. By acting within sst2...sst4
   269  it can no longer delete keys `"f"#6`, `"f"#5`, `"f"#2`.
   270  
   271  Luckily for us, this is harmless since these keys cannot have existed
   272  in the system due to the levelling behavior: we cannot be writing
   273  sst2...sst4 to level `i` if versions of `"f"` younger than `"f"#4` are
   274  already in level `i` or version older than `"f"#7` have been left in
   275  level i - 1. There is some trickery possible to prevent this "loss of
   276  power" for queries (see the "Putting it together" section), but given
   277  the history of bugs in this area, we should be cautious.
   278  
   279  ### Improperly truncated Range Deletes
   280  
   281  We refer to range deletes that have experienced this "loss of power"
   282  as **improper**. In the above example the range deletions in sst2, sst3, sst4
   283  are improper. The problem with improper range deletions occurs
   284  when they need to participate in a future compaction: even though we
   285  have restricted them to act-within their current sstable boundary, we
   286  don't have a way of **"writing"** this restriction to a new sstable,
   287  since they still need to be written in the `[s, e)#n` format.
   288  
   289  For example, sst2 has delete `["f", "h")#10` that must act-within
   290  the bound `["f"#12, "f"#7]`. If sst2 was compacted down to the next
   291  level into a new sstable (whose bounds we cannot predict because they
   292  depend on other data written to that sstable) we need to be able to
   293  write a range delete entry that follows the original restriction. But
   294  the narrowest we can write is `["f", ImmediateSuccessor("f"))#10`. This
   295  is an expansion of the act-within restriction with potentially
   296  unintended consequences. In this case the expansion happened in the suffix.
   297  For sst4, the range deletion `["f", "h")#10` must act-within `["f"#1, "g"#inf]`,
   298  and we can precisely represent the constraint on the suffix by writing
   299  `["f", "g")#10` but it does not precisely represent that this range delete
   300  should not apply to `"f"#9`...`"f"#2`.
   301  
   302  In comparison, the sst1 range delete `["c", "h")#10` that must act-within
   303  the bound `["c"#10, "f"#inf]` is not improper. This restriction can
   304  be applied precisely to get a range delete `["c", "f")#10`. 
   305  
   306  The solution to this is to note that while individual sstables have
   307  improper range deletes, if we look at a collection of sstables we
   308  can restore the improper range deletes spread across them to their proper self
   309  (and their full power). To accurately find these improper range
   310  deletes would require looking into the contents of a file, which is
   311  expensive. But we can construct a pessimistic set based on
   312  looking at the sequence of all files in a level and partitioning them:
   313  adjacent files `f1`, `f2` with largest and smallest bound `k1#n1`,
   314  `k2#n2` must be in the same partition if
   315  
   316  ```
   317  k1 = k2 and n1 != inf
   318  ```
   319  
   320  In the above example sst2, sst3, sst4 are one partition. The
   321  **spanning bound** of this partition is `["f"#12, "g"#inf]` and the
   322  range delete `["f", "h")#10` when constrained to act-within this
   323  spanning bound is precisely the range delete `["f",
   324  "g")#10`. Intuitively, the "loss of power" of this range delete has
   325  been restored for the sake of making it proper, so it can be
   326  accurately "written" in the output of the compaction (it may be
   327  improperly fragmented again in the output, but we have already
   328  discussed that). Such partitions are called "atomic compaction groups"
   329  and must participate as a whole in a compaction (and a
   330  compaction can use multiple atomic compaction groups as input).
   331  
   332  Consider another example:
   333  
   334  ```
   335            sst1              sst2
   336  points:  "e"#12         |  "e"#10
   337  delete: ["c", "g")#8    | ["c", "g")#8
   338  bounds  ["c"#8, "e"#12] | ["e"#10, "g"#inf]
   339  ```
   340  
   341  sst1, sst2 are an atomic compaction group. Say we violated the
   342  requirement that both be inputs in a compaction and only compacted
   343  sst2 down to level `i + 1` and then down to level `i + 2`. Then we add
   344  sst3 with bounds `["h"#10, "j"#5]` to level `i` and sst1 and sst3 are
   345  compacted to level `i + 1` into a single sstable. This new sstable
   346  will have bounds `["c"#8, "j"#5]` so these bounds do not help with the
   347  original apply-witin constraint on `["c", "g")#8` (that it should
   348  apply-within `["c"#8, "e"#12]`). The narrowest we can construct (if we had
   349  `ImmediateSuccessor`) would be `["c", ImmediateSuccessor("e"))#8`.  Now we
   350  can incorrectly apply this range delete that is in level `i + 1` to `"e"#10`
   351  sitting in level `i + 2`. Note that this example can be made worse using
   352  sequence number zeroing -- `"e"#10` may have been rewritten to `"e"#0`.  
   353  
   354  If a range delete `[s, e)#n` is in an atomic compaction group with
   355  spanning bounds `[k1#n1, k2#n2]` our construction above guarantees the
   356  following properties
   357  
   358  - `k1#n1 <= s#n`, so the bounds do not constrain the start of the
   359    range delete.
   360  
   361  - `k2 >= e` or `n2 = inf`, so if `k2` is constraining the range delete
   362    it will properly truncate the range delete.
   363  
   364  
   365  #### New sstable at sequence number 0
   366  
   367  A new sstable can be assigned sequence number 0 (and be written to L0)
   368  if the keys in the sstable are not in any other sstable. This
   369  comparison uses the keys and not key#seqnum, so the loss and
   370  restoration of power does not cause problems since that occurs within
   371  the versions of a single key.
   372  
   373  #### Flawed optimizations
   374  
   375  For the case where the atomic compaction group correspond to the lower
   376  level of a compaction, it may initially seem to be correct to use only
   377  a prefix or suffix of that group in a compaction. In this case the
   378  prefix (suffix) will correspond to the largest key (smallest key) in
   379  the input sstables in the compaction and so can continue to constrain
   380  the range delete.  For example, sst1 and sst2 are in the same atomic
   381  compaction group
   382  
   383  ```
   384            sst1               sst2
   385  points: "c"#10 "e"#12    |  "e"#10
   386  delete: ["c", "g")#8     | ["c", "g")#8
   387  bounds  ["c"#10, "e"#12] | ["e"#10, "g"#inf]
   388  ```
   389  
   390  and this is the lower level of a compaction with
   391  
   392  ```
   393            sst3
   394  points: "a"#14 "d"#15
   395  bounds  ["a"#14, "d"#15]
   396  ```
   397  
   398  we could allow for a compaction involving sst1 and sst3 which would produce
   399  
   400  ```
   401            sst4
   402  points: "a"#14 "c"#10 "d"#15 "e"#12
   403  delete: ["c", "g")#8
   404  bounds  ["a"#14, "e"#12]
   405  ```
   406  
   407  and the range delete is still improper but its act-within constraint has
   408  not expanded.
   409  
   410  But we have to be very careful to not have a more significant loss of power
   411  of this range delete. Consider a situation where sst3 had a single delete
   412  `"e"#16`. It still does not overlap in bounds with sst2 and we again pick
   413  sst1 and sst3 for compaction. This single delete will cause `"e"#12` to be deleted
   414  and sst4 bounds would be (unless we had complicated code preventing it):
   415  
   416  ```
   417            sst4
   418  points: "a"#14 "c"#10 "d"#15
   419  delete: ["c", "g")#8
   420  bounds  ["a"#14, "d"#15]
   421  ```
   422  
   423  Now this delete cannot delete `"dd"#6` and we have lost the ability to know
   424  that sst4 and sst2 are in the same atomic compaction group.
   425  
   426  
   427  ### Putting it together
   428  
   429  Summarizing the above, we have:
   430  
   431  - SStable bounds logic that ensures sstables are not
   432  overlapping. These sstables contain range deletes that extend outside
   433  these bounds. But these range deletes should **apply-within** the
   434  sstable bounds.
   435  
   436  - Compactions: they need to constrain the range deletes in the inputs
   437  to **apply-within**, but this can create problems with **writing** the
   438  **improper** range deletes. The solution is to include the full
   439  **atomic compaction group** in a compaction so we can restore the
   440  **improper** range deletes to their **proper** self and then apply the
   441  constraints of the atomic compaction group.
   442  
   443  - Queries: We need to act-within the file bound constraint on the range delete.
   444    Say the range delete is `[s, e)#n` and the file bound is `[b1#n1,
   445    b2#n2]`. We are guaranteed that `b1#n1 <= s#n` so the only
   446    constraint can come from `b2#n2`.
   447    
   448    - Deciding whether a range delete covers a key in the same or lower levels.
   449  
   450      - `b2 >= e`: there is no act-within constraint.
   451      - `b2 < e`: to be precise we cannot let it delete `b2#n2-1` or
   452        later keys. But it is likely that allowing it to delete up to
   453        `b2#0` would be ok due to the atomic compaction group. This
   454        would prevent the so-called "loss of power" discussed earlier if
   455        one also includes the argument that the gap in the file bounds
   456        that also represents the loss of power is harmless (the gap
   457        exists within versions of key, and anyone doing a query for that
   458        key will start from the sstable to the left of the gap). But it
   459        may be better to be cautious.
   460  
   461    - For using the range delete to seek sstables at lower levels.
   462      - `b2 >= e`: seek to `e` since there is no act-within constraint.
   463      - `b2 < e`: seek to `b2`. We are ignoring that this range delete
   464        is allowed to  delete some versions of `b2` since this is just a
   465        performance optimization.
   466  
   467  
   468  
   469  
   470  
   471