github.com/cockroachdb/pebble@v1.1.1-0.20240513155919-3622ade60459/docs/range_deletions.md (about) 1 # Range Deletions 2 3 TODO: The following explanation of range deletions does not take into account 4 the recent change to prohibit splitting of a user key between sstables. This 5 change simplifies the logic, removing 'improperly truncated range tombstones.' 6 7 TODO: The following explanation of range deletions ignores the 8 kind/trailer that appears at the end of keys after the sequence 9 number. This should be harmless but need to add a justification on why 10 it is harmless. 11 12 ## Background and Notation 13 14 Range deletions are represented as `[start, end)#seqnum`. Points 15 (set/merge/...) are represented as `key#seqnum`. A range delete `[s, e)#n1` 16 deletes every point `k#n2` where `k \in [s, e)` and `n2 < n1`. 17 The inequality `n2 < n1` is to handle the case where a range delete and 18 a point have the same sequence number -- this happens during sstable 19 ingestion where the whole sstable is assigned a single sequence number 20 that applies to all the data in it. 21 22 There is additionally an infinity sequence number, represented as 23 `inf`, which is not used for any point, that we can use for reasoning 24 about range deletes. 25 26 It has been asked why range deletes use an exclusive end key instead 27 of an inclusive end key. For string keys, one can convert a desired 28 range delete on `[s, e]` into a range delete on `[s, ImmediateSuccessor(e))`. 29 For strings, the immediate successor of a key 30 is that key with a \0 appended to it. However one cannot go in the 31 other direction: if one could represent only inclusive end keys in a 32 range delete and one desires to delete a range with an exclusive end 33 key `[s, e)#n`, one needs to compute `ImmediatePredecessor(e)` which 34 is an infinite length string. For example, 35 `ImmediatePredecessor("ab")` is `"aa\xff\xff...."`. Additionally, 36 regardless of user needs, the exclusive end key helps with splitting a 37 range delete as we will see later. 38 39 We will sometimes use ImmediatePredecessor and ImmediateSuccessor in 40 the following for illustrating an idea, but we do not rely on them as 41 something that is viable to produce for a particular kind of key. And 42 even if viable, these functions are not currently provided to 43 RockDB/Pebble. 44 45 ### Visualization 46 47 If we consider a 2 dimensional space with increasing keys on the X 48 axis (with every possible user key represented) and increasing 49 sequence numbers on the Y axis, range deletes apply to a rectangle 50 whose bottom edge sits on the X axis. 51 52 The actual space represented by the ordering in our sstables is a one 53 dimensional space where `k1#n1` is less than `k2#n2` if either of the 54 following holds: 55 56 - k1 < k2 57 58 - k1 = k2 and n1 > n2 (under the assumption that no two points with 59 the same key have the same sequence number). 60 61 ``` 62 ^ 63 | . > . > . > yy 64 | . > . > . > . 65 | . > . > . > . 66 n | V > xx > . > V 67 | . > x. > x. > . 68 | . > x. > x. > . 69 | . > x. > x. > . 70 | .> x.> x.> . 71 ------------------------------------------> 72 k IS(k) IS(IS(k)) 73 ``` 74 75 The above figure uses `.` to represent points and the X axis is dense in 76 that it represents all possible keys. `xx` represents the start of a 77 range delete and `x.` are the points which it deletes. The arrows `V` and 78 `>` represent the ordering of the points in the one dimensional space. 79 `IS` is shorthand for `ImmediateSuccessor` and the range delete represented 80 there is `[k, IS(IS(k)))#n`. Ignore `yy` for now. 81 82 The one dimensional space works fine in a world with only points. But 83 issues arise when storing range deletes, that represent an action in 2 84 dimensional space, into this one dimensional space. 85 86 ## Range Delete Boundaries and the Simplest World 87 88 RocksDB/Pebble store the inclusive bounds of each sstable in one dimensional 89 space. The range deletes two dimensional behavior and exclusive end key needs 90 to be adapted to this requirement. For a range delete `[s, e)#n`, 91 the smallest key it acts on is `s#(n-1)` and the largest key it 92 acts on is `ImmediatePredecessor(e)#0`. So if we position the range delete 93 immediately before the smallest key it acts on and immediately after 94 the largest key it acts on we can give it a tight inclusive bound of 95 `[s#n, e#inf]`. 96 97 Note again that this range delete does not delete everything in its 98 inclusive bound. For example, range delete `["c", "h")#10` has a tight 99 inclusive bound of `["c"#10, "h"#inf]` but does not delete `"d"#11` 100 which lies in that bound. Going back to our earlier diagram, the one 101 dimensional inclusive bounds go from the `xx` to `yy` but there are 102 `.`s in between, in the one dimensional order, that are not deleted. 103 104 This is the reason why one cannot in general 105 use a range delete to seek over all points within its bounds. The one 106 exception to this seeking behaviour is that when we can order sstables 107 from new to old, one can "blindly" use this range delete in a newer 108 sstable to seek to `"h"` in all older sstables since we know those 109 older sstables must only have point keys with sequence numbers `< 10` 110 for the keys in interval `["c", "h")`. This partial order across 111 sstables exists in RocksDB/Pebble between memtable, L0 sstables (where 112 it is a total order) and across sstables in different levels. 113 114 Coming back to the inclusive bounds of the range delete, `[s#n, e#inf]`: 115 these bounds participate in deciding the bounds of the 116 sstable. In this world, one can read all the entries in an sstable and 117 compute its bounds. However being able to construct these bounds by 118 reading an sstable is not essential -- RocksDB/Pebble store these 119 bounds in the `MANIFEST`. This latter fact has been exploited to 120 construct a real world (later section) where the bounds of an sstable 121 are not computable by reading all its keys. 122 123 If we had a system with one sstable per level, for each level lower 124 than L0, we are effectively done. We have represented the tight bounds 125 of each range delete and it is within the bounds of the sstable. This 126 works even with L0 => L0 compactions assuming they output exactly one 127 sstable. 128 129 ## The Mostly Simple World 130 131 Here we have multiple files for levels lower than L0 that are non 132 overlapping in the file bounds. These multiple files occur because 133 compactions produce multiple files. This introduces the need to split a 134 range delete across the files being produced by a compaction. 135 136 There is a clean way to split a range delete `[s, e)#n` into 2 parts 137 (which can be recursively applied to split into arbitrarily many 138 parts): split into `[s, m)#n` and `[m, e)#n`. These range deletes 139 apply to non-overlapping points and their tight bounds are `[s#m, 140 m#inf]`, `[m#n, e#inf]` which are also non-overlapping. 141 142 Consider the following example of an input range delete `["c", "h")#10` and 143 the following two output files from a compaction: 144 145 ``` 146 sst1 sst2 147 last point is "e"#7 | first point is "f"#20 148 ``` 149 150 The range delete can be split into `["c", "f")#10` and `["f", 151 "h")#10`, by using the first point key of sst2 as the split 152 point. Then the bounds of sst1 and sst2 will be `[..., "f"#inf]` and 153 `["f"#20, ...]` which are non-overlapping. It is still possible to compute 154 the sstable bounds by looking at all the entries in the sstable. 155 156 ## The Real World 157 158 Continuing with the same range delete `["c", "h")#10`, we can have the 159 following sstables produced during a compaction: 160 161 ``` 162 sst1 sst2 sst3 sst4 sst5 163 points: "e"#7 | "f"#12 "f"#7 | "f"#4 "f"#3 | "f"#1 | "g"#15 164 ``` 165 166 The range deletes written to these ssts are 167 168 ``` 169 sst1 sst2 sst3 sst4 sst5 170 ["c", "h")#10 | ["f", "h")#10 | ["f", "h")#10 | ["f", "h")#10 | ["g", "h")#10 171 ``` 172 173 The Pebble code that achieves this effect is in 174 `rangedel.Fragmenter`. It is a code structuring artifact that sst1 175 does not contain a range delete equal to `["c", "f")#10` and sst4 does 176 not contain `["f", "g")#10`. However for the range deletes in sst2 and 177 sst3 we cannot do any better because we don't know what the key 178 following "f" will be (the compaction cannot look ahead) and because 179 we don't have an `ImmediateSuccessor` function (otherwise we could 180 have written `["f", ImmediateSuccessor("f"))#10` to sst2, sst3). But 181 the code artifacts are not the ones introducing the real complexity. 182 183 The range delete bounds are 184 185 ``` 186 sst1 sst2, sst3, sst4 sst5 187 ["c"#10, "h"#inf] ["f"#10, "h"#inf] ["g"#10, "h"#inf] 188 189 ``` 190 191 We note the following: 192 193 - The bounds of range deletes are overlapping since we have been 194 unable to split the range deletes. If these decide the sstable 195 bounds, the sstables will have overlapping bounds. This is not 196 permissible. 197 198 - The range deletes included in each sstable result in that sstable 199 being "self-sufficient" wrt having the range delete that deletes 200 some of the points in the sstable (let us assume that the points in 201 this example have not been dropped from that sstable because of a 202 snapshot). 203 204 - The transitions from sst1 to sst2 and sst4 to sst5 are **clean** in 205 that we can pretend that the range deletes in those files are actually: 206 207 ``` 208 sst1 sst2 sst3 sst4 sst5 209 ["c", "f")#10 | ["f", "g")#10 | ["f", "g")#10 | ["f", "g")#10 | ["g", "h")#10 210 ``` 211 212 We could achieve some of these **clean** transitions (but not all) with a 213 code change. Also note that these better range deletes maintain the 214 "self-sufficient" property. 215 216 ### Making Non-overlapping SSTable bounds 217 218 We force the sstable bounds to be non-overlapping by setting them to: 219 220 ``` 221 sst1 sst2 sst3 sst4 sst5 222 ["c"#10, "f"#inf] ["f"#12, "f"#7] ["f"#4, "f"#3] ["f"#1, "g"#inf] ["g"#15, "h"#inf] 223 ``` 224 225 Note that for sst1...sst4 the sstable bounds are smaller than the 226 bounds of the range deletes contained in them. The code that 227 accomplishes this is Pebble is in `compaction.go` -- we will not discuss the 228 details of that logic here but note that it is placing an `inf` 229 sequence number for a clean transition and for an unclean transition 230 it is using the point keys as the bounds. 231 232 Associated with these narrower bounds, we add the following 233 requirement: a range delete in an sstable must **act-within** the bounds of 234 the sstable it is contained in. In the above example: 235 236 - sst1: range delete `["c", "h")#10` must act-within the bound `["c"#10, "f"#inf]` 237 238 - sst2: range delete `["f", "h")#10` must act-within the bound `["f"#12, "f"#7]` 239 240 - sst3: range delete `["f", "h")#10` must act-within the bound `["f"#4, "f"#3]` 241 242 - sst4: range delete `["f", "h")#10` must act-within the bound ["f"#1, "g"#inf] 243 244 - And so on. 245 246 The intuitive reason for the **act-within** requirement is that 247 sst5 can be compacted and moved down to a lower level independent of 248 sst1-sst4, since it was at a **clean** boundary. We do not want the 249 range delete `["f", "h")#10` sitting in sst1...sst4 at the higher 250 level to act on `"g"#15` that has been moved to the lower level. Note 251 that this incorrect action can happen due to 2 reasons: 252 253 1. the invariant that lower levels have older data for keys 254 that also exist in higher levels means we can (a) seek a lower level 255 sstable to the end of a range delete from a higher level, (b) for a key 256 lookup, stop searching in lower levels once a range delete is encountered 257 for that key in a higher level. 258 259 2. Sequence number zeroing logic can change the sequence number of 260 `"g"#15` to `"g"#0` (for better compression) once it realizes that 261 there are no older versions of `"g"`. It would be incorrect for this 262 `"g"#0` to be deleted. 263 264 265 #### Loss of Power 266 267 This act-within behavior introduces some "loss of power" for 268 the original range delete `["c", "h")#10`. By acting within sst2...sst4 269 it can no longer delete keys `"f"#6`, `"f"#5`, `"f"#2`. 270 271 Luckily for us, this is harmless since these keys cannot have existed 272 in the system due to the levelling behavior: we cannot be writing 273 sst2...sst4 to level `i` if versions of `"f"` younger than `"f"#4` are 274 already in level `i` or version older than `"f"#7` have been left in 275 level i - 1. There is some trickery possible to prevent this "loss of 276 power" for queries (see the "Putting it together" section), but given 277 the history of bugs in this area, we should be cautious. 278 279 ### Improperly truncated Range Deletes 280 281 We refer to range deletes that have experienced this "loss of power" 282 as **improper**. In the above example the range deletions in sst2, sst3, sst4 283 are improper. The problem with improper range deletions occurs 284 when they need to participate in a future compaction: even though we 285 have restricted them to act-within their current sstable boundary, we 286 don't have a way of **"writing"** this restriction to a new sstable, 287 since they still need to be written in the `[s, e)#n` format. 288 289 For example, sst2 has delete `["f", "h")#10` that must act-within 290 the bound `["f"#12, "f"#7]`. If sst2 was compacted down to the next 291 level into a new sstable (whose bounds we cannot predict because they 292 depend on other data written to that sstable) we need to be able to 293 write a range delete entry that follows the original restriction. But 294 the narrowest we can write is `["f", ImmediateSuccessor("f"))#10`. This 295 is an expansion of the act-within restriction with potentially 296 unintended consequences. In this case the expansion happened in the suffix. 297 For sst4, the range deletion `["f", "h")#10` must act-within `["f"#1, "g"#inf]`, 298 and we can precisely represent the constraint on the suffix by writing 299 `["f", "g")#10` but it does not precisely represent that this range delete 300 should not apply to `"f"#9`...`"f"#2`. 301 302 In comparison, the sst1 range delete `["c", "h")#10` that must act-within 303 the bound `["c"#10, "f"#inf]` is not improper. This restriction can 304 be applied precisely to get a range delete `["c", "f")#10`. 305 306 The solution to this is to note that while individual sstables have 307 improper range deletes, if we look at a collection of sstables we 308 can restore the improper range deletes spread across them to their proper self 309 (and their full power). To accurately find these improper range 310 deletes would require looking into the contents of a file, which is 311 expensive. But we can construct a pessimistic set based on 312 looking at the sequence of all files in a level and partitioning them: 313 adjacent files `f1`, `f2` with largest and smallest bound `k1#n1`, 314 `k2#n2` must be in the same partition if 315 316 ``` 317 k1 = k2 and n1 != inf 318 ``` 319 320 In the above example sst2, sst3, sst4 are one partition. The 321 **spanning bound** of this partition is `["f"#12, "g"#inf]` and the 322 range delete `["f", "h")#10` when constrained to act-within this 323 spanning bound is precisely the range delete `["f", 324 "g")#10`. Intuitively, the "loss of power" of this range delete has 325 been restored for the sake of making it proper, so it can be 326 accurately "written" in the output of the compaction (it may be 327 improperly fragmented again in the output, but we have already 328 discussed that). Such partitions are called "atomic compaction groups" 329 and must participate as a whole in a compaction (and a 330 compaction can use multiple atomic compaction groups as input). 331 332 Consider another example: 333 334 ``` 335 sst1 sst2 336 points: "e"#12 | "e"#10 337 delete: ["c", "g")#8 | ["c", "g")#8 338 bounds ["c"#8, "e"#12] | ["e"#10, "g"#inf] 339 ``` 340 341 sst1, sst2 are an atomic compaction group. Say we violated the 342 requirement that both be inputs in a compaction and only compacted 343 sst2 down to level `i + 1` and then down to level `i + 2`. Then we add 344 sst3 with bounds `["h"#10, "j"#5]` to level `i` and sst1 and sst3 are 345 compacted to level `i + 1` into a single sstable. This new sstable 346 will have bounds `["c"#8, "j"#5]` so these bounds do not help with the 347 original apply-witin constraint on `["c", "g")#8` (that it should 348 apply-within `["c"#8, "e"#12]`). The narrowest we can construct (if we had 349 `ImmediateSuccessor`) would be `["c", ImmediateSuccessor("e"))#8`. Now we 350 can incorrectly apply this range delete that is in level `i + 1` to `"e"#10` 351 sitting in level `i + 2`. Note that this example can be made worse using 352 sequence number zeroing -- `"e"#10` may have been rewritten to `"e"#0`. 353 354 If a range delete `[s, e)#n` is in an atomic compaction group with 355 spanning bounds `[k1#n1, k2#n2]` our construction above guarantees the 356 following properties 357 358 - `k1#n1 <= s#n`, so the bounds do not constrain the start of the 359 range delete. 360 361 - `k2 >= e` or `n2 = inf`, so if `k2` is constraining the range delete 362 it will properly truncate the range delete. 363 364 365 #### New sstable at sequence number 0 366 367 A new sstable can be assigned sequence number 0 (and be written to L0) 368 if the keys in the sstable are not in any other sstable. This 369 comparison uses the keys and not key#seqnum, so the loss and 370 restoration of power does not cause problems since that occurs within 371 the versions of a single key. 372 373 #### Flawed optimizations 374 375 For the case where the atomic compaction group correspond to the lower 376 level of a compaction, it may initially seem to be correct to use only 377 a prefix or suffix of that group in a compaction. In this case the 378 prefix (suffix) will correspond to the largest key (smallest key) in 379 the input sstables in the compaction and so can continue to constrain 380 the range delete. For example, sst1 and sst2 are in the same atomic 381 compaction group 382 383 ``` 384 sst1 sst2 385 points: "c"#10 "e"#12 | "e"#10 386 delete: ["c", "g")#8 | ["c", "g")#8 387 bounds ["c"#10, "e"#12] | ["e"#10, "g"#inf] 388 ``` 389 390 and this is the lower level of a compaction with 391 392 ``` 393 sst3 394 points: "a"#14 "d"#15 395 bounds ["a"#14, "d"#15] 396 ``` 397 398 we could allow for a compaction involving sst1 and sst3 which would produce 399 400 ``` 401 sst4 402 points: "a"#14 "c"#10 "d"#15 "e"#12 403 delete: ["c", "g")#8 404 bounds ["a"#14, "e"#12] 405 ``` 406 407 and the range delete is still improper but its act-within constraint has 408 not expanded. 409 410 But we have to be very careful to not have a more significant loss of power 411 of this range delete. Consider a situation where sst3 had a single delete 412 `"e"#16`. It still does not overlap in bounds with sst2 and we again pick 413 sst1 and sst3 for compaction. This single delete will cause `"e"#12` to be deleted 414 and sst4 bounds would be (unless we had complicated code preventing it): 415 416 ``` 417 sst4 418 points: "a"#14 "c"#10 "d"#15 419 delete: ["c", "g")#8 420 bounds ["a"#14, "d"#15] 421 ``` 422 423 Now this delete cannot delete `"dd"#6` and we have lost the ability to know 424 that sst4 and sst2 are in the same atomic compaction group. 425 426 427 ### Putting it together 428 429 Summarizing the above, we have: 430 431 - SStable bounds logic that ensures sstables are not 432 overlapping. These sstables contain range deletes that extend outside 433 these bounds. But these range deletes should **apply-within** the 434 sstable bounds. 435 436 - Compactions: they need to constrain the range deletes in the inputs 437 to **apply-within**, but this can create problems with **writing** the 438 **improper** range deletes. The solution is to include the full 439 **atomic compaction group** in a compaction so we can restore the 440 **improper** range deletes to their **proper** self and then apply the 441 constraints of the atomic compaction group. 442 443 - Queries: We need to act-within the file bound constraint on the range delete. 444 Say the range delete is `[s, e)#n` and the file bound is `[b1#n1, 445 b2#n2]`. We are guaranteed that `b1#n1 <= s#n` so the only 446 constraint can come from `b2#n2`. 447 448 - Deciding whether a range delete covers a key in the same or lower levels. 449 450 - `b2 >= e`: there is no act-within constraint. 451 - `b2 < e`: to be precise we cannot let it delete `b2#n2-1` or 452 later keys. But it is likely that allowing it to delete up to 453 `b2#0` would be ok due to the atomic compaction group. This 454 would prevent the so-called "loss of power" discussed earlier if 455 one also includes the argument that the gap in the file bounds 456 that also represents the loss of power is harmless (the gap 457 exists within versions of key, and anyone doing a query for that 458 key will start from the sstable to the left of the gap). But it 459 may be better to be cautious. 460 461 - For using the range delete to seek sstables at lower levels. 462 - `b2 >= e`: seek to `e` since there is no act-within constraint. 463 - `b2 < e`: seek to `b2`. We are ignoring that this range delete 464 is allowed to delete some versions of `b2` since this is just a 465 performance optimization. 466 467 468 469 470 471