github.com/cockroachdb/pebble@v1.1.1-0.20240513155919-3622ade60459/docs/RFCS/20211018_range_keys.md (about) 1 - Feature Name: Range Keys 2 - Status: draft 3 - Start Date: 2021-10-18 4 - Authors: Sumeer Bhola, Jackson Owens 5 - RFC PR: #1341 6 - Pebble Issues: 7 https://github.com/cockroachdb/pebble/issues/1339 8 - Cockroach Issues: 9 https://github.com/cockroachdb/cockroach/issues/70429 10 https://github.com/cockroachdb/cockroach/issues/70412 11 12 ** Design Draft** 13 14 # Summary 15 16 An ongoing effort within CockroachDB to preserve MVCC history across all SQL 17 operations (see cockroachdb/cockroach#69380) requires a more efficient method of 18 deleting ranges of MVCC history. 19 20 This document describes an extension to Pebble introducing first-class support 21 for range keys. Range keys map a range of keyspace to a value. Optionally, the 22 key range may include an suffix encoding a version (eg, MVCC timestamp). Pebble 23 iterators may be configured to surface range keys during iteration, or to mask 24 point keys at lower MVCC timestamps covered by range keys. 25 26 CockroachDB will make use of these range keys to enable history-preserving 27 removal of contiguous ranges of MVCC keys with constant writes, and efficient 28 iteration past deleted versions. 29 30 # Background 31 32 A previous CockroachDB RFC cockroach/cockroachdb#69380 describes the motivation 33 for the larger project of migrating MVCC-noncompliant operations into MVCC 34 compliance. Implemented with the existing MVCC primitives, some operations like 35 removal of an index or table would require performing writes linearly 36 proportional to the size of the table. Dropping a large table using existing 37 MVCC point-delete primitives would be prohibitively expensive. The desire for a 38 sublinear delete of an MVCC range motivates this work. 39 40 The detailed design for MVCC compliant bulk operations ([high-level 41 description](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210825_mvcc_bulk_ops.md); 42 detailed design draft for DeleteRange in internal 43 [doc](https://docs.google.com/document/d/1ItxpitNwuaEnwv95RJORLCGuOczuS2y_GoM2ckJCnFs/edit#heading=h.x6oktstoeb9t)), 44 ran into complexity by placing range operations above the Pebble layer, such 45 that Pebble sees these as points. The complexity causes are various: (a) which 46 key (start or end) to anchor this range on, when represented as a point (there 47 are performance consequences), (b) rewriting on CockroachDB range splits (and 48 concerns about rewrite volume), (c) fragmentation on writes and complexity 49 thereof (and performance concerns for reads when not fragmenting), (d) inability 50 to efficiently skip older MVCC versions that are masked by a `[k1,k2)@ts` (where 51 ts is the MVCC timestamp). 52 53 Pebble currently has only one kind of key that is associated with a range: 54 `RANGEDEL [k1, k2)#seq`, where [k1, k2) is supplied by the caller, and is used 55 to efficiently remove a set of point keys. 56 57 First-class support for range keys in Pebble eliminates all these issues. 58 Additionally, it allows for future extensions like efficient transactional range 59 operations. This issue describes how this feature would work from the 60 perspective of a user of Pebble (like CockroachDB), and sketches some 61 implementation details. 62 63 # Design 64 65 ## Interface 66 67 ### New `Comparer` requirements 68 69 The Pebble `Comparer` type allows users to optionally specify a `Split` function 70 that splits a user key into a prefix and a suffix. This Split allows users 71 implementing MVCC (Multi-Version Concurrency Control) to inform Pebble which 72 part of the key encodes the user key and which part of the key encodes the 73 version (eg, a timestamp). Pebble does not dictate the encoding of an MVCC 74 version, only that the version form a suffix on keys. 75 76 The range keys design described in this RFC introduces stricter requirements for 77 user-provided `Split` implementations and the ordering of keys: 78 79 1. The user key consisting of just a key prefix `k` must sort before all 80 other user keys containing that prefix. Specifically 81 `Compare(k[:Split(k)], k) < 0` where `Split(k) < len(k)`. 82 2. A key consisting of a bare suffix must be a valid key and comparable. The 83 ordering of the empty key prefix with any suffixes must be consistent with 84 the ordering of those same suffixes applied to any other key prefix. 85 Specifically `Compare(k[Split(k):], k2[Split(k2):]) == Compare(k, k2)` where 86 `Compare(k[:Split(k)], k2[:Split(k2)]) == 0`. 87 88 The details of why these new requirements are necessary are explained in the 89 implementation section. 90 91 ### Writes 92 93 This design introduces three new write operations: 94 95 - `RangeKeySet([k1, k2), [optional suffix], <value>)`: This represents the 96 mapping `[k1, k2)@suffix => value`. Keys `k1` and `k2` must not contain a 97 suffix (i.e., `Split(k1)==len(k1)` and `Split(k2)==len(k2))`. 98 99 - `RangeKeyUnset([k1, k2), [optional suffix])`: This removes a mapping 100 previously applied by `RangeKeySet`. The unset may use a smaller key range 101 than the original `RangeKeySet`, in which case only part of the range is 102 deleted. The unset only applies to range keys with a matching optional suffix. 103 If the optional suffix is absent in both the RangeKeySet and RangeKeyUnset, 104 they are considered matching. 105 106 - `RangeKeyDelete([k1, k2))`: This removes all range keys within the provided 107 key span. It behaves like an `Unset` unencumbered by suffix restrictions. 108 109 For example, consider `RangeKeySet([a,d), foo)` (i.e., no suffix). If 110 there is a later call `RangeKeyUnset([b,c))`, the resulting state seen by 111 a reader is `[a,b) => foo`, `[c,d) => foo`. Note that the value is not 112 modified when the key is fragmented. 113 114 Partially overlapping `RangeKeySet`s with the same suffix overwrite one 115 another. For example, consider `RangeKeySet([a,d), foo)`, followed by 116 `RangeKeySet([c,e), bar)`. The resulting state is `[a,c) => foo`, `[c,e) 117 => bar`. 118 119 Point keys (eg, traditional keys defined at a singular byte string key) and 120 range keys do not overwrite one another. They have a parallel existence. Point 121 deletes only apply to points. Range unsets only apply to range keys. However, 122 users may configure iterators to mask point keys covered by newer range keys. 123 This masking behavior is explicitly requested by the user in the context of the 124 iteration. Masking is described in more detail below. 125 126 There exist separate range delete operations for point keys and range keys. A 127 `RangeKeyDelete` may remove part of a range key, just like the new 128 `RangeKeyUnset` operation introduced earlier. `RangeKeyDelete`s differ from 129 `RangeKeyUnset`s, because the latter requires that the suffix matches and 130 applies only to range keys. `RangeKeyDelete`s completely clear all existing 131 range keys within their span at all suffix values. 132 133 The optional suffix in `RangeKeySet` and `RangeKeyUnset` operations is related 134 to the pebble `Comparer.Split` operation which is explicitly documented as being 135 for [MVCC 136 keys](https://github.com/cockroachdb/pebble/blob/e95e73745ce8a85d605ef311d29a6574db8ed3bf/internal/base/comparer.go#L69-L88), 137 without mandating exactly how the versions are represented. `RangeKeySet` and 138 `RangeKeyUnset` keys with different suffixes do not interact logically, although 139 Pebble will observably fragment ranges at intersection points. 140 141 ### Iteration 142 143 A user iterating over a key interval [k1,k2) can request: 144 145 - **[I1]** An iterator over only point keys. 146 147 - **[I2]** A combined iterator over point and range keys. This is what 148 we mainly discuss below in the implementation discussion. 149 150 - **[I3]** An iterator over only range keys. In the CockroachDB use 151 case, range keys will need to be subject to MVCC GC just like 152 point keys — this iterator may be useful for that purpose. 153 154 The `pebble.Iterator` type will be extended to provide accessors for 155 range keys for use in the combined and exclusively range iteration 156 modes. 157 158 ``` 159 // HasPointAndRange indicates whether there exists a point key, a range key or 160 // both at the current iterator position. 161 HasPointAndRange() (hasPoint, hasRange bool) 162 163 // RangeKeyChanged indicates whether the most recent iterator positioning 164 // operation resulted in the iterator stepping into or out of a new range key. 165 // If true previously returned range key bounds and data has been invalidated. 166 // If false, previously obtained range key bounds, suffix and value slices are 167 // still valid and may continue to be read. 168 RangeKeyChanged() bool 169 170 // Key returns the key of the current key/value pair, or nil if done. If 171 // positioned at an iterator position that only holds a range key, Key() 172 // always returns the start bound of the range key. Otherwise, it returns 173 // the point key's key. 174 Key() []byte 175 176 // RangeBounds returns the start (inclusive) and end (exclusive) bounds of the 177 // range key covering the current iterator position. RangeBounds returns nil 178 // bounds if there is no range key covering the current iterator position, or 179 // the iterator is not configured to surface range keys. 180 // 181 // If valid, the returned start bound is less than or equal to Key() and the 182 // returned end bound is greater than Key(). 183 RangeBounds() (start, end []byte) 184 185 // Value returns the value of the current key/value pair, or nil if done. 186 // The caller should not modify the contents of the returned slice, and 187 // its contents may change on the next call to Next. 188 // 189 // Only valid if HasPointAndRange() returns true for hasPoint. 190 Value() []byte 191 192 // RangeKeys returns the range key values and their suffixes covering the 193 // current iterator position. The range bounds may be retrieved separately 194 // through RangeBounds(). 195 RangeKeys() []RangeKey 196 197 type RangeKey struct { 198 Suffix []byte 199 Value []byte 200 } 201 ``` 202 203 When a combined iterator exposes range keys, it exposes all the range 204 keys covering `Key`. During iteration with a combined iterator, an 205 iteration position may surface just a point key, just a range key or 206 both at the currently-positioned `Key`. 207 208 Described another way, a Pebble combined iterator guarantees that it 209 will stop at all positions within the keyspace where: 210 1. There exists a point key at that position. 211 2. There exists a range key that logically begins at that postition. 212 213 In addition to the above positions, a Pebble iterator may also stop at keys 214 in-between the above positions due to fragmentation. Range keys are defined over 215 continuous spans of keyspace. Range keys with different suffix values may 216 overlap each other arbitrarily. To surface these arbitrarily overlapping spans 217 in an understandable and efficient way, the Pebble iterator surfaces range keys 218 fragmented at intersection points. Consider the following sequence of writes: 219 220 ``` 221 RangeKeySet([a,z), @1, 'apple') 222 RangeKeySet([c,e), @3, 'banana') 223 RangeKeySet([e,m), @5, 'orange') 224 RangeKeySet([b,k), @7, 'kiwi') 225 ``` 226 227 This yields a database containing overlapping range keys: 228 ``` 229 @7 → kiwi |-----------------) 230 @5 → orange |---------------) 231 @3 → banana |---) 232 @1 → apple |-------------------------------------------------) 233 a b c d e f g h i j k l m n o p q r s t u v w x y z 234 ``` 235 236 During iteration, these range keys are surfaced using the bounds of their 237 intersection points. For example, a scan across the keyspace containing only 238 these range keys would observe the following iterator positions: 239 240 ``` 241 Key() = a RangeKeyBounds() = [a,b) RangeKeys() = {(@1,apple)} 242 Key() = b RangeKeyBounds() = [b,c) RangeKeys() = {(@7,kiwi), (@1,apple)} 243 Key() = c RangeKeyBounds() = [c,e) RangeKeys() = {(@7,kiwi), (@3,banana), (@1,apple)} 244 Key() = e RangeKeyBounds() = [e,k) RangeKeys() = {(@7,kiwi), (@5,orange), (@1,apple)} 245 Key() = k RangeKeyBounds() = [k,m) RangeKeys() = {(@5,orange), (@1,apple)} 246 Key() = m RangeKeyBounds() = [m,z) RangeKeys() = {(@1,apple)} 247 ``` 248 249 This fragmentation produces a more understandable interface, and avoids forcing 250 iterators to read all range keys within the bounds of the broadest range key. 251 Consider this example: 252 253 ``` 254 iterator pos [ ] - sstable bounds 255 | 256 L1: [a----v1@t2--|-h] [l-----unset@t1----u] 257 L2: [e---|------v1@t1----------r] 258 a b c d e f g h i j k l m n o p q r s t u v w x y z 259 ``` 260 261 If the iterator is positioned at a point key `g`, there are two overlapping 262 physical range keys: `[a,h)@t2→v1` and `[e,r)@t1→v1`. 263 264 However, the `RangeKeyUnset([l,u), @t1)` removes part of the `[e,r)@t1→v1` range 265 key, truncating it to the bounds `[e,l)`. The iterator must return the truncated 266 bounds that correctly respect the `RangeKeyUnset`. However, when the range keys 267 are stored within a log-structured merge tree like Pebble, the `RangeKeyUnset` 268 may not be contained within the level's sstable that overlaps the current point 269 key. Searching for the unset could require reading an unbounded number of 270 sstables, losing the log-structured merge tree's property that bounds read 271 amplification to the number of levels in the tree. 272 273 Fragmenting range keys to intersection points avoids this problem. The iterator 274 positioned at `g` only surfaces range key state with the bounds `[e,h)`, the 275 widest bounds in which it can guarantee t2→v1 and t1→v1 without loading 276 additional sstables. 277 278 #### Iteration order 279 280 Recall that the user-provided `Comparer.Split(k)` function divides all user keys 281 into a prefix and a suffix, such that the prefix is `k[:Split(k)]`, and the 282 suffix is `k[Split(k):]`. If a key does not contain a suffix, the key equals the 283 prefix. 284 285 An iterator that is configured to surface range keys alongside point keys will 286 surface all range keys covering the current `Key()` position. Revisiting an 287 earlier example with the addition of three new point key-value pairs: 288 a→artichoke, b@2→beet and t@3→turnip. Consider '@<number>' to form the suffix 289 where present, with `<number>` denoting a MVCC timestamp. Higher, more-recent 290 timestamps sort before lower, older timestamps. 291 292 ``` 293 . a → artichoke 294 @7 → kiwi |-----------------) 295 @5 → orange |---------------) 296 . b@2 b@2 → beet 297 @3 → banana |---) . t@3 t@3 → turnip 298 @1 → apple |-------------------------------------------------) 299 a b c d e f g h i j k l m n o p q r s t u v w x y z 300 ``` 301 302 An iterator configured to surface both point and range keys will visit the 303 following iterator positions during forward iteration: 304 305 ``` 306 Key() HasPointAndRange() Value() RangeKeyBounds() RangeKeys() 307 a (true, true) artichoke [a,b) {(@1,apple)} 308 b (false, true) - [b,c) {(@7,kiwi), (@1,apple)} 309 b@2 (true, true) beet [b,c) {(@7,kiwi), (@1,apple)} 310 c (false, true) - [c,e) {(@7,kiwi), (@3,banana), (@1,apple)} 311 e (false, true) - [e,k) {(@7,kiwi), (@5,orange), (@1,apple)} 312 k (false, true) - [k,m) {(@5,orange), (@1,apple)} 313 m (false, true) - [m,z) {(@1,apple)} 314 t@3 (true, true) turnip [m,z) {(@1,apple)} 315 ``` 316 317 Note that: 318 319 - While positioned over a point key (eg, Key() = 'a', 'b@2' or t@3'), the 320 iterator exposes both the point key's value through Value() and the 321 overlapping range keys values through `RangeKeys()`. 322 323 - There can be multiple range keys covering a `Key()`, each with a different 324 suffix. 325 326 - There cannot be multiple range keys covering a `Key()` with the same suffix, 327 since the most-recently committed one (eg, the one with the highest sequence 328 number) will win, just like for point keys. 329 330 - If the iterator has configured lower and/or upper bounds, they will truncate 331 the range key to those bounds. For example, if the above iterator had an upper 332 bound 'y', the `[m,z)` range key would be surfaced with the bounds `[m,y)` 333 instead. 334 335 #### Masking 336 337 Range key masking provides additional, optional functionality designed 338 specifically for the use case of implementing a MVCC-compatible delete range. 339 340 When constructing an iterator that iterators over both point and range keys, a 341 user may request that range keys mask point keys. Masking is configured with a 342 suffix parameter that determines which range keys may mask point keys. Only 343 range keys with suffixes that sort after the mask's suffix mask point keys. A 344 range key that meets this condition only masks points with suffixes that sort 345 after the range key's suffix. 346 347 ``` 348 type IterOptions struct { 349 // ... 350 RangeKeyMasking RangeKeyMasking 351 } 352 353 // RangeKeyMasking configures automatic hiding of point keys by range keys. 354 // A non-nil Suffix enables range-key masking. When enabled, range keys with 355 // suffixes ≥ Suffix behave as masks. All point keys that are contained within 356 // a masking range key's bounds and have suffixes greater than the range key's 357 // suffix are automatically skipped. 358 // 359 // Specifically, when configured with a RangeKeyMasking.Suffix _s_, and there 360 // exists a range key with suffix _r_ covering a point key with suffix _p_, and 361 // 362 // _s_ ≤ _r_ < _p_ 363 // 364 // then the point key is elided. 365 // 366 // Range-key masking may only be used when iterating over both point keys and 367 // range keys. 368 type RangeKeyMasking struct { 369 // Suffix configures which range keys may mask point keys. Only range keys 370 // that are defined at suffixes greater than or equal to Suffix will mask 371 // point keys. 372 Suffix []byte 373 // Filter is an optional field that may be used to improve performance of 374 // range-key masking through a block-property filter defined over key 375 // suffixes. If non-nil, Filter is called by Pebble to construct a 376 // block-property filter mask at iterator creation. The filter is used to 377 // skip whole point-key blocks containing point keys with suffixes greater 378 // than a covering range-key's suffix. 379 // 380 // To use this functionality, the caller must create and configure (through 381 // Options.BlockPropertyCollectors) a block-property collector that records 382 // the maxmimum suffix contained within a block. The caller then must write 383 // and provide a BlockPropertyFilterMask implementation on that same 384 // property. See the BlockPropertyFilterMask type for more information. 385 Filter func() BlockPropertyFilterMask 386 } 387 ``` 388 389 Example: A user may construct an iterator with `RangeKeyMasking.Suffix` set to 390 `@50`. The range key `[a, c)@60` would mask nothing, because `@60` is a more 391 recent timestamp than `@50`. However a range key `[a,c)@30` would mask `a@20` 392 and `apple@10` but not `apple@40`. A range key can only mask keys with MVCC 393 timestamps older than the range key's own timestamp. Only range keys with 394 suffixes (eg, MVCC timestamps) may mask anything at all. 395 396 The pebble Iterator surfaces all range keys when masking is enabled. Only point 397 keys are ever skipped, and only when they are contained within the bounds of a 398 range key with a more-recent suffix, and the range key's suffix is older than 399 the timestamp encoded in `RangeKeyMasking.Sufffix`. 400 401 ## Implementation 402 403 ### Write operations 404 405 This design introduces three new Pebble write operations: `RangeKeySet`, 406 `RangeKeyUnset` and `RangeKeyDelete`. Internally, these operations are 407 represented as internal keys with new corresponding key kinds encoded as a part 408 of the key trailer. These keys are stored within special range key blocks 409 separate from point keys, but within the same sstable. The range key blocks hold 410 `RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` keys, but do not hold keys 411 of any other kind. Within the memtables, these range keys are stored in a 412 separate skip list. 413 414 - `RangeKeySet([k1,k2), @suffix, value)` is encoded as a `k1.RANGEKEYSET` key 415 with a value encoding the tuple `(k2,@suffix,value)`. 416 - `RangeKeyUnset([k1,k2), @suffix)` is encoded as a `k1.RANGEUNSET` key 417 with a value encoding the tuple `(k2,@suffix)`. 418 - `RangeKeyDelete([k1,k2)` is encoded as a `k1.RANGEKEYDELETE` key with a value 419 encoding `k2`. 420 421 Range keys are physically fragmented as an artifact of the log-structured merge 422 tree structure and internal sstable boundaries. This fragmentation is essential 423 for preserving the performance characteristics of a log-structured merge tree. 424 Although the public interface operations for `RangeKeySet` and `RangeKeyUnset` 425 require both boundary keys `[k1,k2)` to always be bare prefixes (eg, to not have 426 a suffix), internally these keys may be fragmented to bounds containing 427 suffixes. 428 429 Example: If a user attempts to write `RangeKeySet([a@v1, c@v2), @v3, value)`, 430 Pebble will return an error to the user. If a user writes `RangeKeySet([a, c), 431 @v3, value)`, Pebble will allow the write and may later internally fragment the 432 `RangeKeySet` into three internal keys: 433 - `RangeKeySet([a, a@v1), @v3, value)` 434 - `RangeKeySet([a@v1, c@v2), @v3, value)` 435 - `RangeKeySet([c@v2, c), @v3, value)` 436 437 This fragmentation preserve log-structured merge tree performance 438 characteristics because it allows a range key to be split across many sstables, 439 while preserving locality between range keys and point keys. Consider a 440 `RangeKeySet([a,z), @1, foo)` on a database that contains millions of point keys 441 in the range [a,z). If the [a,z) range key was not permitted to be fragmented 442 internally, it would either need to be stored completely separately from the 443 point keys in a separate sstable or in a single intractably large sstable 444 containing all the overlapping point keys. Fragmentation allows locality, 445 ensuring point keys and range keys in the same region of the keyspace can be 446 stored in the same sstable. 447 448 `RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` keys are assigned sequence 449 numbers, like other internal keys. Log-structured merge tree level invariants 450 are valid across range key, point keys and between the two. That is: 451 452 1. The point key `k1#s2` cannot be at a lower level than `k2#s1` where 453 `k1==k2` and `s1 < s2`. This is the invariant implemented by all LSMs. 454 2. `RangeKeySet([k1,k2))#s2` cannot be at a lower level than 455 `RangeKeySet([k3,k4))#s1` where `[k1,k2)` overlaps `[k3,k4)` and `s1 < s2`. 456 3. `RangeKeySet([k1,k2))#s2` cannot be at a lower level than a point key 457 `k3#s1` where `k3 \in [k1,k2)` and `s1 < s2`. 458 459 Like other tombstones, the `RangeKeyUnset` and `RangeKeyDelete` keys are elided 460 when they fall to the bottomost level of the LSM and there is no snapshot 461 preventing its elision. There is no additional garbage collection problem 462 introduced by these keys. 463 464 There is no Merge operation that affects range keys. 465 466 #### Physical representation 467 468 `RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` keys are keyed by their 469 start key. This poses an obstacle. We must be able to support multiple range 470 keys at the same sequence number, because all keys within an ingested sstable 471 adopt the same sequence number. Duplicate internal keys (keys with equal user 472 keys, sequence numbers and kinds) are prohibited within Pebble. To resolve this 473 issue, fragments with the same bounds are merged within snapshot stripes into a 474 single physical key-value, representing multiple logical key-value pairs: 475 476 ``` 477 k1.RangeKeySet#s2 → (k2,[(@t2,v2),(@t1,v1)]) 478 ``` 479 480 Within a physical key-value pair, suffix-value pairs are stored sorted by 481 suffix, descending. This has a minor advantage of reducing iteration-time 482 user-key comparisons when there exist multiple range keys in a table. 483 484 Unlike other Pebble keys, the `RangeKeySet` and `RangeKeyUnset` keys have values 485 that encode fields of data known to Pebble. The value that the user sets in a 486 call to `RangeKeySet` is opaque to Pebble, but the physical representation of 487 the `RangeKeySet`'s value is known. This encoding is a sequence of fields: 488 489 * End key, `varstring`, encodes the end user key of the fragment. 490 * A series of (suffix, value) tuples representing the logical range keys that 491 were merged into this one physical `RangeKeySet` key: 492 * Suffix, `varstring` 493 * Value, `varstring` 494 495 Similarly, `RangeKeyUnset` keys are merged within snapshot stripes and have a 496 physical representation like: 497 498 ``` 499 k1.RangeKeyUnset#s2 → (k2,[(@t2),(@t1)]) 500 ``` 501 502 A `RangeKeyUnset` key's value is encoded as: 503 * End key, `varstring`, encodes the end user key of the fragment. 504 * A series of suffix `varstring`s. 505 506 When `RangeKeySet` and `RangeKeyUnset` fragments with identical bounds meet 507 within the same snapshot stripe within a compaction, any of the 508 `RangeKeyUnset`'s suffixes that exist within the `RangeKeySet` key are removed. 509 510 A `RangeKeyDelete` key has no additional data beyond its end key, which is 511 encoded directly in the value. 512 513 NB: `RangeKeySet` and `RangeKeyUnset` keys are not merged within batches or the 514 memtable. That's okay, because batches are append-only and indexed batches will 515 refragment and merge the range keys on-demand. In the memtable, every key is 516 guaranteed to have a unique sequence number. 517 518 ### Sequence numbers 519 520 Like all Pebble keys, `RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` are 521 assigned sequence numbers when committed. As described above, overlapping 522 `RangeKeySet`s and `RangeKeyUnset`s are fragmented to have matching start and 523 end bounds. Then the resulting exactly-overlapping range key fragments are 524 merged into a single internal key-value pair, within the same snapshot stripe 525 and sstable. The original, unmerged internal keys each have their own sequence 526 numbers, indicating the moment they were committed within the history of all 527 write operations. 528 529 Recall that sequence numbers are used within Pebble to determine which keys 530 appear live to which iterators. When an iterator is constructed, it takes note 531 of the current _visible sequence number_, and for the lifetime of the iterator, 532 only surfaces keys less than that sequence number. Similarly, snapshots read the 533 current _visible sequence number_, remember it, but also leave a note asking 534 compactions to preserve history at that sequence number. The space between 535 snapshotted sequence numbers is referred to as a _snapshot stripe_, and 536 operations cannot drop or otherwise mutate keys unless they fall within the same 537 _snapshot stripe_. For example a `k.MERGE#5` key may not be merged with a 538 `k.MERGE#1` operation if there's an open snapshot at `#3`. 539 540 The new `RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` keys behave 541 similarly. Overlapping range keys won't be merged if there's an open snapshot 542 separating them. Consider a range key `a-z` written at sequence number `#1` and 543 a point key `d.SET#2`. A combined point-and-range iterator using a sequence 544 number `#3` and positioned at `d` will surface both the range key `a-z` and the 545 point key `d`. 546 547 In the context of masking, the suffix-based masking of range keys can cause 548 potentially unexpected behavior. A range key `[a,z)@10` may be committed as 549 sequence number `#1`. Afterwards, a point key `d@5#2` may be committed. An 550 iterator that is configured with range-key masking with suffix `@20` would mask 551 the point key `d@5#2` because although `d@5#2`'s sequence number is higher, 552 range-key masking uses suffixes to impose order, not sequence numbers. 553 554 ### Boundaries for sstables 555 556 Range keys follow the same relationship to sstable bounadries as the existing 557 `RANGEDEL` tombstones. The bounds of an internal range key are user keys. Every 558 range key is limited by its containing sstable's bounds. 559 560 Consider these keys, annotated with sequence numbers: 561 562 ``` 563 Point keys: a#50, b#70, b#49, b#48, c#47, d#46, e#45, f#44 564 Range key: [a,e)#60 565 ``` 566 567 We have created three versions of `b` in this example. In previous versions, 568 Pebble could split output sstables during a compaction such that the different 569 `b` versions span more than one sstable. This creates problems for `RANGEDEL`s 570 which span these two sstables which are discussed in the section on [improperly 571 truncated RANGEDELS](https://github.com/cockroachdb/pebble/blob/master/docs/range_deletions.md#improperly-truncated-range-deletes). 572 We manage to tolerate this for `RANGEDEL`s since their semantics are defined by 573 the system, which is not true for these range keys where the actual semantics 574 are up to the user. 575 576 Pebble now disallows such sstable split points. In this example, by postponing 577 the sstable split point to the user key c, we can cleanly split the range key 578 into `[a,c)#60` and `[c,e)#60`. The sstable end bound for the first sstable 579 (sstable bounds are inclusive) will be c#inf (where inf is the largest possible 580 seqnum, which is unused except for these cases), and sstable start bound for the 581 second sstable will be c#60. 582 583 The above example deals exclusively with point and range keys without suffixes. 584 Consider this example with suffixed keys, and compaction outputs split in the 585 middle of the `b` prefix: 586 587 ``` 588 first sstable: points: a@100, a@30, b@100, b@40 ranges: [a,c)@50 589 second sstable: points: b@30, c@40, d@40, e@30, ranges: [c,e)@50 590 ``` 591 592 When the compaction code decides to defer `b@30` to the next sstable and finish 593 the first sstable, the range key `[a,c)@50` is sitting in the fragmenter. The 594 compaction must split the range key at the bounds determined by the user key. 595 The compaction uses the first point key of the next sstable, in this case 596 `b@30`, to truncate the range key. The compaction flushes the fragment 597 `[a,b@30)@50` to the first sstable and updates the existing fragment to begin at 598 `b@30`. 599 600 If a range key extends into the next file, the range key's truncated end is used 601 for the purposes of determining the sstable end boundary. The first sstable's 602 end boundary becomes `b@30#inf`, signifying the range key does not cover `b@30`. 603 The second sstable's start boundary is `b@30`. 604 605 ### Block property collectors 606 607 Separate block property collectors may be configured to collect separate 608 properties about range keys. This is necessary for CockroachDB's MVCC block 609 property collectors to ensure the sstable-level properties are correct. 610 611 ### Iteration 612 613 This design extends the `*pebble.Iterator` with the ability to iterate over 614 exclusively range keys, range keys and point keys together or exclusively point 615 keys (the previous behavior). 616 617 - Pebble already requires that the prefix `k` follows the same key validity 618 rules as `k@suffix`. 619 620 - Previously, Pebble did not require that a user key consisting of just a prefix 621 `k` sort before the same prefix with a non-empty suffix. CockroachDB has 622 adopted this behavior since it results in the following clean behavior: 623 `RANGEDEL` over [k1, k2) deletes all versioned keys which have prefixes in the 624 interval [k1, k2). Pebble will now require this behavior for all users using 625 MVCC keys. Specifically, it must hold that `Compare(k[:Split(k)], k) < 0` if 626 `Split(k) < len(k)`. 627 628 # TKTK: Discuss merging iterator 629 630 #### Determinism 631 632 Range keys will be split based on boundaries of sstables in an LSM. Users of an 633 LSM typically expect that two different LSMs with different sstable settings 634 that receive the same writes should output the same key-value pairs when 635 iterating. To provide this behavior, the iterator implementation may be 636 configured to defragment range keys during iteration time. The defragmentation 637 behavior would be: 638 639 - Two visible ranges `[k1,k2)@suffix1=>val1`, `[k2,k3)@suffix2=>val2` are 640 defragmented if suffix1==suffix2 and val1==val2, and become [k1,k3). 641 642 - Defragmentation during user iteration does not consider the sequence number. 643 This is necessary since LSM state can be exported to another LSM via the use 644 of sstable ingestion, which can collapse different seqnums to the same seqnum. 645 We would like both LSMs to look identical to the user when iterating. 646 647 The above defragmentation is conceptually simple, but hard to implement 648 efficiently, since it requires stepping ahead from the current position to 649 defragment range keys. This stepping ahead could switch sstables while there are 650 still points to be consumed in a previous sstable. This determinism is useful 651 for testing and verification purposes: 652 653 - Randomized and metamorphic testing is used extensively to reliably test 654 software including Pebble and CockroachDB. Defragmentation provides 655 the determinism necessary for this form of testing. 656 657 - CockroachDB's replica divergence detector requires a consistent view of the 658 database on each replica. 659 660 In order to provide determinism, Pebble constructs an internal range key 661 iterator stack that's separate from the point iterator stack, even when 662 performing combined iteration over both range and point keys. The separate range 663 key iterator allows the internal range key iterator to move independently of the 664 point key iterator. This allows the range key iterator to independently visit 665 adjacent sstables in order to defragment their range keys if necessary, without 666 repositioning the point iterator. 667 668 Two spans [k1,k2) and [k3, k4) of range keys are defragmented if their bounds 669 abut and their user observable-state is identical. That is, `k2==k3` and each 670 spans' contains exactly the same set of range key (<suffix>, <tuple>) pairs. In 671 order to support `RangeKeyUnset` and `RangeKeyDelete`, defragmentation must be 672 applied _after_ resolving unset and deletes. 673 674 #### Merging iteration 675 676 Recall that range keys are stored in the same sstables as point keys. In a 677 log-structured merge tree, these sstables are distributed across levels. Within 678 a level, sstables are non-overlapping but between levels sstables may overlap 679 arbitrarily. During iteration, keys across levels must be merged together. For 680 point keys, this is typically done with a heap. 681 682 Range keys too must be merged across levels, and the earlier described 683 fragmentation at intersection boundaries must be applied. To implement this, a 684 range key merging iterator is defined. 685 686 A merging iterator is initialized with an arbitrary number of child iterators 687 over fragmented spans. Each child iterator exposes fragmented range keys, such 688 that overlapping range keys are surfaced in a single span with a single set of 689 bounds. Range keys from one child iterator may overlap key spans from another 690 child iterator arbitrarily. The high-level algorithm is: 691 692 1. Initialize a heap with bound keys from child iterators' range keys. 693 2. Find the next [or previous, if in reverse] two unique user keys' from bounds. 694 3. Consider the span formed between the two unique user keys a candidate span. 695 4. Determine if any of the child iterators' spans overlap the candidate span. 696 4a. If any of the child iterator's current bounds are end keys (during 697 forward iteration) or start keys (during reverse iteration), then all the 698 spans with that bound overlap the candidate span. 699 4b. If no spans overlap, forget the smallest (forward iteration) or largest 700 (reverse iteration) unique user key and advance the iterators to the next 701 unique user key. Start again from 3. 702 703 Consider the example: 704 705 ``` 706 i0: b---d e-----h 707 i1: a---c h-----k 708 i2: a------------------------------p 709 710 fragments: a-b-c-d-e-----h-----k----------p 711 ``` 712 713 None of the individual child iterators contain a span with the exact bounds 714 [c,d), but the merging iterator must produce a span [c,d). To accomplish this, 715 the merging iterator visits every span between unique boundary user keys. In the 716 above example, this is: 717 718 ``` 719 [a,b), [b,c), [c,d), [d,e), [e, h), [h, k), [k, p) 720 ``` 721 722 The merging iterator first initializes the heap to prepare for iteration. The 723 description below discusses the mechanics of forward iteration after a call to 724 First, but the mechanics are similar for reverse iteration and other positioning 725 methods. 726 727 During a call to First, the heap is initialized by seeking every level to the 728 first bound of the first fragment. In the above example, this seeks the child 729 iterators to: 730 731 ``` 732 i0: (b, boundKindStart, [ [b,d) ]) 733 i1: (a, boundKindStart, [ [a,c) ]) 734 i2: (a, boundKindStart, [ [a,p) ]) 735 ``` 736 737 After fixing up the heap, the root of the heap is the bound with the smallest 738 user key ('a' in the example). During forward iteration, the root of the heap's 739 user key is the start key of next merged span. The merging iterator records this 740 key as the start key. The heap may contain other levels with range keys that 741 also have the same user key as a bound of a range key, so the merging iterator 742 pulls from the heap until it finds the first bound greater than the recorded 743 start key. 744 745 In the above example, this results in the bounds `[a,b)` and child iterators in 746 the following positions: 747 748 ``` 749 i0: (b, boundKindStart, [ [b,d) ]) 750 i1: (c, boundKindEnd, [ [a,c) ]) 751 i2: (p, boundKindEnd, [ [a,p) ]) 752 ``` 753 754 With the user key bounds of the next merged span established, the merging 755 iterator must determine which, if any, of the range keys overlap the span. 756 During forward iteration any child iterator that is now positioned at an end 757 boundary has an overlapping span. (Justification: The child iterator's end 758 boundary is ≥ the new end bound. The child iterator's range key's corresponding 759 start boundary must be ≤ the new start bound since there were no other user keys 760 between the new span's bounds. So the fragments associated with the iterator's 761 current end boundary have start and end bounds such that start ≤ <new start 762 bound> < <new end bound> ≤ end). 763 764 The merging iterator iterates over the levels, collecting keys from any child 765 iterators positioned at end boundaries. In the above example, i1 and i2 are 766 positioned at end boundaries, so the merging iterator collects the keys of [a,c) 767 and [a,p). These spans contain the merging iterator's [a,b) span, but they may 768 also extend beyond the new span's start and end. The merging iterator returns 769 the keys with the new start and end bounds, preserving the underlying keys' 770 sequence numbers, key kinds and values. 771 772 It may be the case that the merging iterator finds no levels positioned at span 773 end boundaries in which case the span overlaps with nothing. In this case the 774 merging iterator loops, repeating the above process again until it finds a span 775 that does contain keys. 776 777 #### Efficient masking 778 779 Recollect that in the earlier example from the iteration interface, during 780 forward iteration an iterator would output the following keys: 781 782 ``` 783 Key() HasPointAndRange() Value() RangeKeyBounds() RangeKeys() 784 a (true, true) artichoke [a,b) {(@1,apple)} 785 b (false, true) - [b,c) {(@7,kiwi), (@1,apple)} 786 b@2 (true, true) beet [b,c) {(@7,kiwi), (@1,apple)} 787 c (false, true) - [c,e) {(@7,kiwi), (@3,banana), (@1,apple)} 788 e (false, true) - [e,k) {(@7,kiwi), (@5,orange), (@1,apple)} 789 k (false, true) - [k,m) {(@5,orange), (@1,apple)} 790 m (false, true) - [m,z) {(@1,apple)} 791 t@3 (true, true) turnip [m,z) {(@1,apple)} 792 ``` 793 794 When implementing an MVCC "soft delete range" operation using range keys, the 795 range key `[b,k)@7→kiwi` may represent that all keys within the range [b,k) are 796 deleted at MVCC timestamp @7. During iteration, it would be desirable if the 797 caller could indicate that it does not want to observe any "soft deleted" point 798 keys, and the iterator can safely skip them. Note that in a MVCC system, whether 799 or not a key is soft deleted depends on the timestamp at which the database is 800 read. 801 802 This is implemented through "range key masking," where a range key may act as a 803 mask, hiding point keys with MVCC timestamps beneath the range key. This 804 iterator option requires that the client configure the iterator with a MVCC 805 timestamp `suffix` representing the timestamp at which history should be read. 806 All range keys with suffixes (MVCC timestamps) less than or equal to the 807 configured suffix serve as masks. All point keys with suffixes (MVCC timestamps) 808 less than a covering, masking range key's suffix are hidden. 809 810 Specifically, when configured with a RangeKeyMasking.Suffix _s_, and there 811 exists a range key with suffix _r_ covering a point key with suffix _p_, and _s_ 812 ≤ _r_ < _p_ then the point key is elided. 813 814 In the above example, if `RangeKeyMasking.Suffix` is set to `@7`, every range 815 key serves as a mask and the point key `b@2` is hidden during iteration because 816 it's contained within the masking `[b,k)@7→kiwi` range key. Note that `t@3` 817 would _not_ be masked, because its timestamp `@3` is more recent than the only 818 range key that covers it (`[a,z)@1→apple`). 819 820 If `RangeKeyMasking.Suffix` were set to `@6` (a historical, point-in-time read), 821 the `[b,k)@7→kiwi` range key would no longer serve as a mask, and `b@2` would be 822 visible. 823 824 To efficiently implement masking, we cannot rely on the LSM invariant since 825 `b@100` can be at a lower level than `[a,e)@50`. Instead, we build on 826 block-property filters, supporting special use of a MVCC timestamp block 827 property in order to skip blocks wholly containing point keys that are masked by 828 a range key. The client may configure a block-property collector to record the 829 highest MVCC timestamps of point keys within blocks. 830 831 During read time, when positioned within a range key with a suffix ≤ 832 `RangeKeyMasking.Suffix`, the iterator configures sstable readers to use a 833 block-property filter to skip any blocks for which the highest MVCC timestamp is 834 less than the provided suffix. Additionally, these iterators must consult index 835 block bounds to ensure the block-property filter is not applied beyond the 836 bounds of the masking range key. 837 838 ### CockroachDB use 839 840 CockroachDB initially will only use range keys to represent MVCC range 841 tombstones. See the MVCC range tombstones tech note for more details: 842 843 https://github.com/cockroachdb/cockroach/blob/master/docs/tech-notes/mvcc-range-tombstones.md 844 845 ### Alternatives 846 847 #### A1. Automatic elision of range keys that don't cover keys 848 849 We could decide that range keys: 850 851 - Don't contribute to `MVCCStats` themselves. 852 - May be elided by Pebble when they cover zero point keys. 853 854 This means that CockroachDB garbage collection does not need to explicitly 855 remove the range keys, only the point keys they deleted. This option is clean 856 when paired with `RANGEDEL`s dropping both point and range keys. CockroachDB can 857 issue `RANGEDEL`s whenever it wants to drop a contiguous swath of points, and 858 not worry about the fact that it might also need to update the MVCC stats for 859 overlapping range keys. 860 861 However, this option makes deterministic iteration over defragmented range keys 862 for replica divergence detection challenging, because internal fragmentation may 863 elide regions of a range key at any point. Producing a normalized form would 864 require storing state in the value (ie, the original start key) and 865 recalculating the smallest and largest extant covered point keys within the 866 range key and replica bounds. This would require maintaining _O_(range-keys) 867 state during the `storage.ComputeStatsForRange` pass over a replica's combined 868 point and range iterator. 869 870 This likely forces replica divergence detection to use other means (eg, altering 871 the checksum of covered points) to incorporate MVCC range tombstone state. 872 873 This option is also highly tailored to the MVCC Delete Range use case. Other 874 range key usages, like ranged intents, would not want this behavior, so we don't 875 consider it further. 876 877 #### A2. Separate LSM of range keys 878 879 There are two viable options for where to store range keys. They may be encoded 880 within the same sstables as points in separate blocks, or in separate sstables 881 forming a parallel range-key LSM. We examine the tradeoffs between storing range 882 keys in the same sstable in different blocks ("shared sstables") or separate 883 sstables forming a parallel LSM ("separate sstables"): 884 885 - Storing range keys in separate sstables is possible because the only 886 iteractions between range keys and point keys happens at a global level. 887 Masking is defined over suffixes. It may be extended to be defined over 888 sequence numbers too (see 'Sequence numbers' section below), but that is 889 optional. Unlike range deletion tombstones, range keys have no effect on point 890 keys during compactions. 891 892 - With separate sstables, reads may need to open additional sstable(s) and read 893 additional blocks. The number of additional sstables is the number of nonempty 894 levels in the range-key LSM, so it grows logarithmically with the number of 895 range keys. For each sstable, a read must read the index block and a data 896 block. 897 898 - With our expectation of few range keys, the range-key LSM is expected to be 899 small, with one or two levels. Heuristics around sstable boundaries may 900 prevent unnecessary range-key reads when there is no covering range key. Range 901 key sstables and blocks are expected to have much higher table and block cache 902 hit rates, since they are orders of magnitude less dense. Reads in any 903 overlapping point sstables all access the same range key sstables. 904 905 - With shared sstables, `SeekPrefixGE` cannot use bloom filters to entirely 906 eliminate sstables that contain range keys. Pebble does not always use bloom 907 filters in L6, so once a range key is compacted into L6 its impact to 908 `SeekPrefixGE` is lessened. With separate sstables, `SeekPrefixGE` can always 909 use bloom filters for point-key sstables. If there are any overlapping 910 range-key sstables, the read must read them. 911 912 - With shared sstables, range keys create dense sstable boundaries. A range key 913 spanning an sstable boundary leaves no gap between the sstables' bounds. This 914 can force ingested sstables into higher levels of the LSM, even if the 915 sstables' point key spans don't overlap. This problem was previously observed 916 with wide `RANGEDEL` tombstones and was mitigated by prioritizing compaction 917 of sstables that contain `RANGEDEL` keys. We could do the same with range 918 keys, but the write amplification is expected to be much worse. The `RANGEDEL` 919 tombstones drop keys and eventually are dropped themselves as long as there is 920 not an open snapshot. Range keys do not drop data and are expected to persist 921 in L6 for long durations, always requiring ingested sstables to be inserted 922 into L5 or above. 923 924 - With separate sstables, compaction logic is separate, which helps avoid 925 complexity of tricky sstable boundary conditions. Because there are expected 926 to be an order of magnitude fewer range keys, we could impose the constraint 927 that a prefix cannot be split across multiple range key sstables. The 928 simplified compaction logic comes at the cost of higher levels, iterators, etc 929 all needing to deal with the concept of two parallel LSMs. 930 931 - With shared sstables, the LSM invariant is maintained between range keys and 932 point keys. For example, if the point key `b@20` is committed, and 933 subsequently a range key `RangeKey([a,c), @25, ...)` is committed, the range 934 key will never fall below the covered point `b@20` within the LSM. 935 936 We decide to share sstables, because preserving the LSM invariant between range 937 keys and point keys is expected to be useful in the long-term. 938 939 #### A3. Sequence number masking 940 941 In the CockroachDB MVCC range tombstone use case, a point key should never be 942 written below an existing range key with a higher timestamp. The MVCC range 943 tombstone use case would allow us to dictate that an overlapping range key with 944 a higher sequence number always masks range keys with lower sequence numbers. 945 Adding this additional masking scope would avoid the comparatively costly suffix 946 comparison when a point key _is_ masked by a range key. We need to consider how 947 sequence number masking might be affected by the merging of range keys within 948 snapshot stripes. 949 950 Consider the committing of range key `[a,z)@{t1}#10`, followed by point keys 951 `d@t2#11` and `m@t2#11`, followed by range key `[j,z)@{t3}#12`. This sequencing 952 respects the expected timestamp, sequence number relationship in CockroachDB's 953 use case. If all keys are flushed within the same sstable, fragmentation and 954 merging overlapping fragments yields range keys `[a,j)@{t1}#10`, 955 `[j,z)@{t3,t1}#12`. The key `d@t2#11` must not be masked because it's not 956 covered by the new range key, and indeed that's the case because the covering 957 range key's fragment is unchanged `[a,j)@{t1}#10`. 958 959 For now we defer this optimization, with the expectation that we may not be able 960 to preserve this relationship between sequence numbers and suffixes in all range 961 key use cases.