github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/range-merges.md (about) 1 # Range merges 2 3 **Last update:** April 10, 2019 4 5 **Original author:** Nikhil Benesch 6 7 This document serves as an end-to-end description of the implementation of range 8 merges. The target reader is someone who is reasonably familiar with core 9 but unfamiliar with either the "how" or the "why" of the range merge 10 implementation. 11 12 The most complete documentation, of course, is in the code, tests, and the 13 surrounding comments, but those pieces are necessarily split across several 14 files and packages. That scattered knowledge is centralized here, without 15 excessive detail that is likely to become stale. 16 17 ## Table of Contents 18 19 * [Overview](#overview) 20 * [Implementation details](#implementation-details) 21 * [Preconditions](#preconditions) 22 * [Initiating a merge](#initiating-a-merge) 23 * [AdminMerge race](#adminmerge-race) 24 * [Merge transaction](#merge-transaction) 25 * [Transfer of power](#transfer-of-power) 26 * [Snapshots](#snapshots) 27 * [Merge queue](#merge-queue) 28 * [Subtle complexities](#subtle-complexities) 29 * [Range descriptor generation](#range-descriptor-generations) 30 * [Misaligned replica sets](#misaligned-replica-sets) 31 * [Replica GC](#replica-gc) 32 * [Transaction record GC](#transaction-record-gc) 33 * [Unanimity](#unanimity) 34 * [Safety recap](#safety-recap) 35 * [Appendix](#appendix) 36 * [Key encoding oddities](#key-encoding-oddities) 37 38 ## Overview 39 40 A range merge begins when two adjacent ranges are selected to be merged 41 together. For example, suppose our adjacent ranges are _P_ and _Q_, somewhere in 42 the middle of the keyspace: 43 44 ``` 45 --+-----+-----+-- 46 | P | Q | 47 --+-----+-----+-- 48 ``` 49 50 We'll call _P_ the left-hand side (LHS) of the merge, and _Q_ the right-hand 51 side (RHS) of the merge. For reasons that will become clear later, we also refer 52 to _P_ as the subsuming range and _Q_ as the subsumed range. 53 54 The merge is coordinated by the LHS. The coordinator begins by verifying that a) 55 the two ranges are, in fact, adjacent, and b) that the replica sets of the two 56 ranges are aligned. Replica set alignment is a term that is currently only 57 relevant to merges; it means that the set of stores with replicas of the LHS 58 exactly matches the set of stores with replicas of the RHS. For example, this 59 replica set is aligned: 60 61 ``` 62 Store 1 Store 2 Store 3 Store 4 63 +-----+ +-----+ +-----+ +-----+ 64 | P Q | | P Q | | | | P Q | 65 +-----+ +-----+ +-----+ +-----+ 66 ``` 67 68 By requiring replica set alignment, the merge operation is reduced to a metadata 69 update, albeit a tricky one, as the stores that will have a copy of the merged 70 range _PQ_ already have all the constituent data, by virtue of having a copy of 71 both _P_ and _Q_ before the merge begins. Note that replicas of _P_ and _Q_ do 72 not need to be fully up-to-date before the merge begins; they'll be caught up as 73 necessary during the [transfer of power](#transfer-of-power). 74 75 After verifying that the merge is sensible, the coordinator transactionally 76 updates the implicated range descriptors, adjusting P's range descriptor so that 77 it extends to _Q_'s end and deleting _Q_'s range descriptor. 78 79 Then, the coordinator needs to [atomically move 80 responsibility](#transfer-of-power) for the data in the RHS to the LHS. This is 81 tricky, as the lease on the LHS may be held by a different store than the lease 82 on the RHS. The coordinator notifies the RHS that it is about to be subsumed and 83 it is prohibited from serving any additional read or write traffic. Only when 84 the coordinator has received an acknowledgement from _every_ replica of the RHS, 85 indicating that no traffic is possibly being served on the RHS, does the 86 coordinator commit the merge transaction. 87 88 Like with splits, the merge transaction is committed with a special "commit 89 trigger" that instructs the receiving store to update its in-memory bookkeeping 90 to match the updates to the range descriptors in the transaction. The moment the 91 merge transaction is considered committed, the merge is complete! 92 93 At the time of writing, merges are only initiated by the merge queue, which is 94 responsible both for locating ranges that are in need of a merge and aligning 95 their replica sets before initiating the merge. 96 97 The remaining sections cover each of these steps in more detail. 98 99 ## Implementation details 100 101 ### Preconditions 102 103 Not any two ranges can be merged. The first and most obvious criterion is that 104 the two ranges must be adjacent. Suppose a simplified cluster that has only 105 three ranges, _A_, _B_, and _C_: 106 107 ``` 108 +-----+-----+-----+ 109 | A | B | C | 110 +-----+-----+-----+ 111 ``` 112 113 Ranges _A_ and _B_ can be merged, as can ranges _B_ and _C_, but not ranges _A_ 114 and _C_, as they are not adjacent. Note that adjacent ranges are equivalently 115 referred to as "neighbors", as in, range _B_ is range _A_'s right-hand neighbor. 116 117 The second criterion is that the replica sets must be aligned. To illustrate, 118 consider a four node cluster with the default 3x replication. The allocator has 119 attempted to balance ranges as evenly as possible: 120 121 ``` 122 Node 1 Node 2 Node 3 Node 4 123 +-----+ +-----+ +-----+ +-----+ 124 | P Q | | P | | Q | | P Q | 125 +-----+ +-----+ +-----+ +-----+ 126 ``` 127 128 Notice how node 2 does not have a copy of _Q_, and node 3 does not have a copy 129 of _P_. These replica sets are considered "misaligned." Aligning them requires 130 rebalancing Q from node 3 to node 2, or rebalancing _P_ from node 2 to node 3: 131 132 ``` 133 Node 1 Node 2 Node 3 Node 4 134 +-----+ +-----+ +-----+ +-----+ 135 | P Q | | P Q | | | | P Q | 136 +-----+ +-----+ +-----+ +-----+ 137 138 Node 1 Node 2 Node 3 Node 4 139 +-----+ +-----+ +-----+ +-----+ 140 | P Q | | | | P Q | | P Q | 141 +-----+ +-----+ +-----+ +-----+ 142 ``` 143 144 We explored an alternative merge implementation that did not require aligned 145 replica sets, but found it to be unworkable. See the [misaligned replica sets 146 misstep](#misaligned-replica-sets) for details. 147 148 ### Initiating a merge 149 150 A merge is initiated by sending a AdminMerge request to a range. Like other 151 admin commands, DistSender will automatically route the request to the 152 leaseholder of the range, but there is no guarantee that the store will retain 153 its lease while the admin command is executing. 154 155 Note that an AdminMerge request takes no arguments, as there is no choice in 156 what range will be merged. The recipient of the AdminMerge will always be the 157 LHS, subsuming range, and its right neighbor at the moment that the 158 AdminMerge command begins executing will always be the RHS, subsumed range. 159 160 It would have been reasonable to have instead used the RHS to coordinate the 161 merge. That is, the RHS would have been the subsuming range, and the LHS would 162 have been the subsumed range. Using the LHS to coordinate, however, yields a 163 nice symmetry with splits, where the range that coordinates a split becomes the 164 LHS of the split. Maintaining this symmetry means that a range's start key never 165 changes during its lifetime, while its end key may change arbitrarily in 166 response to splits and merges. 167 168 There is another reason to prefer using the LHS to coordinate involving an 169 oddity of key encoding and range bounds. It is trivial for a range to send a 170 request to its right neighbor, as it simply addresses the request to its end 171 key, but it is difficult to send a request to its left neighbor, as there is no 172 function to get the key that immediately precedes the range's start key. See the 173 [key encoding oddities](#key-encoding-oddities) section of the appendix for 174 details. 175 176 At the time of writing, only the [merge queue](#merge-queue) initiates merges, 177 and it does so by bypassing DistSender and invoking the AdminMerge command 178 directly on the local replica. At some point in the future, we may wish to 179 expose manual merges via SQL, at which point the SQL layer will need to send 180 proper AdminMerge requests through the KV API. 181 182 #### AdminMerge race 183 184 At present, AdminMerge requests are subject to a small race. It is possible for 185 the ranges implicated by an AdminMerge request to split or merge between when 186 the client decides to send an AdminMerge request and when the AdminMerge request 187 is processed. 188 189 For example, suppose the client decides that _P_ and _Q_ should be merged and 190 sends an AdminMerge request to _P_. It is possible that, before the AdminMerge 191 request is processed, _P_ splits into _P<sub>1</sub>_ and _P<sub>2</sub>_. The 192 AdminMerge request will thus result in _P<sub>1</sub>_ and _P<sub>2</sub>_ 193 merging together, and not the desired _P_ and _Q_. 194 195 The race could have been avoided if the AdminMerge request required that the 196 descriptors for the implicated ranges were provided as arguments to the request. 197 Then the merge could be aborted if the merge transaction discovered that either 198 of the implicated ranges did not match the corresponding descriptor in the 199 AdminMerge request arguments, forming a sort of optimistic lock. 200 201 Fortunately, the race is rare in practice. If it proves to be a problem, the 202 scheme described above would be easy to implement while maintaining backwards 203 compatibility. 204 205 ### Merge transaction 206 207 The merge transaction piggybacks on CockroachDB's serializability to provide 208 much of the necessary synchronization for the bookkeeping updates. For example, 209 merges cannot occur concurrently with any splits or replica changes on the 210 implicated ranges, because the merge transaction will naturally conflict with 211 those split transaction and change replicas transactions, as both transactions 212 will attempt to write updated range descriptors and conflict. No additional code 213 was needed to enforce this, as our standard transaction conflict detection 214 mechanisms kick in here (write intents, the timestamp cache, the span latch 215 manager, etc.). 216 217 Note that there was one surprising synchronization problem that was not 218 immediately handled by serializability. See [range descriptor 219 generations](#range-descriptor-generations) for details. 220 221 The standard KV operations that the merge transaction performs are: 222 223 * Reading the LHS descriptor and RHS descriptor, and verifying that their 224 replica sets are aligned. 225 * Updating the local and meta copy of the LHS descriptor to reflect 226 the widened end key. 227 * Deleting the local and meta copy of the RHS descriptor. 228 * Writing an entry to the `system.rangelog` table. 229 230 These operations are the essence of a merge, and in fact update all necessary 231 on-disk data! All the remaining complexity exists to update in-memory metadata 232 while the cluster is live. 233 234 Note that the merge transaction's KV operations are not fundamentally dependent 235 and so could theoretically be performed in any order. There are, however, 236 several implementation details that enforce some ordering constraints. 237 238 First, the merge transaction record needs to be located on the LHS. 239 Specifically, the transaction record needs to live on the subsuming range, as 240 the commit trigger that actually applies the merge to the replica's in-memory 241 state runs on the range where the transaction record lives. The transaction 242 record is created on the range that the transaction writes first; therefore, the 243 merge transaction is careful to update the local copy of the LHS descriptor as 244 its first operation, since the local copy of the LHS descriptor lives on the 245 LHS. 246 247 Second, the merge transaction must ensure that, when it issues the delete 248 request to remove the local copy of the RHS descriptor, the resulting intent is 249 actually written to disk. (See the [transfer of power](#transfer-of-power) 250 subsection for why this is required.) Thanks to [transactional 251 pipelining][#26599], KV writes can return early, before their intents have 252 actually been laid down. The intents are not required to make it to disk until 253 the moment before the transaction commits. The merge transaction simply disables 254 pipelining to avoid this hazard. 255 256 As the last step before the commit, the merge transaction needs to freeze the 257 RHS, then wait for _every_ replica of the RHS to apply all outstanding commands. 258 This ensures that, when the merge commits, every LHS replica can blindly assume 259 that it has perfectly up-to-date data for the RHS. To quickly recap, this is 260 guaranteed because 1) the replica sets were aligned when the merge transaction 261 started, 2) rebalances that would misalign the replica sets will conflict with 262 the merge transaction, causing one of the transactions to abort, 3) the RHS is 263 frozen and cannot process any new commands, and 4) every replica of the RHS is 264 caught up on all commands. The process of freezing the RHS and waiting for every 265 replica to catch up is covered more thoroughly in the [transfer of 266 power](#transfer-of-power) subsection. 267 268 Finally, the merge transaction commits, attaching a special [merge commit 269 trigger] to the end transaction request. This trigger has three 270 responsibilities: 271 272 1. It ensures the end transaction request knows which intents it can resolve 273 locally. Intents that live on the RHS range would naively appear to belong 274 to a different range than the one containing the transaction record (i.e., 275 the LHS), but if the merge is committing then the LHS is subsuming the RHS 276 and thus the intents can be resolved locally. 277 278 In fact, it's critical that these intents are considered local, because 279 local intents are resolved synchronously while remote intents are resolved 280 asynchronously. We need to maintain the invariant that, when a store boots 281 up and discovers an intent on its local copy of a range descriptor, it can 282 simply ignore the intent. Because we enforce that these intents are 283 resolved synchronously with the commit of the merge transaction, we are 284 guaranteed that, if we see an intent on a local range descriptor, this 285 replica has not yet applied the `EndTransaction` request for the merge 286 transaction, and it is therefore safe to load the replica. If the intent 287 were instead resolved asynchronously, we could observe the state where the 288 `EndTransaction` request for the merge had applied but the intent 289 resolution had not applied, in which case we would attempt to load both the 290 post-merge subsuming replica, and the subsumed replica, which would overlap 291 and crash the node. 292 293 2. It adjusts the LHS's MVCCStats to incorporate the subsumed range's 294 MVCCstats. 295 296 3. It copies necessary range-ID local data from the RHS and LHS, rekeying each 297 key to use LHS's range ID. At the moment, the only necessary data is the 298 [transaction abort span]. 299 300 4. It attaches a [merge payload to the replicated proposal result][pd-flag]. 301 When each replica of the LHS applies the command, it will notice the merge 302 payload and adjust the store's in-memory state to match the changes to the 303 on-disk state that were committed by the transaction. This entails 304 atomically removing the RHS replica from the store and widening the LHS 305 replica. 306 307 This operation involves a delicate dance of acquiring store locks and locks 308 for both replicas, in a certain order, at various points in the Raft 309 command application flow. The details are too intricate to be worth 310 describing here, especially considering that these tangled interactions 311 between a store and its replicas are due for a refactor. The best thing to 312 do, if you're interested in the details, is to trace through all references 313 to `storagepb.ReplicatedEvalResult.Merge`. 314 315 [#26599]: https://github.com/cockroachdb/cockroach/pull/26599 316 [transaction abort span]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/abortspan/abortspan.go 317 [merge commit trigger]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/batcheval/cmd_end_transaction.go#L984-L994 318 [pd-flag]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/batcheval/cmd_end_transaction.go#L1033-L1035 319 320 #### Transfer of power 321 322 The trickiest part of the merge transaction is making the LHS responsible for 323 the keyspace that was previously owned by the RHS. The transfer of power must be 324 performed atomically; otherwise, all manner of consistency violations can occur. 325 This is hopefully obvious, but here's a quick example to drive the point home. 326 Suppose _P_ and _Q_ simultaneously consider themselves responsible for the key 327 _q_. _Q_ could then allow a write to _q_ at time 1 at the same time that _P_ 328 allowed a read of _q_ at time 2. Consistency requires that either the read see 329 the write, as the read is executing at a higher timestamp, or that the write is 330 bumped to time 3. But because _P_ and _Q_ have separate span latch managers, no 331 synchronization will occur, and the read might fail to see the write! 332 333 The transfer of power is complicated by the fact that there is no guarantee that 334 the leases on the LHS and the RHS will be aligned, nor is there any 335 straightforward way to provide such a guarantee. (Aligned leaseholders would 336 allow the merge transaction to use a simple in-memory lock to make the transfer 337 of power atomic.) The leaseholder of either range might fail at any moment, at 338 which point the lease can be acquired, after it expires, by any other live 339 member of the range. 340 341 Since leaseholder assignment is infeasible, the merge transaction implements 342 what is essentially a distributed lock. 343 344 The lock is initialized by sending a [Subsume][subsume-request] request to the 345 RHS. This is a single-purpose request that exists solely for use in the merge 346 transaction. It is unlikely to ever be useful in another situation, and (ab)uses 347 several implementation details to provide the necessary synchronization. 348 349 When the Subsume request returns, the RHS has made three important promises: 350 351 1. There are no commands in flight. 352 2. Any future requests will block until the merge transaction completes. 353 If the merge transaction commits, the requests will be bounced with a 354 RangeNotFound error. If the merge transaction aborts, the requests will 355 be processed as normal. 356 3. If the RHS loses its lease, the new leaseholder will adhere to promises 357 1 and 2. 358 359 The Subsume request provides promise 1 by declaring that it reads and writes all 360 addressable keys in the range. This is a bald-faced lie, as the Subsume request 361 only reads one key and writes no keys, but it forces synchronization with all 362 latches in the span latch manager, as no other commands can possibly execute in 363 parallel with a command that claims to write all keys. 364 365 **TODO(benesch,nvanbenschoten):** Actually, concurrent reads at lower timestamps 366 are permitted. Is this a problem? Maybe. Answering this question is difficult 367 and requires reasoning about the causal chain established by the sequence of 368 requests sent by the merge transaction. 369 370 It provides promise 2 by flipping [a bit][merge-bit] on the replica that 371 indicates that a subsumption is in progress. When the bit is active, the 372 replica blocks processing of all requests. 373 374 Importantly, the bit needs to be cleared when the merge transaction completes, 375 so that the requests are not blocked forever. This is the responsibility of the 376 [merge watcher goroutine][watcher]. Determining whether a transaction has 377 committed or not is conceptually simple, but the details are brutally 378 complicated. See the [transaction record GC](#transaction-record-gc) section 379 for details. 380 381 Note that the Subsume request only runs on the leaseholder, and therefore the 382 merge bit is only set on the leaseholder and the watcher goroutine only runs on 383 the leaseholder. This is perfectly safe, as none of the follower replicas can 384 process requests. 385 386 Promise 3 is actually not provided by the Subsume request at all, but by a hook 387 in the lease acquisition code path. Whenever a replica acquires a lease, it 388 checks to see whether its local descriptor has a deletion intent. If it does, it 389 can infer that a subsumption is in progress, as nothing else leaves a deletion 390 intent on a range descriptor. In that case, the replica, instead of serving 391 traffic, flips the merge bit and launches its own merge watcher goroutine, just 392 as the Subsume command would have. This means there can actually be multiple 393 replicas of the RHS with the merge bit set and a merge watcher goroutine 394 running—assuming the old leaseholder did not crash but lost its lease for other 395 reasons—but this does not cause any problems. 396 397 [subsume-request]: https://github.com/cockroachdb/cockroach/blob/d6adf24cae788d7cd967feadae8e9c0388ce5273/pkg/storage/batcheval/cmd_subsume.go#L56-L86 398 [merge-bit]: https://github.com/cockroachdb/cockroach/blob/d6adf24cae788d7cd967feadae8e9c0388ce5273/pkg/storage/replica.go#L361-L364 399 [watcher]: https://github.com/cockroachdb/cockroach/blob/d6adf24cae788d7cd967feadae8e9c0388ce5273/pkg/storage/replica.go#L2817-L2926 400 401 ### Snapshots 402 403 The LHS might be advanced over the command that commits a merge with a snapshot. 404 That means that all the complicated bookkeeping that normally takes place when a 405 replica processes a command with a non-nil `ReplicatedEvalResult.Merge` is 406 entirely skipped! Most problematically, the snapshot will need to widen the 407 receiving replica, but there will be a replica of the RHS in the way—remember, 408 this is guaranteed by replica set alignment. In fact, the snapshot could be 409 advancing over several merges, or a combination of several merges and splits, in 410 which case there will be several RHSes to subsume. 411 412 This turns out to be relatively straightforward to handle. If an initialized 413 replica receives a snapshot that widens it, it can infer that a merge occured, 414 and it simply subsumes all replicas that are overlapped by the snapshot in one 415 shot. This requires the same delicate synchronization dance, mentioned at the 416 end of the [merge transaction](#merge-transaction) section, to update bookeeping 417 information. After all, applying a widening snapshot is simply the bulk version 418 of applying a merge command directly. The details are too complicated to go into 419 here, but you can begin your own exploration by starting with this call to 420 [`Replica.maybeAcquireSnapshotMergeLock`][code-start] and tracing how the 421 returned `subsumedRepls` value is used. 422 423 [code-start]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/replica.go#L4071-L4072 424 425 ### Merge queue 426 427 The merge queue, like most of the other queues in the system, runs on every 428 store and periodically scans all replicas for which that store holds the lease. 429 For each replica, the merge queue evaluates whether it should be merged with 430 its right neighbor. Looking rightward is a natural fit for the reasons 431 described in the [key encoding oddities](#key-encoding-oddities) section of 432 the appendix. 433 434 In some ways, the merge queue has an easy job. For any given range _P_ and its 435 right neighbor _Q_, the merge queue synthesizes the hypothetical merged range 436 _PQ_ and asks whether the split queue would immediately split that merged range. 437 If the split queue would immediately split _PQ_, then obviously _P_ and _Q_ 438 should not be merged; otherwise, the ranges _should_ be merged! This means that 439 any improvement to our split heuristics also improves our merge heuristics with 440 essentially no extra work. For example, load-based splitting hardly required any 441 changes to the merge queue. 442 443 Note that, to avoid thrashing, ranges at or above the minimum size threshold 444 (8MB) are never considered for a merge. The minimum size threshold is 445 configurable on a per-zone basis. 446 447 Unfortunately, constructing the hypothetical merged range _PQ_ requires 448 information about _Q_ that only _Q_'s leaseholder maintains, like the amount of 449 load that _Q_ is currently experiencing. The merge queue must send a RangeStats 450 RPC to collect this information from _Q_'s leaseholder, because there is no 451 guarantee that the current store is _Q_'s leaseholder—or that the current store 452 even has a replica of _Q_ at all. 453 454 To prevent unnecessary RPC chatter, the merge queue uses several heuristics to 455 void sending RangeStats requests when it seems like the merge is unlikely to be 456 permitted. For example, if it determines that _P_ and _Q_ store different 457 tables, the split between them is mandatory and it won't bother sending the 458 RangeStats requests. Similarly, if _P_ is above the minimum size threshold, it 459 doesn't bother asking about _Q_. 460 461 ## Subtle complexities 462 463 ### Range descriptor generations 464 465 There was one knotty race that was not immediately eliminated by transaction 466 serializability. Suppose we have our standard aligned replica set situation: 467 468 ``` 469 Store 1 Store 2 Store 3 Store 4 470 +-----+ +-----+ +-----+ +-----+ 471 | P Q | | P Q | | | | P Q | 472 +-----+ +-----+ +-----+ +-----+ 473 ``` 474 475 In an unfortunate twist of fate, a rebalance of P from store 2 to store 3 476 begins at the same time as a merge of P and Q begins. Let's quickly cover the 477 valid outcomes of this race. 478 479 1. The rebalance commits before the merge. The merge must abort, as the 480 replica sets of P and Q are no longer aligned. 481 482 2. The merge commits before the rebalance starts. The rebalance should 483 voluntarily abort, as the decision to rebalance P needs to be updated in 484 light of the merge. It is not, however, a correctness problem if the 485 rebalance commits; it simply results in rebalancing a larger range than 486 may have been intended. 487 488 3. The merge commits before the rebalance ends, but after the rebalance has 489 sent a preemptive snapshot to store 3. The rebalance must abort, as 490 otherwise the preemptive snapshot it sent to store 3 is a ticking time 491 bomb. 492 493 To see why, suppose the rebalance commits. Since the preemptive snapshot 494 predates the commit of the merge transaction, the new replica on store 3 495 will need to be streamed the Raft command that commits the merge 496 transaction. But applying this merge command is disastrous, as store 3 497 does not have a replica of Q to merge! This is a very subtle way in which 498 replica set alignment can be subverted. 499 500 Guaranteeing the correct outcome in case 1 is easy. The merge transaction simply 501 checks for replica set alignment by transactionally reading the range descriptor 502 for _P_ and the range descriptor for _Q_ and verifying that they list the same 503 replicas. Serializability guarantees the rest. 504 505 Case 2 is similarly easy to handle. The rebalance transaction simply verifies 506 that the range descriptor used to make the rebalance decision matches the range 507 descriptor that it reads transactionally. 508 509 Case 3, however, has an extremely subtle pitfall. It seems like the solution for 510 case 2 should apply: simply abort the transaction if the range descriptor 511 changes between when the preemptive snapshot is sent and when the rebalance 512 transaction starts. But, as it turns out, this is not quite foolproof. What if, 513 between when the preemptive snapshot is sent and when the rebalance transaction 514 starts, _P_ and _Q_ merge together and then split at exactly the same key? The 515 range descriptor for _P_ will look entirely unchanged to the rebalance 516 transaction! 517 518 The solution was to add a generation counter to the range descriptor: 519 520 ```protobuf 521 message RangeDescriptor { 522 // ... 523 524 // generation is incremented on every split and every merge, i.e., whenever 525 // the end_key of this range changes. It is initialized to zero when the range 526 // is first created. 527 optional int64 generation = 6; 528 } 529 ``` 530 531 It is no longer possible for a range descriptor to be unchanged by a sequence of 532 splits and merges, as every split and merge will bump the generation counter. 533 Rebalances can thus detect if a merge commits between when the preemptive 534 snapshot is sent and when the transaction begins, and abort accordingly. 535 536 ### Misaligned replica sets 537 538 An early implementation allowed merges between ranges with misaligned replica 539 sets. The intent was to simplify matters by avoiding replica rebalancing. 540 541 Consider again our example misaligned replica set: 542 543 ``` 544 Store 1 Store 2 Store 3 Store 4 545 +-----+ +-----+ +-----+ +-----+ 546 | P Q | | P | | Q | | P Q | 547 +-----+ +-----+ +-----+ +-----+ 548 549 P: (s1, s2, s4) 550 Q: (s1, s3, s4) 551 ``` 552 553 Note that there are two perspectives shown here. The store boxes represent the 554 replicas that are *actually* present on that store, from the perspective of the 555 store itself. The descriptor tuples at the bottom represent the stores that are 556 considered to be members of the range, from the perspective of the most recently 557 committed range descriptor. 558 559 Now, to merge _P_ and _Q_ in this situation without aligning their replica sets, 560 store 2 needed to be provided a copy of store 3's data. To accomplish this, a 561 copy of _Q_'s data was stashed in the merge trigger, and _P_ would write this 562 data into its store when applying the merge trigger. 563 564 There was, sadly, a large and unforeseen problem with lagging replicas. Suppose 565 store 2 loses network connectivity a moment before ranges _P_ and _Q_ are 566 merged. Note that store 2 is not required for _P_ and _Q_ to merge, because only 567 a quorum is required on the LHS to commit a merge. Now the situation looks like 568 this: 569 570 ``` 571 Store 1 Store 2 Store 3 Store 4 572 +-----+ xxxxxxx +-----+ +-----+ 573 | PQ | | P | | | | PQ | 574 +-----+ xxxxxxx +-----+ +-----+ 575 576 PQ: (s1, s2, s4) 577 ``` 578 579 There is nothing stopping the newly merged _PQ_ range from immediately splitting 580 into _P_ and _Q'_. Note that _P_ is the same range as the original _P_ (i.e., it 581 has the same range ID) and so the replica on _P_ is still considered a member, 582 while _Q'_ is a new range, with a new ID, that is unrelated to _Q_: 583 584 ``` 585 Store 1 Store 2 Store 3 Store 4 586 +-----+ xxxxxxx +-----+ +-----+ 587 | P Q'| | P | | | | P Q'| 588 +-----+ xxxxxxx +-----+ +-----+ 589 590 P: (s1, s2, s4) 591 Q': (s1, s2, s4) 592 ``` 593 594 When store 2 comes back online, it will start catching up on missed messages. 595 But notice how the meta ranges consider store 2 to be a member of _Q'_, because 596 it was a member of _P_ before the split. The leaseholder for _Q'_ will notice 597 that store 2's replica is out of date and send over a snapshot so that store 2 598 can initialize its replica... and all that might happen before store 2 manages 599 to apply the merge command for _PQ_. If so, applying the merge command for _PQ_ 600 will explode, because the keyspace of the merged range _PQ_ intersects with the 601 keyspace of _Q'_! 602 603 By requiring aligned replica sets, we sidestep this problem. The RHS is, in 604 effect, a lock on the post-merge keyspace. Suppose we find ourselves in the 605 analogous situation with replica sets aligned: 606 607 ``` 608 Store 1 Store 2 Store 3 Store 4 609 +-----+ xxxxxxx +-----+ +-----+ 610 | P Q'| | P Q | | | | P Q'| 611 +-----+ xxxxxxx +-----+ +-----+ 612 613 P: (s1, s2, s4) 614 Q': (s1, s2, s4) 615 ``` 616 617 Here, _PQ_ split into _P_ and _Q'_ immediately after merging, but notice how 618 store 2 has a replica of both _P_ and _Q_ because we required replica set 619 alignment during the merge. That replica of _Q_ prevents store 2 from 620 initializing a replica of _Q'_ until either store 2's replica of _P_ applies the 621 merge command (to _PQ_) and the split command (to _P_ and _Q'_), or store 2's 622 replica of _P_ is rebalanced away. 623 624 ### Replica GC 625 626 Per the discussion in the last section, we use the replica of the RHS as a lock 627 on the keyspace extension. This means that we need to be careful not to GC this 628 replica too early. 629 630 It's easiest to see why this is a problem if we consider the case where one 631 replica is extremely slow in applying a merge: 632 633 ``` 634 Store 1 Store 2 Store 3 Store 4 635 +-----+ +-----+ +-----+ +-----+ 636 | PQ | | PQ | | | | P Q | 637 +-----+ +-----+ +-----+ +-----+ 638 639 PQ: (s1, s2, s4) 640 ``` 641 642 Here, _P_ and _Q_ have just merged. Store 4 hasn't yet processed the merge while 643 stores 1 and 2 have. 644 645 The replica GC queue is continually scanning for replicas that are no longer a 646 member of their range. What if the replica GC queue on store 4 scans its replica 647 of _Q_ at this very moment? It would notice that the _Q_ range has been merged 648 away and, conceivably, conclude that _Q_ could be garbage collected. This would 649 be disastrous, as when _P_ finally applied the merge trigger it would no longer 650 have a replica of _Q_ to subsume! 651 652 One potential solution would be for the replica GC queue to refuse to GC 653 replicas for ranges that have been merged away. But that could result in 654 replicas getting permanently stuck. Suppose that, before store 4 applies the 655 merge transaction, the _PQ_ range is rebalanced away to store 3: 656 657 ``` 658 Store 1 Store 2 Store 3 Store 4 659 +-----+ +-----+ +-----+ +-----+ 660 | PQ | | PQ | | PQ | | P Q | 661 +-----+ +-----+ +-----+ +-----+ 662 663 PQ: (s1, s2, s3) 664 ``` 665 666 Store 4's replica of _P_ will likely never hear about the merge, as it is no 667 longer a member of the range and therefore not receiving any additional Raft 668 messages from the leader, so it will never subsume _Q_. The replica GC queue 669 _must_ be capable of garbage collecting _Q_ in this case. Otherwise _Q_ will 670 be stuck on store 4 forever, permanently preventing the store from ever 671 acquiring a new replica that overlaps with that keyspace. 672 673 Solving this problem turns out to be quite tricky. What the replica GC queue 674 wants to know when it discovers that _Q_'s range has been subsumed is whether 675 the local replica of _Q_ might possibly still be subsumed by its local left 676 neighbor _P_. It can't just ask the local _P_ whether it's about to apply a 677 merge, since _P_ might be lagging behind, as it is here, and have no idea that a 678 merge is about to occur. 679 680 So the problem reduces to proving that _P_ cannot apply a merge trigger that 681 will subsume _Q_. The chosen approach is to fetch the current range descriptor 682 for _P_ from the meta index. If that descriptor exactly matches the local 683 descriptor, thanks to [range descriptor generations](#range-descriptor-generations), 684 we are assured that there are no merge triggers that _P_ has yet to apply, and 685 _Q_ can safely be GC'd. 686 687 Note that it is possible to form long chains of replicas that can only be GC'd 688 from left to right; the GC queue is not aware of these dependencies and 689 therefore processes such chains extremely inefficiently (i.e., by processing 690 replicas in an arbitrary order instead of the necessary order). These chains 691 turn out to be extremely rare in practice. 692 693 There is one additional subtlety here. Suppose we have two adjacent ranges, _O_ 694 and _Q_. _O_ has just split into _O_ and _P_, but store 3 is lagging and has not 695 yet processed the split. 696 697 ``` 698 STATE 1 699 700 Store 1 Store 2 Store 3 Store 4 701 +-------+ +-------+ +-------+ +-------+ 702 | O P Q | | O P Q | | O Q | | | 703 +-------+ +-------+ +-------+ +-------+ 704 705 O: (s1, s2, s3) 706 P: (s1, s2, s3) 707 Q: (s1, s2, s3) 708 ``` 709 710 At this point, suppose the leader for the new range _P_ decides that store 3 711 will need a snapshot to catch up, and starts sending the snapshot over the 712 network. This will be important later. At the same time, _P_ and _Q_ merge while 713 store 3 is still lagging. 714 715 It may seem strange that this merge is permitted, but notice how the replica 716 sets are aligned according to the descriptors, even though store 3 does not 717 physically have a replica of _P_ yet. Here's the new state of the world: 718 719 ``` 720 STATE 2 721 722 Store 1 Store 2 Store 3 Store 4 723 +-------+ +-------+ +-------+ +-------+ 724 | O PQ | | O PQ | | O Q | | | 725 +-------+ +-------+ +-------+ +-------+ 726 727 O: (s1, s2, s3) 728 PQ: (s1, s2, s3) 729 ``` 730 731 Finally, _O_ is rebalanced from store 3 to store 4 and garbage collected on 732 store 4: 733 734 ``` 735 STATE 3 736 737 Store 1 Store 2 Store 3 Store 4 738 +-------+ +-------+ +-------+ +-------+ 739 | O PQ | | O PQ | | Q | | O | 740 +-------+ +-------+ +-------+ +-------+ 741 742 O: (s1, s2, s4) 743 PQ: (s1, s2, s3) 744 ``` 745 746 The replica GC queue might reasonably think that store 3's replica of _Q_ is out 747 of date, as _Q_ has no left neighbor that could subsume it. But, at any moment 748 in time, store 3 could finish receiving the snapshot for _P_ that was started 749 between state 1 and state 2. Crucially, this snapshot predates the merge, so it 750 will need to apply the merge trigger... and the replica for _Q_ had better be 751 present on the store! 752 753 This hazard is avoided by requiring that all replicas of the LHS are initialized 754 before a merge begins. This prevents a transition from state 1 to state 2, as 755 the merge of _P_ and _Q_ cannot occur until store 3 initializes its replica of 756 _P_. The AdminMerge command will wait a few seconds in the hope that store 3 757 catches up quickly; otherwise, it will refuse to launch the merge transaction. 758 It is therefore impossible to end up in a dangerous state, like state 3, and it 759 is thus safe for the replica GC queue to GC _Q_ if its left neighbor is 760 generationally up to date. 761 762 ### Transaction record GC 763 764 The merge watcher goroutine needs to wait until the merge transaction completes, 765 and determine whether or not the transaction committed or aborted. This turns 766 out to be brutally complicated, thanks to the aggressive garbage collection of 767 transaction records. 768 769 The watcher goroutine begins by sending a PushTxn request to the merge 770 transaction. It can easily discover the necessary arguments for the PushTxn 771 request, that is, the ID and key of the current merge transaction record, 772 because they're recorded in the intent that the merge transaction has left on 773 the RHS's local copy of the descriptor. 774 775 If the PushTxn request reports that the merge transaction committed, we're 776 guaranteed that the merge transaction did, in fact, complete. That means that we 777 can mark the RHS replica as destroyed, bounce all requests back to DistSender 778 (where they'll get retried on the subsuming range), and clean up the watcher 779 goroutine. The RHS replica will be cleaned up either when the LHS replica 780 subsumes it or when the replica GC queue notices that it has been abandoned. 781 782 If the PushTxn request instead reports that the merge transaction aborted, we're 783 not guaranteed that the merge transaction actually aborted. The merge 784 transaction may have committed so quickly that its transaction record was 785 garbage collected before our PushTxn request arrived. The PushTxn incorrectly 786 interprets this state to mean that the transaction was aborted, when, in fact, 787 it was committed and GCed. To be fair, we're somewhat abusing the PushTxn 788 request here. Outside of merges, a PushTxn request is only sent when a pending 789 intent is discovered, and transaction records can't be GCed until all their 790 intents have been resolved. 791 792 So we need some way to determine whether a merge transaction was actually 793 aborted. What we do is look for the effects of the merge transaction in meta2. 794 If the merge aborted, we'll see our own range descriptor, with our range ID, in 795 meta2. By contrast, if the merge committed, we'll see a range descriptor for a 796 different range in meta2. 797 798 This complexity is extremely unfortunate, and turns what should be a simple 799 goroutine, spawned on the RHS leader for every merge transaction 800 801 ```go 802 go func() { 803 <-txn.Done() // wait for txn to complete 804 if txn.Committed() { 805 repl.MarkDestroyed("replica subsumed") 806 } 807 repl.UnblockRequests() 808 } 809 ``` 810 811 into [150 lines of hard to follow code][code]. 812 813 [code]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/replica.go#L2813-L2920 814 815 ### Unanimity 816 817 The largest conceptual incongruity with the current merge implementation is the 818 fact that it requires unanimous consent from all replicas, instead of a majority 819 quorum, like everything else in Raft. Further confusing matters, only the RHS 820 needs unanimous consent; a merge can proceed with only majority consent from the 821 LHS. In fact, it's even a bit more subtle: while only a majority of the LHS 822 replicas need to vote on the merge command, all LHS replicas need to confirm 823 that they are initialized for the merge to start. 824 825 There is no theoretical reason that merges need unanimous consent, but the 826 complexity of the implementation quickly skyrockets without it. For example, 827 suppose you adjusted the transfer of power so that only a majority of replicas 828 on the RHS need to be fully up to date before the merge commits. Now, when 829 applying the merge trigger, the LHS needs to check to see if its copy of the RHS 830 is up to date; if it's not, the LHS needs to throw away its entire Raft state 831 and demand a snapshot from the leader. This is both unsightly—our code is worse 832 off every time we reach into Raft—and less efficient than the existing 833 implementation, as it requires sending a potentially multi-megabyte snapshot if 834 one replica of the RHS is just a little bit behind in applying the latest 835 commands. 836 837 It's possible that these problems could be mitigated while retaining the ability 838 to merge with a minority of replicas offline, but an obvious solution did not 839 present itself. On the bright side, having too many ranges is unlikely to cause 840 acute performance problems; that is, a situation where a merge is critical to 841 the health of a cluster is difficult to imagine. Unlike large ranges, which can 842 appear suddenly and require an immediate split or log truncation, merges are 843 only required when there are on the order of tens of thousands of excessively 844 small ranges, which takes a long time to build up. 845 846 ## Safety recap 847 848 This section is a recap of the various mechanisms, which are described in 849 detail above, that work together to ensure that merges do not violate 850 consistency. 851 852 The first broad safety mechanism is replica set alignment, which is required so 853 that every store participating in the merge has a copy of both halves of the 854 data in the merged range. Replica sets are first optimistically aligned by the 855 merge queue. The replicas might drift apart, e.g., because the ranges in 856 question were also targeted for a rebalance by the replicate queue, so the 857 merge transaction verifies that the replica sets are still aligned from within 858 the transaction. If a concurrent split or rebalance were to occur on the 859 implicated ranges, transactional isolation kicks in and aborts one of the 860 transactions, so we know that the replica sets are still aligned at the moment 861 that the merge commits. 862 863 Crucially, we need to maintain alignment until the merge applies on all replicas 864 that were present at the time of the merge. This is enforced by refusing to 865 GC a replica of the RHS of a merge unless it can be proven that the store does 866 not have a replica of the LHS that predates the merge, _nor_ will it acquire 867 a replica of the LHS that predates the merge. Proving that it does not currently 868 have a replica of the LHS that predates the merge is fairly straightforward: 869 we simply prove that the local left neighbor's generation is the newest 870 generation, as indicated by the LHS's meta descriptor. Proving that the store 871 will _never_ acquire a replica of the LHS that predates the merge is harder— 872 there could be a snapshot in flight that the LHS is entirely unaware of. So 873 instead we require that replicas of the LHS in a merge are initialized on every 874 store before the merge can begin. 875 876 The second broad safety mechanism is the range freeze. This ensures that the 877 subsuming range and the subsumed range do not serve traffic at the same time, 878 which would lead to clear consistency violations. The mechanism works by tying 879 the freeze to the lifetime of the merge transaction; the merge will not commit 880 until all replicas of the RHS are verified to be frozen, and the replicas of the 881 RHS will not unfreeze unless the merge transaction is verified to be aborted. 882 Lease transfers are freeze-aware, so the freeze will persist even if the lease 883 moves around on the RHS during the merge or if the leaseholder restarts. The 884 implementation of the freeze ab(uses) the span latch manager, to flush out 885 in-flight commands on the RHS, an intent on the local range descriptor, to 886 ensure the freeze persists if the lease is transfered, and an RPC that 887 repeatedly polls the RHS to wait until it is fully caught up. 888 889 ## Appendix 890 891 The appendix contains digressions that are not directly pertinent to range 892 merges, but are not covered in documentation elsewhere. 893 894 ### Key encoding oddities 895 896 Lexicographic ordering of keys of unbounded length has the interesting property 897 that it is always possible to construct the key that immediately succeeds a 898 given key, but it is not always possible to construct the key that immediately 899 precedes a given key. 900 901 In the following diagrams `\xNN` represents a byte whose value in hexadecimal is 902 `NN`. The null byte is thus `\x00` and the maximal byte is thus `\xff`. 903 904 Now consider a sequence of keys that has no gaps: 905 906 ``` 907 a 908 a\x00 909 a\x00\x00 910 ``` 911 912 No gaps means that there are no possible keys that can sort between any of the 913 members of the sequence. For example, there is, simply, no key that sorts 914 between `a` and `a\x00`. 915 916 Because we can construct such a sequence, we must have next and previous 917 operations over the sequence, which, given a key, construct the immediately 918 following key and the immediately preceding key, respectively. We can see from 919 the diagram that the next operation appends a null byte (`\x00`), while the 920 previous operation strips off that null byte. 921 922 But what if we want to perform the previous operation on a key that does not end 923 in a null byte? For example, what is the key that immediately precedes `b`? It's 924 not `a`, because `a\x00` sorts between `a` and `b`. Similarly, it's not `a\xff`, 925 because `a\xff\xff` sorts between `a\xff` and `b`. This process continues 926 inductively until we conclude the key that immediately precedes `b` is 927 `a\xff\xff\xff...`, where there are an infinite number of trailing `\xff` bytes. 928 929 It is not possible to represent this key in CockroachDB without infinite space. 930 You could imagine designing the key encoding with an additional bit that means, 931 "pretend this key has an infinite number of trailing maximal bytes," but 932 CockroachDB does not have such a bit. 933 934 The upshot is that it is trivial to advance in the keyspace using purely 935 lexical operations, but it is impossible to reverse in the keyspace with purely 936 lexical operations. 937 938 This problem pervades the system. Given a range that spans from `StartKey`, 939 inclusive, to `EndKey`, exclusive, it is trivial to address a request to 940 following range, but *not* the preceding range. To route a request to a range, 941 we must construct a key that lives inside that range. Constructing such a key 942 for the following range is trivial, as the end key of a range is, by definition, 943 contained in the following range. But constructing such a key for the preceding 944 range would require constructing the key that immediately precedes `StartKey`, 945 which is not possible with CockroachDB's key encoding.