github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/range-merges.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/range-merges.md (about)

1 # Range merges
2
3 **Last update:** April 10, 2019
4
5 **Original author:** Nikhil Benesch
6
7 This document serves as an end-to-end description of the implementation of range
8 merges. The target reader is someone who is reasonably familiar with core
9 but unfamiliar with either the "how" or the "why" of the range merge
10 implementation.
11
12 The most complete documentation, of course, is in the code, tests, and the
13 surrounding comments, but those pieces are necessarily split across several
14 files and packages. That scattered knowledge is centralized here, without
15 excessive detail that is likely to become stale.
16
17 ## Table of Contents
18
19 * [Overview](#overview)
20 * [Implementation details](#implementation-details)
21 * [Preconditions](#preconditions)
22 * [Initiating a merge](#initiating-a-merge)
23 * [AdminMerge race](#adminmerge-race)
24 * [Merge transaction](#merge-transaction)
25 * [Transfer of power](#transfer-of-power)
26 * [Snapshots](#snapshots)
27 * [Merge queue](#merge-queue)
28 * [Subtle complexities](#subtle-complexities)
29 * [Range descriptor generation](#range-descriptor-generations)
30 * [Misaligned replica sets](#misaligned-replica-sets)
31 * [Replica GC](#replica-gc)
32 * [Transaction record GC](#transaction-record-gc)
33 * [Unanimity](#unanimity)
34 * [Safety recap](#safety-recap)
35 * [Appendix](#appendix)
36 * [Key encoding oddities](#key-encoding-oddities)
37
38 ## Overview
39
40 A range merge begins when two adjacent ranges are selected to be merged
41 together. For example, suppose our adjacent ranges are _P_ and _Q_, somewhere in
42 the middle of the keyspace:
43
44 ```
45 --+-----+-----+--
46 | P | Q |
47 --+-----+-----+--
48 ```
49
50 We'll call _P_ the left-hand side (LHS) of the merge, and _Q_ the right-hand
51 side (RHS) of the merge. For reasons that will become clear later, we also refer
52 to _P_ as the subsuming range and _Q_ as the subsumed range.
53
54 The merge is coordinated by the LHS. The coordinator begins by verifying that a)
55 the two ranges are, in fact, adjacent, and b) that the replica sets of the two
56 ranges are aligned. Replica set alignment is a term that is currently only
57 relevant to merges; it means that the set of stores with replicas of the LHS
58 exactly matches the set of stores with replicas of the RHS. For example, this
59 replica set is aligned:
60
61 ```
62 Store 1 Store 2 Store 3 Store 4
63 +-----+ +-----+ +-----+ +-----+
64 | P Q | | P Q | | | | P Q |
65 +-----+ +-----+ +-----+ +-----+
66 ```
67
68 By requiring replica set alignment, the merge operation is reduced to a metadata
69 update, albeit a tricky one, as the stores that will have a copy of the merged
70 range _PQ_ already have all the constituent data, by virtue of having a copy of
71 both _P_ and _Q_ before the merge begins. Note that replicas of _P_ and _Q_ do
72 not need to be fully up-to-date before the merge begins; they'll be caught up as
73 necessary during the [transfer of power](#transfer-of-power).
74
75 After verifying that the merge is sensible, the coordinator transactionally
76 updates the implicated range descriptors, adjusting P's range descriptor so that
77 it extends to _Q_'s end and deleting _Q_'s range descriptor.
78
79 Then, the coordinator needs to [atomically move
80 responsibility](#transfer-of-power) for the data in the RHS to the LHS. This is
81 tricky, as the lease on the LHS may be held by a different store than the lease
82 on the RHS. The coordinator notifies the RHS that it is about to be subsumed and
83 it is prohibited from serving any additional read or write traffic. Only when
84 the coordinator has received an acknowledgement from _every_ replica of the RHS,
85 indicating that no traffic is possibly being served on the RHS, does the
86 coordinator commit the merge transaction.
87
88 Like with splits, the merge transaction is committed with a special "commit
89 trigger" that instructs the receiving store to update its in-memory bookkeeping
90 to match the updates to the range descriptors in the transaction. The moment the
91 merge transaction is considered committed, the merge is complete!
92
93 At the time of writing, merges are only initiated by the merge queue, which is
94 responsible both for locating ranges that are in need of a merge and aligning
95 their replica sets before initiating the merge.
96
97 The remaining sections cover each of these steps in more detail.
98
99 ## Implementation details
100
101 ### Preconditions
102
103 Not any two ranges can be merged. The first and most obvious criterion is that
104 the two ranges must be adjacent. Suppose a simplified cluster that has only
105 three ranges, _A_, _B_, and _C_:
106
107 ```
108 +-----+-----+-----+
109 | A | B | C |
110 +-----+-----+-----+
111 ```
112
113 Ranges _A_ and _B_ can be merged, as can ranges _B_ and _C_, but not ranges _A_
114 and _C_, as they are not adjacent. Note that adjacent ranges are equivalently
115 referred to as "neighbors", as in, range _B_ is range _A_'s right-hand neighbor.
116
117 The second criterion is that the replica sets must be aligned. To illustrate,
118 consider a four node cluster with the default 3x replication. The allocator has
119 attempted to balance ranges as evenly as possible:
120
121 ```
122 Node 1 Node 2 Node 3 Node 4
123 +-----+ +-----+ +-----+ +-----+
124 | P Q | | P | | Q | | P Q |
125 +-----+ +-----+ +-----+ +-----+
126 ```
127
128 Notice how node 2 does not have a copy of _Q_, and node 3 does not have a copy
129 of _P_. These replica sets are considered "misaligned." Aligning them requires
130 rebalancing Q from node 3 to node 2, or rebalancing _P_ from node 2 to node 3:
131
132 ```
133 Node 1 Node 2 Node 3 Node 4
134 +-----+ +-----+ +-----+ +-----+
135 | P Q | | P Q | | | | P Q |
136 +-----+ +-----+ +-----+ +-----+
137
138 Node 1 Node 2 Node 3 Node 4
139 +-----+ +-----+ +-----+ +-----+
140 | P Q | | | | P Q | | P Q |
141 +-----+ +-----+ +-----+ +-----+
142 ```
143
144 We explored an alternative merge implementation that did not require aligned
145 replica sets, but found it to be unworkable. See the [misaligned replica sets
146 misstep](#misaligned-replica-sets) for details.
147
148 ### Initiating a merge
149
150 A merge is initiated by sending a AdminMerge request to a range. Like other
151 admin commands, DistSender will automatically route the request to the
152 leaseholder of the range, but there is no guarantee that the store will retain
153 its lease while the admin command is executing.
154
155 Note that an AdminMerge request takes no arguments, as there is no choice in
156 what range will be merged. The recipient of the AdminMerge will always be the
157 LHS, subsuming range, and its right neighbor at the moment that the
158 AdminMerge command begins executing will always be the RHS, subsumed range.
159
160 It would have been reasonable to have instead used the RHS to coordinate the
161 merge. That is, the RHS would have been the subsuming range, and the LHS would
162 have been the subsumed range. Using the LHS to coordinate, however, yields a
163 nice symmetry with splits, where the range that coordinates a split becomes the
164 LHS of the split. Maintaining this symmetry means that a range's start key never
165 changes during its lifetime, while its end key may change arbitrarily in
166 response to splits and merges.
167
168 There is another reason to prefer using the LHS to coordinate involving an
169 oddity of key encoding and range bounds. It is trivial for a range to send a
170 request to its right neighbor, as it simply addresses the request to its end
171 key, but it is difficult to send a request to its left neighbor, as there is no
172 function to get the key that immediately precedes the range's start key. See the
173 [key encoding oddities](#key-encoding-oddities) section of the appendix for
174 details.
175
176 At the time of writing, only the [merge queue](#merge-queue) initiates merges,
177 and it does so by bypassing DistSender and invoking the AdminMerge command
178 directly on the local replica. At some point in the future, we may wish to
179 expose manual merges via SQL, at which point the SQL layer will need to send
180 proper AdminMerge requests through the KV API.
181
182 #### AdminMerge race
183
184 At present, AdminMerge requests are subject to a small race. It is possible for
185 the ranges implicated by an AdminMerge request to split or merge between when
186 the client decides to send an AdminMerge request and when the AdminMerge request
187 is processed.
188
189 For example, suppose the client decides that _P_ and _Q_ should be merged and
190 sends an AdminMerge request to _P_. It is possible that, before the AdminMerge
191 request is processed, _P_ splits into _P1_ and _P2_. The
192 AdminMerge request will thus result in _P1_ and _P2_
193 merging together, and not the desired _P_ and _Q_.
194
195 The race could have been avoided if the AdminMerge request required that the
196 descriptors for the implicated ranges were provided as arguments to the request.
197 Then the merge could be aborted if the merge transaction discovered that either
198 of the implicated ranges did not match the corresponding descriptor in the
199 AdminMerge request arguments, forming a sort of optimistic lock.
200
201 Fortunately, the race is rare in practice. If it proves to be a problem, the
202 scheme described above would be easy to implement while maintaining backwards
203 compatibility.
204
205 ### Merge transaction
206
207 The merge transaction piggybacks on CockroachDB's serializability to provide
208 much of the necessary synchronization for the bookkeeping updates. For example,
209 merges cannot occur concurrently with any splits or replica changes on the
210 implicated ranges, because the merge transaction will naturally conflict with
211 those split transaction and change replicas transactions, as both transactions
212 will attempt to write updated range descriptors and conflict. No additional code
213 was needed to enforce this, as our standard transaction conflict detection
214 mechanisms kick in here (write intents, the timestamp cache, the span latch
215 manager, etc.).
216
217 Note that there was one surprising synchronization problem that was not
218 immediately handled by serializability. See [range descriptor
219 generations](#range-descriptor-generations) for details.
220
221 The standard KV operations that the merge transaction performs are:
222
223 * Reading the LHS descriptor and RHS descriptor, and verifying that their
224 replica sets are aligned.
225 * Updating the local and meta copy of the LHS descriptor to reflect
226 the widened end key.
227 * Deleting the local and meta copy of the RHS descriptor.
228 * Writing an entry to the `system.rangelog` table.
229
230 These operations are the essence of a merge, and in fact update all necessary
231 on-disk data! All the remaining complexity exists to update in-memory metadata
232 while the cluster is live.
233
234 Note that the merge transaction's KV operations are not fundamentally dependent
235 and so could theoretically be performed in any order. There are, however,
236 several implementation details that enforce some ordering constraints.
237
238 First, the merge transaction record needs to be located on the LHS.
239 Specifically, the transaction record needs to live on the subsuming range, as
240 the commit trigger that actually applies the merge to the replica's in-memory
241 state runs on the range where the transaction record lives. The transaction
242 record is created on the range that the transaction writes first; therefore, the
243 merge transaction is careful to update the local copy of the LHS descriptor as
244 its first operation, since the local copy of the LHS descriptor lives on the
245 LHS.
246
247 Second, the merge transaction must ensure that, when it issues the delete
248 request to remove the local copy of the RHS descriptor, the resulting intent is
249 actually written to disk. (See the [transfer of power](#transfer-of-power)
250 subsection for why this is required.) Thanks to [transactional
251 pipelining][#26599], KV writes can return early, before their intents have
252 actually been laid down. The intents are not required to make it to disk until
253 the moment before the transaction commits. The merge transaction simply disables
254 pipelining to avoid this hazard.
255
256 As the last step before the commit, the merge transaction needs to freeze the
257 RHS, then wait for _every_ replica of the RHS to apply all outstanding commands.
258 This ensures that, when the merge commits, every LHS replica can blindly assume
259 that it has perfectly up-to-date data for the RHS. To quickly recap, this is
260 guaranteed because 1) the replica sets were aligned when the merge transaction
261 started, 2) rebalances that would misalign the replica sets will conflict with
262 the merge transaction, causing one of the transactions to abort, 3) the RHS is
263 frozen and cannot process any new commands, and 4) every replica of the RHS is
264 caught up on all commands. The process of freezing the RHS and waiting for every
265 replica to catch up is covered more thoroughly in the [transfer of
266 power](#transfer-of-power) subsection.
267
268 Finally, the merge transaction commits, attaching a special [merge commit
269 trigger] to the end transaction request. This trigger has three
270 responsibilities:
271
272 1. It ensures the end transaction request knows which intents it can resolve
273 locally. Intents that live on the RHS range would naively appear to belong
274 to a different range than the one containing the transaction record (i.e.,
275 the LHS), but if the merge is committing then the LHS is subsuming the RHS
276 and thus the intents can be resolved locally.
277
278 In fact, it's critical that these intents are considered local, because
279 local intents are resolved synchronously while remote intents are resolved
280 asynchronously. We need to maintain the invariant that, when a store boots
281 up and discovers an intent on its local copy of a range descriptor, it can
282 simply ignore the intent. Because we enforce that these intents are
283 resolved synchronously with the commit of the merge transaction, we are
284 guaranteed that, if we see an intent on a local range descriptor, this
285 replica has not yet applied the `EndTransaction` request for the merge
286 transaction, and it is therefore safe to load the replica. If the intent
287 were instead resolved asynchronously, we could observe the state where the
288 `EndTransaction` request for the merge had applied but the intent
289 resolution had not applied, in which case we would attempt to load both the
290 post-merge subsuming replica, and the subsumed replica, which would overlap
291 and crash the node.
292
293 2. It adjusts the LHS's MVCCStats to incorporate the subsumed range's
294 MVCCstats.
295
296 3. It copies necessary range-ID local data from the RHS and LHS, rekeying each
297 key to use LHS's range ID. At the moment, the only necessary data is the
298 [transaction abort span].
299
300 4. It attaches a [merge payload to the replicated proposal result][pd-flag].
301 When each replica of the LHS applies the command, it will notice the merge
302 payload and adjust the store's in-memory state to match the changes to the
303 on-disk state that were committed by the transaction. This entails
304 atomically removing the RHS replica from the store and widening the LHS
305 replica.
306
307 This operation involves a delicate dance of acquiring store locks and locks
308 for both replicas, in a certain order, at various points in the Raft
309 command application flow. The details are too intricate to be worth
310 describing here, especially considering that these tangled interactions
311 between a store and its replicas are due for a refactor. The best thing to
312 do, if you're interested in the details, is to trace through all references
313 to `storagepb.ReplicatedEvalResult.Merge`.
314
315 [#26599]: https://github.com/cockroachdb/cockroach/pull/26599
316 [transaction abort span]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/abortspan/abortspan.go
317 [merge commit trigger]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/batcheval/cmd_end_transaction.go#L984-L994
318 [pd-flag]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/batcheval/cmd_end_transaction.go#L1033-L1035
319
320 #### Transfer of power
321
322 The trickiest part of the merge transaction is making the LHS responsible for
323 the keyspace that was previously owned by the RHS. The transfer of power must be
324 performed atomically; otherwise, all manner of consistency violations can occur.
325 This is hopefully obvious, but here's a quick example to drive the point home.
326 Suppose _P_ and _Q_ simultaneously consider themselves responsible for the key
327 _q_. _Q_ could then allow a write to _q_ at time 1 at the same time that _P_
328 allowed a read of _q_ at time 2. Consistency requires that either the read see
329 the write, as the read is executing at a higher timestamp, or that the write is
330 bumped to time 3. But because _P_ and _Q_ have separate span latch managers, no
331 synchronization will occur, and the read might fail to see the write!
332
333 The transfer of power is complicated by the fact that there is no guarantee that
334 the leases on the LHS and the RHS will be aligned, nor is there any
335 straightforward way to provide such a guarantee. (Aligned leaseholders would
336 allow the merge transaction to use a simple in-memory lock to make the transfer
337 of power atomic.) The leaseholder of either range might fail at any moment, at
338 which point the lease can be acquired, after it expires, by any other live
339 member of the range.
340
341 Since leaseholder assignment is infeasible, the merge transaction implements
342 what is essentially a distributed lock.
343
344 The lock is initialized by sending a [Subsume][subsume-request] request to the
345 RHS. This is a single-purpose request that exists solely for use in the merge
346 transaction. It is unlikely to ever be useful in another situation, and (ab)uses
347 several implementation details to provide the necessary synchronization.
348
349 When the Subsume request returns, the RHS has made three important promises:
350
351 1. There are no commands in flight.
352 2. Any future requests will block until the merge transaction completes.
353 If the merge transaction commits, the requests will be bounced with a
354 RangeNotFound error. If the merge transaction aborts, the requests will
355 be processed as normal.
356 3. If the RHS loses its lease, the new leaseholder will adhere to promises
357 1 and 2.
358
359 The Subsume request provides promise 1 by declaring that it reads and writes all
360 addressable keys in the range. This is a bald-faced lie, as the Subsume request
361 only reads one key and writes no keys, but it forces synchronization with all
362 latches in the span latch manager, as no other commands can possibly execute in
363 parallel with a command that claims to write all keys.
364
365 **TODO(benesch,nvanbenschoten):** Actually, concurrent reads at lower timestamps
366 are permitted. Is this a problem? Maybe. Answering this question is difficult
367 and requires reasoning about the causal chain established by the sequence of
368 requests sent by the merge transaction.
369
370 It provides promise 2 by flipping [a bit][merge-bit] on the replica that
371 indicates that a subsumption is in progress. When the bit is active, the
372 replica blocks processing of all requests.
373
374 Importantly, the bit needs to be cleared when the merge transaction completes,
375 so that the requests are not blocked forever. This is the responsibility of the
376 [merge watcher goroutine][watcher]. Determining whether a transaction has
377 committed or not is conceptually simple, but the details are brutally
378 complicated. See the [transaction record GC](#transaction-record-gc) section
379 for details.
380
381 Note that the Subsume request only runs on the leaseholder, and therefore the
382 merge bit is only set on the leaseholder and the watcher goroutine only runs on
383 the leaseholder. This is perfectly safe, as none of the follower replicas can
384 process requests.
385
386 Promise 3 is actually not provided by the Subsume request at all, but by a hook
387 in the lease acquisition code path. Whenever a replica acquires a lease, it
388 checks to see whether its local descriptor has a deletion intent. If it does, it
389 can infer that a subsumption is in progress, as nothing else leaves a deletion
390 intent on a range descriptor. In that case, the replica, instead of serving
391 traffic, flips the merge bit and launches its own merge watcher goroutine, just
392 as the Subsume command would have. This means there can actually be multiple
393 replicas of the RHS with the merge bit set and a merge watcher goroutine
394 running—assuming the old leaseholder did not crash but lost its lease for other
395 reasons—but this does not cause any problems.
396
397 [subsume-request]: https://github.com/cockroachdb/cockroach/blob/d6adf24cae788d7cd967feadae8e9c0388ce5273/pkg/storage/batcheval/cmd_subsume.go#L56-L86
398 [merge-bit]: https://github.com/cockroachdb/cockroach/blob/d6adf24cae788d7cd967feadae8e9c0388ce5273/pkg/storage/replica.go#L361-L364
399 [watcher]: https://github.com/cockroachdb/cockroach/blob/d6adf24cae788d7cd967feadae8e9c0388ce5273/pkg/storage/replica.go#L2817-L2926
400
401 ### Snapshots
402
403 The LHS might be advanced over the command that commits a merge with a snapshot.
404 That means that all the complicated bookkeeping that normally takes place when a
405 replica processes a command with a non-nil `ReplicatedEvalResult.Merge` is
406 entirely skipped! Most problematically, the snapshot will need to widen the
407 receiving replica, but there will be a replica of the RHS in the way—remember,
408 this is guaranteed by replica set alignment. In fact, the snapshot could be
409 advancing over several merges, or a combination of several merges and splits, in
410 which case there will be several RHSes to subsume.
411
412 This turns out to be relatively straightforward to handle. If an initialized
413 replica receives a snapshot that widens it, it can infer that a merge occured,
414 and it simply subsumes all replicas that are overlapped by the snapshot in one
415 shot. This requires the same delicate synchronization dance, mentioned at the
416 end of the [merge transaction](#merge-transaction) section, to update bookeeping
417 information. After all, applying a widening snapshot is simply the bulk version
418 of applying a merge command directly. The details are too complicated to go into
419 here, but you can begin your own exploration by starting with this call to
420 [`Replica.maybeAcquireSnapshotMergeLock`][code-start] and tracing how the
421 returned `subsumedRepls` value is used.
422
423 [code-start]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/replica.go#L4071-L4072
424
425 ### Merge queue
426
427 The merge queue, like most of the other queues in the system, runs on every
428 store and periodically scans all replicas for which that store holds the lease.
429 For each replica, the merge queue evaluates whether it should be merged with
430 its right neighbor. Looking rightward is a natural fit for the reasons
431 described in the [key encoding oddities](#key-encoding-oddities) section of
432 the appendix.
433
434 In some ways, the merge queue has an easy job. For any given range _P_ and its
435 right neighbor _Q_, the merge queue synthesizes the hypothetical merged range
436 _PQ_ and asks whether the split queue would immediately split that merged range.
437 If the split queue would immediately split _PQ_, then obviously _P_ and _Q_
438 should not be merged; otherwise, the ranges _should_ be merged! This means that
439 any improvement to our split heuristics also improves our merge heuristics with
440 essentially no extra work. For example, load-based splitting hardly required any
441 changes to the merge queue.
442
443 Note that, to avoid thrashing, ranges at or above the minimum size threshold
444 (8MB) are never considered for a merge. The minimum size threshold is
445 configurable on a per-zone basis.
446
447 Unfortunately, constructing the hypothetical merged range _PQ_ requires
448 information about _Q_ that only _Q_'s leaseholder maintains, like the amount of
449 load that _Q_ is currently experiencing. The merge queue must send a RangeStats
450 RPC to collect this information from _Q_'s leaseholder, because there is no
451 guarantee that the current store is _Q_'s leaseholder—or that the current store
452 even has a replica of _Q_ at all.
453
454 To prevent unnecessary RPC chatter, the merge queue uses several heuristics to
455 void sending RangeStats requests when it seems like the merge is unlikely to be
456 permitted. For example, if it determines that _P_ and _Q_ store different
457 tables, the split between them is mandatory and it won't bother sending the
458 RangeStats requests. Similarly, if _P_ is above the minimum size threshold, it
459 doesn't bother asking about _Q_.
460
461 ## Subtle complexities
462
463 ### Range descriptor generations
464
465 There was one knotty race that was not immediately eliminated by transaction
466 serializability. Suppose we have our standard aligned replica set situation:
467
468 ```
469 Store 1 Store 2 Store 3 Store 4
470 +-----+ +-----+ +-----+ +-----+
471 | P Q | | P Q | | | | P Q |
472 +-----+ +-----+ +-----+ +-----+
473 ```
474
475 In an unfortunate twist of fate, a rebalance of P from store 2 to store 3
476 begins at the same time as a merge of P and Q begins. Let's quickly cover the
477 valid outcomes of this race.
478
479 1. The rebalance commits before the merge. The merge must abort, as the
480 replica sets of P and Q are no longer aligned.
481
482 2. The merge commits before the rebalance starts. The rebalance should
483 voluntarily abort, as the decision to rebalance P needs to be updated in
484 light of the merge. It is not, however, a correctness problem if the
485 rebalance commits; it simply results in rebalancing a larger range than
486 may have been intended.
487
488 3. The merge commits before the rebalance ends, but after the rebalance has
489 sent a preemptive snapshot to store 3. The rebalance must abort, as
490 otherwise the preemptive snapshot it sent to store 3 is a ticking time
491 bomb.
492
493 To see why, suppose the rebalance commits. Since the preemptive snapshot
494 predates the commit of the merge transaction, the new replica on store 3
495 will need to be streamed the Raft command that commits the merge
496 transaction. But applying this merge command is disastrous, as store 3
497 does not have a replica of Q to merge! This is a very subtle way in which
498 replica set alignment can be subverted.
499
500 Guaranteeing the correct outcome in case 1 is easy. The merge transaction simply
501 checks for replica set alignment by transactionally reading the range descriptor
502 for _P_ and the range descriptor for _Q_ and verifying that they list the same
503 replicas. Serializability guarantees the rest.
504
505 Case 2 is similarly easy to handle. The rebalance transaction simply verifies
506 that the range descriptor used to make the rebalance decision matches the range
507 descriptor that it reads transactionally.
508
509 Case 3, however, has an extremely subtle pitfall. It seems like the solution for
510 case 2 should apply: simply abort the transaction if the range descriptor
511 changes between when the preemptive snapshot is sent and when the rebalance
512 transaction starts. But, as it turns out, this is not quite foolproof. What if,
513 between when the preemptive snapshot is sent and when the rebalance transaction
514 starts, _P_ and _Q_ merge together and then split at exactly the same key? The
515 range descriptor for _P_ will look entirely unchanged to the rebalance
516 transaction!
517
518 The solution was to add a generation counter to the range descriptor:
519
520 ```protobuf
521 message RangeDescriptor {
522 // ...
523
524 // generation is incremented on every split and every merge, i.e., whenever
525 // the end_key of this range changes. It is initialized to zero when the range
526 // is first created.
527 optional int64 generation = 6;
528 }
529 ```
530
531 It is no longer possible for a range descriptor to be unchanged by a sequence of
532 splits and merges, as every split and merge will bump the generation counter.
533 Rebalances can thus detect if a merge commits between when the preemptive
534 snapshot is sent and when the transaction begins, and abort accordingly.
535
536 ### Misaligned replica sets
537
538 An early implementation allowed merges between ranges with misaligned replica
539 sets. The intent was to simplify matters by avoiding replica rebalancing.
540
541 Consider again our example misaligned replica set:
542
543 ```
544 Store 1 Store 2 Store 3 Store 4
545 +-----+ +-----+ +-----+ +-----+
546 | P Q | | P | | Q | | P Q |
547 +-----+ +-----+ +-----+ +-----+
548
549 P: (s1, s2, s4)
550 Q: (s1, s3, s4)
551 ```
552
553 Note that there are two perspectives shown here. The store boxes represent the
554 replicas that are *actually* present on that store, from the perspective of the
555 store itself. The descriptor tuples at the bottom represent the stores that are
556 considered to be members of the range, from the perspective of the most recently
557 committed range descriptor.
558
559 Now, to merge _P_ and _Q_ in this situation without aligning their replica sets,
560 store 2 needed to be provided a copy of store 3's data. To accomplish this, a
561 copy of _Q_'s data was stashed in the merge trigger, and _P_ would write this
562 data into its store when applying the merge trigger.
563
564 There was, sadly, a large and unforeseen problem with lagging replicas. Suppose
565 store 2 loses network connectivity a moment before ranges _P_ and _Q_ are
566 merged. Note that store 2 is not required for _P_ and _Q_ to merge, because only
567 a quorum is required on the LHS to commit a merge. Now the situation looks like
568 this:
569
570 ```
571 Store 1 Store 2 Store 3 Store 4
572 +-----+ xxxxxxx +-----+ +-----+
573 | PQ | | P | | | | PQ |
574 +-----+ xxxxxxx +-----+ +-----+
575
576 PQ: (s1, s2, s4)
577 ```
578
579 There is nothing stopping the newly merged _PQ_ range from immediately splitting
580 into _P_ and _Q'_. Note that _P_ is the same range as the original _P_ (i.e., it
581 has the same range ID) and so the replica on _P_ is still considered a member,
582 while _Q'_ is a new range, with a new ID, that is unrelated to _Q_:
583
584 ```
585 Store 1 Store 2 Store 3 Store 4
586 +-----+ xxxxxxx +-----+ +-----+
587 | P Q'| | P | | | | P Q'|
588 +-----+ xxxxxxx +-----+ +-----+
589
590 P: (s1, s2, s4)
591 Q': (s1, s2, s4)
592 ```
593
594 When store 2 comes back online, it will start catching up on missed messages.
595 But notice how the meta ranges consider store 2 to be a member of _Q'_, because
596 it was a member of _P_ before the split. The leaseholder for _Q'_ will notice
597 that store 2's replica is out of date and send over a snapshot so that store 2
598 can initialize its replica... and all that might happen before store 2 manages
599 to apply the merge command for _PQ_. If so, applying the merge command for _PQ_
600 will explode, because the keyspace of the merged range _PQ_ intersects with the
601 keyspace of _Q'_!
602
603 By requiring aligned replica sets, we sidestep this problem. The RHS is, in
604 effect, a lock on the post-merge keyspace. Suppose we find ourselves in the
605 analogous situation with replica sets aligned:
606
607 ```
608 Store 1 Store 2 Store 3 Store 4
609 +-----+ xxxxxxx +-----+ +-----+
610 | P Q'| | P Q | | | | P Q'|
611 +-----+ xxxxxxx +-----+ +-----+
612
613 P: (s1, s2, s4)
614 Q': (s1, s2, s4)
615 ```
616
617 Here, _PQ_ split into _P_ and _Q'_ immediately after merging, but notice how
618 store 2 has a replica of both _P_ and _Q_ because we required replica set
619 alignment during the merge. That replica of _Q_ prevents store 2 from
620 initializing a replica of _Q'_ until either store 2's replica of _P_ applies the
621 merge command (to _PQ_) and the split command (to _P_ and _Q'_), or store 2's
622 replica of _P_ is rebalanced away.
623
624 ### Replica GC
625
626 Per the discussion in the last section, we use the replica of the RHS as a lock
627 on the keyspace extension. This means that we need to be careful not to GC this
628 replica too early.
629
630 It's easiest to see why this is a problem if we consider the case where one
631 replica is extremely slow in applying a merge:
632
633 ```
634 Store 1 Store 2 Store 3 Store 4
635 +-----+ +-----+ +-----+ +-----+
636 | PQ | | PQ | | | | P Q |
637 +-----+ +-----+ +-----+ +-----+
638
639 PQ: (s1, s2, s4)
640 ```
641
642 Here, _P_ and _Q_ have just merged. Store 4 hasn't yet processed the merge while
643 stores 1 and 2 have.
644
645 The replica GC queue is continually scanning for replicas that are no longer a
646 member of their range. What if the replica GC queue on store 4 scans its replica
647 of _Q_ at this very moment? It would notice that the _Q_ range has been merged
648 away and, conceivably, conclude that _Q_ could be garbage collected. This would
649 be disastrous, as when _P_ finally applied the merge trigger it would no longer
650 have a replica of _Q_ to subsume!
651
652 One potential solution would be for the replica GC queue to refuse to GC
653 replicas for ranges that have been merged away. But that could result in
654 replicas getting permanently stuck. Suppose that, before store 4 applies the
655 merge transaction, the _PQ_ range is rebalanced away to store 3:
656
657 ```
658 Store 1 Store 2 Store 3 Store 4
659 +-----+ +-----+ +-----+ +-----+
660 | PQ | | PQ | | PQ | | P Q |
661 +-----+ +-----+ +-----+ +-----+
662
663 PQ: (s1, s2, s3)
664 ```
665
666 Store 4's replica of _P_ will likely never hear about the merge, as it is no
667 longer a member of the range and therefore not receiving any additional Raft
668 messages from the leader, so it will never subsume _Q_. The replica GC queue
669 _must_ be capable of garbage collecting _Q_ in this case. Otherwise _Q_ will
670 be stuck on store 4 forever, permanently preventing the store from ever
671 acquiring a new replica that overlaps with that keyspace.
672
673 Solving this problem turns out to be quite tricky. What the replica GC queue
674 wants to know when it discovers that _Q_'s range has been subsumed is whether
675 the local replica of _Q_ might possibly still be subsumed by its local left
676 neighbor _P_. It can't just ask the local _P_ whether it's about to apply a
677 merge, since _P_ might be lagging behind, as it is here, and have no idea that a
678 merge is about to occur.
679
680 So the problem reduces to proving that _P_ cannot apply a merge trigger that
681 will subsume _Q_. The chosen approach is to fetch the current range descriptor
682 for _P_ from the meta index. If that descriptor exactly matches the local
683 descriptor, thanks to [range descriptor generations](#range-descriptor-generations),
684 we are assured that there are no merge triggers that _P_ has yet to apply, and
685 _Q_ can safely be GC'd.
686
687 Note that it is possible to form long chains of replicas that can only be GC'd
688 from left to right; the GC queue is not aware of these dependencies and
689 therefore processes such chains extremely inefficiently (i.e., by processing
690 replicas in an arbitrary order instead of the necessary order). These chains
691 turn out to be extremely rare in practice.
692
693 There is one additional subtlety here. Suppose we have two adjacent ranges, _O_
694 and _Q_. _O_ has just split into _O_ and _P_, but store 3 is lagging and has not
695 yet processed the split.
696
697 ```
698 STATE 1
699
700 Store 1 Store 2 Store 3 Store 4
701 +-------+ +-------+ +-------+ +-------+
702 | O P Q | | O P Q | | O Q | | |
703 +-------+ +-------+ +-------+ +-------+
704
705 O: (s1, s2, s3)
706 P: (s1, s2, s3)
707 Q: (s1, s2, s3)
708 ```
709
710 At this point, suppose the leader for the new range _P_ decides that store 3
711 will need a snapshot to catch up, and starts sending the snapshot over the
712 network. This will be important later. At the same time, _P_ and _Q_ merge while
713 store 3 is still lagging.
714
715 It may seem strange that this merge is permitted, but notice how the replica
716 sets are aligned according to the descriptors, even though store 3 does not
717 physically have a replica of _P_ yet. Here's the new state of the world:
718
719 ```
720 STATE 2
721
722 Store 1 Store 2 Store 3 Store 4
723 +-------+ +-------+ +-------+ +-------+
724 | O PQ | | O PQ | | O Q | | |
725 +-------+ +-------+ +-------+ +-------+
726
727 O: (s1, s2, s3)
728 PQ: (s1, s2, s3)
729 ```
730
731 Finally, _O_ is rebalanced from store 3 to store 4 and garbage collected on
732 store 4:
733
734 ```
735 STATE 3
736
737 Store 1 Store 2 Store 3 Store 4
738 +-------+ +-------+ +-------+ +-------+
739 | O PQ | | O PQ | | Q | | O |
740 +-------+ +-------+ +-------+ +-------+
741
742 O: (s1, s2, s4)
743 PQ: (s1, s2, s3)
744 ```
745
746 The replica GC queue might reasonably think that store 3's replica of _Q_ is out
747 of date, as _Q_ has no left neighbor that could subsume it. But, at any moment
748 in time, store 3 could finish receiving the snapshot for _P_ that was started
749 between state 1 and state 2. Crucially, this snapshot predates the merge, so it
750 will need to apply the merge trigger... and the replica for _Q_ had better be
751 present on the store!
752
753 This hazard is avoided by requiring that all replicas of the LHS are initialized
754 before a merge begins. This prevents a transition from state 1 to state 2, as
755 the merge of _P_ and _Q_ cannot occur until store 3 initializes its replica of
756 _P_. The AdminMerge command will wait a few seconds in the hope that store 3
757 catches up quickly; otherwise, it will refuse to launch the merge transaction.
758 It is therefore impossible to end up in a dangerous state, like state 3, and it
759 is thus safe for the replica GC queue to GC _Q_ if its left neighbor is
760 generationally up to date.
761
762 ### Transaction record GC
763
764 The merge watcher goroutine needs to wait until the merge transaction completes,
765 and determine whether or not the transaction committed or aborted. This turns
766 out to be brutally complicated, thanks to the aggressive garbage collection of
767 transaction records.
768
769 The watcher goroutine begins by sending a PushTxn request to the merge
770 transaction. It can easily discover the necessary arguments for the PushTxn
771 request, that is, the ID and key of the current merge transaction record,
772 because they're recorded in the intent that the merge transaction has left on
773 the RHS's local copy of the descriptor.
774
775 If the PushTxn request reports that the merge transaction committed, we're
776 guaranteed that the merge transaction did, in fact, complete. That means that we
777 can mark the RHS replica as destroyed, bounce all requests back to DistSender
778 (where they'll get retried on the subsuming range), and clean up the watcher
779 goroutine. The RHS replica will be cleaned up either when the LHS replica
780 subsumes it or when the replica GC queue notices that it has been abandoned.
781
782 If the PushTxn request instead reports that the merge transaction aborted, we're
783 not guaranteed that the merge transaction actually aborted. The merge
784 transaction may have committed so quickly that its transaction record was
785 garbage collected before our PushTxn request arrived. The PushTxn incorrectly
786 interprets this state to mean that the transaction was aborted, when, in fact,
787 it was committed and GCed. To be fair, we're somewhat abusing the PushTxn
788 request here. Outside of merges, a PushTxn request is only sent when a pending
789 intent is discovered, and transaction records can't be GCed until all their
790 intents have been resolved.
791
792 So we need some way to determine whether a merge transaction was actually
793 aborted. What we do is look for the effects of the merge transaction in meta2.
794 If the merge aborted, we'll see our own range descriptor, with our range ID, in
795 meta2. By contrast, if the merge committed, we'll see a range descriptor for a
796 different range in meta2.
797
798 This complexity is extremely unfortunate, and turns what should be a simple
799 goroutine, spawned on the RHS leader for every merge transaction
800
801 ```go
802 go func() {
803 <-txn.Done() // wait for txn to complete
804 if txn.Committed() {
805 repl.MarkDestroyed("replica subsumed")
806 }
807 repl.UnblockRequests()
808 }
809 ```
810
811 into [150 lines of hard to follow code][code].
812
813 [code]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/replica.go#L2813-L2920
814
815 ### Unanimity
816
817 The largest conceptual incongruity with the current merge implementation is the
818 fact that it requires unanimous consent from all replicas, instead of a majority
819 quorum, like everything else in Raft. Further confusing matters, only the RHS
820 needs unanimous consent; a merge can proceed with only majority consent from the
821 LHS. In fact, it's even a bit more subtle: while only a majority of the LHS
822 replicas need to vote on the merge command, all LHS replicas need to confirm
823 that they are initialized for the merge to start.
824
825 There is no theoretical reason that merges need unanimous consent, but the
826 complexity of the implementation quickly skyrockets without it. For example,
827 suppose you adjusted the transfer of power so that only a majority of replicas
828 on the RHS need to be fully up to date before the merge commits. Now, when
829 applying the merge trigger, the LHS needs to check to see if its copy of the RHS
830 is up to date; if it's not, the LHS needs to throw away its entire Raft state
831 and demand a snapshot from the leader. This is both unsightly—our code is worse
832 off every time we reach into Raft—and less efficient than the existing
833 implementation, as it requires sending a potentially multi-megabyte snapshot if
834 one replica of the RHS is just a little bit behind in applying the latest
835 commands.
836
837 It's possible that these problems could be mitigated while retaining the ability
838 to merge with a minority of replicas offline, but an obvious solution did not
839 present itself. On the bright side, having too many ranges is unlikely to cause
840 acute performance problems; that is, a situation where a merge is critical to
841 the health of a cluster is difficult to imagine. Unlike large ranges, which can
842 appear suddenly and require an immediate split or log truncation, merges are
843 only required when there are on the order of tens of thousands of excessively
844 small ranges, which takes a long time to build up.
845
846 ## Safety recap
847
848 This section is a recap of the various mechanisms, which are described in
849 detail above, that work together to ensure that merges do not violate
850 consistency.
851
852 The first broad safety mechanism is replica set alignment, which is required so
853 that every store participating in the merge has a copy of both halves of the
854 data in the merged range. Replica sets are first optimistically aligned by the
855 merge queue. The replicas might drift apart, e.g., because the ranges in
856 question were also targeted for a rebalance by the replicate queue, so the
857 merge transaction verifies that the replica sets are still aligned from within
858 the transaction. If a concurrent split or rebalance were to occur on the
859 implicated ranges, transactional isolation kicks in and aborts one of the
860 transactions, so we know that the replica sets are still aligned at the moment
861 that the merge commits.
862
863 Crucially, we need to maintain alignment until the merge applies on all replicas
864 that were present at the time of the merge. This is enforced by refusing to
865 GC a replica of the RHS of a merge unless it can be proven that the store does
866 not have a replica of the LHS that predates the merge, _nor_ will it acquire
867 a replica of the LHS that predates the merge. Proving that it does not currently
868 have a replica of the LHS that predates the merge is fairly straightforward:
869 we simply prove that the local left neighbor's generation is the newest
870 generation, as indicated by the LHS's meta descriptor. Proving that the store
871 will _never_ acquire a replica of the LHS that predates the merge is harder—
872 there could be a snapshot in flight that the LHS is entirely unaware of. So
873 instead we require that replicas of the LHS in a merge are initialized on every
874 store before the merge can begin.
875
876 The second broad safety mechanism is the range freeze. This ensures that the
877 subsuming range and the subsumed range do not serve traffic at the same time,
878 which would lead to clear consistency violations. The mechanism works by tying
879 the freeze to the lifetime of the merge transaction; the merge will not commit
880 until all replicas of the RHS are verified to be frozen, and the replicas of the
881 RHS will not unfreeze unless the merge transaction is verified to be aborted.
882 Lease transfers are freeze-aware, so the freeze will persist even if the lease
883 moves around on the RHS during the merge or if the leaseholder restarts. The
884 implementation of the freeze ab(uses) the span latch manager, to flush out
885 in-flight commands on the RHS, an intent on the local range descriptor, to
886 ensure the freeze persists if the lease is transfered, and an RPC that
887 repeatedly polls the RHS to wait until it is fully caught up.
888
889 ## Appendix
890
891 The appendix contains digressions that are not directly pertinent to range
892 merges, but are not covered in documentation elsewhere.
893
894 ### Key encoding oddities
895
896 Lexicographic ordering of keys of unbounded length has the interesting property
897 that it is always possible to construct the key that immediately succeeds a
898 given key, but it is not always possible to construct the key that immediately
899 precedes a given key.
900
901 In the following diagrams `\xNN` represents a byte whose value in hexadecimal is
902 `NN`. The null byte is thus `\x00` and the maximal byte is thus `\xff`.
903
904 Now consider a sequence of keys that has no gaps:
905
906 ```
907 a
908 a\x00
909 a\x00\x00
910 ```
911
912 No gaps means that there are no possible keys that can sort between any of the
913 members of the sequence. For example, there is, simply, no key that sorts
914 between `a` and `a\x00`.
915
916 Because we can construct such a sequence, we must have next and previous
917 operations over the sequence, which, given a key, construct the immediately
918 following key and the immediately preceding key, respectively. We can see from
919 the diagram that the next operation appends a null byte (`\x00`), while the
920 previous operation strips off that null byte.
921
922 But what if we want to perform the previous operation on a key that does not end
923 in a null byte? For example, what is the key that immediately precedes `b`? It's
924 not `a`, because `a\x00` sorts between `a` and `b`. Similarly, it's not `a\xff`,
925 because `a\xff\xff` sorts between `a\xff` and `b`. This process continues
926 inductively until we conclude the key that immediately precedes `b` is
927 `a\xff\xff\xff...`, where there are an infinite number of trailing `\xff` bytes.
928
929 It is not possible to represent this key in CockroachDB without infinite space.
930 You could imagine designing the key encoding with an additional bit that means,
931 "pretend this key has an infinite number of trailing maximal bytes," but
932 CockroachDB does not have such a bit.
933
934 The upshot is that it is trivial to advance in the keyspace using purely
935 lexical operations, but it is impossible to reverse in the keyspace with purely
936 lexical operations.
937
938 This problem pervades the system. Given a range that spans from `StartKey`,
939 inclusive, to `EndKey`, exclusive, it is trivial to address a request to
940 following range, but *not* the preceding range. To route a request to a range,
941 we must construct a key that lives inside that range. Constructing such a key
942 for the following range is trivial, as the end key of a range is, by definition,
943 contained in the following range. But constructing such a key for the preceding
944 range would require constructing the key that immediately precedes `StartKey`,
945 which is not possible with CockroachDB's key encoding.