github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/range-merges.md (about)

     1  # Range merges
     2  
     3  **Last update:** April 10, 2019
     4  
     5  **Original author:** Nikhil Benesch
     6  
     7  This document serves as an end-to-end description of the implementation of range
     8  merges. The target reader is someone who is reasonably familiar with core
     9  but unfamiliar with either the "how" or the "why" of the range merge
    10  implementation.
    11  
    12  The most complete documentation, of course, is in the code, tests, and the
    13  surrounding comments, but those pieces are necessarily split across several
    14  files and packages. That scattered knowledge is centralized here, without
    15  excessive detail that is likely to become stale.
    16  
    17  ## Table of Contents
    18  
    19  * [Overview](#overview)
    20  * [Implementation details](#implementation-details)
    21    * [Preconditions](#preconditions)
    22    * [Initiating a merge](#initiating-a-merge)
    23      * [AdminMerge race](#adminmerge-race)
    24    * [Merge transaction](#merge-transaction)
    25      * [Transfer of power](#transfer-of-power)
    26    * [Snapshots](#snapshots)
    27    * [Merge queue](#merge-queue)
    28  * [Subtle complexities](#subtle-complexities)
    29    * [Range descriptor generation](#range-descriptor-generations)
    30    * [Misaligned replica sets](#misaligned-replica-sets)
    31    * [Replica GC](#replica-gc)
    32    * [Transaction record GC](#transaction-record-gc)
    33    * [Unanimity](#unanimity)
    34  * [Safety recap](#safety-recap)
    35  * [Appendix](#appendix)
    36    * [Key encoding oddities](#key-encoding-oddities)
    37  
    38  ## Overview
    39  
    40  A range merge begins when two adjacent ranges are selected to be merged
    41  together. For example, suppose our adjacent ranges are _P_ and _Q_, somewhere in
    42  the middle of the keyspace:
    43  
    44  ```
    45  --+-----+-----+--
    46    |  P  |  Q  |
    47  --+-----+-----+--
    48  ```
    49  
    50  We'll call _P_ the left-hand side (LHS) of the merge, and _Q_ the right-hand
    51  side (RHS) of the merge. For reasons that will become clear later, we also refer
    52  to _P_ as the subsuming range and _Q_ as the subsumed range.
    53  
    54  The merge is coordinated by the LHS. The coordinator begins by verifying that a)
    55  the two ranges are, in fact, adjacent, and b) that the replica sets of the two
    56  ranges are aligned. Replica set alignment is a term that is currently only
    57  relevant to merges; it means that the set of stores with replicas of the LHS
    58  exactly matches the set of stores with replicas of the RHS. For example, this
    59  replica set is aligned:
    60  
    61  ```
    62  Store 1    Store 2     Store 3     Store 4
    63  +-----+    +-----+     +-----+     +-----+
    64  | P Q |    | P Q |     |     |     | P Q |
    65  +-----+    +-----+     +-----+     +-----+
    66  ```
    67  
    68  By requiring replica set alignment, the merge operation is reduced to a metadata
    69  update, albeit a tricky one, as the stores that will have a copy of the merged
    70  range _PQ_ already have all the constituent data, by virtue of having a copy of
    71  both _P_ and _Q_ before the merge begins. Note that replicas of _P_ and _Q_ do
    72  not need to be fully up-to-date before the merge begins; they'll be caught up as
    73  necessary during the [transfer of power](#transfer-of-power).
    74  
    75  After verifying that the merge is sensible, the coordinator transactionally
    76  updates the implicated range descriptors, adjusting P's range descriptor so that
    77  it extends to _Q_'s end and deleting _Q_'s range descriptor.
    78  
    79  Then, the coordinator needs to [atomically move
    80  responsibility](#transfer-of-power) for the data in the RHS to the LHS. This is
    81  tricky, as the lease on the LHS may be held by a different store than the lease
    82  on the RHS. The coordinator notifies the RHS that it is about to be subsumed and
    83  it is prohibited from serving any additional read or write traffic. Only when
    84  the coordinator has received an acknowledgement from _every_ replica of the RHS,
    85  indicating that no traffic is possibly being served on the RHS, does the
    86  coordinator commit the merge transaction.
    87  
    88  Like with splits, the merge transaction is committed with a special "commit
    89  trigger" that instructs the receiving store to update its in-memory bookkeeping
    90  to match the updates to the range descriptors in the transaction. The moment the
    91  merge transaction is considered committed, the merge is complete!
    92  
    93  At the time of writing, merges are only initiated by the merge queue, which is
    94  responsible both for locating ranges that are in need of a merge and aligning
    95  their replica sets before initiating the merge.
    96  
    97  The remaining sections cover each of these steps in more detail.
    98  
    99  ## Implementation details
   100  
   101  ### Preconditions
   102  
   103  Not any two ranges can be merged. The first and most obvious criterion is that
   104  the two ranges must be adjacent. Suppose a simplified cluster that has only
   105  three ranges, _A_, _B_, and _C_:
   106  
   107  ```
   108  +-----+-----+-----+
   109  |  A  |  B  |  C  |
   110  +-----+-----+-----+
   111  ```
   112  
   113  Ranges _A_ and _B_ can be merged, as can ranges _B_ and _C_, but not ranges _A_
   114  and _C_, as they are not adjacent. Note that adjacent ranges are equivalently
   115  referred to as "neighbors", as in, range _B_ is range _A_'s right-hand neighbor.
   116  
   117  The second criterion is that the replica sets must be aligned. To illustrate,
   118  consider a four node cluster with the default 3x replication. The allocator has
   119  attempted to balance ranges as evenly as possible:
   120  
   121  ```
   122  Node  1    Node  2     Node  3     Node  4
   123  +-----+    +-----+     +-----+     +-----+
   124  | P Q |    |  P  |     |  Q  |     | P Q |
   125  +-----+    +-----+     +-----+     +-----+
   126  ```
   127  
   128  Notice how node 2 does not have a copy of _Q_, and node 3 does not have a copy
   129  of _P_. These replica sets are considered "misaligned." Aligning them requires
   130  rebalancing Q from node 3 to node 2, or rebalancing _P_ from node 2 to node 3:
   131  
   132  ```
   133  Node  1    Node  2     Node  3     Node  4
   134  +-----+    +-----+     +-----+     +-----+
   135  | P Q |    | P Q |     |     |     | P Q |
   136  +-----+    +-----+     +-----+     +-----+
   137  
   138  Node  1    Node  2     Node  3     Node  4
   139  +-----+    +-----+     +-----+     +-----+
   140  | P Q |    |     |     | P Q |     | P Q |
   141  +-----+    +-----+     +-----+     +-----+
   142  ```
   143  
   144  We explored an alternative merge implementation that did not require aligned
   145  replica sets, but found it to be unworkable. See the [misaligned replica sets
   146  misstep](#misaligned-replica-sets) for details.
   147  
   148  ### Initiating a merge
   149  
   150  A merge is initiated by sending a AdminMerge request to a range. Like other
   151  admin commands, DistSender will automatically route the request to the
   152  leaseholder of the range, but there is no guarantee that the store will retain
   153  its lease while the admin command is executing.
   154  
   155  Note that an AdminMerge request takes no arguments, as there is no choice in
   156  what range will be merged. The recipient of the AdminMerge will always be the
   157  LHS, subsuming range, and its right neighbor at the moment that the
   158  AdminMerge command begins executing will always be the RHS, subsumed range.
   159  
   160  It would have been reasonable to have instead used the RHS to coordinate the
   161  merge. That is, the RHS would have been the subsuming range, and the LHS would
   162  have been the subsumed range. Using the LHS to coordinate, however, yields a
   163  nice symmetry with splits, where the range that coordinates a split becomes the
   164  LHS of the split. Maintaining this symmetry means that a range's start key never
   165  changes during its lifetime, while its end key may change arbitrarily in
   166  response to splits and merges.
   167  
   168  There is another reason to prefer using the LHS to coordinate involving an
   169  oddity of key encoding and range bounds. It is trivial for a range to send a
   170  request to its right neighbor, as it simply addresses the request to its end
   171  key, but it is difficult to send a request to its left neighbor, as there is no
   172  function to get the key that immediately precedes the range's start key. See the
   173  [key encoding oddities](#key-encoding-oddities) section of the appendix for
   174  details.
   175  
   176  At the time of writing, only the [merge queue](#merge-queue) initiates merges,
   177  and it does so by bypassing DistSender and invoking the AdminMerge command
   178  directly on the local replica. At some point in the future, we may wish to
   179  expose manual merges via SQL, at which point the SQL layer will need to send
   180  proper AdminMerge requests through the KV API.
   181  
   182  #### AdminMerge race
   183  
   184  At present, AdminMerge requests are subject to a small race. It is possible for
   185  the ranges implicated by an AdminMerge request to split or merge between when
   186  the client decides to send an AdminMerge request and when the AdminMerge request
   187  is processed.
   188  
   189  For example, suppose the client decides that _P_ and _Q_ should be merged and
   190  sends an AdminMerge request to _P_. It is possible that, before the AdminMerge
   191  request is processed, _P_ splits into _P<sub>1</sub>_ and _P<sub>2</sub>_. The
   192  AdminMerge request will thus result in _P<sub>1</sub>_ and _P<sub>2</sub>_
   193  merging together, and not the desired _P_ and _Q_.
   194  
   195  The race could have been avoided if the AdminMerge request required that the
   196  descriptors for the implicated ranges were provided as arguments to the request.
   197  Then the merge could be aborted if the merge transaction discovered that either
   198  of the implicated ranges did not match the corresponding descriptor in the
   199  AdminMerge request arguments, forming a sort of optimistic lock.
   200  
   201  Fortunately, the race is rare in practice. If it proves to be a problem, the
   202  scheme described above would be easy to implement while maintaining backwards
   203  compatibility.
   204  
   205  ### Merge transaction
   206  
   207  The merge transaction piggybacks on CockroachDB's serializability to provide
   208  much of the necessary synchronization for the bookkeeping updates. For example,
   209  merges cannot occur concurrently with any splits or replica changes on the
   210  implicated ranges, because the merge transaction will naturally conflict with
   211  those split transaction and change replicas transactions, as both transactions
   212  will attempt to write updated range descriptors and conflict. No additional code
   213  was needed to enforce this, as our standard transaction conflict detection
   214  mechanisms kick in here (write intents, the timestamp cache, the span latch
   215  manager, etc.).
   216  
   217  Note that there was one surprising synchronization problem that was not
   218  immediately handled by serializability. See [range descriptor
   219  generations](#range-descriptor-generations) for details.
   220  
   221  The standard KV operations that the merge transaction performs are:
   222  
   223    * Reading the LHS descriptor and RHS descriptor, and verifying that their
   224      replica sets are aligned.
   225    * Updating the local and meta copy of the LHS descriptor to reflect
   226      the widened end key.
   227    * Deleting the local and meta copy of the RHS descriptor.
   228    * Writing an entry to the `system.rangelog` table.
   229  
   230  These operations are the essence of a merge, and in fact update all necessary
   231  on-disk data! All the remaining complexity exists to update in-memory metadata
   232  while the cluster is live.
   233  
   234  Note that the merge transaction's KV operations are not fundamentally dependent
   235  and so could theoretically be performed in any order. There are, however,
   236  several implementation details that enforce some ordering constraints.
   237  
   238  First, the merge transaction record needs to be located on the LHS.
   239  Specifically, the transaction record needs to live on the subsuming range, as
   240  the commit trigger that actually applies the merge to the replica's in-memory
   241  state runs on the range where the transaction record lives. The transaction
   242  record is created on the range that the transaction writes first; therefore, the
   243  merge transaction is careful to update the local copy of the LHS descriptor as
   244  its first operation, since the local copy of the LHS descriptor lives on the
   245  LHS.
   246  
   247  Second, the merge transaction must ensure that, when it issues the delete
   248  request to remove the local copy of the RHS descriptor, the resulting intent is
   249  actually written to disk. (See the [transfer of power](#transfer-of-power)
   250  subsection for why this is required.) Thanks to [transactional
   251  pipelining][#26599], KV writes can return early, before their intents have
   252  actually been laid down. The intents are not required to make it to disk until
   253  the moment before the transaction commits. The merge transaction simply disables
   254  pipelining to avoid this hazard.
   255  
   256  As the last step before the commit, the merge transaction needs to freeze the
   257  RHS, then wait for _every_ replica of the RHS to apply all outstanding commands.
   258  This ensures that, when the merge commits, every LHS replica can blindly assume
   259  that it has perfectly up-to-date data for the RHS. To quickly recap, this is
   260  guaranteed because 1) the replica sets were aligned when the merge transaction
   261  started, 2) rebalances that would misalign the replica sets will conflict with
   262  the merge transaction, causing one of the transactions to abort, 3) the RHS is
   263  frozen and cannot process any new commands, and 4) every replica of the RHS is
   264  caught up on all commands. The process of freezing the RHS and waiting for every
   265  replica to catch up is covered more thoroughly in the [transfer of
   266  power](#transfer-of-power) subsection.
   267  
   268  Finally, the merge transaction commits, attaching a special [merge commit
   269  trigger] to the end transaction request. This trigger has three
   270  responsibilities:
   271  
   272    1. It ensures the end transaction request knows which intents it can resolve
   273       locally. Intents that live on the RHS range would naively appear to belong
   274       to a different range than the one containing the transaction record (i.e.,
   275       the LHS), but if the merge is committing then the LHS is subsuming the RHS
   276       and thus the intents can be resolved locally.
   277  
   278       In fact, it's critical that these intents are considered local, because
   279       local intents are resolved synchronously while remote intents are resolved
   280       asynchronously. We need to maintain the invariant that, when a store boots
   281       up and discovers an intent on its local copy of a range descriptor, it can
   282       simply ignore the intent. Because we enforce that these intents are
   283       resolved synchronously with the commit of the merge transaction, we are
   284       guaranteed that, if we see an intent on a local range descriptor, this
   285       replica has not yet applied the `EndTransaction` request for the merge
   286       transaction, and it is therefore safe to load the replica. If the intent
   287       were instead resolved asynchronously, we could observe the state where the
   288       `EndTransaction` request for the merge had applied but the intent
   289       resolution had not applied, in which case we would attempt to load both the
   290       post-merge subsuming replica, and the subsumed replica, which would overlap
   291       and crash the node.
   292  
   293    2. It adjusts the LHS's MVCCStats to incorporate the subsumed range's
   294       MVCCstats.
   295  
   296    3. It copies necessary range-ID local data from the RHS and LHS, rekeying each
   297       key to use LHS's range ID. At the moment, the only necessary data is the
   298       [transaction abort span].
   299  
   300    4. It attaches a [merge payload to the replicated proposal result][pd-flag].
   301       When each replica of the LHS applies the command, it will notice the merge
   302       payload and adjust the store's in-memory state to match the changes to the
   303       on-disk state that were committed by the transaction. This entails
   304       atomically removing the RHS replica from the store and widening the LHS
   305       replica.
   306  
   307       This operation involves a delicate dance of acquiring store locks and locks
   308       for both replicas, in a certain order, at various points in the Raft
   309       command application flow. The details are too intricate to be worth
   310       describing here, especially considering that these tangled interactions
   311       between a store and its replicas are due for a refactor. The best thing to
   312       do, if you're interested in the details, is to trace through all references
   313       to `storagepb.ReplicatedEvalResult.Merge`.
   314  
   315  [#26599]: https://github.com/cockroachdb/cockroach/pull/26599
   316  [transaction abort span]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/abortspan/abortspan.go
   317  [merge commit trigger]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/batcheval/cmd_end_transaction.go#L984-L994
   318  [pd-flag]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/batcheval/cmd_end_transaction.go#L1033-L1035
   319  
   320  #### Transfer of power
   321  
   322  The trickiest part of the merge transaction is making the LHS responsible for
   323  the keyspace that was previously owned by the RHS. The transfer of power must be
   324  performed atomically; otherwise, all manner of consistency violations can occur.
   325  This is hopefully obvious, but here's a quick example to drive the point home.
   326  Suppose _P_ and _Q_ simultaneously consider themselves responsible for the key
   327  _q_. _Q_ could then allow a write to _q_ at time 1 at the same time that _P_
   328  allowed a read of _q_ at time 2. Consistency requires that either the read see
   329  the write, as the read is executing at a higher timestamp, or that the write is
   330  bumped to time 3. But because _P_ and _Q_ have separate span latch managers, no
   331  synchronization will occur, and the read might fail to see the write!
   332  
   333  The transfer of power is complicated by the fact that there is no guarantee that
   334  the leases on the LHS and the RHS will be aligned, nor is there any
   335  straightforward way to provide such a guarantee. (Aligned leaseholders would
   336  allow the merge transaction to use a simple in-memory lock to make the transfer
   337  of power atomic.) The leaseholder of either range might fail at any moment, at
   338  which point the lease can be acquired, after it expires, by any other live
   339  member of the range.
   340  
   341  Since leaseholder assignment is infeasible, the merge transaction implements
   342  what is essentially a distributed lock.
   343  
   344  The lock is initialized by sending a [Subsume][subsume-request] request to the
   345  RHS. This is a single-purpose request that exists solely for use in the merge
   346  transaction. It is unlikely to ever be useful in another situation, and (ab)uses
   347  several implementation details to provide the necessary synchronization.
   348  
   349  When the Subsume request returns, the RHS has made three important promises:
   350  
   351    1. There are no commands in flight.
   352    2. Any future requests will block until the merge transaction completes.
   353       If the merge transaction commits, the requests will be bounced with a
   354       RangeNotFound error. If the merge transaction aborts, the requests will
   355       be processed as normal.
   356    3. If the RHS loses its lease, the new leaseholder will adhere to promises
   357       1 and 2.
   358  
   359  The Subsume request provides promise 1 by declaring that it reads and writes all
   360  addressable keys in the range. This is a bald-faced lie, as the Subsume request
   361  only reads one key and writes no keys, but it forces synchronization with all
   362  latches in the span latch manager, as no other commands can possibly execute in
   363  parallel with a command that claims to write all keys.
   364  
   365  **TODO(benesch,nvanbenschoten):** Actually, concurrent reads at lower timestamps
   366  are permitted. Is this a problem? Maybe. Answering this question is difficult
   367  and requires reasoning about the causal chain established by the sequence of
   368  requests sent by the merge transaction.
   369  
   370  It provides promise 2 by flipping [a bit][merge-bit] on the replica that
   371  indicates that a subsumption is in progress. When the bit is active, the
   372  replica blocks processing of all requests.
   373  
   374  Importantly, the bit needs to be cleared when the merge transaction completes,
   375  so that the requests are not blocked forever. This is the responsibility of the
   376  [merge watcher goroutine][watcher]. Determining whether a transaction has
   377  committed or not is conceptually simple, but the details are brutally
   378  complicated. See the [transaction record GC](#transaction-record-gc) section
   379  for details.
   380  
   381  Note that the Subsume request only runs on the leaseholder, and therefore the
   382  merge bit is only set on the leaseholder and the watcher goroutine only runs on
   383  the leaseholder. This is perfectly safe, as none of the follower replicas can
   384  process requests.
   385  
   386  Promise 3 is actually not provided by the Subsume request at all, but by a hook
   387  in the lease acquisition code path. Whenever a replica acquires a lease, it
   388  checks to see whether its local descriptor has a deletion intent. If it does, it
   389  can infer that a subsumption is in progress, as nothing else leaves a deletion
   390  intent on a range descriptor. In that case, the replica, instead of serving
   391  traffic, flips the merge bit and launches its own merge watcher goroutine, just
   392  as the Subsume command would have. This means there can actually be multiple
   393  replicas of the RHS with the merge bit set and a merge watcher goroutine
   394  running—assuming the old leaseholder did not crash but lost its lease for other
   395  reasons—but this does not cause any problems.
   396  
   397  [subsume-request]: https://github.com/cockroachdb/cockroach/blob/d6adf24cae788d7cd967feadae8e9c0388ce5273/pkg/storage/batcheval/cmd_subsume.go#L56-L86
   398  [merge-bit]: https://github.com/cockroachdb/cockroach/blob/d6adf24cae788d7cd967feadae8e9c0388ce5273/pkg/storage/replica.go#L361-L364
   399  [watcher]: https://github.com/cockroachdb/cockroach/blob/d6adf24cae788d7cd967feadae8e9c0388ce5273/pkg/storage/replica.go#L2817-L2926
   400  
   401  ### Snapshots
   402  
   403  The LHS might be advanced over the command that commits a merge with a snapshot.
   404  That means that all the complicated bookkeeping that normally takes place when a
   405  replica processes a command with a non-nil `ReplicatedEvalResult.Merge` is
   406  entirely skipped! Most problematically, the snapshot will need to widen the
   407  receiving replica, but there will be a replica of the RHS in the way—remember,
   408  this is guaranteed by replica set alignment. In fact, the snapshot could be
   409  advancing over several merges, or a combination of several merges and splits, in
   410  which case there will be several RHSes to subsume.
   411  
   412  This turns out to be relatively straightforward to handle. If an initialized
   413  replica receives a snapshot that widens it, it can infer that a merge occured,
   414  and it simply subsumes all replicas that are overlapped by the snapshot in one
   415  shot. This requires the same delicate synchronization dance, mentioned at the
   416  end of the [merge transaction](#merge-transaction) section, to update bookeeping
   417  information. After all, applying a widening snapshot is simply the bulk version
   418  of applying a merge command directly. The details are too complicated to go into
   419  here, but you can begin your own exploration by starting with this call to
   420  [`Replica.maybeAcquireSnapshotMergeLock`][code-start] and tracing how the
   421  returned `subsumedRepls` value is used.
   422  
   423  [code-start]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/replica.go#L4071-L4072
   424  
   425  ### Merge queue
   426  
   427  The merge queue, like most of the other queues in the system, runs on every
   428  store and periodically scans all replicas for which that store holds the lease.
   429  For each replica, the merge queue evaluates whether it should be merged with
   430  its right neighbor. Looking rightward is a natural fit for the reasons
   431  described in the [key encoding oddities](#key-encoding-oddities) section of
   432  the appendix.
   433  
   434  In some ways, the merge queue has an easy job. For any given range _P_ and its
   435  right neighbor _Q_, the merge queue synthesizes the hypothetical merged range
   436  _PQ_ and asks whether the split queue would immediately split that merged range.
   437  If the split queue would immediately split _PQ_, then obviously _P_ and _Q_
   438  should not be merged; otherwise, the ranges _should_ be merged! This means that
   439  any improvement to our split heuristics also improves our merge heuristics with
   440  essentially no extra work. For example, load-based splitting hardly required any
   441  changes to the merge queue.
   442  
   443  Note that, to avoid thrashing, ranges at or above the minimum size threshold
   444  (8MB) are never considered for a merge. The minimum size threshold is
   445  configurable on a per-zone basis.
   446  
   447  Unfortunately, constructing the hypothetical merged range _PQ_ requires
   448  information about _Q_ that only _Q_'s leaseholder maintains, like the amount of
   449  load that _Q_ is currently experiencing. The merge queue must send a RangeStats
   450  RPC to collect this information from _Q_'s leaseholder, because there is no
   451  guarantee that the current store is _Q_'s leaseholder—or that the current store
   452  even has a replica of _Q_ at all.
   453  
   454  To prevent unnecessary RPC chatter, the merge queue uses several heuristics to
   455  void sending RangeStats requests when it seems like the merge is unlikely to be
   456  permitted. For example, if it determines that _P_ and _Q_ store different
   457  tables, the split between them is mandatory and it won't bother sending the
   458  RangeStats requests. Similarly, if _P_ is above the minimum size threshold, it
   459  doesn't bother asking about _Q_.
   460  
   461  ## Subtle complexities
   462  
   463  ### Range descriptor generations
   464  
   465  There was one knotty race that was not immediately eliminated by transaction
   466  serializability. Suppose we have our standard aligned replica set situation:
   467  
   468  ```
   469  Store 1    Store 2     Store 3     Store 4
   470  +-----+    +-----+     +-----+     +-----+
   471  | P Q |    | P Q |     |     |     | P Q |
   472  +-----+    +-----+     +-----+     +-----+
   473  ```
   474  
   475  In an unfortunate twist of fate, a rebalance of P from store 2 to store 3
   476  begins at the same time as a merge of P and Q begins. Let's quickly cover the
   477  valid outcomes of this race.
   478  
   479    1. The rebalance commits before the merge. The merge must abort, as the
   480       replica sets of P and Q are no longer aligned.
   481  
   482    2. The merge commits before the rebalance starts. The rebalance should
   483       voluntarily abort, as the decision to rebalance P needs to be updated in
   484       light of the merge. It is not, however, a correctness problem if the
   485       rebalance commits; it simply results in rebalancing a larger range than
   486       may have been intended.
   487  
   488    3. The merge commits before the rebalance ends, but after the rebalance has
   489       sent a preemptive snapshot to store 3. The rebalance must abort, as
   490       otherwise the preemptive snapshot it sent to store 3 is a ticking time
   491       bomb.
   492  
   493       To see why, suppose the rebalance commits. Since the preemptive snapshot
   494       predates the commit of the merge transaction, the new replica on store 3
   495       will need to be streamed the Raft command that commits the merge
   496       transaction. But applying this merge command is disastrous, as store 3
   497       does not have a replica of Q to merge! This is a very subtle way in which
   498       replica set alignment can be subverted.
   499  
   500  Guaranteeing the correct outcome in case 1 is easy. The merge transaction simply
   501  checks for replica set alignment by transactionally reading the range descriptor
   502  for _P_ and the range descriptor for _Q_ and verifying that they list the same
   503  replicas. Serializability guarantees the rest.
   504  
   505  Case 2 is similarly easy to handle. The rebalance transaction simply verifies
   506  that the range descriptor used to make the rebalance decision matches the range
   507  descriptor that it reads transactionally.
   508  
   509  Case 3, however, has an extremely subtle pitfall. It seems like the solution for
   510  case 2 should apply: simply abort the transaction if the range descriptor
   511  changes between when the preemptive snapshot is sent and when the rebalance
   512  transaction starts. But, as it turns out, this is not quite foolproof. What if,
   513  between when the preemptive snapshot is sent and when the rebalance transaction
   514  starts, _P_ and _Q_ merge together and then split at exactly the same key? The
   515  range descriptor for _P_ will look entirely unchanged to the rebalance
   516  transaction!
   517  
   518  The solution was to add a generation counter to the range descriptor:
   519  
   520  ```protobuf
   521  message RangeDescriptor {
   522      // ...
   523  
   524      // generation is incremented on every split and every merge, i.e., whenever
   525      // the end_key of this range changes. It is initialized to zero when the range
   526      // is first created.
   527      optional int64 generation = 6;
   528  }
   529  ```
   530  
   531  It is no longer possible for a range descriptor to be unchanged by a sequence of
   532  splits and merges, as every split and merge will bump the generation counter.
   533  Rebalances can thus detect if a merge commits between when the preemptive
   534  snapshot is sent and when the transaction begins, and abort accordingly.
   535  
   536  ### Misaligned replica sets
   537  
   538  An early implementation allowed merges between ranges with misaligned replica
   539  sets. The intent was to simplify matters by avoiding replica rebalancing.
   540  
   541  Consider again our example misaligned replica set:
   542  
   543  ```
   544  Store 1    Store 2     Store 3     Store 4
   545  +-----+    +-----+     +-----+     +-----+
   546  | P Q |    |  P  |     |  Q  |     | P Q |
   547  +-----+    +-----+     +-----+     +-----+
   548  
   549  P: (s1, s2, s4)
   550  Q: (s1, s3, s4)
   551  ```
   552  
   553  Note that there are two perspectives shown here. The store boxes represent the
   554  replicas that are *actually* present on that store, from the perspective of the
   555  store itself. The descriptor tuples at the bottom represent the stores that are
   556  considered to be members of the range, from the perspective of the most recently
   557  committed range descriptor.
   558  
   559  Now, to merge _P_ and _Q_ in this situation without aligning their replica sets,
   560  store 2 needed to be provided a copy of store 3's data. To accomplish this, a
   561  copy of _Q_'s data was stashed in the merge trigger, and _P_ would write this
   562  data into its store when applying the merge trigger.
   563  
   564  There was, sadly, a large and unforeseen problem with lagging replicas. Suppose
   565  store 2 loses network connectivity a moment before ranges _P_ and _Q_ are
   566  merged. Note that store 2 is not required for _P_ and _Q_ to merge, because only
   567  a quorum is required on the LHS to commit a merge. Now the situation looks like
   568  this:
   569  
   570  ```
   571  Store 1    Store 2     Store 3     Store 4
   572  +-----+    xxxxxxx     +-----+     +-----+
   573  | PQ  |    |  P  |     |     |     | PQ  |
   574  +-----+    xxxxxxx     +-----+     +-----+
   575  
   576  PQ: (s1, s2, s4)
   577  ```
   578  
   579  There is nothing stopping the newly merged _PQ_ range from immediately splitting
   580  into _P_ and _Q'_. Note that _P_ is the same range as the original _P_ (i.e., it
   581  has the same range ID) and so the replica on _P_ is still considered a member,
   582  while _Q'_ is a new range, with a new ID, that is unrelated to _Q_:
   583  
   584  ```
   585  Store 1    Store 2     Store 3     Store 4
   586  +-----+    xxxxxxx     +-----+     +-----+
   587  | P Q'|    |  P  |     |     |     | P Q'|
   588  +-----+    xxxxxxx     +-----+     +-----+
   589  
   590  P:  (s1, s2, s4)
   591  Q': (s1, s2, s4)
   592  ```
   593  
   594  When store 2 comes back online, it will start catching up on missed messages.
   595  But notice how the meta ranges consider store 2 to be a member of _Q'_, because
   596  it was a member of _P_ before the split. The leaseholder for _Q'_ will notice
   597  that store 2's replica is out of date and send over a snapshot so that store 2
   598  can initialize its replica... and all that might happen before store 2 manages
   599  to apply the merge command for _PQ_. If so, applying the merge command for _PQ_
   600  will explode, because the keyspace of the merged range _PQ_ intersects with the
   601  keyspace of _Q'_!
   602  
   603  By requiring aligned replica sets, we sidestep this problem. The RHS is, in
   604  effect, a lock on the post-merge keyspace. Suppose we find ourselves in the
   605  analogous situation with replica sets aligned:
   606  
   607  ```
   608  Store 1    Store 2     Store 3     Store 4
   609  +-----+    xxxxxxx     +-----+     +-----+
   610  | P Q'|    | P Q |     |     |     | P Q'|
   611  +-----+    xxxxxxx     +-----+     +-----+
   612  
   613  P:  (s1, s2, s4)
   614  Q': (s1, s2, s4)
   615  ```
   616  
   617  Here, _PQ_ split into _P_ and _Q'_ immediately after merging, but notice how
   618  store 2 has a replica of both _P_ and _Q_ because we required replica set
   619  alignment during the merge. That replica of _Q_ prevents store 2 from
   620  initializing a replica of _Q'_ until either store 2's replica of _P_ applies the
   621  merge command (to _PQ_) and the split command (to _P_ and _Q'_), or store 2's
   622  replica of _P_ is rebalanced away.
   623  
   624  ### Replica GC
   625  
   626  Per the discussion in the last section, we use the replica of the RHS as a lock
   627  on the keyspace extension. This means that we need to be careful not to GC this
   628  replica too early.
   629  
   630  It's easiest to see why this is a problem if we consider the case where one
   631  replica is extremely slow in applying a merge:
   632  
   633  ```
   634  Store 1    Store 2     Store 3     Store 4
   635  +-----+    +-----+     +-----+     +-----+
   636  | PQ  |    | PQ  |     |     |     | P Q |
   637  +-----+    +-----+     +-----+     +-----+
   638  
   639  PQ: (s1, s2, s4)
   640  ```
   641  
   642  Here, _P_ and _Q_ have just merged. Store 4 hasn't yet processed the merge while
   643  stores 1 and 2 have.
   644  
   645  The replica GC queue is continually scanning for replicas that are no longer a
   646  member of their range. What if the replica GC queue on store 4 scans its replica
   647  of _Q_ at this very moment? It would notice that the _Q_ range has been merged
   648  away and, conceivably, conclude that _Q_ could be garbage collected. This would
   649  be disastrous, as when _P_ finally applied the merge trigger it would no longer
   650  have a replica of _Q_ to subsume!
   651  
   652  One potential solution would be for the replica GC queue to refuse to GC
   653  replicas for ranges that have been merged away. But that could result in
   654  replicas getting permanently stuck. Suppose that, before store 4 applies the
   655  merge transaction, the _PQ_ range is rebalanced away to store 3:
   656  
   657  ```
   658  Store 1    Store 2     Store 3     Store 4
   659  +-----+    +-----+     +-----+     +-----+
   660  | PQ  |    | PQ  |     | PQ  |     | P Q |
   661  +-----+    +-----+     +-----+     +-----+
   662  
   663  PQ: (s1, s2, s3)
   664  ```
   665  
   666  Store 4's replica of _P_ will likely never hear about the merge, as it is no
   667  longer a member of the range and therefore not receiving any additional Raft
   668  messages from the leader, so it will never subsume _Q_. The replica GC queue
   669  _must_ be capable of garbage collecting _Q_ in this case. Otherwise _Q_ will
   670  be stuck on store 4 forever, permanently preventing the store from ever
   671  acquiring a new replica that overlaps with that keyspace.
   672  
   673  Solving this problem turns out to be quite tricky. What the replica GC queue
   674  wants to know when it discovers that _Q_'s range has been subsumed is whether
   675  the local replica of _Q_ might possibly still be subsumed by its local left
   676  neighbor _P_. It can't just ask the local _P_ whether it's about to apply a
   677  merge, since _P_ might be lagging behind, as it is here, and have no idea that a
   678  merge is about to occur.
   679  
   680  So the problem reduces to proving that _P_ cannot apply a merge trigger that
   681  will subsume _Q_. The chosen approach is to fetch the current range descriptor
   682  for _P_ from the meta index. If that descriptor exactly matches the local
   683  descriptor, thanks to [range descriptor generations](#range-descriptor-generations),
   684  we are assured that there are no merge triggers that _P_ has yet to apply, and
   685  _Q_ can safely be GC'd.
   686  
   687  Note that it is possible to form long chains of replicas that can only be GC'd
   688  from left to right; the GC queue is not aware of these dependencies and
   689  therefore processes such chains extremely inefficiently (i.e., by processing
   690  replicas in an arbitrary order instead of the necessary order). These chains
   691  turn out to be extremely rare in practice.
   692  
   693  There is one additional subtlety here. Suppose we have two adjacent ranges, _O_
   694  and _Q_. _O_ has just split into _O_ and _P_, but store 3 is lagging and has not
   695  yet processed the split.
   696  
   697  ```
   698  STATE 1
   699  
   700   Store 1      Store 2      Store 3     Store 4
   701  +-------+    +-------+    +-------+   +-------+
   702  | O P Q |    | O P Q |    | O   Q |   |       |
   703  +-------+    +-------+    +-------+   +-------+
   704  
   705  O: (s1, s2, s3)
   706  P: (s1, s2, s3)
   707  Q: (s1, s2, s3)
   708  ```
   709  
   710  At this point, suppose the leader for the new range _P_ decides that store 3
   711  will need a snapshot to catch up, and starts sending the snapshot over the
   712  network. This will be important later. At the same time, _P_ and _Q_ merge while
   713  store 3 is still lagging.
   714  
   715  It may seem strange that this merge is permitted, but notice how the replica
   716  sets are aligned according to the descriptors, even though store 3 does not
   717  physically have a replica of _P_ yet. Here's the new state of the world:
   718  
   719  ```
   720  STATE 2
   721  
   722   Store 1      Store 2      Store 3     Store 4
   723  +-------+    +-------+    +-------+   +-------+
   724  | O  PQ |    | O  PQ |    | O   Q |   |       |
   725  +-------+    +-------+    +-------+   +-------+
   726  
   727  O:  (s1, s2, s3)
   728  PQ: (s1, s2, s3)
   729  ```
   730  
   731  Finally, _O_ is rebalanced from store 3 to store 4 and garbage collected on
   732  store 4:
   733  
   734  ```
   735  STATE 3
   736  
   737   Store 1      Store 2      Store 3     Store 4
   738  +-------+    +-------+    +-------+   +-------+
   739  | O  PQ |    | O  PQ |    |     Q |   | O     |
   740  +-------+    +-------+    +-------+   +-------+
   741  
   742  O:  (s1, s2, s4)
   743  PQ: (s1, s2, s3)
   744  ```
   745  
   746  The replica GC queue might reasonably think that store 3's replica of _Q_ is out
   747  of date, as _Q_ has no left neighbor that could subsume it. But, at any moment
   748  in time, store 3 could finish receiving the snapshot for _P_ that was started
   749  between state 1 and state 2. Crucially, this snapshot predates the merge, so it
   750  will need to apply the merge trigger... and the replica for _Q_ had better be
   751  present on the store!
   752  
   753  This hazard is avoided by requiring that all replicas of the LHS are initialized
   754  before a merge begins. This prevents a transition from state 1 to state 2, as
   755  the merge of _P_ and _Q_ cannot occur until store 3 initializes its replica of
   756  _P_. The AdminMerge command will wait a few seconds in the hope that store 3
   757  catches up quickly; otherwise, it will refuse to launch the merge transaction.
   758  It is therefore impossible to end up in a dangerous state, like state 3, and it
   759  is thus safe for the replica GC queue to GC _Q_ if its left neighbor is
   760  generationally up to date.
   761  
   762  ### Transaction record GC
   763  
   764  The merge watcher goroutine needs to wait until the merge transaction completes,
   765  and determine whether or not the transaction committed or aborted. This turns
   766  out to be brutally complicated, thanks to the aggressive garbage collection of
   767  transaction records.
   768  
   769  The watcher goroutine begins by sending a PushTxn request to the merge
   770  transaction. It can easily discover the necessary arguments for the PushTxn
   771  request, that is, the ID and key of the current merge transaction record,
   772  because they're recorded in the intent that the merge transaction has left on
   773  the RHS's local copy of the descriptor.
   774  
   775  If the PushTxn request reports that the merge transaction committed, we're
   776  guaranteed that the merge transaction did, in fact, complete. That means that we
   777  can mark the RHS replica as destroyed, bounce all requests back to DistSender
   778  (where they'll get retried on the subsuming range), and clean up the watcher
   779  goroutine. The RHS replica will be cleaned up either when the LHS replica
   780  subsumes it or when the replica GC queue notices that it has been abandoned.
   781  
   782  If the PushTxn request instead reports that the merge transaction aborted, we're
   783  not guaranteed that the merge transaction actually aborted. The merge
   784  transaction may have committed so quickly that its transaction record was
   785  garbage collected before our PushTxn request arrived. The PushTxn incorrectly
   786  interprets this state to mean that the transaction was aborted, when, in fact,
   787  it was committed and GCed. To be fair, we're somewhat abusing the PushTxn
   788  request here. Outside of merges, a PushTxn request is only sent when a pending
   789  intent is discovered, and transaction records can't be GCed until all their
   790  intents have been resolved.
   791  
   792  So we need some way to determine whether a merge transaction was actually
   793  aborted. What we do is look for the effects of the merge transaction in meta2.
   794  If the merge aborted, we'll see our own range descriptor, with our range ID, in
   795  meta2. By contrast, if the merge committed, we'll see a range descriptor for a
   796  different range in meta2.
   797  
   798  This complexity is extremely unfortunate, and turns what should be a simple
   799  goroutine, spawned on the RHS leader for every merge transaction
   800  
   801  ```go
   802  go func() {
   803      <-txn.Done() // wait for txn to complete
   804      if txn.Committed() {
   805          repl.MarkDestroyed("replica subsumed")
   806      }
   807      repl.UnblockRequests()
   808  }
   809  ```
   810  
   811  into [150 lines of hard to follow code][code].
   812  
   813  [code]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/replica.go#L2813-L2920
   814  
   815  ### Unanimity
   816  
   817  The largest conceptual incongruity with the current merge implementation is the
   818  fact that it requires unanimous consent from all replicas, instead of a majority
   819  quorum, like everything else in Raft. Further confusing matters, only the RHS
   820  needs unanimous consent; a merge can proceed with only majority consent from the
   821  LHS. In fact, it's even a bit more subtle: while only a majority of the LHS
   822  replicas need to vote on the merge command, all LHS replicas need to confirm
   823  that they are initialized for the merge to start.
   824  
   825  There is no theoretical reason that merges need unanimous consent, but the
   826  complexity of the implementation quickly skyrockets without it. For example,
   827  suppose you adjusted the transfer of power so that only a majority of replicas
   828  on the RHS need to be fully up to date before the merge commits. Now, when
   829  applying the merge trigger, the LHS needs to check to see if its copy of the RHS
   830  is up to date; if it's not, the LHS needs to throw away its entire Raft state
   831  and demand a snapshot from the leader. This is both unsightly—our code is worse
   832  off every time we reach into Raft—and less efficient than the existing
   833  implementation, as it requires sending a potentially multi-megabyte snapshot if
   834  one replica of the RHS is just a little bit behind in applying the latest
   835  commands.
   836  
   837  It's possible that these problems could be mitigated while retaining the ability
   838  to merge with a minority of replicas offline, but an obvious solution did not
   839  present itself. On the bright side, having too many ranges is unlikely to cause
   840  acute performance problems; that is, a situation where a merge is critical to
   841  the health of a cluster is difficult to imagine. Unlike large ranges, which can
   842  appear suddenly and require an immediate split or log truncation, merges are
   843  only required when there are on the order of tens of thousands of excessively
   844  small ranges, which takes a long time to build up.
   845  
   846  ## Safety recap
   847  
   848  This section is a recap of the various mechanisms, which are described in
   849  detail above, that work together to ensure that merges do not violate
   850  consistency.
   851  
   852  The first broad safety mechanism is replica set alignment, which is required so
   853  that every store participating in the merge has a copy of both halves of the
   854  data in the merged range. Replica sets are first optimistically aligned by the
   855  merge queue. The replicas might drift apart, e.g., because the ranges in
   856  question were also targeted for a rebalance by the replicate queue, so the
   857  merge transaction verifies that the replica sets are still aligned from within
   858  the transaction. If a concurrent split or rebalance were to occur on the
   859  implicated ranges, transactional isolation kicks in and aborts one of the
   860  transactions, so we know that the replica sets are still aligned at the moment
   861  that the merge commits.
   862  
   863  Crucially, we need to maintain alignment until the merge applies on all replicas
   864  that were present at the time of the merge. This is enforced by refusing to
   865  GC a replica of the RHS of a merge unless it can be proven that the store does
   866  not have a replica of the LHS that predates the merge, _nor_ will it acquire
   867  a replica of the LHS that predates the merge. Proving that it does not currently
   868  have a replica of the LHS that predates the merge is fairly straightforward:
   869  we simply prove that the local left neighbor's generation is the newest
   870  generation, as indicated by the LHS's meta descriptor. Proving that the store
   871  will _never_ acquire a replica of the LHS that predates the merge is harder—
   872  there could be a snapshot in flight that the LHS is entirely unaware of. So
   873  instead we require that replicas of the LHS in a merge are initialized on every
   874  store before the merge can begin.
   875  
   876  The second broad safety mechanism is the range freeze. This ensures that the
   877  subsuming range and the subsumed range do not serve traffic at the same time,
   878  which would lead to clear consistency violations. The mechanism works by tying
   879  the freeze to the lifetime of the merge transaction; the merge will not commit
   880  until all replicas of the RHS are verified to be frozen, and the replicas of the
   881  RHS will not unfreeze unless the merge transaction is verified to be aborted.
   882  Lease transfers are freeze-aware, so the freeze will persist even if the lease
   883  moves around on the RHS during the merge or if the leaseholder restarts. The
   884  implementation of the freeze ab(uses) the span latch manager, to flush out
   885  in-flight commands on the RHS, an intent on the local range descriptor, to
   886  ensure the freeze persists if the lease is transfered, and an RPC that
   887  repeatedly polls the RHS to wait until it is fully caught up.
   888  
   889  ## Appendix
   890  
   891  The appendix contains digressions that are not directly pertinent to range
   892  merges, but are not covered in documentation elsewhere.
   893  
   894  ### Key encoding oddities
   895  
   896  Lexicographic ordering of keys of unbounded length has the interesting property
   897  that it is always possible to construct the key that immediately succeeds a
   898  given key, but it is not always possible to construct the key that immediately
   899  precedes a given key.
   900  
   901  In the following diagrams `\xNN` represents a byte whose value in hexadecimal is
   902  `NN`. The null byte is thus `\x00` and the maximal byte is thus `\xff`.
   903  
   904  Now consider a sequence of keys that has no gaps:
   905  
   906  ```
   907  a
   908  a\x00
   909  a\x00\x00
   910  ```
   911  
   912  No gaps means that there are no possible keys that can sort between any of the
   913  members of the sequence. For example, there is, simply, no key that sorts
   914  between `a` and `a\x00`.
   915  
   916  Because we can construct such a sequence, we must have next and previous
   917  operations over the sequence, which, given a key, construct the immediately
   918  following key and the immediately preceding key, respectively. We can see from
   919  the diagram that the next operation appends a null byte (`\x00`), while the
   920  previous operation strips off that null byte.
   921  
   922  But what if we want to perform the previous operation on a key that does not end
   923  in a null byte? For example, what is the key that immediately precedes `b`? It's
   924  not `a`, because `a\x00` sorts between `a` and `b`. Similarly, it's not `a\xff`,
   925  because `a\xff\xff` sorts between `a\xff` and `b`. This process continues
   926  inductively until we conclude the key that immediately precedes `b` is
   927  `a\xff\xff\xff...`, where there are an infinite number of trailing `\xff` bytes.
   928  
   929  It is not possible to represent this key in CockroachDB without infinite space.
   930  You could imagine designing the key encoding with an additional bit that means,
   931  "pretend this key has an infinite number of trailing maximal bytes," but
   932  CockroachDB does not have such a bit.
   933  
   934  The upshot is that it is trivial to advance in the keyspace using purely
   935  lexical operations, but it is impossible to reverse in the keyspace with purely
   936  lexical operations.
   937  
   938  This problem pervades the system. Given a range that spans from `StartKey`,
   939  inclusive, to `EndKey`, exclusive, it is trivial to address a request to
   940  following range, but *not* the preceding range. To route a request to a range,
   941  we must construct a key that lives inside that range. Constructing such a key
   942  for the following range is trivial, as the end key of a range is, by definition,
   943  contained in the following range. But constructing such a key for the preceding
   944  range would require constructing the key that immediately precedes `StartKey`,
   945  which is not possible with CockroachDB's key encoding.