github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20150729_replica_tombstone.md (about)

     1  - Feature Name: replica_tombstone
     2  - Status: completed
     3  - Start Date: 2015-07-29
     4  - RFC PR: [#1865](https://github.com/cockroachdb/cockroach/pull/1865)
     5  - Cockroach Issue: [#768](https://github.com/cockroachdb/cockroach/issues/768),
     6                     [#1878](https://github.com/cockroachdb/cockroach/issues/1878),
     7                     [#2798](https://github.com/cockroachdb/cockroach/pull/2798)
     8  
     9  # Summary
    10  
    11  Add a **Replica ID** to each replica in a range descriptor, and pass
    12  this replica ID in all raft messages. When a replica is removed from a
    13  node, record a tombstone with the replica ID, and reject any later
    14  messages referring to that replica.
    15  
    16  # Motivation
    17  
    18  Replica removal is necessarily an asynchronous process -- the node
    19  holding the removed replica may be down at the time of its removal,
    20  and so any recovering node may have some replicas that should have
    21  been removed but have not yet been garbage-collected. These nodes may
    22  try to send raft messages that could disrupt the legitimate members of
    23  the group. Even worse, if there has been enough turnover in the
    24  membership of a group, a quorum of removed replicas may manage to
    25  elect a lease holder among themselves.
    26  
    27  We have an additional complication compared to vanilla raft because we
    28  allow node IDs to be reused (this is necessary for coalesced
    29  heartbeats. We may remove a replica from a node and then re-add it
    30  later with the same node ID, so we must be able to distinguish
    31  messages from an earlier epoch of the range.
    32  
    33  Here is a scenario that can lead to split-brain in the current system:
    34  
    35  1. A range R has replicas on nodes A, B, and C; C is down.
    36  2. Nodes A and B execute a `ChangeReplicas` transactions to remove
    37     node C. Several more `ChangeReplicas` transactions follow, adding
    38     nodes D, E, and F and removing A and B.
    39  3. Nodes A and B garbage-collect their copies of the range.
    40  4. Node C comes back up. When it doesn't hear from the lease holder of
    41     range R, it starts an election.
    42  5. Nodes A and B see that node C has a more advanced log position for
    43     range R than they do (since they have nothing), so they vote for it.
    44     C becomes lease holder and sends snapshots to A and B.
    45  6. There are now two "live" versions of the range. Clients
    46     (`DistSenders`) whose range descriptor cache is out of date may
    47     talk to the ABC group instead of the correct DEF group.
    48  
    49  The problems caused by removing replicas are also discussed in   [#768](https://github.com/cockroachdb/cockroach/issues/768).
    50  
    51  # Detailed design
    52  
    53  A **Replica ID** is generated with every `ChangeReplicas` call and
    54  stored in the `Replica` message of the range descriptor. Replica IDs
    55  are monotonically increasing within a range and never reused. They are
    56  generated using a `next_replica_id` field which will be added to the
    57  `RangeDescriptor` (alternative generation strategies include using the
    58  raft log position or database timestamp of the `EndTransaction` call
    59  that commits the membership change, but this information is less
    60  accessible at the point where it would be needed).
    61  
    62  The `ReplicaID` replaces the current `RaftNodeID` (which is
    63  constructed from the `NodeID` and `StoreID`). Raft only knows about
    64  replica IDs, which are never reused, so we don't have to worry about
    65  certain problems that can aries when node IDs are reused (such as vote
    66  changes or log regression. `MultiRaft` is responsible for mapping
    67  `ReplicaIDs` to node and store IDs (which it must do to coalesce
    68  heartbeats). This is done with a new method on the
    69  `WriteableGroupStorage` interface, along with an in-memory cache. Note
    70  the node and store IDs for a given replica never change once that
    71  replica is created, so we don't need to worry about synchronizing or
    72  invalidating entries in this cache.
    73  
    74  The raft transport will send the node, store, and replica ID of both
    75  the sender and receiver with every `RaftMessageRequest`. This is how
    76  the `Store` will learn of the replica ID for new ranges (the sender's
    77  IDs must be inserted in the replica ID cache, so we can route
    78  responses that do not yet appear in any range descriptor we have
    79  stored). The `Store` will drop incoming messages with a replica ID
    80  that is less than the last known one for the range (in order to
    81  minimize disruption from out-of-date servers).
    82  
    83  The `DistSender` will also send the replica ID in the header of every
    84  request, and the receiver will reject requests with incorrect replica
    85  IDs (This will be rare since a range will normally be absent from a
    86  node for a time before being re-replicated).
    87  
    88  When a replica is garbage-collected, we write a **tombstone**
    89  containing the replica ID, so that the store can continue to drop
    90  out-of-date raft messages even after the GC.
    91  
    92  In the scenario above, with this change node C would send node A and
    93  B's replica IDs in its vote requests. They would see that they have a
    94  tombstone for that replica and not recreate it.
    95  
    96  # Drawbacks
    97  
    98  Tombstones must be long-lived, since a node may come back online after
    99  a lengthy delay. We cannot garbage-collect tombstones unless we also
   100  guarantee that no node will come back after a period of time longer
   101  than the tombstone GC time.
   102  
   103  # Alternatives
   104  
   105  If we had an explicit replica-creation RPC instead of creating them
   106  automatically on the first sighting of a raft message, this problem
   107  may go away. However, doing so would be tricky: this RPC would need to
   108  be retried in certain circumstances, and it is difficult to
   109  distinguish cases where a retry is necessary from cases that lead to
   110  the problem discussed here.
   111  
   112  # Unresolved questions
   113  
   114  * How long must we wait before garbage-collecting tombstones? Can we
   115    do it at all?