github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20150729_replica_tombstone.md (about) 1 - Feature Name: replica_tombstone 2 - Status: completed 3 - Start Date: 2015-07-29 4 - RFC PR: [#1865](https://github.com/cockroachdb/cockroach/pull/1865) 5 - Cockroach Issue: [#768](https://github.com/cockroachdb/cockroach/issues/768), 6 [#1878](https://github.com/cockroachdb/cockroach/issues/1878), 7 [#2798](https://github.com/cockroachdb/cockroach/pull/2798) 8 9 # Summary 10 11 Add a **Replica ID** to each replica in a range descriptor, and pass 12 this replica ID in all raft messages. When a replica is removed from a 13 node, record a tombstone with the replica ID, and reject any later 14 messages referring to that replica. 15 16 # Motivation 17 18 Replica removal is necessarily an asynchronous process -- the node 19 holding the removed replica may be down at the time of its removal, 20 and so any recovering node may have some replicas that should have 21 been removed but have not yet been garbage-collected. These nodes may 22 try to send raft messages that could disrupt the legitimate members of 23 the group. Even worse, if there has been enough turnover in the 24 membership of a group, a quorum of removed replicas may manage to 25 elect a lease holder among themselves. 26 27 We have an additional complication compared to vanilla raft because we 28 allow node IDs to be reused (this is necessary for coalesced 29 heartbeats. We may remove a replica from a node and then re-add it 30 later with the same node ID, so we must be able to distinguish 31 messages from an earlier epoch of the range. 32 33 Here is a scenario that can lead to split-brain in the current system: 34 35 1. A range R has replicas on nodes A, B, and C; C is down. 36 2. Nodes A and B execute a `ChangeReplicas` transactions to remove 37 node C. Several more `ChangeReplicas` transactions follow, adding 38 nodes D, E, and F and removing A and B. 39 3. Nodes A and B garbage-collect their copies of the range. 40 4. Node C comes back up. When it doesn't hear from the lease holder of 41 range R, it starts an election. 42 5. Nodes A and B see that node C has a more advanced log position for 43 range R than they do (since they have nothing), so they vote for it. 44 C becomes lease holder and sends snapshots to A and B. 45 6. There are now two "live" versions of the range. Clients 46 (`DistSenders`) whose range descriptor cache is out of date may 47 talk to the ABC group instead of the correct DEF group. 48 49 The problems caused by removing replicas are also discussed in [#768](https://github.com/cockroachdb/cockroach/issues/768). 50 51 # Detailed design 52 53 A **Replica ID** is generated with every `ChangeReplicas` call and 54 stored in the `Replica` message of the range descriptor. Replica IDs 55 are monotonically increasing within a range and never reused. They are 56 generated using a `next_replica_id` field which will be added to the 57 `RangeDescriptor` (alternative generation strategies include using the 58 raft log position or database timestamp of the `EndTransaction` call 59 that commits the membership change, but this information is less 60 accessible at the point where it would be needed). 61 62 The `ReplicaID` replaces the current `RaftNodeID` (which is 63 constructed from the `NodeID` and `StoreID`). Raft only knows about 64 replica IDs, which are never reused, so we don't have to worry about 65 certain problems that can aries when node IDs are reused (such as vote 66 changes or log regression. `MultiRaft` is responsible for mapping 67 `ReplicaIDs` to node and store IDs (which it must do to coalesce 68 heartbeats). This is done with a new method on the 69 `WriteableGroupStorage` interface, along with an in-memory cache. Note 70 the node and store IDs for a given replica never change once that 71 replica is created, so we don't need to worry about synchronizing or 72 invalidating entries in this cache. 73 74 The raft transport will send the node, store, and replica ID of both 75 the sender and receiver with every `RaftMessageRequest`. This is how 76 the `Store` will learn of the replica ID for new ranges (the sender's 77 IDs must be inserted in the replica ID cache, so we can route 78 responses that do not yet appear in any range descriptor we have 79 stored). The `Store` will drop incoming messages with a replica ID 80 that is less than the last known one for the range (in order to 81 minimize disruption from out-of-date servers). 82 83 The `DistSender` will also send the replica ID in the header of every 84 request, and the receiver will reject requests with incorrect replica 85 IDs (This will be rare since a range will normally be absent from a 86 node for a time before being re-replicated). 87 88 When a replica is garbage-collected, we write a **tombstone** 89 containing the replica ID, so that the store can continue to drop 90 out-of-date raft messages even after the GC. 91 92 In the scenario above, with this change node C would send node A and 93 B's replica IDs in its vote requests. They would see that they have a 94 tombstone for that replica and not recreate it. 95 96 # Drawbacks 97 98 Tombstones must be long-lived, since a node may come back online after 99 a lengthy delay. We cannot garbage-collect tombstones unless we also 100 guarantee that no node will come back after a period of time longer 101 than the tombstone GC time. 102 103 # Alternatives 104 105 If we had an explicit replica-creation RPC instead of creating them 106 automatically on the first sighting of a raft message, this problem 107 may go away. However, doing so would be tricky: this RPC would need to 108 be retried in certain circumstances, and it is difficult to 109 distinguish cases where a retry is necessary from cases that lead to 110 the problem discussed here. 111 112 # Unresolved questions 113 114 * How long must we wait before garbage-collecting tombstones? Can we 115 do it at all?