github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20150729_replica_tombstone.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20150729_replica_tombstone.md (about)

1 - Feature Name: replica_tombstone
2 - Status: completed
3 - Start Date: 2015-07-29
4 - RFC PR: [#1865](https://github.com/cockroachdb/cockroach/pull/1865)
5 - Cockroach Issue: [#768](https://github.com/cockroachdb/cockroach/issues/768),
6 [#1878](https://github.com/cockroachdb/cockroach/issues/1878),
7 [#2798](https://github.com/cockroachdb/cockroach/pull/2798)
8
9 # Summary
10
11 Add a **Replica ID** to each replica in a range descriptor, and pass
12 this replica ID in all raft messages. When a replica is removed from a
13 node, record a tombstone with the replica ID, and reject any later
14 messages referring to that replica.
15
16 # Motivation
17
18 Replica removal is necessarily an asynchronous process -- the node
19 holding the removed replica may be down at the time of its removal,
20 and so any recovering node may have some replicas that should have
21 been removed but have not yet been garbage-collected. These nodes may
22 try to send raft messages that could disrupt the legitimate members of
23 the group. Even worse, if there has been enough turnover in the
24 membership of a group, a quorum of removed replicas may manage to
25 elect a lease holder among themselves.
26
27 We have an additional complication compared to vanilla raft because we
28 allow node IDs to be reused (this is necessary for coalesced
29 heartbeats. We may remove a replica from a node and then re-add it
30 later with the same node ID, so we must be able to distinguish
31 messages from an earlier epoch of the range.
32
33 Here is a scenario that can lead to split-brain in the current system:
34
35 1. A range R has replicas on nodes A, B, and C; C is down.
36 2. Nodes A and B execute a `ChangeReplicas` transactions to remove
37 node C. Several more `ChangeReplicas` transactions follow, adding
38 nodes D, E, and F and removing A and B.
39 3. Nodes A and B garbage-collect their copies of the range.
40 4. Node C comes back up. When it doesn't hear from the lease holder of
41 range R, it starts an election.
42 5. Nodes A and B see that node C has a more advanced log position for
43 range R than they do (since they have nothing), so they vote for it.
44 C becomes lease holder and sends snapshots to A and B.
45 6. There are now two "live" versions of the range. Clients
46 (`DistSenders`) whose range descriptor cache is out of date may
47 talk to the ABC group instead of the correct DEF group.
48
49 The problems caused by removing replicas are also discussed in [#768](https://github.com/cockroachdb/cockroach/issues/768).
50
51 # Detailed design
52
53 A **Replica ID** is generated with every `ChangeReplicas` call and
54 stored in the `Replica` message of the range descriptor. Replica IDs
55 are monotonically increasing within a range and never reused. They are
56 generated using a `next_replica_id` field which will be added to the
57 `RangeDescriptor` (alternative generation strategies include using the
58 raft log position or database timestamp of the `EndTransaction` call
59 that commits the membership change, but this information is less
60 accessible at the point where it would be needed).
61
62 The `ReplicaID` replaces the current `RaftNodeID` (which is
63 constructed from the `NodeID` and `StoreID`). Raft only knows about
64 replica IDs, which are never reused, so we don't have to worry about
65 certain problems that can aries when node IDs are reused (such as vote
66 changes or log regression. `MultiRaft` is responsible for mapping
67 `ReplicaIDs` to node and store IDs (which it must do to coalesce
68 heartbeats). This is done with a new method on the
69 `WriteableGroupStorage` interface, along with an in-memory cache. Note
70 the node and store IDs for a given replica never change once that
71 replica is created, so we don't need to worry about synchronizing or
72 invalidating entries in this cache.
73
74 The raft transport will send the node, store, and replica ID of both
75 the sender and receiver with every `RaftMessageRequest`. This is how
76 the `Store` will learn of the replica ID for new ranges (the sender's
77 IDs must be inserted in the replica ID cache, so we can route
78 responses that do not yet appear in any range descriptor we have
79 stored). The `Store` will drop incoming messages with a replica ID
80 that is less than the last known one for the range (in order to
81 minimize disruption from out-of-date servers).
82
83 The `DistSender` will also send the replica ID in the header of every
84 request, and the receiver will reject requests with incorrect replica
85 IDs (This will be rare since a range will normally be absent from a
86 node for a time before being re-replicated).
87
88 When a replica is garbage-collected, we write a **tombstone**
89 containing the replica ID, so that the store can continue to drop
90 out-of-date raft messages even after the GC.
91
92 In the scenario above, with this change node C would send node A and
93 B's replica IDs in its vote requests. They would see that they have a
94 tombstone for that replica and not recreate it.
95
96 # Drawbacks
97
98 Tombstones must be long-lived, since a node may come back online after
99 a lengthy delay. We cannot garbage-collect tombstones unless we also
100 guarantee that no node will come back after a period of time longer
101 than the tombstone GC time.
102
103 # Alternatives
104
105 If we had an explicit replica-creation RPC instead of creating them
106 automatically on the first sighting of a raft message, this problem
107 may go away. However, doing so would be tricky: this RPC would need to
108 be retried in certain circumstances, and it is difficult to
109 distinguish cases where a retry is necessary from cases that lead to
110 the problem discussed here.
111
112 # Unresolved questions
113
114 * How long must we wait before garbage-collecting tombstones? Can we
115 do it at all?