github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20150819_stateless_replica_relocation.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20150819_stateless_replica_relocation.md (about)

1 - Feature Name: Stateless Replica Relocation
2 - Status: completed
3 - Start Date: 2015-08-19
4 - RFC PR: [#2171](https://github.com/cockroachdb/cockroach/pull/2171)
5 - Cockroach Issue: [#620](https://github.com/cockroachdb/cockroach/issues/620)
6
7 # Summary
8 A relocation is, conceptually, the transfer of a single replica from one store to
9 another. However, the implementation is necessarily the combination of two operations:
10
11 1. Creating a new replica for a range on a new store.
12 2. Removing a replica of the same range from a different store.
13
14 For example, by creating a replica on store Y and then removing a replica from
15 store X, you have in effect moved the replica from X to Y.
16
17 This RFC is suggesting an overall architectural design goal: that the decision to
18 make any individual operation (either a create or a remove) should be **stateless**.
19 In other words, the second operation in a replica relocation should not depend on a
20 specific invocation of the first operation.
21
22 # Motivation
23 For an assortment of reasons, Cockroach will often need to relocate the replicas
24 of a range. Most immediately, this is needed for repair (when a store dies and
25 its replicas are no longer usable) and rebalance (relocating replicas on
26 overloaded stores to stores with excess capacity).
27
28 A relocation must be expressed as a combination of two operations:
29
30 1. Creating a new replica for a range on a new store.
31 2. Removing a replica of the same range from a different store.
32
33 These operations can happen in either order as long as quorum is maintained in
34 the range's raft group after each individual operation, although one ordering may be
35 preferred over another.
36
37 Expressing a specific relocation (i.e. "move replica from store X to store Y")
38 would require maintaining some persistent state to link the two operations
39 involved. Storing that state presents a number of issues: where is it stored,
40 in memory or on disk? If it's in memory, does it have to be replicated through
41 raft or is it local to one node? If on disk, can the persistent state become
42 stale? How do you detect conflicts between two relocation operations initiated
43 from different places?
44
45 This RFC suggests that no such relocation state should be persisted. Instead, a
46 system that wants to initiate a relocation will perform only the first
47 operation; a different system will later detect the need for a complementary
48 operation and perform it. A relocation is thus completed without any state being
49 exchanged between those two systems.
50
51 By eliminating the need to persist any data about in-progress relocation
52 operations, the overall system is dramatically simplified.
53
54 # Detailed design
55 The implementation involves a few pieces:
56
57 1. Each range must have a persisted *target replication state*. This does not
58 prescribe specific replica locations; it specifies a required count of
59 replicas, along with some desired attributes for the stores where they are
60 placed.
61 2. The core mechanic is a stateless function which can compare the immediate
62 replication state of a range to its target replication state; if the target
63 state is different, this function will either create or remove a replica in
64 order to move the range towards the target replication state. By running
65 multiple times (adding or removing a replica each time), the target
66 replication state will eventually be matched.
67 3. Any operations that wish to *relocate* a replica need only perform the first
68 operation of the relocation (either a create or a remove). This will perturb
69 the range's replication state away from the target; the core function will
70 later detect that mismatch, and correct it by performing the complementary
71 operation of the relocation (a remove or a create).
72
73 The first piece is already present: each range has a zone configuration
74 which determines the target replication state for the range.
75
76 The second piece, the core mechanic, will be performed by the existing
77 "replicate queue" which will be renamed the "*replication queue*". This queue is
78 already used to add replicas to ranges which are under-replicated; it can be
79 enhanced to remove replicas from over-replicated ranges, thus satisfying the
80 basic requirements of the core mechanic.
81
82 The third piece simply informs the design of systems performing relocations; for
83 example, the upcoming repair and rebalance systems (still being planned). After
84 identifying a relocation opportunity, these systems will perform the first
85 operation of the relocation (add or remove) and then insert the corresponding
86 replica into the local replication queue. The replication queue will then
87 perform the complementary operation.
88
89 The final complication is how to ensure that the replicate queue promptly
90 identifies ranges outside of their ideal target state. As a queue it will be
91 attached to the replica scanner, but we will also want to enqueue a replica
92 immediately whenever we knowingly perturb the replication state. Thus,
93 components initiating a relocation (e.g. rebalance or repair) should immediately
94 enqueue their target replicas after changing the replication state.
95
96 # Drawbacks
97
98 ### Specific Move Operations
99 A stateless model for relocation precludes the ability to request specific
100 relocations; only the first operation can be made with specificity.
101
102 For example, the verb "Move replica from X to Y" cannot be expressed with
103 certainty; instead, only "Move replica to X from (some store)" or "Move replica
104 to Y from (some store)" can be expressed. The replicate queue will be responsible
105 for selecting an appropriate store for the complementary operation.
106
107 It is assumed that this level of specificity is simply not needed for any
108 relocation operations; there is no currently apparent use case where a move
109 between a specific pair of stores is needed.
110
111 Even if this was necessary, it might be possible to express by manipulating the
112 factors behind the individual stateless decisions.
113
114 ### Thrashing of complementary operation
115 Because there is no relocation state, the possibility of "thrashing" is introduced.
116 For example:
117
118 1. Rebalance operation adds a new replica to the range.
119 2. The replicate queue picks up the range and detects the need to remove a
120 replica; however, it decides to remove the replica that was just added.
121
122 This is possible if the rebalance's criteria for new replica selection are
123 sufficiently different from the replicate queue's selection criteria for
124 removing a replica.
125
126 To reduce this risk, there must be sufficient agreement in the criteria between
127 the operations; a Rebalance operation should avoid adding a new replica if
128 there's a realistic chance that the replicate queue will immediately remove it.
129
130 This can realized by adding a "buffer" zone between the different criteria; that
131 is, when selecting a replica to add, the new replica should be *significantly*
132 removed from the criteria for removing a replica, thus reducing the chances that
133 it will be selected.
134
135 When properly tuned to avoid thrashing, the stateless nature could instead be
136 considered a positive because it can respond to changes during the interim;
137 consider if, due to a delay, the relative health of nodes changes and the
138 originally selected replica is no longer a good option. The stateless system
139 will respond correctly to this situation.
140
141 ### Delay in complementary operation
142 By separating the two operations, we are possibly delaying the application of
143 the second operation. For example, the replicate queue could be very busy, or an
144 untimely crash could result in the range being under- or over-replicated without
145 being in the replicate queue on any store.
146
147 However, many of these concerns are allayed by the existence of repair; if the
148 node goes down, the repair system will add the range to the replicate queue on
149 another store.
150
151 Even in an exotic failure scenario, the stateless design will eventually detect
152 the replication anomaly through the normal operation of the replica scanner.
153
154 ### Lack of non-voting replicas
155 Our raft implementation currently lacks support for non-voting members; as a
156 result, some types of relocation will temporarily make the effected range more
157 fragile.
158
159 When initially creating a replica, it is very far behind the current state of
160 the range and thus needs to receive a snapshot. It may take some time before the
161 range fully catches up and can take place in quorum commits.
162
163 However, without non-voting replicas we have no choice but to add the new
164 replica as a full member, thus changing the quorum requirements of the group. In
165 the case of odd-numbered ranges, this will increase the quorum count by one,
166 with the new range unable to be part of a quorum decision. This increases the
167 chance of losing quorum until that replica is caught up, thus reducing
168 availability.
169
170 This could be mitigated somewhat without having to completely add-non voting
171 replicas; in preparation for adding a replica, we could manually generate a
172 snapshot and send it to the node *before* adding it to the raft configuration.
173 This would decrease the window of time between adding the replica and having it
174 fully catch up.
175
176 "Lack of Non-voting replicas" is listed this as a drawback because going forward
177 with relocation *without* non-voting replication introduces this fragility,
178 regardless of how relocation decisions are made. Stateless relocation will still
179 work correctly when non-voting replicas are implemented; there will simply be a
180 delay in the case where a replica is added first (e.g. rebalance), with the
181 removal not taking place until the non-voting replica catches up and is upgraded
182 to a full group member. This is not trivial, but will still allow for stateless
183 decisions.
184
185 # Alternatives
186 The main alternative would be some sort of stateful system, where relocation
187 operations would be expressed explicitly (i.e. "Move replica from X to Y") and
188 then seen to completion. For reasons outlined in the "motivation" section, this
189 is considered sub-optimal when making distributed decisions.
190
191 ### Relocation Master
192 The ultimate expression of this would be the designation of a "Relocation
193 master", a single node that makes all relocation decisions for the entire
194 cluster
195
196 There is enough information in gossip for a central node to make acceptable
197 decisions about replica movement. In some ways the individual decisions would
198 be worse, because they would be unable to consider current raft state; however,
199 in aggregate the decisions could be much better, because the central master
200 could consider groups of relocations together. For example, it would be able to
201 avoid distributed pitfalls such as under-loaded nodes being suddenly inundated
202 with requests for new replicas. It would be able to more quickly and correctly
203 move replicas to new nodes introduced to the system.
204
205 This system would also have an easier time communicating the individual
206 relocation decisions that were made. This could be helpful for debugging or
207 tweaking relocation criteria.
208
209 However, It's important to note that a relocation master is not entirely
210 incompatible with the "stateless" design; the relocation master could simply be
211 the initiator of the stateless operations. You could thus get much of the
212 improved decision making and communication of the master without having to move
213 to a stateful design.
214
215 # Unresolved questions
216
217 ### Raft leader constraint
218 The biggest unresolved issue is the requirement for relocation decisions to be
219 constrained to the raft leader.
220
221 For example, there is no firm need for a relocation operation to be initiated
222 from a range leader; a secondary replica can safely initiate an add or remove
223 replica, with the safety being guaranteed by a combination of raft and
224 cockroach's transactions.
225
226 However, if the criteria for an operation wants to consider raft state (such as
227 which nodes are behind in replication), those decisions could only be made from
228 the raft leader (which is the only node that has the full raft state).
229 Alternatively, an interface could be provided for non-leader members to query
230 that information from the leader.