github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20150819_stateless_replica_relocation.md (about) 1 - Feature Name: Stateless Replica Relocation 2 - Status: completed 3 - Start Date: 2015-08-19 4 - RFC PR: [#2171](https://github.com/cockroachdb/cockroach/pull/2171) 5 - Cockroach Issue: [#620](https://github.com/cockroachdb/cockroach/issues/620) 6 7 # Summary 8 A relocation is, conceptually, the transfer of a single replica from one store to 9 another. However, the implementation is necessarily the combination of two operations: 10 11 1. Creating a new replica for a range on a new store. 12 2. Removing a replica of the same range from a different store. 13 14 For example, by creating a replica on store Y and then removing a replica from 15 store X, you have in effect moved the replica from X to Y. 16 17 This RFC is suggesting an overall architectural design goal: that the decision to 18 make any individual operation (either a create or a remove) should be **stateless**. 19 In other words, the second operation in a replica relocation should not depend on a 20 specific invocation of the first operation. 21 22 # Motivation 23 For an assortment of reasons, Cockroach will often need to relocate the replicas 24 of a range. Most immediately, this is needed for repair (when a store dies and 25 its replicas are no longer usable) and rebalance (relocating replicas on 26 overloaded stores to stores with excess capacity). 27 28 A relocation must be expressed as a combination of two operations: 29 30 1. Creating a new replica for a range on a new store. 31 2. Removing a replica of the same range from a different store. 32 33 These operations can happen in either order as long as quorum is maintained in 34 the range's raft group after each individual operation, although one ordering may be 35 preferred over another. 36 37 Expressing a specific relocation (i.e. "move replica from store X to store Y") 38 would require maintaining some persistent state to link the two operations 39 involved. Storing that state presents a number of issues: where is it stored, 40 in memory or on disk? If it's in memory, does it have to be replicated through 41 raft or is it local to one node? If on disk, can the persistent state become 42 stale? How do you detect conflicts between two relocation operations initiated 43 from different places? 44 45 This RFC suggests that no such relocation state should be persisted. Instead, a 46 system that wants to initiate a relocation will perform only the first 47 operation; a different system will later detect the need for a complementary 48 operation and perform it. A relocation is thus completed without any state being 49 exchanged between those two systems. 50 51 By eliminating the need to persist any data about in-progress relocation 52 operations, the overall system is dramatically simplified. 53 54 # Detailed design 55 The implementation involves a few pieces: 56 57 1. Each range must have a persisted *target replication state*. This does not 58 prescribe specific replica locations; it specifies a required count of 59 replicas, along with some desired attributes for the stores where they are 60 placed. 61 2. The core mechanic is a stateless function which can compare the immediate 62 replication state of a range to its target replication state; if the target 63 state is different, this function will either create or remove a replica in 64 order to move the range towards the target replication state. By running 65 multiple times (adding or removing a replica each time), the target 66 replication state will eventually be matched. 67 3. Any operations that wish to *relocate* a replica need only perform the first 68 operation of the relocation (either a create or a remove). This will perturb 69 the range's replication state away from the target; the core function will 70 later detect that mismatch, and correct it by performing the complementary 71 operation of the relocation (a remove or a create). 72 73 The first piece is already present: each range has a zone configuration 74 which determines the target replication state for the range. 75 76 The second piece, the core mechanic, will be performed by the existing 77 "replicate queue" which will be renamed the "*replication queue*". This queue is 78 already used to add replicas to ranges which are under-replicated; it can be 79 enhanced to remove replicas from over-replicated ranges, thus satisfying the 80 basic requirements of the core mechanic. 81 82 The third piece simply informs the design of systems performing relocations; for 83 example, the upcoming repair and rebalance systems (still being planned). After 84 identifying a relocation opportunity, these systems will perform the first 85 operation of the relocation (add or remove) and then insert the corresponding 86 replica into the local replication queue. The replication queue will then 87 perform the complementary operation. 88 89 The final complication is how to ensure that the replicate queue promptly 90 identifies ranges outside of their ideal target state. As a queue it will be 91 attached to the replica scanner, but we will also want to enqueue a replica 92 immediately whenever we knowingly perturb the replication state. Thus, 93 components initiating a relocation (e.g. rebalance or repair) should immediately 94 enqueue their target replicas after changing the replication state. 95 96 # Drawbacks 97 98 ### Specific Move Operations 99 A stateless model for relocation precludes the ability to request specific 100 relocations; only the first operation can be made with specificity. 101 102 For example, the verb "Move replica from X to Y" cannot be expressed with 103 certainty; instead, only "Move replica to X from (some store)" or "Move replica 104 to Y from (some store)" can be expressed. The replicate queue will be responsible 105 for selecting an appropriate store for the complementary operation. 106 107 It is assumed that this level of specificity is simply not needed for any 108 relocation operations; there is no currently apparent use case where a move 109 between a specific pair of stores is needed. 110 111 Even if this was necessary, it might be possible to express by manipulating the 112 factors behind the individual stateless decisions. 113 114 ### Thrashing of complementary operation 115 Because there is no relocation state, the possibility of "thrashing" is introduced. 116 For example: 117 118 1. Rebalance operation adds a new replica to the range. 119 2. The replicate queue picks up the range and detects the need to remove a 120 replica; however, it decides to remove the replica that was just added. 121 122 This is possible if the rebalance's criteria for new replica selection are 123 sufficiently different from the replicate queue's selection criteria for 124 removing a replica. 125 126 To reduce this risk, there must be sufficient agreement in the criteria between 127 the operations; a Rebalance operation should avoid adding a new replica if 128 there's a realistic chance that the replicate queue will immediately remove it. 129 130 This can realized by adding a "buffer" zone between the different criteria; that 131 is, when selecting a replica to add, the new replica should be *significantly* 132 removed from the criteria for removing a replica, thus reducing the chances that 133 it will be selected. 134 135 When properly tuned to avoid thrashing, the stateless nature could instead be 136 considered a positive because it can respond to changes during the interim; 137 consider if, due to a delay, the relative health of nodes changes and the 138 originally selected replica is no longer a good option. The stateless system 139 will respond correctly to this situation. 140 141 ### Delay in complementary operation 142 By separating the two operations, we are possibly delaying the application of 143 the second operation. For example, the replicate queue could be very busy, or an 144 untimely crash could result in the range being under- or over-replicated without 145 being in the replicate queue on any store. 146 147 However, many of these concerns are allayed by the existence of repair; if the 148 node goes down, the repair system will add the range to the replicate queue on 149 another store. 150 151 Even in an exotic failure scenario, the stateless design will eventually detect 152 the replication anomaly through the normal operation of the replica scanner. 153 154 ### Lack of non-voting replicas 155 Our raft implementation currently lacks support for non-voting members; as a 156 result, some types of relocation will temporarily make the effected range more 157 fragile. 158 159 When initially creating a replica, it is very far behind the current state of 160 the range and thus needs to receive a snapshot. It may take some time before the 161 range fully catches up and can take place in quorum commits. 162 163 However, without non-voting replicas we have no choice but to add the new 164 replica as a full member, thus changing the quorum requirements of the group. In 165 the case of odd-numbered ranges, this will increase the quorum count by one, 166 with the new range unable to be part of a quorum decision. This increases the 167 chance of losing quorum until that replica is caught up, thus reducing 168 availability. 169 170 This could be mitigated somewhat without having to completely add-non voting 171 replicas; in preparation for adding a replica, we could manually generate a 172 snapshot and send it to the node *before* adding it to the raft configuration. 173 This would decrease the window of time between adding the replica and having it 174 fully catch up. 175 176 "Lack of Non-voting replicas" is listed this as a drawback because going forward 177 with relocation *without* non-voting replication introduces this fragility, 178 regardless of how relocation decisions are made. Stateless relocation will still 179 work correctly when non-voting replicas are implemented; there will simply be a 180 delay in the case where a replica is added first (e.g. rebalance), with the 181 removal not taking place until the non-voting replica catches up and is upgraded 182 to a full group member. This is not trivial, but will still allow for stateless 183 decisions. 184 185 # Alternatives 186 The main alternative would be some sort of stateful system, where relocation 187 operations would be expressed explicitly (i.e. "Move replica from X to Y") and 188 then seen to completion. For reasons outlined in the "motivation" section, this 189 is considered sub-optimal when making distributed decisions. 190 191 ### Relocation Master 192 The ultimate expression of this would be the designation of a "Relocation 193 master", a single node that makes all relocation decisions for the entire 194 cluster 195 196 There is enough information in gossip for a central node to make acceptable 197 decisions about replica movement. In some ways the individual decisions would 198 be worse, because they would be unable to consider current raft state; however, 199 in aggregate the decisions could be much better, because the central master 200 could consider groups of relocations together. For example, it would be able to 201 avoid distributed pitfalls such as under-loaded nodes being suddenly inundated 202 with requests for new replicas. It would be able to more quickly and correctly 203 move replicas to new nodes introduced to the system. 204 205 This system would also have an easier time communicating the individual 206 relocation decisions that were made. This could be helpful for debugging or 207 tweaking relocation criteria. 208 209 However, It's important to note that a relocation master is not entirely 210 incompatible with the "stateless" design; the relocation master could simply be 211 the initiator of the stateless operations. You could thus get much of the 212 improved decision making and communication of the master without having to move 213 to a stateful design. 214 215 # Unresolved questions 216 217 ### Raft leader constraint 218 The biggest unresolved issue is the requirement for relocation decisions to be 219 constrained to the raft leader. 220 221 For example, there is no firm need for a relocation operation to be initiated 222 from a range leader; a secondary replica can safely initiate an add or remove 223 replica, with the safety being guaranteed by a combination of raft and 224 cockroach's transactions. 225 226 However, if the criteria for an operation wants to consider raft state (such as 227 which nodes are behind in replication), those decisions could only be made from 228 the raft leader (which is the only node that has the full raft state). 229 Alternatively, an interface could be provided for non-leader members to query 230 that information from the leader.