github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20181204_copysets.md (about) 1 - Feature Name: Copysets 2 - Status: draft 3 - Start Date: 2018-12-04 4 - Authors: Vijay Karthik, Mohammed Hassan 5 - RFC PR: (PR # after acceptance of initial draft) 6 - Cockroach Issue: [#25194](https://github.com/cockroachdb/cockroach/issues/25194) 7 8 # Table of Contents 9 10 - [Table of Contents](#table-of-contents) 11 - [Summary](#summary) 12 - [Motivation](#motivation) 13 - [Guide level explanation](#guide-level-explanation) 14 - [Design](#design) 15 - [Managing copysets](#managing-copysets) 16 - [Rebalancing ranges](#rebalancing-ranges) 17 - [Copyset score](#copyset-score) 18 - [Drawbacks](#drawbacks) 19 - [Rationale and Alternatives](#rationale-and-alternatives) 20 - [Copyset allocation to minimize data movement](#copyset-allocation-to-minimize-data-movement) 21 - [Chainsets](#chainsets) 22 - [Testing scenarios](#testing-scenarios) 23 24 # Summary 25 26 Copysets reduce the probability of data loss in the presence of multi node 27 failures in large clusters. 28 29 This RFC will present a design for integrating copysets in cockroach and discuss 30 its tradeoffs. Copysets have earlier been discussed in 31 [RFC #6484](https://github.com/cockroachdb/cockroach/pull/6484). 32 33 More details on copysets can be seen in the 34 [academic literature](https://web.stanford.edu/~skatti/pubs/usenix13-copysets.pdf). 35 36 # Motivation 37 In large clusters simultaneous loss of multiple nodes have a very high probability 38 of data loss. For example, consider a cluster of 100 nodes using a 39 replication factor of 3 having ~10k ranges. The simultaneous loss of 2 or more 40 nodes has a very high probability of data loss since there could be a range 41 out of the 10k ranges which has 2 out of its 3 replicas on the 2 lost nodes. 42 This probability can be reduced by adding locality to nodes since cockroach 43 supports failures of all nodes in a locality, but the loss of two nodes in 44 different localities again has a high probability of data loss. 45 46 Copysets significantly reduces the probability of data loss in the presence 47 of multi node failures. 48 49 # Guide-level explanation 50 Copysets divides the cluster into disjoint sets of nodes. The size of each set 51 will be based on the used replication factors. Separate copysets are created 52 for each replication factor. A range should prefer to allocate its replicas 53 within a copyset rather than spread its replicas across copysets. 54 55 So there are two major components 56 1. Managing copysets (which node belongs to which copyset) 57 Copyset assignments should take into account locality of nodes so that locality 58 fault tolerance is not lost. Addition / Removal / Crashed nodes should 59 be taken into account when assigning nodes to copysets. 60 2. Rebalancing all replicas of a range to reside within a single copyset on a 61 best effort basis. 62 Rebalancing replicas into copysets is important, but some properties like 63 constraints set by a user should take priority over copysets. 64 65 Copysets will initially be an opt-in feature (based on a cluster setting) and 66 implemented for a scatter width of `replication_factor - 1` (eventually it will 67 be extended to support higher scatter width). 68 **For simplicity, we will explain the design without considering scatter width 69 in copysets.** 70 ## Design 71 ### Managing copysets 72 The cluster will be divided into copysets. For each replication factor in the 73 cluster, separate copysets will be generated. 74 75 The requirements for copysets of a replication factor are 76 1. There should be no overlap of nodes between copysets for scatter width of 77 rf -1 and minimize overlapping nodes for scatter width >= rf (where 78 rf is the replication factor). 79 2. Copysets should be locality fault tolerant (each node in a copyset should 80 preferably be from a different locality) 81 3. Copysets should rebalance on node additions / removal / failures. 82 83 Copysets are generated for each replication factor used in the system. 84 Better failure tolerance can be provided if copysets for different replication 85 factors are aligned, but this is not the case in the presented strategies. 86 87 Two possible strategies for copyset allocation is presented below. 88 89 #### Optimal diversity copyset allocation 90 Optimal allocation (for locality diversity) of copysets for a particular 91 replication factor can be done as follows: 92 ``` 93 1. Compute num_copysets = floor(num_stores/replication_factor) 94 2. Sort stores based on increasing order of locality. 95 3. Assign copysets to stores in a round robin fashion. 96 ``` 97 For example, consider the case where we have stores as follows: 98 ``` 99 Locality1: S1 S2 S3 100 Locality2: S4 S5 S6 101 Locality3: S7 S8 S9 S10 102 ``` 103 Copysets for RF 3 would be created as 104 ``` 105 num_copysets = 10/3 = 3 106 CS1: S1 S4 S7 S10 107 CS2: S2 S5 S8 108 CS3: S3 S6 S9 109 ``` 110 #### Minimize data movement copyset allocation 111 In this strategy the goal is to minimize data movement when copysets are 112 regenerated with a different store list (some stores added, some stores 113 removed). 114 115 This allocation tries to create a copyset-store mapping (with 116 incremental changes over previously used copysets) which is diverse in locality. 117 It tries to minimize the number of changes to previously used copysets and 118 ensure that each store in a copyset belongs to a different locality when 119 possible. 120 The allocation 121 1. Computes the required number of copysets for the new store list. 122 2. Assign previously existing stores to the same copyset id they belonged to 123 (if copyset id exists based on 1) if copyset size < replication factor 124 3. Adds the newly added stores (not present in previous copyset allocation) 125 and remaining stores from (2) to empty spots in each copyset (if the copyset 126 has < replication factor stores or if it is the last copyset). 127 after assigning previously existing stores which have carried over). 128 4. Swaps stores between copysets to avoid duplicate localities in a single 129 copyset till it converges (diversity cannot be improved further). 130 131 #### Swaps 132 Swaps are made between a source copyset and a target copyset which guarantee 133 that the diversity of the source copyset increases while the diversity of the 134 target copyset does decrease (or if it decreases it still is > replication 135 factor). 136 137 Store swaps are made between a source copyset and a target copyset based 138 on the localities present in the source and target copyset. The conditions 139 required for a swap are: 140 1. The source copyset has diversity < replication factor. This means that 141 the source copyset has two stores in a particular locality. One of these 142 stores will be a source swap candidate. 143 2. The target copyset has a locality not present in the source copyset 144 (let's call this target locality). A store from this locality will be a target 145 swap candidate. 146 3. One of the following is true 147 1. Locality of the source swap candidate is not present in the target 148 copyset. 149 2. Target copyset either 150 1. Has two stores in the target locality. 151 2. Has diversity > replication factor. 152 153 By diversity above we mean the number of localities in a copyset. 154 155 Point (3) above ensures that diversity of the target copyset does not decrease 156 (or if it decreases it does not fall below replication factor). 157 158 159 A single iteration doing swaps considers all `(n choose 2)` copyset combinations 160 where `n` is the number of copysets. These iterations continue till sum of 161 diversity of all copysets cannot be improved further (no swap are candidates 162 found for a whole iteration). 163 164 For example, consider the case where we have stores as follows: 165 ``` 166 Locality1: S1 S2 S3 167 Locality2: S4 S5 S6 168 Locality3: S7 S8 S9 169 Locality4: S10 S11 S12 S13 170 ``` 171 And initial copyset allocation as 172 ``` 173 CS1: S1 S5 S9 174 CS2: S2 S6 S10 175 CS3: S3 S7 S11 176 CS4: S4 S8 S12 S13 177 ``` 178 179 Say store `S6` is removed. 180 181 After step 2 (assign stores to same copyset ID till size reaches rf), we have 182 ``` 183 CS1: S1 S5 S9 184 CS2: S2 S10 185 CS3: S3 S7 S11 186 CS4: S4 S8 S12 187 ``` 188 189 After filling empty spots by adding remaining stores (`S13` in this case) 190 ``` 191 CS1: S1 S5 S9 192 CS2: S2 S10 S13 193 CS3: S3 S7 S11 194 CS4: S4 S8 S12 195 ``` 196 197 After swaps (between `CS1` and `CS2` since CS2 has 2 stores from `Locality4`) 198 ``` 199 CS1: S1 S5 S13 200 CS2: S2 S10 S9 201 CS3: S3 S7 S11 202 CS4: S4 S8 S12 203 ``` 204 205 This strategy may not achieve optimal possible diversity but tries to ensure 206 that each locality within a copyset is different. 207 208 #### Copyset re-generation 209 The store list considered for copyset allocation would be the current live 210 stores. The way live stores are computed will be the same as the way allocator 211 detects live stores (but throttled stores will not be excluded.) 212 Copysets will be re-generated if the store list has been stable and not changed 213 for 3 ticks (each tick has a 10s interval). 214 215 Copyset allocation can be persisted as a proto in the distributed KV layer. 216 The copysets strategy which minimizes data movement requires copysets to be 217 persisted (it requires the previous state to be global and survive restarts). 218 The lowest live node ID in the cluster would be managing (persisting) copysets. 219 Other nodes will be periodically (every 10s) cache the persisted copysets and 220 using it for re-balancing. 221 222 Copysets will only be re-generated (and persisted) if the store list changes. 223 In steady state all nodes will be periodically reading the persisted copysets 224 and there will be no need to re-generate and persist new copysets. 225 226 The cluster can tolerate failure of one node within each copyset for RF=3. For 227 example a 100 node cluster can tolerate the simultaneous failure of 33 nodes 228 in the best case (for RF=3) without suffering any data loss. 229 230 ## Rebalancing ranges 231 Ranges need to be rebalanced to be contained within a copyset. 232 There are two range re-balancers currently being used in cockroach: 233 1. Replicate queue 234 2. Store rebalancer 235 236 This RFC will explain the implementation for copyset rebalancing for the 237 replicate queue which processes one replica at a time. 238 Replica rebalancing by the store rebalancer will be disabled if copysets is 239 enabled (at least for the initial version). The store rebalancer can still 240 perform lease holder rebalancing. 241 242 The allocator uses a scoring function to 243 1. Decide which store to use for a new replica for a range 244 2. Which replica to remove when a range has more than required replicas 245 3. Whether a replica has to move from one store to another where the resultant 246 score for the range will he higher. 247 248 The scoring function considers the following (given in order of priority) 249 1. Zone constraints (which are constraints on having certain tables in certain 250 zones) 251 2. Disk fullness: checks whether the source or target is too full. 252 3. Diversity score difference: Diversity score is proportional to the number of 253 different localities the range has a replica in. It looks at nC2 diversity 254 score based on their localities where n is the number of replicas. 255 4. Convergence score difference: Convergence score is used to avoid moving 256 ranges whose movement will cause the stats (range count) of a range to move 257 away from the global mean. 258 5. Balance score difference: Balance score is the normalized utilization of a 259 node. It currently considers number of ranges. Nodes with a low balance score 260 are preferred. 261 6. Range count difference: Stores with a low range count are preferred. 262 263 ### Copyset score 264 For rebalancing ranges into copysets, a new "copyset score" will be added to 265 the allocator. Priority wise it will be between (2) and (3) above. Zone 266 constraints and disk fullness take a higher priority over copyset score. 267 268 Since copyset allocation considers diversity, it's priority can be placed above 269 diversity score. 270 If copysets are disabled in the cluster, this score will have no impact in 271 rebalancing. 272 273 Copyset score (higher score is better) of a range is high if: 274 1. A range is completely contained within a copyset. 275 2. The copysets the range is in are under-utilized. We want each copyset to 276 be equally loaded. 277 If a range is completely contained in a copyset `x` we should move the range 278 completely to a copyset `y` if the nodes in copyset `y` have a 279 **significantly** lower load (for example nodes in `y` have a lot more free 280 disk space). 281 282 So the following replica transition for a range of RF 3 should be allowed in 283 case (2): 284 `x x x -> x x y -> x y y -> y y y` 285 where `x x x` means that the 3 replicas of the range are in copyset `x`. 286 287 Let's say `r` is the replication factor of a range. Each of its replicas belongs 288 to a node with a particular copyset id. We can formally define the scores as: 289 1. Homogeneity score: `Number of pairwise same copyset id / (r choose 2)` 290 2. Idle score: This score is proportional to how "idle" a store is. For 291 starters we can consider this to be % disk free (Available Capacity / Total 292 Capacity of the store). We want ranges to migrate to copysets with significantly 293 lower load. 294 1. The idle score of a store is proportional to the idleness of a store, like 295 % disk free on the store. 296 2. The idle score of a copyset is the lowest idle score of the stores in the 297 copyset. 298 3. The idle score of a range is the weighted average idle score of the 299 copysets of the stores a range is present in. A range can be a part of 300 multiple copysets when it is in flux (examples given below). 301 302 Copyset score can be defined as `(k * homogeneity_score + idle_score) / (k + 1)`. 303 It is normalized and lies between 0 and 1. 304 305 #### Computation of k 306 Let's say we want to migrate a range from a copyset `x` to a copyset `y` if the 307 idle score of `y` differs by more than `d` (configurable). 308 If `d` is too small, it could lead to thrashing of replicas, so we can use a 309 value like 15%. 310 Though the below calculations may seem a bit complex, to the end user we can 311 just expose `d`, which is easy to understand - the max difference between idle 312 scores of two copysets in the cluster. 313 314 For example, if idle score of `x` is `a` and `y` is `a + d`, we require: 315 ``` 316 copyset_score(x x x) < copyset_score(x x y) 317 k * homogeneityScore(x x x) + idleScore(x x x) < k * homogeneityScore(x x y) + idleScore(x x y) 318 # Generalizing for replication factor r where r = 3 below 319 homogeneityScore(x x x) = 1 320 idleScore(x x x) = ra/r = a 321 homogeneityScore(x x y) = (r-1 choose 2) / (r choose 2) # since 1 copyset is different. 322 idleScore(x x y) = ((r-1) * a + a + d)/r = (ra + d) / r 323 # So we get 324 k * 1 + a <= k * (r-1 choose 2) / (r choose 2) + (ra + d) / r 325 => k <= d / 2 326 ``` 327 For example, for `r = 3`, `d = 0.15`, and idle score of x being `0.2` and idle 328 score of y being `0.36` 329 ``` 330 totalScore(x x x) = 0.075 * 1 + 0.2 = 0.275 331 totalScore(x x y) = 0.075 * 0.33 + (0.2 + 0.2 + 0.36)/3 = 0.278 332 ``` 333 So a range will migrate from 334 ``` 335 (x x x) -> (x x y) -> (x y y) -> (y y y) 336 ``` 337 The above migration will not happen if `y` has an idle score of `0.34` (since 338 `d = 0.15`). 339 The first step `(x x x) -> (x x y)` is the hardest as homogeneity 340 is broken. The proof for this is given above. 341 For `(x x y) -> (x y y)` step, the homogeneity score remains the 342 same, and idle score improves (since y has a better idle score). 343 For `(x y y) -> (y y y)` step, both the homogeneity score and 344 idle score improve. 345 When a range actually migrates from `(x x x)` to `(x x y)`, it goes 346 through an intermediate step `(x x x y)` after which one `x` is 347 removed, but similar math applies. 348 349 This scoring function will allow ranges to organically move into copysets 350 and try to maintain approximately equal load among copysets. Thrashing will 351 be avoided by choosing an appropriate value of `d`. 352 353 ## Drawbacks 354 1. Copysets increase recovery time since only nodes within the copyset of a 355 crashed node can up-replicate data. This can be mitigated by choosing a higher 356 scatter width (description of scatter width is given in the 357 [academic literature](https://web.stanford.edu/~skatti/pubs/usenix13-copysets.pdf)). 358 2. Zone constraints will not be supported in the initial version of copysets. 359 Copyset allocation can later be tweaked to respect zone contraints. 360 3. Heterogeneous clusters. Copysets will work in heterogeneous clusters but each 361 copyset will be limited by the weakest node in the copyset (since idle score 362 of a copyset is the lowest node idle score). This may be something we can live 363 with. 364 4. Doesn't play well with the store rebalancer. For the first cut store based 365 replica rebalancing will be disabled with copysets enabled. A similar 366 logic can be incorporated into the store rebalancer at a later point. 367 368 Due to the above drawbacks, copysets will be disabled by default and there 369 will be a cluster setting where users can enable copysets if they are ok with 370 the above drawbacks. 371 372 ## Rationale and Alternatives 373 There can be multiple approaches for both copyset allocation and the scoring 374 function. This design in this RFC is something simple and the respective 375 algorithms can be tweaked independently later. 376 377 ### Chainsets 378 [Chainsets](http://hackingdistributed.com/2014/02/14/chainsets/) is one way 379 to make incremental changes to copysets, but again potentially at the cost 380 of reduced locality diversity. The length of the chain used in chainsets 381 could be considered equivalent to replication factor in cockroach. 382 383 ## Testing scenarios 384 Apart from unit tests, roachtests can be added which verify copyset based 385 rebalancing in the presence of 386 1. Node addition / removal 387 2. Node crashes (up to 1/3rd of the cluster) 388 3. Change of replication factors 389 4. Locality fault tolerance 390 5. Changes of constraints