github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160503_rebalancing_v2.md (about) 1 - Feature name: Rebalancing V2 2 - Status: completed 3 - Start date: 2016-04-20 4 - Last revised: 2016-05-03 5 - Authors: Bram Gruneir & Cuong Do 6 - RFC PR: [#6484](https://github.com/cockroachdb/cockroach/pull/6484) 7 - Cockroach Issue: 8 9 # Table of Contents 10 11 - [Table of Contents](#table-of-contents) 12 - [Summary](#summary) 13 - [Motivation](#motivation) 14 - [Goals](#goals) 15 - [Non-Goals](#non-goals) 16 - [Metrics for evaluating rebalancing](#metrics-for-evaluating-rebalancing) 17 - [Detailed Design](#detailed-design) 18 - [Store: Add the ability to reserve a replica](#store-add-the-ability-to-reserve-a-replica) 19 - [Protos: add a timestamp to *StoreDescriptor* and reservations to *StoreCapacity*](#protos-add-a-timestamp-to-storedescriptor-and-reservations-to-storecapacity) 20 - [RPC: ReserveReplica](#rpc-reservereplica) 21 - [Update *StorePool/Allocator* to call *ReserveReplica*](#update-storepoolallocator-to-call-reservereplica) 22 - [Drawbacks](#drawbacks) 23 - [Too many requests](#too-many-requests) 24 - [Alternate Allocation Strategies](#alternate-allocation-strategies) 25 - [Other enhancements to distributed allocation](#other-enhancements-to-distributed-allocation) 26 - [Centralized allocation strategy](#centralized-allocation-strategy) 27 - [Allocator lease acquisition](#allocator-lease-acquisition) 28 - [Allocator lease renewal](#allocator-lease-renewal) 29 - [Updating the allocator’s *StoreDescriptors*](#updating-the-allocators-storedescriptors) 30 - [Centralized decision-making](#centralized-decision-making) 31 - [Failure modes for allocation lease holders](#failure-modes-for-allocation-lease-holders) 32 - [Conclusion](#conclusion) 33 - [CopySets](#copysets) 34 - [CopySets emulation](#copysets-emulation) 35 - [Allocation Heuristic Features](#allocation-heuristic-features) 36 - [Testing Scenarios](#testing-scenarios) 37 - [Simulator](#simulator) 38 - [Unresolved Questions](#unresolved-questions) 39 - [Centralized vs Decentralized](#centralized-vs-decentralized) 40 41 # Summary 42 43 Rebalancing is the redistribution of replicas to optimize for a chosen set of heuristics. Currently, 44 each range lease holder runs a distributed algorithm that spreads replicas as evenly as possible across 45 the stores in a cluster. We artificially limit the rate at which each node may move replicas to 46 avoid the excessive thrashing of replicas that results from making independent rebalancing decisions 47 based on outdated information (gossiped `StoreDescriptor`s that are up to a minute old). 48 49 As detailed later in this document, we’ve weighed decentralized against centralized allocation 50 algorithms, as well as different heuristics for evaluating whether replicas are properly balanced. 51 For V2 of our replica allocator, we are adding a replica reservation step to the distributed 52 allocator and intelligently increasing the frequency at which we gossip `StoreDescriptor`s. 53 These modifications will significantly reduce the time required to rebalance small-to-medium-sized 54 clusters while avoiding the waste of resources and degradation in performance caused by excessive 55 thrashing of replicas. 56 57 We’re specifically not addressing load-based rebalancing or heterogeneous stores/nodes in V2. 58 Moreover, we are not addressing the potential for data unavailability when more than one node goes 59 offline. These are important problems that will be addressed in V3 or later. 60 61 # Motivation 62 63 To allocate replicas for ranges, we currently rely on distributed 64 [stateless replica relocation](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20150819_stateless_replica_relocation.md). 65 66 Each range lease holder is responsible for replica allocation decisions (adding and removing replicas) 67 for its respective range. This is a good, simple start. However, it is particularly susceptible to 68 thrashing. Because different range lease holders distribute replicas independently, they don't necessarily 69 converge on a desirable distribution of replicas within a reasonable number of replica allocations. 70 71 A number of factors contribute to this thrashing. First, there is a lack of updated information on 72 the current state of all stores so replication decisions typically rely on outdated data. The 73 current state of the cluster is retrieved from gossiped store descriptors. However, store 74 descriptors are only gossiped at an interval of once every minute. So if a store is suddenly 75 overloaded with replicas, it may take up to one minute (plus gossip network propagation time) for 76 its updated descriptor to reach various range lease holders. 77 78 Secondly, until recently, our replica allocator had no limits for how fast rebalancing could occur. 79 Combined with lack of data and since there is no coordination between replica allocators, an 80 over-allocation to the new store is likely to occur. 81 82 For example, consider the following scenario where we have 3 perfectly balanced stores: 83 84  85 86 Let's add an empty store: 87 88  89 90 As soon as that store is seen by each range’s allocator, the following occurs: 91 92  93 94 This over-rebalancing continues for many cycles, often resulting in tens of thousands of replicas 95 adds and removes for clusters with miniscule data. 96 97 As a stopgap measure, a 98 [recent change](https://github.com/cockroachdb/cockroach/commit/c4273b9ef7f418cab2ac30a10a8707c1601e5e99) 99 has added a minimum delay of 65 seconds between rebalance attempts for each node, to reduce 100 thrashing. This works well for small clusters with little data. However, this severely slows down 101 the process of rebalancing many replicas in an imbalanced cluster. 102 103 # Goals 104 105 Rebalancing is a topic of ongoing research and investigation, in this section, goals for a second 106 version of rebalancing are presented. In the section that immediately follows it, a collection of 107 future goals are presented for potential post V2 work. 108 109 Relative priorities are debatable and depend on deployment specifics, so the following are listed 110 in no particular order: 111 112 - Minimizes thrashing. 113 114 Thrashing, which can occur when a node is added or removed can quickly bring a cluster to a near 115 halt due to the number replicas being moved between nodes which results in requests queuing up 116 waiting to be serviced. 117 118 - Is performant in clusters with up to 32 nodes. 119 120 The choice of 32 nodes here matches our OKRs. This limit is to make testing more tractable. 121 Performance will be measured using the metrics described in the Metrics section below. 122 123 - Is performant in a dynamic cluster. A dynamic cluster is one in which nodes can be added and 124 removed arbitrarily. 125 126 While this may seem like an obvious goal, we should ensure that equilibrium is reached quickly 127 in cases when one or more nodes are added and removed at once. It should be noted that only 128 a single node can be removed at a time but any number of nodes can be added. 129 130 - Handles outages of any number of nodes gracefully, as long as quorum is maintained. 131 132 This is the classic repair scenario. 133 134 - Don’t overfill stores. 135 136 Overly full disks tend to perform worse. But this is also to ensure we don’t overly fill a new 137 store when it’s added. Using the example from the motivation section above, if the new store 138 could only hold 100 replicas it would be catastrophic for the cluster. 139 140 # Non-Goals 141 142 There are a number of interesting further investigations on improving rebalancing and how it can 143 impact overall cluster and perhaps individual operation performance. We list them here for 144 potential future work on what we are calling post V2 rebalancing. Again, these are not ordered by 145 priority: 146 147 - Is performant in heterogeneous clusters. 148 149 Clusters with different sized stores with different CPUs and lagging nodes. 150 151 - Is performant in large clusters (>>32 nodes). 152 153 Further work past our arbitrary limit of 32 nodes. 154 155 - Replicas are moved to where there is demand for them. 156 157 Experiment to see if this would be useful. There may be performance gains on keeping replicas of 158 single tables together on the same set of stores. 159 160 - Globally distributed data. 161 162 How should the replicas be organized when there are potentially long round trips between 163 datacenters? 164 165 - Optimizes data transfer based on network topology. 166 167 Examples: ping times between replicas, network distance, rack and datacenter awareness. 168 169 - Decrease chance of data unavailability as nodes go down. 170 171 See the discussion on CopySets below. 172 173 - Distribute "hot" keys and ranges evenly through the cluster. 174 175 This would greatly help to distribute load and make the cluster more robust and be able to handle 176 a larger number of queries with lower latency. 177 178 - Defragment. Co-locate tables that are too big for a single range to the same set of stores. 179 180 This could speed up all large table queries. 181 182 # Metrics for evaluating rebalancing 183 184 As with any system, a set of evaluation criteria are required to ensure that changes improve the 185 cluster. We propose the following criteria. It should be noted that most changes may positively 186 impact one and negatively affect the others: 187 188 - Distribution of data across stores. Measured using percentage of store capacity available. 189 - Speed at which perturbed systems reach equilibrium. Measured using the number of rebalances 190 until and the total time until the cluster is stable. 191 192 For post V2 consideration: 193 194 - User latency. Rebalancing should never affect user query latency, but too many rebalances may do 195 just that. Measured using latencies of user queries. 196 - Distribution of load across stores. Measured using CPU usage and network traffic. 197 198 # Detailed Design 199 200 The current distributed allocator cannot rebalance quickly because of the >= 65 second rebalancing 201 backoff. Because removing that backoff would cause excessive allocation thrashing, the `Allocator` 202 has to be modified to make faster progress while minimizing thrashing. 203 204 To reduce thrashing, we are introducing the concept of reserved replicas. Before allocating a new 205 replica, an allocator will first reserve the space for the new replica. This will be accomplished 206 by adding a new RPC `ReserveReplica` that requests to reserve space for the new replica on one 207 store. Once received, the store can reply with either a yes or a no. When it replies with a 208 `reserved`, the space for said replica is reserved for a predetermined amount of time. If no 209 replica appears within that time, it is no longer reserved. (It should be noted that the size of a 210 replica depends on the split size based on the table and or zone. This should be taken into 211 consideration.) 212 213 Each `ReserveReplica` response contains the latest `StoreDescriptor`s from the node with a 214 node-local timestamp. The caller can use these to update its cached copy of the `StoreDescriptor`. 215 216 When it replies with a `not reserved`, it also includes possible reasons as to why for debugging 217 purposes. These reasons can include: 218 219 1. Too full in terms of absolute free disk space (this includes all reserved replica spots) 220 1. Overloaded (once we define what that term specifically means) 221 1. Too many current reservations (throttling factor will be determined experimentally) 222 223 Any other error, including networking errors can be considered a `not reserved` response for the 224 purposes of allocation. When a `not reserved` is received, that response is cached in the store 225 pool until the next `StoreDescriptor` update. We avoid issuing further `ReserveReplica` calls to 226 that store until the next `StoreDescriptor` update. 227 228 The next subsections contain all of the major tasks that will be required to complete this feature 229 and further details about each. 230 231 ## Store: Add the ability to reserve a replica 232 233 To prevent a store from being overwhelmed and overloaded, the concept of a reserved replica will be 234 added to a store. A reserved replica reserves a full replica’s amount of space for a predetermined 235 amount of time (typically for a `storeGossipInterval`) and reserves it for the expected incoming 236 replica for a specific RangeID. If the replica for the range is not added within the reservation 237 timeframe, the reservation is removed and the space becomes free again. 238 239 If a new replica arrives and there is no reservation, the store will still create the new replica 240 and this will not cancel any pre-existing reservations. The gating of when to allow a new 241 reservation is decided in the `ReserveReplica` RPC and is not part of adding a replica in the 242 store. 243 244 Optionally, the ability to control the amount of currently available space on a store might be used 245 to slow down a cluster from suddenly jumping on a new node when one becomes available. By 246 pre-reserving (for a non existing store) all or most of the new store’s capacity and staggering 247 the timeouts, it may prevent all replicas from suddenly being interested in adding themselves to 248 the store. This will require some testing to determine if it will be beneficial. 249 250 ## Protos: add a timestamp to *StoreDescriptor* and reservations to *StoreCapacity* 251 252 By adding a timestamp to the `StoreDescriptor` proto, it enables the ability to quickly pick the 253 most recent `StoreDescriptor`. This timestamp is local to the store that generated the 254 `StoreDescriptor`. The main use case for this is when calling `ReserveReplica`, regardless of the 255 response being a `reserved` or a `not reserved`, it will also return updated `StoreDescriptor`s for 256 all the stores on the node. These updated descriptors will be used to update the cached 257 `StoreDescriptor` in the `StorePool` of the node calling `ReserveReplica`. There may be a small 258 race with these descriptors and new ones that are arriving via gossip. A timestamp fixes this 259 problem. Any subsequent calls to the allocator will have a fresher sense of the cluster. Note that 260 it may be possible to skip the addition of the timestamp by returning a gossip `Info` from the 261 `reserveReplica` RPC. 262 263 By adding a `reservedSpace` value to capacity, it allows more insight into how the total capacity 264 of a store is used and be able to make better decisions around it. Also, by adding 265 `activeReservations` an allocator will be able to choose rebalancing targets that are not 266 overwhelmed with requests. 267 268 ## RPC: ReserveReplica 269 270 By adding `ReserveReplica` RPC to a node, it will enable a range to reserve a replica on a node 271 before calling `changeReplica` and adding the replica directly. Because no node will ever have the 272 most up to date information about another one, the response will always include an updated 273 `StoreDescriptor` for the store in which a reservation is requested. It should be noted that this 274 is a new type of RPC in which it addresses a store and not a node or range. 275 276 The request will include: 277 278 - `StoreID` of the store in which to reserve the replica space 279 - `RangeID` of the requesting range. Consider repurposing a `ReplicaDescriptor` here. 280 - All other parameters that are required by the allocator, such as required attributes. 281 282 The response will include: 283 284 - `StoreDescriptor`s The most up to date store descriptors for all stores in the node. Note that 285 there may be a requirement to limit the number of times `engine.Capacity` is called as this is 286 doing a physical walk of the hard drive. Consider wrapping the descriptor in a gossip `Info`. 287 - `Status` An ENUM or boolean value that indicates if a replica has been reserved or not. Usually 288 this will be either a `reserved` or `not reserved`. 289 290 When determining if a store should reserve a new replica based on a request, it should first check 291 some basic conditions: 292 293 - Is the store being decommissioned? 294 - Are there too many reserved replicas? 295 - Is there enough free (non-reserved) space available on the store? 296 297 Typically the response will be a yes. A response of `not reserved` will only occur when the store 298 is being overloaded or is close to being overly full. Even when a reservation has been made, there 299 is no guarantee that the store calling will still fill the reservation. It only means that that 300 space has been reserved. 301 302 ## Update *StorePool/Allocator* to call *ReserveReplica* 303 304 When trying to choose a new store to allocate replica a range to, after sorting all the available 305 ranges and ruling out the ones with incorrect attributes. The allocator picks the top store to 306 locate the new replica based on the heuristic discussed at the end of the of this document. It then 307 calls `ReserveReplica` on that node. 308 309 After each call to `ReserveReplica`, the `StorePool` on a node will update its cached 310 `StoreDescriptor`s. (Consider reusing or extending some of the gossip primitives as this could be 311 partially considered a forced gossip update.) 312 313 On a `not reserved` response, add a note that the store refused and so that it will not be 314 considered for new allocations for a short period (perhaps 1 second). 315 316 On a `reserved` response the replica will issue a `replicaChangeRequest` to add the chosen store. 317 318 # Drawbacks 319 320 ## Too many requests 321 322 When a new node joins the cluster and it's gossiped `StoreDescriptor` makes its way to all stores 323 that could stand to have some ranges rebalanced, it may create too much network traffic calling the 324 `ReserveReplica` RPC. To ensure this doesn't happen, the RPC should be extremely quick to respond 325 and require very little processing on the receiving store's side, especially in the case that it is 326 a rejection. 327 328 # Alternate Allocation Strategies 329 330 This sections contains a collection of other techniques and strategies that were considered. Some 331 of these enhancements may still be included in V2. 332 333 ## Other enhancements to distributed allocation 334 335 Here is a small collection to tweaks that could be used to alter how a distributed allocation could 336 work. These are not being implemented now, but could be considered as alternatives if the 337 `ReserveReplica` strategy doesn’t solve all issues. 338 339 - Make the gossiping of `StoreDescriptor`s event driven. Anytime a snapshot is applied or a 340 replica is garbage collected. If no event occurs, gossip the `storeDescriptor` every 341 `gossipStoresInterval`. 342 343 This could reduce the time it takes for the updated descriptor to make its way to all other 344 nodes. 345 346 - Decrease the `gossipStoresInterval` to 10 seconds so `StoreDescriptor`s are fresher. 347 348 This adds a lot of churn to the gossiped descriptors so the increased network traffic might 349 outweigh the benefits of faster rebalancing. 350 351 - Move from using gossiped `StoreDescriptor`s (updated every 60 seconds) to gossiped 352 `StoreStatuses` (written every 10 seconds). 353 354 This would require gossiping which would incur the same problem as decreasing the 355 `gossipStoresInterval`. 356 357 - Based on the latest Store Descriptors, determine how many stores would likely rebalance in the 358 next 10 seconds. Then, each of those stores rebalances with probability 359 `1/(# of candidate stores)`. For example, suppose that we're balancing by replica count. Two 360 stores have 100 replicas, and one store has 0 replicas. So, there are 2 stores that are good 361 candidates to move replicas from. Each of those 2 stores would have a `1/2` probability of 362 starting a rebalance. We could speed this up by doing this virtual coin toss `N` times, where `N` 363 is the total number of replicas we'd like to move to the destination store. 364 365 This might be a useful option if there is still too much thrashing when a new node is added. 366 367 - Don't try to rebalance any other replicas on a store until the previous `ChangeReplicas` call has 368 finished and the snapshot has been applied. 369 370 This limits each store to performing a single `ChangeReplica`/Snapshot at a time. It would limit 371 thrashing but also greatly increase the time it takes to reach equilibrium. 372 373 ## Centralized allocation strategy 374 375 One way to avoid the thrashing caused by multiple independently acting allocators is to centralize 376 all replica allocation. In this section, a possible centralized allocation strategy is described in 377 detail. 378 379 ### Allocator lease acquisition 380 381 Every second, each node checks whether there’s an allocation lease holder 382 ("allocator") through a `Get(globalAllocatorKey)`. If that returns no data, the 383 node tries to become the allocator lease holder using a `CPut` for that key. In 384 pseudo-code: 385 386 ``` pseudocode 387 every 60 seconds: 388 result := Get(globalAllocatorKey) 389 if result != nil { 390 // do nothing 391 return 392 } 393 err := CPut(globalAllocatorKey, nodeID+"-”+expireTime, nil) 394 if err != nil { 395 // Some other node became the allocator. 396 return 397 } 398 // This node is now the allocator. 399 ``` 400 401 ### Allocator lease renewal 402 403 Near the end of the allocation lease, the current allocator does the following: 404 405 ``` golang 406 err := CPut(globalAllocatorKey, 407 nodeID+"-”+newExpireTime, 408 nodeID+"-”+oldExpireTime) 409 if err != nil { 410 // Re-election failed. Step down as allocator lease holder. 411 return err 412 } 413 // **Re-election succeeded**. We’re still the allocation lease holder. 414 ``` 415 416 For example, if the allocation lease term is 60 seconds, the current allocation lease holder could 417 try to renew its lease 55 seconds into its term. 418 419 We may want to enforce artificial allocator lease term limits to more regularly 420 exercise the lease acquisition code. 421 422 ### Updating the allocator’s *StoreDescriptors* 423 424 An allocation lease holder needs recent store information to make effective allocation decisions. 425 426 This could be achieved using either of two different mechanisms: decreasing the 427 interval for gossiping `StoreDescriptor`s from 60 seconds to a lower value, (perhaps 10 seconds) or 428 by writing the descriptors to a system keyspace and retrieving them, possibly using inconsistent 429 reads, (also every 10 seconds or so). Also, using `StoreStatus`es instead of descriptors is also an 430 option. Recall that `StoreDescriptor` updates are frequent and the allocation lease holder is the only 431 node making rebalancing decisions. So, the allocation lease holder could use the latest gossiped 432 `StoreDescriptor`s and its knowledge of the replica allocation decisions made since the last 433 `StoreDescriptor` gossip to derive the current state of replica allocation in the cluster. 434 435 ### Centralized decision-making 436 437 Pseudo-code for centralized rebalancing: 438 439 ``` pseudocode 440 for _, rangeDesc := range GetAllRangeDescriptorsForCluster() { 441 makeAllocationDecision(rangeDesc, allStoreDescriptors) 442 } 443 ``` 444 445 The `StoreDescriptor`s are discussed in the previous section. `GetAllRangeDescriptorsForCluster` 446 warrants specific attention: it needs to retrieve a potentially large number of range descriptors. 447 For example, suppose that a cluster is storing 100 TiB of de-duplicated data. That is a minimum of 448 16,384,000 ranges each with an associated `RangeDescriptor`. Requiring the scanning, sorting and 449 decision-making based on this large of a collection could be a performance problem. There are 450 clearly methods which may solve some of these bottlenecking problems. Ideas include only looking to 451 move ranges from high to low loads or using a "power of two" technique to randomly pick two stores 452 when looking for a rebalance target. 453 454 ### Failure modes for allocation lease holders 455 456 1. Poor network connectivity. 457 1. Leader node goes down. 458 1. Overloaded allocator node. This may be unlikely to cause problems that 459 extend beyond one term. An overloaded allocator node probably 460 wouldn’t complete its allocator lease renewal KV transaction before its term 461 ends. 462 463 The likely failure modes can largely be alleviated by using short allocation lease terms. 464 465 ### Conclusion 466 467 ***Advantages*** 468 469 - Less thrashing and no herding, since the allocator will know not to overload a new node. 470 Distributed, independently acting allocators can make decisions that run counter to the others’ 471 decisions. 472 - Easier to debug, as there is only one place that performs the rebalancing. 473 - Easier to work with a CopySet style algorithm (see below for a discussion on CopySets). 474 475 ***Disadvantages*** 476 477 - When making rebalancing decisions, there is a lack of information that must be overcome. 478 Specifically, the lack of `RangeDescriptor`s that are required when actually making the final 479 decision. These are too numerous to be gossiped and must be stored and retrieved from the db 480 directly. On the other hand, in a decentralized system, all `RangeDescriptor`s are already 481 available directly in memory in the store. 482 - When dealing with a cluster that use attributes, the central allocator will have to handle all 483 rebalancing decisions by either using a full knowledge of a cluster or by using subsets of the 484 cluster based on combinations of all available attributes. 485 - As the cluster grows, there may be performance issues that arise on the central allocator. Some 486 ways to alleviate this would be to ensure that the centralized allocator itself is located on the 487 same node in which all required data exists (be it tables and indexes which might be required). 488 - If we use `CPuts` for allocator election, the range that contains the leader key becomes a single 489 point of failure for the cluster. This could be alleviated by making the allocation lease holder the 490 same as the range lease holder of the range holding the `StoreStatus` protos. 491 - More internal work needs to be done to support a centralized system. Be it via an election or 492 using the range lease holder of a specific key. 493 494 ***Verdict*** 495 496 The main issue that causes the thrashing and overloading of stores is lack of current information. 497 A big part of that is the lack of knowledge about allocation decisions that are occurring while 498 making other decision. A centralized allocator would fix those issues. However, there are 499 implementation and performance issues that may arise from moving to a central allocator. Be it the 500 potential requirement to iterate over a set or all of the `RangeDescriptors`, dealing with 501 performance concerns of having all rebalancing decisions made in an expedient manner, or cases in 502 which the centralized allocator itself is faulty in some way, make the centralized solution less 503 appealing. 504 505 ## CopySets 506 507 [https://www.usenix.org/system/files/conference/atc13/atc13-cidon.pdf](https://www.usenix.org/system/files/conference/atc13/atc13-cidon.pdf) 508 509 By using an algorithm to determine the best CopySets for each type of configuration (ignoring 510 overlap), we can limit the locations of all replicas and as the shape of the cluster changes, it 511 can adapt appropriately. 512 513 ***Advantages*** 514 515 - Greatly reduces the chance of data availability when >1 nodes die. 516 - No central controller/lease holder 517 - No fighting with all ranges when a new node joins or one is lost. 518 - It will take a bit of time for all nodes to receive the updated gossiped network topology, so this 519 might be a good way to gate the changes. 520 - While there is greater complexity in the algorithm for determining the CopySets themselves, the 521 allocator becomes extremely simple. 522 523 ***Disadvantages*** 524 525 - When a new node joins, it could be a number of replicas need to move, all at once, depending on 526 how the algorithm is setup. So some artificial limiting may be required on a new node being 527 added or one being removed. 528 - Heterogeneous environments in which stores differ in sizes makes the CopySet algorithm also 529 extremely problematic. 530 - In dynamic environments, ones in which nodes are added and removed, the CopySet algorithm will 531 lead to potential store rot. 532 533 ***Verdict*** 534 535 While the advantages of CopySets are clear, its disadvantages are too numerous. The CopySet 536 algorithms only works well in a static (no new nodes added or removed) and homogenous (all stores 537 are the same size) setup. Trying to work around these limitations leads one into a rabbit hole. 538 Here is a list of considered ways to shoehorn the algorithm to our system: 539 540 - For the dynamic cluster - recalculate the CopySets each time a node is added and removed and then 541 move all misplaced replicas 542 - For heterogeneous stores - split all store space into blocks (of around 100 or 1000 replicas) and 543 run the algorithm against that 544 - For zones and different replication factors - have a collection of CopySets, one for each 545 combination, with overlap 546 - For the overlaps created by the zones fix - make CopySets that contain more than the number of 547 replicas, so make the CopySet fit to 4 instead of 3, and rebalance amongst the 4 stores 548 549 Each of these "solutions" adds more complexity and takes away from the original benefit of using 550 the CopySet algorithm in the first place. 551 552 ## CopySets emulation 553 554 As an alternative to implementing the CopySets algorithm directly, add a secondary tier of 555 rebalancing that adds an affinity for co-locating replicas on the same set of stores. This can be 556 done by simply applying a small weight to having all replicas co-located. Note that this should 557 not interrupt the need for real rebalancing, but all other features being equal, choose a store 558 with the most other replicas in common. 559 560 Testing will be required to see if this has the desired effect. 561 562 ***Advantages*** 563 564 - A weaker constraint than straight CopySets. CopySets prescribe exactly where each replica should 565 go, while this method will let that happen organically. 566 - Easy to add to our current balancing heuristic. 567 - Reduces the chance of data loss when more than one node dies. 568 569 ***Disadvantages*** 570 571 - May cause more thrashing and more rebalances before equilibrium is set. 572 - It will never be as efficient as straight CopySets 573 - There is a chance that the cluster gets into a less desirable state if not done carefully. 574 575 ***Verdict*** 576 577 If done well, this could greatly reduce the risk of data loss when more than one node dies. This 578 should be in serious consideration for rebalancing V3. 579 580 # Allocation Heuristic Features 581 582 Currently, the allocator makes replica counts converge on the cluster mean range count. This 583 effectively reduces the standard deviation of replica counts across stores. Stores that are too 584 full (95% used capacity) do not receive new replicas. 585 586 Possible changes for V2: 587 588 - **Mean vs. median** 589 It is possible that converging on the mean has undesirable consequences under certain scenarios. 590 We may want to converge on the median instead. Care should be taken that whatever is chosen works 591 well for small *and* large clusters. 592 593 For future consideration (post v2): 594 595 - **Store capacity available** 596 Care must be taken. Using free disk space is problematic, because for nearly empty clusters, the 597 OS install dominates disk usage. This will be one of the first aspects to look at in post V2 598 work. 599 600 - **Node load** 601 We will likely want to move replicas off overloaded nodes for some definition of "load." 602 603 - **Node health** 604 If a node is serving requests slowly for a sustained period, we should rebalance away from that 605 node. This is related but not identical to load. 606 607 # Testing Scenarios 608 609 The chosen allocation strategy should perform well in the following scenarios: 610 611 For V2: 612 613 1. Small (3 node) cluster 614 1. Medium (32 node) cluster 615 1. Bringing up new nodes in a staggered fashion 616 1. Bringing up new nodes all at once 617 1. Removing nodes, one at a time. 618 1. Removing and bringing a node back up after different timeouts. 619 1. Cluster with overloaded stores (i.e. hotspots) 620 1. Nearly full stores 621 1. Node permanently going down 622 1. Network slowness 623 1. Changing the attribute labels of a store 624 625 For future consideration (post v2): 626 627 1. Large (100+ node) cluster 628 1. Very large (1000+ node) cluster 629 1. Stores with different capacities 630 1. Heterogeneous nodes (CPU) 631 1. Replication factor > 3 (some basic testing will be done, but it won’t be concentrated on) 632 1. Geographically distributed cluster 633 634 It will take many iterations to arrive at a replication strategy that works for all of these cases. 635 These will be incorporated into unit and acceptance tests as applicable. 636 637 ## Simulator 638 639 To aid in testing, the rebalancing simulator will be update to speed up testing. Some of these 640 updates include: 641 642 - Being able to take a running cluster and output the current and all previous states so that the 643 simulator can emulate it. 644 - Convert the custom allocator input formats to protos. 645 - Update the simulator based on changes proposed in this RFC. (i.e. add replica reservations). 646 - Add a collection of more accurate metrics. 647 648 # Unresolved Questions 649 650 ## Centralized vs Decentralized 651 652 Both approaches are clearly viable solutions with advantages and drawbacks. Is one option 653 objectively better than the other? It might be worthwhile to test the performance of both a central 654 and decentralized rebalancing scheme in different configurations under different loads. One option 655 would be to update the simulator to be able to test both, but that would not be an ideal 656 environment. How much time will this take and can it be done quickly?