github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160503_rebalancing_v2.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160503_rebalancing_v2.md (about)

1 - Feature name: Rebalancing V2
2 - Status: completed
3 - Start date: 2016-04-20
4 - Last revised: 2016-05-03
5 - Authors: Bram Gruneir & Cuong Do
6 - RFC PR: [#6484](https://github.com/cockroachdb/cockroach/pull/6484)
7 - Cockroach Issue:
8
9 # Table of Contents
10
11 - [Table of Contents](#table-of-contents)
12 - [Summary](#summary)
13 - [Motivation](#motivation)
14 - [Goals](#goals)
15 - [Non-Goals](#non-goals)
16 - [Metrics for evaluating rebalancing](#metrics-for-evaluating-rebalancing)
17 - [Detailed Design](#detailed-design)
18 - [Store: Add the ability to reserve a replica](#store-add-the-ability-to-reserve-a-replica)
19 - [Protos: add a timestamp to *StoreDescriptor* and reservations to *StoreCapacity*](#protos-add-a-timestamp-to-storedescriptor-and-reservations-to-storecapacity)
20 - [RPC: ReserveReplica](#rpc-reservereplica)
21 - [Update *StorePool/Allocator* to call *ReserveReplica*](#update-storepoolallocator-to-call-reservereplica)
22 - [Drawbacks](#drawbacks)
23 - [Too many requests](#too-many-requests)
24 - [Alternate Allocation Strategies](#alternate-allocation-strategies)
25 - [Other enhancements to distributed allocation](#other-enhancements-to-distributed-allocation)
26 - [Centralized allocation strategy](#centralized-allocation-strategy)
27 - [Allocator lease acquisition](#allocator-lease-acquisition)
28 - [Allocator lease renewal](#allocator-lease-renewal)
29 - [Updating the allocator’s *StoreDescriptors*](#updating-the-allocators-storedescriptors)
30 - [Centralized decision-making](#centralized-decision-making)
31 - [Failure modes for allocation lease holders](#failure-modes-for-allocation-lease-holders)
32 - [Conclusion](#conclusion)
33 - [CopySets](#copysets)
34 - [CopySets emulation](#copysets-emulation)
35 - [Allocation Heuristic Features](#allocation-heuristic-features)
36 - [Testing Scenarios](#testing-scenarios)
37 - [Simulator](#simulator)
38 - [Unresolved Questions](#unresolved-questions)
39 - [Centralized vs Decentralized](#centralized-vs-decentralized)
40
41 # Summary
42
43 Rebalancing is the redistribution of replicas to optimize for a chosen set of heuristics. Currently,
44 each range lease holder runs a distributed algorithm that spreads replicas as evenly as possible across
45 the stores in a cluster. We artificially limit the rate at which each node may move replicas to
46 avoid the excessive thrashing of replicas that results from making independent rebalancing decisions
47 based on outdated information (gossiped `StoreDescriptor`s that are up to a minute old).
48
49 As detailed later in this document, we’ve weighed decentralized against centralized allocation
50 algorithms, as well as different heuristics for evaluating whether replicas are properly balanced.
51 For V2 of our replica allocator, we are adding a replica reservation step to the distributed
52 allocator and intelligently increasing the frequency at which we gossip `StoreDescriptor`s.
53 These modifications will significantly reduce the time required to rebalance small-to-medium-sized
54 clusters while avoiding the waste of resources and degradation in performance caused by excessive
55 thrashing of replicas.
56
57 We’re specifically not addressing load-based rebalancing or heterogeneous stores/nodes in V2.
58 Moreover, we are not addressing the potential for data unavailability when more than one node goes
59 offline. These are important problems that will be addressed in V3 or later.
60
61 # Motivation
62
63 To allocate replicas for ranges, we currently rely on distributed
64 [stateless replica relocation](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20150819_stateless_replica_relocation.md).
65
66 Each range lease holder is responsible for replica allocation decisions (adding and removing replicas)
67 for its respective range. This is a good, simple start. However, it is particularly susceptible to
68 thrashing. Because different range lease holders distribute replicas independently, they don't necessarily
69 converge on a desirable distribution of replicas within a reasonable number of replica allocations.
70
71 A number of factors contribute to this thrashing. First, there is a lack of updated information on
72 the current state of all stores so replication decisions typically rely on outdated data. The
73 current state of the cluster is retrieved from gossiped store descriptors. However, store
74 descriptors are only gossiped at an interval of once every minute. So if a store is suddenly
75 overloaded with replicas, it may take up to one minute (plus gossip network propagation time) for
76 its updated descriptor to reach various range lease holders.
77
78 Secondly, until recently, our replica allocator had no limits for how fast rebalancing could occur.
79 Combined with lack of data and since there is no coordination between replica allocators, an
80 over-allocation to the new store is likely to occur.
81
82 For example, consider the following scenario where we have 3 perfectly balanced stores:
83
84 ![Thrashing 1](images/rebalancing-v2-thrashing1.png?raw=true "Thrashing 1")
85
86 Let's add an empty store:
87
88 ![Thrashing 2](images/rebalancing-v2-thrashing2.png?raw=true "Thrashing 2")
89
90 As soon as that store is seen by each range’s allocator, the following occurs:
91
92 ![Thrashing 3](images/rebalancing-v2-thrashing3.png?raw=true "Thrashing 3")
93
94 This over-rebalancing continues for many cycles, often resulting in tens of thousands of replicas
95 adds and removes for clusters with miniscule data.
96
97 As a stopgap measure, a
98 [recent change](https://github.com/cockroachdb/cockroach/commit/c4273b9ef7f418cab2ac30a10a8707c1601e5e99)
99 has added a minimum delay of 65 seconds between rebalance attempts for each node, to reduce
100 thrashing. This works well for small clusters with little data. However, this severely slows down
101 the process of rebalancing many replicas in an imbalanced cluster.
102
103 # Goals
104
105 Rebalancing is a topic of ongoing research and investigation, in this section, goals for a second
106 version of rebalancing are presented. In the section that immediately follows it, a collection of
107 future goals are presented for potential post V2 work.
108
109 Relative priorities are debatable and depend on deployment specifics, so the following are listed
110 in no particular order:
111
112 - Minimizes thrashing.
113
114 Thrashing, which can occur when a node is added or removed can quickly bring a cluster to a near
115 halt due to the number replicas being moved between nodes which results in requests queuing up
116 waiting to be serviced.
117
118 - Is performant in clusters with up to 32 nodes.
119
120 The choice of 32 nodes here matches our OKRs. This limit is to make testing more tractable.
121 Performance will be measured using the metrics described in the Metrics section below.
122
123 - Is performant in a dynamic cluster. A dynamic cluster is one in which nodes can be added and
124 removed arbitrarily.
125
126 While this may seem like an obvious goal, we should ensure that equilibrium is reached quickly
127 in cases when one or more nodes are added and removed at once. It should be noted that only
128 a single node can be removed at a time but any number of nodes can be added.
129
130 - Handles outages of any number of nodes gracefully, as long as quorum is maintained.
131
132 This is the classic repair scenario.
133
134 - Don’t overfill stores.
135
136 Overly full disks tend to perform worse. But this is also to ensure we don’t overly fill a new
137 store when it’s added. Using the example from the motivation section above, if the new store
138 could only hold 100 replicas it would be catastrophic for the cluster.
139
140 # Non-Goals
141
142 There are a number of interesting further investigations on improving rebalancing and how it can
143 impact overall cluster and perhaps individual operation performance. We list them here for
144 potential future work on what we are calling post V2 rebalancing. Again, these are not ordered by
145 priority:
146
147 - Is performant in heterogeneous clusters.
148
149 Clusters with different sized stores with different CPUs and lagging nodes.
150
151 - Is performant in large clusters (>>32 nodes).
152
153 Further work past our arbitrary limit of 32 nodes.
154
155 - Replicas are moved to where there is demand for them.
156
157 Experiment to see if this would be useful. There may be performance gains on keeping replicas of
158 single tables together on the same set of stores.
159
160 - Globally distributed data.
161
162 How should the replicas be organized when there are potentially long round trips between
163 datacenters?
164
165 - Optimizes data transfer based on network topology.
166
167 Examples: ping times between replicas, network distance, rack and datacenter awareness.
168
169 - Decrease chance of data unavailability as nodes go down.
170
171 See the discussion on CopySets below.
172
173 - Distribute "hot" keys and ranges evenly through the cluster.
174
175 This would greatly help to distribute load and make the cluster more robust and be able to handle
176 a larger number of queries with lower latency.
177
178 - Defragment. Co-locate tables that are too big for a single range to the same set of stores.
179
180 This could speed up all large table queries.
181
182 # Metrics for evaluating rebalancing
183
184 As with any system, a set of evaluation criteria are required to ensure that changes improve the
185 cluster. We propose the following criteria. It should be noted that most changes may positively
186 impact one and negatively affect the others:
187
188 - Distribution of data across stores. Measured using percentage of store capacity available.
189 - Speed at which perturbed systems reach equilibrium. Measured using the number of rebalances
190 until and the total time until the cluster is stable.
191
192 For post V2 consideration:
193
194 - User latency. Rebalancing should never affect user query latency, but too many rebalances may do
195 just that. Measured using latencies of user queries.
196 - Distribution of load across stores. Measured using CPU usage and network traffic.
197
198 # Detailed Design
199
200 The current distributed allocator cannot rebalance quickly because of the >= 65 second rebalancing
201 backoff. Because removing that backoff would cause excessive allocation thrashing, the `Allocator`
202 has to be modified to make faster progress while minimizing thrashing.
203
204 To reduce thrashing, we are introducing the concept of reserved replicas. Before allocating a new
205 replica, an allocator will first reserve the space for the new replica. This will be accomplished
206 by adding a new RPC `ReserveReplica` that requests to reserve space for the new replica on one
207 store. Once received, the store can reply with either a yes or a no. When it replies with a
208 `reserved`, the space for said replica is reserved for a predetermined amount of time. If no
209 replica appears within that time, it is no longer reserved. (It should be noted that the size of a
210 replica depends on the split size based on the table and or zone. This should be taken into
211 consideration.)
212
213 Each `ReserveReplica` response contains the latest `StoreDescriptor`s from the node with a
214 node-local timestamp. The caller can use these to update its cached copy of the `StoreDescriptor`.
215
216 When it replies with a `not reserved`, it also includes possible reasons as to why for debugging
217 purposes. These reasons can include:
218
219 1. Too full in terms of absolute free disk space (this includes all reserved replica spots)
220 1. Overloaded (once we define what that term specifically means)
221 1. Too many current reservations (throttling factor will be determined experimentally)
222
223 Any other error, including networking errors can be considered a `not reserved` response for the
224 purposes of allocation. When a `not reserved` is received, that response is cached in the store
225 pool until the next `StoreDescriptor` update. We avoid issuing further `ReserveReplica` calls to
226 that store until the next `StoreDescriptor` update.
227
228 The next subsections contain all of the major tasks that will be required to complete this feature
229 and further details about each.
230
231 ## Store: Add the ability to reserve a replica
232
233 To prevent a store from being overwhelmed and overloaded, the concept of a reserved replica will be
234 added to a store. A reserved replica reserves a full replica’s amount of space for a predetermined
235 amount of time (typically for a `storeGossipInterval`) and reserves it for the expected incoming
236 replica for a specific RangeID. If the replica for the range is not added within the reservation
237 timeframe, the reservation is removed and the space becomes free again.
238
239 If a new replica arrives and there is no reservation, the store will still create the new replica
240 and this will not cancel any pre-existing reservations. The gating of when to allow a new
241 reservation is decided in the `ReserveReplica` RPC and is not part of adding a replica in the
242 store.
243
244 Optionally, the ability to control the amount of currently available space on a store might be used
245 to slow down a cluster from suddenly jumping on a new node when one becomes available. By
246 pre-reserving (for a non existing store) all or most of the new store’s capacity and staggering
247 the timeouts, it may prevent all replicas from suddenly being interested in adding themselves to
248 the store. This will require some testing to determine if it will be beneficial.
249
250 ## Protos: add a timestamp to *StoreDescriptor* and reservations to *StoreCapacity*
251
252 By adding a timestamp to the `StoreDescriptor` proto, it enables the ability to quickly pick the
253 most recent `StoreDescriptor`. This timestamp is local to the store that generated the
254 `StoreDescriptor`. The main use case for this is when calling `ReserveReplica`, regardless of the
255 response being a `reserved` or a `not reserved`, it will also return updated `StoreDescriptor`s for
256 all the stores on the node. These updated descriptors will be used to update the cached
257 `StoreDescriptor` in the `StorePool` of the node calling `ReserveReplica`. There may be a small
258 race with these descriptors and new ones that are arriving via gossip. A timestamp fixes this
259 problem. Any subsequent calls to the allocator will have a fresher sense of the cluster. Note that
260 it may be possible to skip the addition of the timestamp by returning a gossip `Info` from the
261 `reserveReplica` RPC.
262
263 By adding a `reservedSpace` value to capacity, it allows more insight into how the total capacity
264 of a store is used and be able to make better decisions around it. Also, by adding
265 `activeReservations` an allocator will be able to choose rebalancing targets that are not
266 overwhelmed with requests.
267
268 ## RPC: ReserveReplica
269
270 By adding `ReserveReplica` RPC to a node, it will enable a range to reserve a replica on a node
271 before calling `changeReplica` and adding the replica directly. Because no node will ever have the
272 most up to date information about another one, the response will always include an updated
273 `StoreDescriptor` for the store in which a reservation is requested. It should be noted that this
274 is a new type of RPC in which it addresses a store and not a node or range.
275
276 The request will include:
277
278 - `StoreID` of the store in which to reserve the replica space
279 - `RangeID` of the requesting range. Consider repurposing a `ReplicaDescriptor` here.
280 - All other parameters that are required by the allocator, such as required attributes.
281
282 The response will include:
283
284 - `StoreDescriptor`s The most up to date store descriptors for all stores in the node. Note that
285 there may be a requirement to limit the number of times `engine.Capacity` is called as this is
286 doing a physical walk of the hard drive. Consider wrapping the descriptor in a gossip `Info`.
287 - `Status` An ENUM or boolean value that indicates if a replica has been reserved or not. Usually
288 this will be either a `reserved` or `not reserved`.
289
290 When determining if a store should reserve a new replica based on a request, it should first check
291 some basic conditions:
292
293 - Is the store being decommissioned?
294 - Are there too many reserved replicas?
295 - Is there enough free (non-reserved) space available on the store?
296
297 Typically the response will be a yes. A response of `not reserved` will only occur when the store
298 is being overloaded or is close to being overly full. Even when a reservation has been made, there
299 is no guarantee that the store calling will still fill the reservation. It only means that that
300 space has been reserved.
301
302 ## Update *StorePool/Allocator* to call *ReserveReplica*
303
304 When trying to choose a new store to allocate replica a range to, after sorting all the available
305 ranges and ruling out the ones with incorrect attributes. The allocator picks the top store to
306 locate the new replica based on the heuristic discussed at the end of the of this document. It then
307 calls `ReserveReplica` on that node.
308
309 After each call to `ReserveReplica`, the `StorePool` on a node will update its cached
310 `StoreDescriptor`s. (Consider reusing or extending some of the gossip primitives as this could be
311 partially considered a forced gossip update.)
312
313 On a `not reserved` response, add a note that the store refused and so that it will not be
314 considered for new allocations for a short period (perhaps 1 second).
315
316 On a `reserved` response the replica will issue a `replicaChangeRequest` to add the chosen store.
317
318 # Drawbacks
319
320 ## Too many requests
321
322 When a new node joins the cluster and it's gossiped `StoreDescriptor` makes its way to all stores
323 that could stand to have some ranges rebalanced, it may create too much network traffic calling the
324 `ReserveReplica` RPC. To ensure this doesn't happen, the RPC should be extremely quick to respond
325 and require very little processing on the receiving store's side, especially in the case that it is
326 a rejection.
327
328 # Alternate Allocation Strategies
329
330 This sections contains a collection of other techniques and strategies that were considered. Some
331 of these enhancements may still be included in V2.
332
333 ## Other enhancements to distributed allocation
334
335 Here is a small collection to tweaks that could be used to alter how a distributed allocation could
336 work. These are not being implemented now, but could be considered as alternatives if the
337 `ReserveReplica` strategy doesn’t solve all issues.
338
339 - Make the gossiping of `StoreDescriptor`s event driven. Anytime a snapshot is applied or a
340 replica is garbage collected. If no event occurs, gossip the `storeDescriptor` every
341 `gossipStoresInterval`.
342
343 This could reduce the time it takes for the updated descriptor to make its way to all other
344 nodes.
345
346 - Decrease the `gossipStoresInterval` to 10 seconds so `StoreDescriptor`s are fresher.
347
348 This adds a lot of churn to the gossiped descriptors so the increased network traffic might
349 outweigh the benefits of faster rebalancing.
350
351 - Move from using gossiped `StoreDescriptor`s (updated every 60 seconds) to gossiped
352 `StoreStatuses` (written every 10 seconds).
353
354 This would require gossiping which would incur the same problem as decreasing the
355 `gossipStoresInterval`.
356
357 - Based on the latest Store Descriptors, determine how many stores would likely rebalance in the
358 next 10 seconds. Then, each of those stores rebalances with probability
359 `1/(# of candidate stores)`. For example, suppose that we're balancing by replica count. Two
360 stores have 100 replicas, and one store has 0 replicas. So, there are 2 stores that are good
361 candidates to move replicas from. Each of those 2 stores would have a `1/2` probability of
362 starting a rebalance. We could speed this up by doing this virtual coin toss `N` times, where `N`
363 is the total number of replicas we'd like to move to the destination store.
364
365 This might be a useful option if there is still too much thrashing when a new node is added.
366
367 - Don't try to rebalance any other replicas on a store until the previous `ChangeReplicas` call has
368 finished and the snapshot has been applied.
369
370 This limits each store to performing a single `ChangeReplica`/Snapshot at a time. It would limit
371 thrashing but also greatly increase the time it takes to reach equilibrium.
372
373 ## Centralized allocation strategy
374
375 One way to avoid the thrashing caused by multiple independently acting allocators is to centralize
376 all replica allocation. In this section, a possible centralized allocation strategy is described in
377 detail.
378
379 ### Allocator lease acquisition
380
381 Every second, each node checks whether there’s an allocation lease holder
382 ("allocator") through a `Get(globalAllocatorKey)`. If that returns no data, the
383 node tries to become the allocator lease holder using a `CPut` for that key. In
384 pseudo-code:
385
386 ``` pseudocode
387 every 60 seconds:
388 result := Get(globalAllocatorKey)
389 if result != nil {
390 // do nothing
391 return
392 }
393 err := CPut(globalAllocatorKey, nodeID+"-”+expireTime, nil)
394 if err != nil {
395 // Some other node became the allocator.
396 return
397 }
398 // This node is now the allocator.
399 ```
400
401 ### Allocator lease renewal
402
403 Near the end of the allocation lease, the current allocator does the following:
404
405 ``` golang
406 err := CPut(globalAllocatorKey,
407 nodeID+"-”+newExpireTime,
408 nodeID+"-”+oldExpireTime)
409 if err != nil {
410 // Re-election failed. Step down as allocator lease holder.
411 return err
412 }
413 // **Re-election succeeded**. We’re still the allocation lease holder.
414 ```
415
416 For example, if the allocation lease term is 60 seconds, the current allocation lease holder could
417 try to renew its lease 55 seconds into its term.
418
419 We may want to enforce artificial allocator lease term limits to more regularly
420 exercise the lease acquisition code.
421
422 ### Updating the allocator’s *StoreDescriptors*
423
424 An allocation lease holder needs recent store information to make effective allocation decisions.
425
426 This could be achieved using either of two different mechanisms: decreasing the
427 interval for gossiping `StoreDescriptor`s from 60 seconds to a lower value, (perhaps 10 seconds) or
428 by writing the descriptors to a system keyspace and retrieving them, possibly using inconsistent
429 reads, (also every 10 seconds or so). Also, using `StoreStatus`es instead of descriptors is also an
430 option. Recall that `StoreDescriptor` updates are frequent and the allocation lease holder is the only
431 node making rebalancing decisions. So, the allocation lease holder could use the latest gossiped
432 `StoreDescriptor`s and its knowledge of the replica allocation decisions made since the last
433 `StoreDescriptor` gossip to derive the current state of replica allocation in the cluster.
434
435 ### Centralized decision-making
436
437 Pseudo-code for centralized rebalancing:
438
439 ``` pseudocode
440 for _, rangeDesc := range GetAllRangeDescriptorsForCluster() {
441 makeAllocationDecision(rangeDesc, allStoreDescriptors)
442 }
443 ```
444
445 The `StoreDescriptor`s are discussed in the previous section. `GetAllRangeDescriptorsForCluster`
446 warrants specific attention: it needs to retrieve a potentially large number of range descriptors.
447 For example, suppose that a cluster is storing 100 TiB of de-duplicated data. That is a minimum of
448 16,384,000 ranges each with an associated `RangeDescriptor`. Requiring the scanning, sorting and
449 decision-making based on this large of a collection could be a performance problem. There are
450 clearly methods which may solve some of these bottlenecking problems. Ideas include only looking to
451 move ranges from high to low loads or using a "power of two" technique to randomly pick two stores
452 when looking for a rebalance target.
453
454 ### Failure modes for allocation lease holders
455
456 1. Poor network connectivity.
457 1. Leader node goes down.
458 1. Overloaded allocator node. This may be unlikely to cause problems that
459 extend beyond one term. An overloaded allocator node probably
460 wouldn’t complete its allocator lease renewal KV transaction before its term
461 ends.
462
463 The likely failure modes can largely be alleviated by using short allocation lease terms.
464
465 ### Conclusion
466
467 ***Advantages***
468
469 - Less thrashing and no herding, since the allocator will know not to overload a new node.
470 Distributed, independently acting allocators can make decisions that run counter to the others’
471 decisions.
472 - Easier to debug, as there is only one place that performs the rebalancing.
473 - Easier to work with a CopySet style algorithm (see below for a discussion on CopySets).
474
475 ***Disadvantages***
476
477 - When making rebalancing decisions, there is a lack of information that must be overcome.
478 Specifically, the lack of `RangeDescriptor`s that are required when actually making the final
479 decision. These are too numerous to be gossiped and must be stored and retrieved from the db
480 directly. On the other hand, in a decentralized system, all `RangeDescriptor`s are already
481 available directly in memory in the store.
482 - When dealing with a cluster that use attributes, the central allocator will have to handle all
483 rebalancing decisions by either using a full knowledge of a cluster or by using subsets of the
484 cluster based on combinations of all available attributes.
485 - As the cluster grows, there may be performance issues that arise on the central allocator. Some
486 ways to alleviate this would be to ensure that the centralized allocator itself is located on the
487 same node in which all required data exists (be it tables and indexes which might be required).
488 - If we use `CPuts` for allocator election, the range that contains the leader key becomes a single
489 point of failure for the cluster. This could be alleviated by making the allocation lease holder the
490 same as the range lease holder of the range holding the `StoreStatus` protos.
491 - More internal work needs to be done to support a centralized system. Be it via an election or
492 using the range lease holder of a specific key.
493
494 ***Verdict***
495
496 The main issue that causes the thrashing and overloading of stores is lack of current information.
497 A big part of that is the lack of knowledge about allocation decisions that are occurring while
498 making other decision. A centralized allocator would fix those issues. However, there are
499 implementation and performance issues that may arise from moving to a central allocator. Be it the
500 potential requirement to iterate over a set or all of the `RangeDescriptors`, dealing with
501 performance concerns of having all rebalancing decisions made in an expedient manner, or cases in
502 which the centralized allocator itself is faulty in some way, make the centralized solution less
503 appealing.
504
505 ## CopySets
506
507 [https://www.usenix.org/system/files/conference/atc13/atc13-cidon.pdf](https://www.usenix.org/system/files/conference/atc13/atc13-cidon.pdf)
508
509 By using an algorithm to determine the best CopySets for each type of configuration (ignoring
510 overlap), we can limit the locations of all replicas and as the shape of the cluster changes, it
511 can adapt appropriately.
512
513 ***Advantages***
514
515 - Greatly reduces the chance of data availability when >1 nodes die.
516 - No central controller/lease holder
517 - No fighting with all ranges when a new node joins or one is lost.
518 - It will take a bit of time for all nodes to receive the updated gossiped network topology, so this
519 might be a good way to gate the changes.
520 - While there is greater complexity in the algorithm for determining the CopySets themselves, the
521 allocator becomes extremely simple.
522
523 ***Disadvantages***
524
525 - When a new node joins, it could be a number of replicas need to move, all at once, depending on
526 how the algorithm is setup. So some artificial limiting may be required on a new node being
527 added or one being removed.
528 - Heterogeneous environments in which stores differ in sizes makes the CopySet algorithm also
529 extremely problematic.
530 - In dynamic environments, ones in which nodes are added and removed, the CopySet algorithm will
531 lead to potential store rot.
532
533 ***Verdict***
534
535 While the advantages of CopySets are clear, its disadvantages are too numerous. The CopySet
536 algorithms only works well in a static (no new nodes added or removed) and homogenous (all stores
537 are the same size) setup. Trying to work around these limitations leads one into a rabbit hole.
538 Here is a list of considered ways to shoehorn the algorithm to our system:
539
540 - For the dynamic cluster - recalculate the CopySets each time a node is added and removed and then
541 move all misplaced replicas
542 - For heterogeneous stores - split all store space into blocks (of around 100 or 1000 replicas) and
543 run the algorithm against that
544 - For zones and different replication factors - have a collection of CopySets, one for each
545 combination, with overlap
546 - For the overlaps created by the zones fix - make CopySets that contain more than the number of
547 replicas, so make the CopySet fit to 4 instead of 3, and rebalance amongst the 4 stores
548
549 Each of these "solutions" adds more complexity and takes away from the original benefit of using
550 the CopySet algorithm in the first place.
551
552 ## CopySets emulation
553
554 As an alternative to implementing the CopySets algorithm directly, add a secondary tier of
555 rebalancing that adds an affinity for co-locating replicas on the same set of stores. This can be
556 done by simply applying a small weight to having all replicas co-located. Note that this should
557 not interrupt the need for real rebalancing, but all other features being equal, choose a store
558 with the most other replicas in common.
559
560 Testing will be required to see if this has the desired effect.
561
562 ***Advantages***
563
564 - A weaker constraint than straight CopySets. CopySets prescribe exactly where each replica should
565 go, while this method will let that happen organically.
566 - Easy to add to our current balancing heuristic.
567 - Reduces the chance of data loss when more than one node dies.
568
569 ***Disadvantages***
570
571 - May cause more thrashing and more rebalances before equilibrium is set.
572 - It will never be as efficient as straight CopySets
573 - There is a chance that the cluster gets into a less desirable state if not done carefully.
574
575 ***Verdict***
576
577 If done well, this could greatly reduce the risk of data loss when more than one node dies. This
578 should be in serious consideration for rebalancing V3.
579
580 # Allocation Heuristic Features
581
582 Currently, the allocator makes replica counts converge on the cluster mean range count. This
583 effectively reduces the standard deviation of replica counts across stores. Stores that are too
584 full (95% used capacity) do not receive new replicas.
585
586 Possible changes for V2:
587
588 - **Mean vs. median**
589 It is possible that converging on the mean has undesirable consequences under certain scenarios.
590 We may want to converge on the median instead. Care should be taken that whatever is chosen works
591 well for small *and* large clusters.
592
593 For future consideration (post v2):
594
595 - **Store capacity available**
596 Care must be taken. Using free disk space is problematic, because for nearly empty clusters, the
597 OS install dominates disk usage. This will be one of the first aspects to look at in post V2
598 work.
599
600 - **Node load**
601 We will likely want to move replicas off overloaded nodes for some definition of "load."
602
603 - **Node health**
604 If a node is serving requests slowly for a sustained period, we should rebalance away from that
605 node. This is related but not identical to load.
606
607 # Testing Scenarios
608
609 The chosen allocation strategy should perform well in the following scenarios:
610
611 For V2:
612
613 1. Small (3 node) cluster
614 1. Medium (32 node) cluster
615 1. Bringing up new nodes in a staggered fashion
616 1. Bringing up new nodes all at once
617 1. Removing nodes, one at a time.
618 1. Removing and bringing a node back up after different timeouts.
619 1. Cluster with overloaded stores (i.e. hotspots)
620 1. Nearly full stores
621 1. Node permanently going down
622 1. Network slowness
623 1. Changing the attribute labels of a store
624
625 For future consideration (post v2):
626
627 1. Large (100+ node) cluster
628 1. Very large (1000+ node) cluster
629 1. Stores with different capacities
630 1. Heterogeneous nodes (CPU)
631 1. Replication factor > 3 (some basic testing will be done, but it won’t be concentrated on)
632 1. Geographically distributed cluster
633
634 It will take many iterations to arrive at a replication strategy that works for all of these cases.
635 These will be incorporated into unit and acceptance tests as applicable.
636
637 ## Simulator
638
639 To aid in testing, the rebalancing simulator will be update to speed up testing. Some of these
640 updates include:
641
642 - Being able to take a running cluster and output the current and all previous states so that the
643 simulator can emulate it.
644 - Convert the custom allocator input formats to protos.
645 - Update the simulator based on changes proposed in this RFC. (i.e. add replica reservations).
646 - Add a collection of more accurate metrics.
647
648 # Unresolved Questions
649
650 ## Centralized vs Decentralized
651
652 Both approaches are clearly viable solutions with advantages and drawbacks. Is one option
653 objectively better than the other? It might be worthwhile to test the performance of both a central
654 and decentralized rebalancing scheme in different configurations under different loads. One option
655 would be to update the simulator to be able to test both, but that would not be an ideal
656 environment. How much time will this take and can it be done quickly?