github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160503_rebalancing_v2.md (about)

     1  - Feature name: Rebalancing V2
     2  - Status: completed
     3  - Start date: 2016-04-20
     4  - Last revised: 2016-05-03
     5  - Authors: Bram Gruneir & Cuong Do
     6  - RFC PR: [#6484](https://github.com/cockroachdb/cockroach/pull/6484)
     7  - Cockroach Issue:
     8  
     9  # Table of Contents
    10  
    11  - [Table of Contents](#table-of-contents)
    12  - [Summary](#summary)
    13  - [Motivation](#motivation)
    14  - [Goals](#goals)
    15  - [Non-Goals](#non-goals)
    16  - [Metrics for evaluating rebalancing](#metrics-for-evaluating-rebalancing)
    17  - [Detailed Design](#detailed-design)
    18    - [Store: Add the ability to reserve a replica](#store-add-the-ability-to-reserve-a-replica)
    19    - [Protos: add a timestamp to *StoreDescriptor* and reservations to *StoreCapacity*](#protos-add-a-timestamp-to-storedescriptor-and-reservations-to-storecapacity)
    20    - [RPC: ReserveReplica](#rpc-reservereplica)
    21    - [Update *StorePool/Allocator* to call *ReserveReplica*](#update-storepoolallocator-to-call-reservereplica)
    22  - [Drawbacks](#drawbacks)
    23    - [Too many requests](#too-many-requests)
    24  - [Alternate Allocation Strategies](#alternate-allocation-strategies)
    25    - [Other enhancements to distributed allocation](#other-enhancements-to-distributed-allocation)
    26    - [Centralized allocation strategy](#centralized-allocation-strategy)
    27      - [Allocator lease acquisition](#allocator-lease-acquisition)
    28      - [Allocator lease renewal](#allocator-lease-renewal)
    29      - [Updating the allocator’s *StoreDescriptors*](#updating-the-allocators-storedescriptors)
    30      - [Centralized decision-making](#centralized-decision-making)
    31      - [Failure modes for allocation lease holders](#failure-modes-for-allocation-lease-holders)
    32      - [Conclusion](#conclusion)
    33    - [CopySets](#copysets)
    34    - [CopySets emulation](#copysets-emulation)
    35  - [Allocation Heuristic Features](#allocation-heuristic-features)
    36  - [Testing Scenarios](#testing-scenarios)
    37    - [Simulator](#simulator)
    38  - [Unresolved Questions](#unresolved-questions)
    39    - [Centralized vs Decentralized](#centralized-vs-decentralized)
    40  
    41  # Summary
    42  
    43  Rebalancing is the redistribution of replicas to optimize for a chosen set of heuristics. Currently,
    44  each range lease holder runs a distributed algorithm that spreads replicas as evenly as possible across
    45  the stores in a cluster. We artificially limit the rate at which each node may move replicas to
    46  avoid the excessive thrashing of replicas that results from making independent rebalancing decisions
    47  based on outdated information (gossiped `StoreDescriptor`s that are up to a minute old).
    48  
    49  As detailed later in this document, we’ve weighed decentralized against centralized allocation
    50  algorithms, as well as different heuristics for evaluating whether replicas are properly balanced.
    51  For V2 of our replica allocator, we are adding a replica reservation step to the distributed
    52  allocator and intelligently increasing the frequency at which we gossip `StoreDescriptor`s.
    53  These modifications will significantly reduce the time required to rebalance small-to-medium-sized
    54  clusters while avoiding the waste of resources and degradation in performance caused by excessive
    55  thrashing of replicas.
    56  
    57  We’re specifically not addressing load-based rebalancing or heterogeneous stores/nodes in V2.
    58  Moreover, we are not addressing the potential for data unavailability when more than one node goes
    59  offline. These are important problems that will be addressed in V3 or later.
    60  
    61  # Motivation
    62  
    63  To allocate replicas for ranges, we currently rely on distributed
    64  [stateless replica relocation](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20150819_stateless_replica_relocation.md).
    65  
    66  Each range lease holder is responsible for replica allocation decisions (adding and removing replicas)
    67  for its respective range. This is a good, simple start. However, it is particularly susceptible to
    68  thrashing. Because different range lease holders distribute replicas independently, they don't necessarily
    69  converge on a desirable distribution of replicas within a reasonable number of replica allocations.
    70  
    71  A number of factors contribute to this thrashing. First, there is a lack of updated information on
    72  the current state of all stores so replication decisions typically rely on outdated data. The
    73  current state of the cluster is retrieved from gossiped store descriptors. However, store
    74  descriptors are only gossiped at an interval of once every minute. So if a store is suddenly
    75  overloaded with replicas, it may take up to one minute (plus gossip network propagation time) for
    76  its updated descriptor to reach various range lease holders.
    77  
    78  Secondly, until recently, our replica allocator had no limits for how fast rebalancing could occur.
    79  Combined with lack of data and since there is no coordination between replica allocators, an
    80  over-allocation to the new store is likely to occur.
    81  
    82  For example, consider the following scenario where we have 3 perfectly balanced stores:
    83  
    84  ![Thrashing 1](images/rebalancing-v2-thrashing1.png?raw=true "Thrashing 1")
    85  
    86  Let's add an empty store:
    87  
    88  ![Thrashing 2](images/rebalancing-v2-thrashing2.png?raw=true "Thrashing 2")
    89  
    90  As soon as that store is seen by each range’s allocator, the following occurs:
    91  
    92  ![Thrashing 3](images/rebalancing-v2-thrashing3.png?raw=true "Thrashing 3")
    93  
    94  This over-rebalancing continues for many cycles, often resulting in tens of thousands of replicas
    95  adds and removes for clusters with miniscule data.
    96  
    97  As a stopgap measure, a
    98  [recent change](https://github.com/cockroachdb/cockroach/commit/c4273b9ef7f418cab2ac30a10a8707c1601e5e99)
    99  has added a minimum delay of 65 seconds between rebalance attempts for each node, to reduce
   100  thrashing. This works well for small clusters with little data. However, this severely slows down
   101  the process of rebalancing many replicas in an imbalanced cluster.
   102  
   103  # Goals
   104  
   105  Rebalancing is a topic of ongoing research and investigation, in this section, goals for a second
   106  version of rebalancing are presented. In the section that immediately follows it, a collection of
   107  future goals are presented for potential post V2 work.
   108  
   109  Relative priorities are debatable and depend on deployment specifics, so the following are listed
   110  in no particular order:
   111  
   112  - Minimizes thrashing.
   113  
   114    Thrashing, which can occur when a node is added or removed can quickly bring a cluster to a near
   115    halt due to the number replicas being moved between nodes which results in requests queuing up
   116    waiting to be serviced.
   117  
   118  - Is performant in clusters with up to 32 nodes.
   119  
   120    The choice of 32 nodes here matches our OKRs. This limit is to make testing more tractable.
   121    Performance will be measured using the metrics described in the Metrics section below.
   122  
   123  - Is performant in a dynamic cluster. A dynamic cluster is one in which nodes can be added and
   124    removed arbitrarily.
   125  
   126    While this may seem like an obvious goal, we should ensure that equilibrium is reached quickly
   127    in cases when one or more nodes are added and removed at once. It should be noted that only
   128    a single node can be removed at a time but any number of nodes can be added.
   129  
   130  - Handles outages of any number of nodes gracefully, as long as quorum is maintained.
   131  
   132    This is the classic repair scenario.
   133  
   134  - Don’t overfill stores.
   135  
   136    Overly full disks tend to perform worse. But this is also to ensure we don’t overly fill a new
   137    store when it’s added. Using the example from the motivation section above, if the new store
   138    could only hold 100 replicas it would be catastrophic for the cluster.
   139  
   140  # Non-Goals
   141  
   142  There are a number of interesting further investigations on improving rebalancing and how it can
   143  impact overall cluster and perhaps individual operation performance. We list them here for
   144  potential future work on what we are calling post V2 rebalancing. Again, these are not ordered by
   145  priority:
   146  
   147  - Is performant in heterogeneous clusters.
   148  
   149    Clusters with different sized stores with different CPUs and lagging nodes.
   150  
   151  - Is performant in large clusters (>>32 nodes).
   152  
   153    Further work past our arbitrary limit of 32 nodes.
   154  
   155  - Replicas are moved to where there is demand for them.
   156  
   157    Experiment to see if this would be useful. There may be performance gains on keeping replicas of
   158    single tables together on the same set of stores.
   159  
   160  - Globally distributed data.
   161  
   162    How should the replicas be organized when there are potentially long round trips between
   163    datacenters?
   164  
   165  - Optimizes data transfer based on network topology.
   166  
   167    Examples: ping times between replicas, network distance, rack and datacenter awareness.
   168  
   169  - Decrease chance of data unavailability as nodes go down.
   170  
   171    See the discussion on CopySets below.
   172  
   173  - Distribute "hot" keys and ranges evenly through the cluster.
   174  
   175    This would greatly help to distribute load and make the cluster more robust and be able to handle
   176    a larger number of queries with lower latency.
   177  
   178  - Defragment. Co-locate tables that are too big for a single range to the same set of stores.
   179  
   180    This could speed up all large table queries.
   181  
   182  # Metrics for evaluating rebalancing
   183  
   184  As with any system, a set of evaluation criteria are required to ensure that changes improve the
   185  cluster. We propose the following criteria. It should be noted that most changes may positively
   186  impact one and negatively affect the others:
   187  
   188  - Distribution of data across stores. Measured using percentage of store capacity available.
   189  - Speed at which perturbed systems reach equilibrium. Measured using the number of rebalances
   190    until and the total time until the cluster is stable.
   191  
   192  For post V2 consideration:
   193  
   194  - User latency. Rebalancing should never affect user query latency, but too many rebalances may do
   195    just that. Measured using latencies of user queries.
   196  - Distribution of load across stores. Measured using CPU usage and network traffic.
   197  
   198  # Detailed Design
   199  
   200  The current distributed allocator cannot rebalance quickly because of the >= 65 second rebalancing
   201  backoff. Because removing that backoff would cause excessive allocation thrashing, the `Allocator`
   202  has to be modified to make faster progress while minimizing thrashing.
   203  
   204  To reduce thrashing, we are introducing the concept of reserved replicas. Before allocating a new
   205  replica, an allocator will first reserve the space for the new replica. This will be accomplished
   206  by adding a new RPC `ReserveReplica` that requests to reserve space for the new replica on one
   207  store. Once received, the store can reply with either a yes or a no. When it replies with a
   208  `reserved`, the space for said replica is reserved for a predetermined amount of time. If no
   209  replica appears within that time, it is no longer reserved. (It should be noted that the size of a
   210  replica depends on the split size based on the table and or zone. This should be taken into
   211  consideration.)
   212  
   213  Each `ReserveReplica` response contains the latest `StoreDescriptor`s from the node with a
   214  node-local timestamp. The caller can use these to update its cached copy of the `StoreDescriptor`.
   215  
   216  When it replies with a `not reserved`, it also includes possible reasons as to why for debugging
   217  purposes. These reasons can include:
   218  
   219  1. Too full in terms of absolute free disk space (this includes all reserved replica spots)
   220  1. Overloaded (once we define what that term specifically means)
   221  1. Too many current reservations (throttling factor will be determined experimentally)
   222  
   223  Any other error, including networking errors can be considered a `not reserved` response for the
   224  purposes of allocation. When a `not reserved` is received, that response is cached in the store
   225  pool until the next `StoreDescriptor` update. We avoid issuing further `ReserveReplica` calls to
   226  that store until the next `StoreDescriptor` update.
   227  
   228  The next subsections contain all of the major tasks that will be required to complete this feature
   229  and further details about each.
   230  
   231  ## Store: Add the ability to reserve a replica
   232  
   233  To prevent a store from being overwhelmed and overloaded, the concept of a reserved replica will be
   234  added to a store. A reserved replica reserves a full replica’s amount of space for a predetermined
   235  amount of time (typically for a `storeGossipInterval`) and reserves it for the expected incoming
   236  replica for a specific RangeID. If the replica for the range is not added within the reservation
   237  timeframe, the reservation is removed and the space becomes free again.
   238  
   239  If a new replica arrives and there is no reservation, the store will still create the new replica
   240  and this will not cancel any pre-existing reservations. The gating of when to allow a new
   241  reservation is decided in the `ReserveReplica` RPC and is not part of adding a replica in the
   242  store.
   243  
   244  Optionally, the ability to control the amount of currently available space on a store might be used
   245  to slow down a cluster from suddenly jumping on a new node when one becomes available. By
   246  pre-reserving (for a non existing store) all or most of the new store’s capacity and staggering
   247  the timeouts, it may prevent all replicas from suddenly being interested in adding themselves to
   248  the store. This will require some testing to determine if it will be beneficial.
   249  
   250  ## Protos: add a timestamp to *StoreDescriptor* and reservations to *StoreCapacity*
   251  
   252  By adding a timestamp to the `StoreDescriptor` proto, it enables the ability to quickly pick the
   253  most recent `StoreDescriptor`. This timestamp is local to the store that generated the
   254  `StoreDescriptor`. The main use case for this is when calling `ReserveReplica`, regardless of the
   255  response being a `reserved` or a `not reserved`, it will also return updated `StoreDescriptor`s for
   256  all the stores on the node. These updated descriptors will be used to update the cached
   257  `StoreDescriptor` in the `StorePool` of the node calling `ReserveReplica`. There may be a small
   258  race with these descriptors and new ones that are arriving via gossip. A timestamp fixes this
   259  problem. Any subsequent calls to the allocator will have a fresher sense of the cluster. Note that
   260  it may be possible to skip the addition of the timestamp by returning a gossip `Info` from the
   261  `reserveReplica` RPC.
   262  
   263  By adding a `reservedSpace` value to capacity, it allows more insight into how the total capacity
   264  of a store is used and be able to make better decisions around it. Also, by adding
   265  `activeReservations` an allocator will be able to choose rebalancing targets that are not
   266  overwhelmed with requests.
   267  
   268  ## RPC: ReserveReplica
   269  
   270  By adding `ReserveReplica` RPC to a node, it will enable a range to reserve a replica on a node
   271  before calling `changeReplica` and adding the replica directly. Because no node will ever have the
   272  most up to date information about another one, the response will always include an updated
   273  `StoreDescriptor` for the store in which a reservation is requested. It should be noted that this
   274  is a new type of RPC in which it addresses a store and not a node or range.
   275  
   276  The request will include:
   277  
   278  - `StoreID` of the store in which to reserve the replica space
   279  - `RangeID` of the requesting range. Consider repurposing a `ReplicaDescriptor` here.
   280  - All other parameters that are required by the allocator, such as required attributes.
   281  
   282  The response will include:
   283  
   284  - `StoreDescriptor`s The most up to date store descriptors for all stores in the node. Note that
   285    there may be a requirement to limit the number of times `engine.Capacity` is called as this is
   286    doing a physical walk of the hard drive. Consider wrapping the descriptor in a gossip `Info`.
   287  - `Status` An ENUM or boolean value that indicates if a replica has been reserved or not. Usually
   288    this will be either a `reserved` or `not reserved`.
   289  
   290  When determining if a store should reserve a new replica based on a request, it should first check
   291  some basic conditions:
   292  
   293  - Is the store being decommissioned?
   294  - Are there too many reserved replicas?
   295  - Is there enough free (non-reserved) space available on the store?
   296  
   297  Typically the response will be a yes. A response of `not reserved` will only occur when the store
   298  is being overloaded or is close to being overly full. Even when a reservation has been made, there
   299  is no guarantee that the store calling will still fill the reservation. It only means that that
   300  space has been reserved.
   301  
   302  ## Update *StorePool/Allocator* to call *ReserveReplica*
   303  
   304  When trying to choose a new store to allocate replica a range to, after sorting all the available
   305  ranges and ruling out the ones with incorrect attributes. The allocator picks the top store to
   306  locate the new replica based on the heuristic discussed at the end of the of this document. It then
   307  calls `ReserveReplica` on that node.
   308  
   309  After each call to `ReserveReplica`, the `StorePool` on a node will update its cached
   310  `StoreDescriptor`s. (Consider reusing or extending some of the gossip primitives as this could be
   311  partially considered a forced gossip update.)
   312  
   313  On a `not reserved` response, add a note that the store refused and so that it will not be
   314  considered for new allocations for a short period (perhaps 1 second).
   315  
   316  On a `reserved` response the replica will issue a `replicaChangeRequest` to add the chosen store.
   317  
   318  # Drawbacks
   319  
   320  ## Too many requests
   321  
   322  When a new node joins the cluster and it's gossiped `StoreDescriptor` makes its way to all stores
   323  that could stand to have some ranges rebalanced, it may create too much network traffic calling the
   324  `ReserveReplica` RPC. To ensure this doesn't happen, the RPC should be extremely quick to respond
   325  and require very little processing on the receiving store's side, especially in the case that it is
   326  a rejection.
   327  
   328  # Alternate Allocation Strategies
   329  
   330  This sections contains a collection of other techniques and strategies that were considered. Some
   331  of these enhancements may still be included in V2.
   332  
   333  ## Other enhancements to distributed allocation
   334  
   335  Here is a small collection to tweaks that could be used to alter how a distributed allocation could
   336  work. These are not being implemented now, but could be considered as alternatives if the
   337  `ReserveReplica` strategy doesn’t solve all issues.
   338  
   339  - Make the gossiping of `StoreDescriptor`s event driven. Anytime a snapshot is applied or a
   340    replica is garbage collected. If no event occurs, gossip the `storeDescriptor` every
   341    `gossipStoresInterval`.
   342  
   343    This could reduce the time it takes for the updated descriptor to make its way to all other
   344    nodes.
   345  
   346  - Decrease the `gossipStoresInterval` to 10 seconds so `StoreDescriptor`s are fresher.
   347  
   348    This adds a lot of churn to the gossiped descriptors so the increased network traffic might
   349    outweigh the benefits of faster rebalancing.
   350  
   351  - Move from using gossiped `StoreDescriptor`s (updated every 60 seconds) to gossiped
   352   `StoreStatuses` (written every 10 seconds).
   353  
   354    This would require gossiping which would incur the same problem as decreasing the
   355    `gossipStoresInterval`.
   356  
   357  - Based on the latest Store Descriptors, determine how many stores would likely rebalance in the
   358    next 10 seconds. Then, each of those stores rebalances with probability
   359    `1/(# of candidate stores)`. For example, suppose that we're balancing by replica count. Two
   360    stores have 100 replicas, and one store has 0 replicas. So, there are 2 stores that are good
   361    candidates to move replicas from. Each of those 2 stores would have a `1/2` probability of
   362    starting a rebalance. We could speed this up by doing this virtual coin toss `N` times, where `N`
   363    is the total number of replicas we'd like to move to the destination store.
   364  
   365    This might be a useful option if there is still too much thrashing when a new node is added.
   366  
   367  - Don't try to rebalance any other replicas on a store until the previous `ChangeReplicas` call has
   368    finished and the snapshot has been applied.
   369  
   370    This limits each store to performing a single `ChangeReplica`/Snapshot at a time. It would limit
   371    thrashing but also greatly increase the time it takes to reach equilibrium.
   372  
   373  ## Centralized allocation strategy
   374  
   375  One way to avoid the thrashing caused by multiple independently acting allocators is to centralize
   376  all replica allocation. In this section, a possible centralized allocation strategy is described in
   377  detail.
   378  
   379  ### Allocator lease acquisition
   380  
   381  Every second, each node checks whether there’s an allocation lease holder
   382  ("allocator") through a `Get(globalAllocatorKey)`. If that returns no data, the
   383  node tries to become the allocator lease holder using a `CPut` for that key. In
   384  pseudo-code:
   385  
   386  ``` pseudocode
   387      every 60 seconds:
   388        result := Get(globalAllocatorKey)
   389        if result != nil {
   390          // do nothing
   391          return
   392        }
   393        err := CPut(globalAllocatorKey, nodeID+"-”+expireTime, nil)
   394        if err != nil {
   395          // Some other node became the allocator.
   396          return
   397        }
   398        // This node is now the allocator.
   399  ```
   400  
   401  ### Allocator lease renewal
   402  
   403  Near the end of the allocation lease, the current allocator does the following:
   404  
   405  ``` golang
   406      err := CPut(globalAllocatorKey,
   407        nodeID+"-”+newExpireTime,
   408        nodeID+"-”+oldExpireTime)
   409      if err != nil {
   410        // Re-election failed. Step down as allocator lease holder.
   411        return err
   412      }
   413      // **Re-election succeeded**. We’re still the allocation lease holder.
   414  ```
   415  
   416  For example, if the allocation lease term is 60 seconds, the current allocation lease holder could
   417   try to renew its lease 55 seconds into its term.
   418  
   419  We may want to enforce artificial allocator lease term limits to more regularly
   420  exercise the lease acquisition code.
   421  
   422  ### Updating the allocator’s *StoreDescriptors*
   423  
   424  An allocation lease holder needs recent store information to make effective allocation decisions.
   425  
   426  This could be achieved using either of two different mechanisms: decreasing the
   427  interval for gossiping `StoreDescriptor`s from 60 seconds to a lower value, (perhaps 10 seconds) or
   428  by writing the descriptors to a system keyspace and retrieving them, possibly using inconsistent
   429  reads, (also every 10 seconds or so). Also, using `StoreStatus`es instead of descriptors is also an
   430  option. Recall that `StoreDescriptor` updates are frequent and the allocation lease holder is the only
   431  node making rebalancing decisions. So, the allocation lease holder could use the latest gossiped
   432  `StoreDescriptor`s and its knowledge of the replica allocation decisions made since the last
   433  `StoreDescriptor` gossip to derive the current state of replica allocation in the cluster.
   434  
   435  ### Centralized decision-making
   436  
   437  Pseudo-code for centralized rebalancing:
   438  
   439  ``` pseudocode
   440      for _, rangeDesc := range GetAllRangeDescriptorsForCluster() {
   441        makeAllocationDecision(rangeDesc, allStoreDescriptors)
   442      }
   443  ```
   444  
   445  The `StoreDescriptor`s are discussed in the previous section. `GetAllRangeDescriptorsForCluster`
   446  warrants specific attention: it needs to retrieve a potentially large number of range descriptors.
   447  For example, suppose that a cluster is storing 100 TiB of de-duplicated data. That is a minimum of
   448  16,384,000 ranges each with an associated `RangeDescriptor`. Requiring the scanning, sorting and
   449  decision-making based on this large of a collection could be a performance problem. There are
   450  clearly methods which may solve some of these bottlenecking problems. Ideas include only looking to
   451  move ranges from high to low loads or using a "power of two" technique to randomly pick two stores
   452  when looking for a rebalance target.
   453  
   454  ### Failure modes for allocation lease holders
   455  
   456  1. Poor network connectivity.
   457  1. Leader node goes down.
   458  1. Overloaded allocator node. This may be unlikely to cause problems that
   459     extend beyond one term. An overloaded allocator node probably
   460     wouldn’t complete its allocator lease renewal KV transaction before its term
   461     ends.
   462  
   463  The likely failure modes can largely be alleviated by using short allocation lease terms.
   464  
   465  ### Conclusion
   466  
   467  ***Advantages***
   468  
   469  - Less thrashing and no herding, since the allocator will know not to overload a new node.
   470    Distributed, independently acting allocators can make decisions that run counter to the others’
   471    decisions.
   472  - Easier to debug, as there is only one place that performs the rebalancing.
   473  - Easier to work with a CopySet style algorithm (see below for a discussion on CopySets).
   474  
   475  ***Disadvantages***
   476  
   477  - When making rebalancing decisions, there is a lack of information that must be overcome.
   478    Specifically, the lack of `RangeDescriptor`s that are required when actually making the final
   479    decision. These are too numerous to be gossiped and must be stored and retrieved from the db
   480    directly. On the other hand, in a decentralized system, all `RangeDescriptor`s are already
   481    available directly in memory in the store.
   482  - When dealing with a cluster that use attributes, the central allocator will have to handle all
   483    rebalancing decisions by either using a full knowledge of a cluster or by using subsets of the
   484    cluster based on combinations of all available attributes.
   485  - As the cluster grows, there may be performance issues that arise on the central allocator. Some
   486    ways to alleviate this would be to ensure that the centralized allocator itself is located on the
   487    same node in which all required data exists (be it tables and indexes which might be required).
   488  - If we use `CPuts` for allocator election, the range that contains the leader key becomes a single
   489    point of failure for the cluster. This could be alleviated by making the allocation lease holder the
   490    same as the range lease holder of the range holding the `StoreStatus` protos.
   491  - More internal work needs to be done to support a centralized system. Be it via an election or
   492    using the range lease holder of a specific key.
   493  
   494  ***Verdict***
   495  
   496  The main issue that causes the thrashing and overloading of stores is lack of current information.
   497  A big part of that is the lack of knowledge about allocation decisions that are occurring while
   498  making other decision. A centralized allocator would fix those issues. However, there are
   499  implementation and performance issues that may arise from moving to a central allocator. Be it the
   500  potential requirement to iterate over a set or all of the `RangeDescriptors`, dealing with
   501  performance concerns of having all rebalancing decisions made in an expedient manner, or cases in
   502  which the centralized allocator itself is faulty in some way, make the centralized solution less
   503  appealing.
   504  
   505  ## CopySets
   506  
   507  [https://www.usenix.org/system/files/conference/atc13/atc13-cidon.pdf](https://www.usenix.org/system/files/conference/atc13/atc13-cidon.pdf)
   508  
   509  By using an algorithm to determine the best CopySets for each type of configuration (ignoring
   510  overlap), we can limit the locations of all replicas and as the shape of the cluster changes, it
   511  can adapt appropriately.
   512  
   513  ***Advantages***
   514  
   515  - Greatly reduces the chance of data availability when >1 nodes die.
   516  - No central controller/lease holder
   517  - No fighting with all ranges when a new node joins or one is lost.
   518  - It will take a bit of time for all nodes to receive the updated gossiped network topology, so this
   519    might be a good way to gate the changes.
   520  - While there is greater complexity in the algorithm for determining the CopySets themselves, the
   521    allocator becomes extremely simple.
   522  
   523  ***Disadvantages***
   524  
   525  - When a new node joins, it could be a number of replicas need to move, all at once, depending on
   526    how the algorithm is setup. So some artificial limiting may be required on a new node being
   527    added or one being removed.
   528  - Heterogeneous environments in which stores differ in sizes makes the CopySet algorithm also
   529    extremely problematic.
   530  - In dynamic environments, ones in which nodes are added and removed, the CopySet algorithm will
   531    lead to potential store rot.
   532  
   533  ***Verdict***
   534  
   535  While the advantages of CopySets are clear, its disadvantages are too numerous. The CopySet
   536  algorithms only works well in a static (no new nodes added or removed) and homogenous (all stores
   537  are the same size) setup. Trying to work around these limitations leads one into a rabbit hole.
   538  Here is a list of considered ways to shoehorn the algorithm to our system:
   539  
   540  - For the dynamic cluster - recalculate the CopySets each time a node is added and removed and then
   541    move all misplaced replicas
   542  - For heterogeneous stores - split all store space into blocks (of around 100 or 1000 replicas) and
   543    run the algorithm against that
   544  - For zones and different replication factors - have a collection of CopySets, one for each
   545    combination, with overlap
   546  - For the overlaps created by the zones fix - make CopySets that contain more than the number of
   547    replicas, so make the CopySet fit to 4 instead of 3, and rebalance amongst the 4 stores
   548  
   549  Each of these "solutions" adds more complexity and takes away from the original benefit of using
   550  the CopySet algorithm in the first place.
   551  
   552  ## CopySets emulation
   553  
   554  As an alternative to implementing the CopySets algorithm directly, add a secondary tier of
   555  rebalancing that adds an affinity for co-locating replicas on the same set of stores. This can be
   556  done by simply applying a small weight to having all replicas co-located. Note that this should
   557  not interrupt the need for real rebalancing, but all other features being equal, choose a store
   558  with the most other replicas in common.
   559  
   560  Testing will be required to see if this has the desired effect.
   561  
   562  ***Advantages***
   563  
   564  - A weaker constraint than straight CopySets. CopySets prescribe exactly where each replica should
   565    go, while this method will let that happen organically.
   566  - Easy to add to our current balancing heuristic.
   567  - Reduces the chance of data loss when more than one node dies.
   568  
   569  ***Disadvantages***
   570  
   571  - May cause more thrashing and more rebalances before equilibrium is set.
   572  - It will never be as efficient as straight CopySets
   573  - There is a chance that the cluster gets into a less desirable state if not done carefully.
   574  
   575  ***Verdict***
   576  
   577  If done well, this could greatly reduce the risk of data loss when more than one node dies. This
   578  should be in serious consideration for rebalancing V3.
   579  
   580  # Allocation Heuristic Features
   581  
   582  Currently, the allocator makes replica counts converge on the cluster mean range count. This
   583  effectively reduces the standard deviation of replica counts across stores. Stores that are too
   584  full (95% used capacity) do not receive new replicas.
   585  
   586  Possible changes for V2:
   587  
   588  - **Mean vs. median**
   589    It is possible that converging on the mean has undesirable consequences under certain scenarios.
   590    We may want to converge on the median instead. Care should be taken that whatever is chosen works
   591    well for small *and* large clusters.
   592  
   593  For future consideration (post v2):
   594  
   595  - **Store capacity available**
   596    Care must be taken. Using free disk space is problematic, because for nearly empty clusters, the
   597    OS install dominates disk usage. This will be one of the first aspects to look at in post V2
   598    work.
   599  
   600  - **Node load**
   601    We will likely want to move replicas off overloaded nodes for some definition of "load."
   602  
   603  - **Node health**
   604    If a node is serving requests slowly for a sustained period, we should rebalance away from that
   605    node. This is related but not identical to load.
   606  
   607  # Testing Scenarios
   608  
   609  The chosen allocation strategy should perform well in the following scenarios:
   610  
   611  For V2:
   612  
   613  1. Small (3 node) cluster
   614  1. Medium (32 node) cluster
   615  1. Bringing up new nodes in a staggered fashion
   616  1. Bringing up new nodes all at once
   617  1. Removing nodes, one at a time.
   618  1. Removing and bringing a node back up after different timeouts.
   619  1. Cluster with overloaded stores (i.e. hotspots)
   620  1. Nearly full stores
   621  1. Node permanently going down
   622  1. Network slowness
   623  1. Changing the attribute labels of a store
   624  
   625  For future consideration (post v2):
   626  
   627  1. Large (100+ node) cluster
   628  1. Very large (1000+ node) cluster
   629  1. Stores with different capacities
   630  1. Heterogeneous nodes (CPU)
   631  1. Replication factor > 3 (some basic testing will be done, but it won’t be concentrated on)
   632  1. Geographically distributed cluster
   633  
   634  It will take many iterations to arrive at a replication strategy that works for all of these cases.
   635  These will be incorporated into unit and acceptance tests as applicable.
   636  
   637  ## Simulator
   638  
   639  To aid in testing, the rebalancing simulator will be update to speed up testing. Some of these
   640  updates include:
   641  
   642  - Being able to take a running cluster and output the current and all previous states so that the
   643    simulator can emulate it.
   644  - Convert the custom allocator input formats to protos.
   645  - Update the simulator based on changes proposed in this RFC. (i.e. add replica reservations).
   646  - Add a collection of more accurate metrics.
   647  
   648  # Unresolved Questions
   649  
   650  ## Centralized vs Decentralized
   651  
   652  Both approaches are clearly viable solutions with advantages and drawbacks. Is one option
   653  objectively better than the other? It might be worthwhile to test the performance of both a central
   654  and decentralized rebalancing scheme in different configurations under different loads. One option
   655  would be to update the simulator to be able to test both, but that would not be an ideal
   656  environment. How much time will this take and can it be done quickly?