github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/rebalancing.md (about)

     1  # Rebalancing
     2  
     3  **Last update:** January 2019
     4  
     5  **Original author:** Alex Robinson
     6  
     7  This document serves as an end-to-end description of the current state of range
     8  and lease rebalancing in Cockroach as of v2.1. The target reader is anyone who
     9  is interested in better understanding how and why replica and lease placement
    10  decisions get made. Little detailed knowledge of core should be needed.
    11  
    12  The most complete documentation, of course, is in the code, tests, and the
    13  surrounding comments, but those are necessarily split across several files and
    14  are much tougher to piece together into a logical understanding. That scattered
    15  knowledge is centralized here, without excessive detail that is likely to become
    16  stale.
    17  
    18  ## Table of Contents
    19  
    20  * [Overview](#overview)
    21  * [Considerations](#considerations)
    22  * [Implementation](#implementation)
    23    * [Replicate Queue](#replicate-queue)
    24      * [Choosing an action](#choosing-an-action)
    25      * [Picking an up-replication target](#picking-an-up-replication-target)
    26      * [Picking a down-replication target](#picking-a-down-replication-target)
    27      * [Picking a rebalance target](#picking-a-rebalance-target)
    28      * [Per-replica / expressive constraints](#per-replica--expressive-constraints)
    29      * [Lease transfer decisions](#lease-transfer-decisions)
    30    * [Store Rebalancer](#store-rebalancer)
    31    * [Other details](#other-details)
    32  * [Known issues](#known-issues)
    33  
    34  ## Overview
    35  
    36  Cockroach maintains multiple replicas of each range of data for fault tolerance.
    37  It also maintains a single leaseholder for each range to optimize the
    38  performance of reads and help maintain correctness invariants. The locations of
    39  these replicas and leaseholders are hugely important to the fault tolerance and
    40  performance of the system as a whole, and so Cockroach contains a bunch of logic
    41  that proactively tries to ensure a reasonably optimal distribution of replicas
    42  and leases throughout the cluster.
    43  
    44  ## Considerations
    45  
    46  There are a number of factors that need to be considered when making placement
    47  decisions. For replicas, this includes:
    48  
    49  * User-specified constraints in zone configs. These obviously need to be respected.
    50  * Disk fullness. Moving replicas to a store that's out (or nearly out) of disk
    51    space is clearly a bad idea. Moving replicas away from a store that's nearly
    52    out of disk space is often a good idea, but not always.
    53  * Diversity of localities. If we put all of the replicas for a range in just one
    54    or two localities, then a single locality (datacenter, rack, region, etc)
    55    failure will cause data unavailability / loss. We should try to spread
    56    replicas as widely as possible for maximal fault tolerance.
    57  * Number of ranges on each store.
    58  * Load on each node. Uneven distribution of load can cause bottlenecks that
    59    seriously affect the overall throughput of the cluster (e.g. [#26059]).
    60  * Amount of data on each store. We don't want one disk in a cluster to fill up
    61    long before the others, and it's also valuable for recovery time to be
    62    roughly the same after any given node failure, which isn't the case if one
    63    node has significantly more data on it than others.
    64  * Number of writes on each store. Disks have a limited amount of IOPs and
    65    bandwidth, so bottlenecks can be a problem here as well, at least
    66    hypothetically.
    67  
    68  We currently don't directly use the last two factors, instead hoping that
    69  balancing the overall load and number of ranges are good enough proxies for the
    70  number of writes and the amount of data, respectively. We previously tried to
    71  integrate these factors into decisions, but allocation decisions became quite
    72  complex and the approach ran into a number of issues ([#17979]), so it has since
    73  been removed in favor of the `StoreRebalancer`'s pure QPS-based rebalancing.
    74  
    75  For lease rebalancing, the considerations include:
    76  
    77  * Lease count on each node. Balancing this should roughly balance out the amount
    78    of load on each node, assuming a uniform distribution.
    79  * QPS on each node. It turns out that not all workloads have a uniform
    80    distribution.
    81  * Locality of data access. If most requests to a range are coming from the other
    82    side of the world, maybe we should move the lease closer to them.
    83  
    84  Note that there is a built-in conflict here -- moving leases closer to where
    85  requests are coming from may require imbalancing the lease count, or causing the
    86  nodes in those localities to have more load than other nodes. Getting this right
    87  can be a tough balancing act, and it's hard to ever be fully confident that
    88  you've gotten it right because there are almost certainly workloads out there
    89  that won't be handled optimally by whatever decision-making logic you implement.
    90  
    91  ## Implementation
    92  
    93  Historically, all rebalancing has been handled by the `ReplicateQueue`. As of
    94  v2.1, there's also a separate component called the `StoreRebalancer` which
    95  focuses specifically on the problem of balancing the QPS on each store. QPS is
    96  used here essentially as a proxy for the (CPU/network/disk) load on each node.
    97  It's not a perfect proxy in general, but it seems to work well in benchmarks and
    98  tests.
    99  
   100  ### Replicate Queue
   101  
   102  The `ReplicateQueue` is one of our handful of replica queues which periodically
   103  iterate over all the replicas in each store. Replicas are queued by the
   104  `replicaScanner` on each store, which simply scans over all replicas at a
   105  configurable pace and runs them through each of the replica queues. Replicas
   106  are also sometimes manually queued in the `ReplicateQueue` upon certain
   107  triggers, as will be explained later.
   108  
   109  Upon being asked to operate on a replica, the `ReplicateQueue` must:
   110  
   111  1. Decide whether the replica's range needs any replica/lease placement changes
   112  2. Decide exactly what change to make
   113  3. Make the change
   114  4. Repeat until no more changes are needed
   115  
   116  The main interesting bit here, of course, is how the decisions are made, which
   117  is described in detail in the subsections below. The only other points of note
   118  are:
   119  
   120  * The `ReplicateQueue` only acts on ranges for which the local store is the
   121    current leaseholder.
   122  * Only one goroutine is doing all this processing, including the actual sending
   123    of snapshots. This is desirable because it keeps snapshots from overloading
   124    the network and seriously impacting user traffic, but it also has downsides.
   125    In particular, if any part of the processing of a replica gets stalled
   126    (sending a snapshot being the most likely slow part, but IIRC we also had
   127    occasional problems with a lease transfer blocking in the past), then it will
   128    take a long time for the replicate queue to get through all of its store's
   129    replicas. There's a hard limit of 60 seconds processing time per replica, but
   130    even this means that up-replication from a node failure can take a
   131    surprisingly long time in some pathological cases, whether due to a bug or
   132    just due to large replicas and low bandwidth between nodes.
   133  * The 60 second deadline per replica means that sufficiently low snapshot
   134    bandwidth or sufficiently large replicas can make some ranges impossible to
   135    up-replicate or rebalance, because their snapshots can't complete in time
   136    and just get canceled on every attempt. This shouldn't happen with the
   137    default settings of `kv.snapshot_rebalance.max_rate`,
   138    `kv.snapshot_recovery.max_rate`, and `ZoneConfig.RangeMaxBytes`, but
   139    modifications to one or more of them can put a cluster in danger.
   140  * We limit lease transfers away from each node to one per second. This is a
   141    very long-standing policy that hasn't been reconsidered in a long time, but
   142    it has minimal known downsides ([#19355]) that qps-based lease rebalancing
   143    mostly obviates.
   144  * If a node needs to be up-replicated but there are no available matching nodes,
   145    or if a range needs to be processed but doesn't have a quorum of live
   146    replicas (i.e. it's an "unavailable" range), the replica will be put in
   147    purgatory to be re-processed when new nodes become live.
   148  
   149  ##### Choosing an action
   150  
   151  First, we must decide what action to do - up-replicate, down-replicate, or
   152  consider a rebalance. This decision is quite simple and can be easily
   153  understood from the code. Essentially we just have to compare the number of
   154  desired replicas from the applicable `ZoneConfig` to the number of non-dead,
   155  non-decommissioning replicas from the range. There's a bit of extra logic needed
   156  to dynamically adjust the desired number of replicas when it's greater than the
   157  number of nodes in the cluster ([#27349], [#30441], [#32949], [#34126]), but
   158  that's about it.
   159  
   160  ##### Picking an up-replication target
   161  
   162  Picking an up-replication target is relatively straightforward. We can just
   163  iterate over all live stores in the cluster, evaluating them on each of the
   164  [considerations](#Considerations) in order, choosing one of the best results. We
   165  will never, ever choose a store that doesn't meet the `ZoneConfig` constraints,
   166  has an overfull disk, or is on the same node as another store that already
   167  contains a replica for the range. After that, we will first prefer maximizing
   168  diversity before considering factors such as the range count on each store. We
   169  notably do not consider the QPS on each store here -- it's only taken into
   170  account by the `StoreRebalancer`, never by the `ReplicateQueue`.
   171  
   172  Rather than always choosing the best result, if there are two similarly good
   173  options we will choose randomly between them. See
   174  https://brooker.co.za/blog/2012/01/17/two-random.html or
   175  https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf for details on
   176  why this behavior is preferable.
   177  
   178  ##### Picking a down-replication target
   179  
   180  Picking a replica to remove from an over-replicated range is also quite
   181  straightforward. We just iterate over each replica's store, grading it on the
   182  same [considerations](#Considerations) as always, choosing one of the two
   183  worst-scoring stores. The only real exception is if one of the replicas is dead;
   184  in such cases, we'll always remove the dead store(s) first. Note that as of
   185  [#28875] we don't remove replicas from dead stores until we have allocated a
   186  replacement replica on a different store. This makes certain data loss scenarios
   187  less likely (see [#25392] for details).
   188  
   189  If the algorithm chooses to remove the local replica, the replica must first
   190  transfer the lease away before it can be removed. Note that while the new
   191  leaseholder's replicate queue will examine the range shortly after acquiring the
   192  lease, it's possible for the new leaseholder to make a different decision. This
   193  isn't a real problem, but it does mean that removing oneself involves more work
   194  and less certainty than removing any of the other replicas.
   195  
   196  ##### Picking a rebalance target
   197  
   198  Deciding when to rebalance is when things start to get legitimately tricky, and
   199  is what much of the allocator code is devoted to. This makes intuitive sense if
   200  you consider that when adding or removing replica you both
   201  
   202  1. know that you need to take action - unless all the options are truly terrible
   203     you should pick one of them.
   204  2. only have to consider each store with respect to the set of existing
   205     replicas' stores. For adding a replica, this is roughly linear with respect
   206     to the number of live stores in the cluster. For removing a replica, it's
   207     linear with respect to the number of replicas in the range.
   208  
   209  However, when rebalancing, you have to decide whether taking action is actually
   210  desirable. And in practice, you want a bias against action, since there's a real
   211  cost to moving data around, and we don't want to do so unless there's a
   212  correspondingly real benefit. Also, the problem isn't linear any more - it's
   213  roughly O(m*n) when there are m replicas in the range and n live stores in the cluster,
   214  because we have to choose both the replica to be removed and the replica to add.
   215  This is particularly an issue for diversity score calculations and per-replica /
   216  expressive zone config constraints. For example, if you have the following
   217  stores:
   218  
   219  StoreID | Locality              | Range Count
   220  --------|-----------------------|------------
   221  s1      | region=west,zone=a    | 10
   222  s2      | region=west,zone=b    | 10
   223  s3      | region=central,zone=a | 100
   224  s4      | region=east,zone=a    | 100
   225  
   226  And a range has replicas on s1, s3, and s4, then going purely by range count it
   227  would be great to take a replica off of range s3 or s4, which are both
   228  relatively overfull. It would also be great to add a replica to s2, which is
   229  relatively underfull. However, replacing s3 or s4 with s2 would hurt the range's
   230  diversity, which we never choose to do without the user telling us to.
   231  
   232  You can probably imagine that as the number of stores grows, doing all the
   233  pairwise comparisons could become quite a bit of work. To optimize these
   234  calculations, we group stores that share the same locality and the same
   235  node/store attributes (a mostly-forgotten feature, but one that still needs to
   236  be accounted for when considering `ZoneConfig` constraints). We can do all
   237  constraint and diversity-scoring calculations just once for each group, and also
   238  pair each group up against the only existing replicas that it could legally
   239  replace without hurting diversity or violating constraints. We then only have to
   240  do range count comparisons within these "comparable" classes of stores.
   241  
   242  At the end, we can determine which (added replica, removed replica) pairs are
   243  the largest improvement and choose from amongst them. As one last precautionary
   244  step, we then simulate the down-replication logic on the set of replicas that
   245  will result from adding the new replica. If the simulation finds that we would
   246  remove the replica that was just added, we choose not to make that change. This
   247  avoids thrashing, and is needed because we can't atomically add a member to the
   248  raft group at the same time that we remove one. It's possible that this isn't
   249  necessary right now, since the rebalancing code has been significantly improved
   250  since it was added, but at the very least it's a nice fail-safe against future
   251  mistakes.
   252  
   253  ##### Per-replica / expressive constraints
   254  
   255  We support two high-level types of constraints -- those which apply to all
   256  replicas in a range, and those which are scoped to only apply to a particular
   257  number of the replicas in a range (publicly referred to as [per-replica
   258  constraints]). The latter option adds a good deal of subtlety to all allocator
   259  decisions -- up-replication, down-replication, and especially rebalancing.
   260  
   261  In order to satisfy the requirements, we had to split up constraint checking
   262  into separate functions that work differently for adding, removing, and
   263  rebalancing. We also had to add an internal concept of whether a replica is
   264  "necessary" for meeting the required constraints, in addition to the existing
   265  concept of whether or not the replica is valid. A replica is "necessary" if the
   266  per-replica constraints wouldn't be satisfied if the replica weren't part of a
   267  range.
   268  
   269  For more details on the design of the feature, see the discussion on [#19985].
   270  For the implementation, see [#22819].
   271  
   272  
   273  #### Lease transfer decisions
   274  
   275  For the most part, deciding whether to transfer a lease is a fairly
   276  straightforward decision based on whether the current leaseholder node is in a
   277  draining state and on the lease counts on all the stores holding replicas for a
   278  range. The more complex logic is related to the follow-the-workload
   279  functionality that kicks in if-and-only-if the various nodes holding replicas
   280  are in different localities. The logic involved here is better explained in the
   281  [original RFC](../RFCS/20170125_leaseholder_locality.md) than I could do in less
   282  space here. The logic has not meaningfully changed since the original
   283  design/implementation.
   284  
   285  ### Store Rebalancer
   286  
   287  As of v2.1, Cockroach also includes a separate control loop on each store called
   288  the `StoreRebalancer`. The `StoreRebalancer` exists because we found in [#26059]
   289  that an uneven balance of load on each node was causing serious performance
   290  problems when attempting to run TPC-C at large scale without using partitioning.
   291  Ensuring that each laod had a more even balance of work to do was experimentally
   292  found to allow significantly higher and smoother performance.
   293  
   294  The `StoreRebalancer` takes a somewhat different approach to rebalancing,
   295  though. While the `ReplicateQueue` iterates over each replica one at a time,
   296  deciding whether the replica would be better off somewhere else, the
   297  `StoreRebalancer` looks at the overall amount of load (`BatchRequest` QPS
   298  specifically, although it could in theory consider other factors) on each store
   299  and attempts to take action if the local store is overloaded relative to the
   300  other stores in the cluster. This difference is important -- our previous
   301  attempt to rebalance based on load was integrated into the replicate queue, and
   302  it didn't work very well for at least three different reasons:
   303  
   304  1. We bit off more than we could chew, trying to rebalance on too many different
   305     factors at once - range count, keys written per second, and disk space used.
   306  2. Keys written per second was the wrong metric, at least for TPC-C. Experimentation
   307     showed that the number of `BatchRequest`s being handled by a store per second
   308     were much more strongly correlated with a load imbalance than keys written
   309     per second.
   310  3. Most importantly, the replicate queue only looks at one replica at a time. It
   311     may see that the load on each store is uneven, but it doesn't have a good way
   312     of knowing whether the replica in question would be a good one to move to try
   313     to event things out (if a particular range is relatively low in the metric
   314     we want to even out, it's intuitively a bad one to move). We did start
   315     gossiping quantiles in order to help determine which quantile a range fell in
   316     and thus whether it would be a good one to move, but this was still pretty
   317     imprecise.
   318  
   319  The `StoreRebalancer` solves all these problems. It only focuses on QPS, and by
   320  focusing on the store-level imbalance first and picking ranges to rebalance
   321  later, it can choose ranges that are specifically high in QPS in order to have
   322  the biggest influence on store-level balance with the smallest disruption on
   323  range count (which the `ReplicateQueue` is still responsible for attempting to
   324  even out). Ranges to rebalance are efficiently chosen because we have started
   325  tracking a priority queue of the hottest ranges by QPS on each store. This queue
   326  gets repopulated once a minute, when the existing loop that iterates over all
   327  replicas to compute store-level stats does its thing. This list of hot ranges
   328  can have other uses as well, such as powering debug endpoints for the admin UI
   329  ([#33336]).
   330  
   331  Interpreting the exact details of how things work from the code should be pretty
   332  straightforward; we attempt to move leases to resolve imbalances first, and only
   333  resort to moving replicas around if moving leases was insufficient to resolve
   334  the imbalance. There are some controls in place to avoid rebalancing when QPS is
   335  too low to matter, or to avoid messing with a range that's so hot that it
   336  constitutes the majority of a node's qps, or to not bother moving ranges with
   337  too few qps to really matter, or a few other such things.
   338  
   339  The `StoreRebalancer` can be controlled by a cluster setting that either fully
   340  turns it off, enables just lease rebalancing, or enables both lease and replica
   341  rebalancing, which is the default.
   342  
   343  For more details, see the original prototype ([#26608]) or the final
   344  implementation ([#28340], [#28852]).
   345  
   346  ### Other details
   347  
   348  Before removing a replica or transferring a lease, we need to take the raft
   349  status of the various existing replicas into account. This is important to avoid
   350  temporary unavailability.
   351  
   352  For example, if you transfer the lease for a range to a replica that is way
   353  behind in processing its raft log, it will take some time before that replica
   354  gets around to processing the command which transferred the lease to it, and it
   355  won't be able to serve any requests until it does so.
   356  
   357  Or when considering which replica to remove from a range, we must take care not
   358  to remove a replica that is critical for the range's quorum. If only 3 replicas
   359  out of 5 are caught up with the raft leader's state, we can't remove any of
   360  those 3, but can safely remove either of the other 2.
   361  
   362  Note that it's possible that the raft state of the underlying replicas changes
   363  between when we do this check and when the actual transfer/removal takes place,
   364  so it isn't a foolproof protection, but the window of risk is very small and we
   365  haven't noticed it being a problem in practice.
   366  
   367  ## Known issues
   368  
   369  * Rebalancing isn't atomic, meaning that adding a new replica and removing the
   370    replica it replaces is done as two separate steps rather than just one. This
   371    leaves room for locality failures between the two steps to cause
   372    unavailability ([#12768]). For example, if a range has replicas in localities
   373    `a`, `b`, and `c`, and wants to rebalance to a different store in `a`, there
   374    will be a short period of time in which 2 of the range's 4 replicas are in
   375    `a`.  If `a` goes down before one of them is removed, the range will be
   376    without a quorum until `a` comes back up.
   377  * Rebalancing doesn't work well with multiple stores per node because we want to
   378    avoid ever putting multiple replicas of the same range on the same node
   379    ([#6782]). This has never been a deal breaker for anyone AFAIK, but occasionally
   380    annoys a user or two.
   381  * `RelocateRange` is flaky in v2.2-alpha versions because we now immediately put a
   382    range through the replicate queue when a new lease is acquired on it ([#31287]).
   383    It may fail to complete its desired changes successfully due to racing with
   384    changes proposed by the new leaseholder.
   385  * `RelocateRange` (and consequently the `StoreRebalancer` as a whole) doesn't
   386    populate any useful information into the `system.rangelog` table, which has
   387    traditionally been the best way to debug rebalancing decisions after the
   388    fact (#34130).
   389  
   390  [#6782]: https://github.com/cockroachdb/cockroach/issues/6782
   391  [#12768]: https://github.com/cockroachdb/cockroach/issues/12768
   392  [#17979]: https://github.com/cockroachdb/cockroach/issues/17979
   393  [#19355]: https://github.com/cockroachdb/cockroach/issues/19355
   394  [#19985]: https://github.com/cockroachdb/cockroach/issues/19985
   395  [#22819]: https://github.com/cockroachdb/cockroach/pulls/22819
   396  [#25392]: https://github.com/cockroachdb/cockroach/issues/25392
   397  [#26059]: https://github.com/cockroachdb/cockroach/issues/26059
   398  [#26608]: https://github.com/cockroachdb/cockroach/pull/26608
   399  [#27349]: https://github.com/cockroachdb/cockroach/pull/27349
   400  [#28340]: https://github.com/cockroachdb/cockroach/pull/28340
   401  [#28852]: https://github.com/cockroachdb/cockroach/pull/28852
   402  [#28875]: https://github.com/cockroachdb/cockroach/pull/28875
   403  [#30441]: https://github.com/cockroachdb/cockroach/pull/30441
   404  [#31287]: https://github.com/cockroachdb/cockroach/issues/31287
   405  [#32949]: https://github.com/cockroachdb/cockroach/pull/32949
   406  [#33336]: https://github.com/cockroachdb/cockroach/pull/33336
   407  [#34126]: https://github.com/cockroachdb/cockroach/pull/34126
   408  [#34130]: https://github.com/cockroachdb/cockroach/issues/34130
   409  [per-replica constraints]: https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html#scope-of-constraints