github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170602_rebalancing_for_1_1.md (about)

     1  - Feature Name: Rebalancing plans for 1.1
     2  - Status: in-progress
     3  - Start Date: 2017-06-02
     4  - Authors: Alex Robinson
     5  - RFC PR: [#16296](https://github.com/cockroachdb/cockroach/pull/16296)
     6  - Cockroach Issue:
     7    - [#12996](https://github.com/cockroachdb/cockroach/issues/12996)
     8    - [#15988](https://github.com/cockroachdb/cockroach/issues/15988)
     9    - [#17979](https://github.com/cockroachdb/cockroach/issues/17979)
    10  
    11  # Summary
    12  
    13  Lay out plans for which rebalancing improvements to make (or not make) in the
    14  1.1 release and designs for how to implement them.
    15  
    16  # Background / Motivation
    17  
    18  We’ve made a couple of efforts over the past year to improve the balance of
    19  [replicas](20160503_rebalancing_v2.md) and
    20  [leases](20170125_leaseholder_locality.md) across a
    21  cluster, but our balancing algorithms still don’t take into account everything
    22  that a user might care about balancing within their cluster. This document puts
    23  forth plans for what we’ll work on with respect to rebalancing during the 1.1
    24  release cycle. In particular, four different improvements have been proposed.
    25  
    26  ## Balancing disk capacity, not just number of ranges ("size-based rebalancing")
    27  
    28  Our existing rebalancing heuristics only consider the number of ranges on each
    29  node, not the amount of bytes, effectively assuming that all ranges are the same
    30  size. This is a flawed assumption -- a large number of empty ranges can be
    31  created when a user drops/truncates a table or runs a restore from backup that
    32  fails to finish. Not considering the size of the ranges in rebalancing can lead
    33  to some nodes containing far more data than others.
    34  
    35  ## Balancing request load, not just number of ranges ("load-based rebalancing")
    36  
    37  Similarly, the rebalancing heuristics do not consider the amount of load on each
    38  node when making placement decisions. While this works great for some of our
    39  load generators (e.g. kv), it can cause problems with others like ycsb and with
    40  many real-world workloads if many of the most popular ranges end up on the same
    41  node. When deciding whether to move a given range, we should consider how much
    42  load is on that range and on each of the candidate nodes.
    43  
    44  ## Moving replicas closer to where their load is coming from ("load-based replica locality")
    45  
    46  For the 1.0 release, [we added lease transfer
    47  heuristics](20170125_leaseholder_locality.md) that move leases closer to the
    48  where requests are coming from in high-latency environments. It’s easy to
    49  imagine a similar heuristic for moving ranges -- if a lot of requests for a
    50  range are coming from a locality that doesn’t have a replica of the range, then
    51  we should add a replica there. That will then enable the lease-transferring
    52  heuristics to transfer the lease there if appropriate, reducing the latency to
    53  access the range.
    54  
    55  ## Splitting ranges based on load ("load-based splitting")
    56  
    57  A single hot range can become a bottleneck. We currently only split ranges when
    58  they hit a size threshold, meaning that all of a cluster’s load could be to a
    59  single range and we wouldn’t do anything about it, even if there are other nodes
    60  in the cluster (that don’t contain the hot range) that are idle. While splitting
    61  decisions may seem somewhat separate from rebalancing decisions, in some
    62  situations splitting a hot range would allow us to more evenly distribute the
    63  load across the cluster by rebalancing one of the halves. 
    64  
    65  This is so important for performance that we already support manually
    66  introducing range splits, but an automated approach would be more appropriate as
    67  a permanent solution.
    68  
    69  # Detailed Design
    70  
    71  ## Balancing based on multiple factors
    72  
    73  Currently when we’re scoring a potential replica rebalance, we only have to
    74  consider the relevant zone config settings and the number of replicas on each
    75  store. This allows us to effectively treat all replicas as if they’re exactly
    76  the same. Adding in factors like the size of the range and the number
    77  of QPS to a range invalidates that assumption, and forces us to consider how a
    78  replica differs from the typical replica on both dimensions. For example, if
    79  node 1 has fewer replicas than node 2 but more bytes stored on it, then we might
    80  be willing to move a big replica from node 1 to 2 or a small replica from node 2
    81  to 1, but wouldn’t want the inverses.
    82  
    83  Thus, in addition to knowing the size or QPS of the particular range
    84  we’re considering rebalancing, we’ll also want to know some idea of the
    85  distribution of size or QPS per range for the replicas in a store. This will
    86  mean periodically iterating over all the replicas in a store to aggregate
    87  statistics so that we can know whether a range is larger/smaller than others or
    88  has more/less QPS than others. Specifically, we'll try computing a few
    89  percentiles to help pick out the true outliers that would have the greatest
    90  effect on the cluster's balance.
    91  
    92  We can them compute rebalance scores by considering the percentiles of a
    93  replica and under/over-fullness of stores amongst all the considered dimensions.
    94  We will prefer moving away replicas at high percentiles from stores that are
    95  overfull for that dimension toward stores that are less full for the dimension
    96  (and vice versa for low percentiles and underful stores under the expectation
    97  that the removed replicas can be replaced by higher percentile replicas). The
    98  extremeness of a given percentile and under/over-fullness will increase the
    99  weight we give to that dimension. These heuristics will allow us to combine
   100  the different dimensions into a single final score, and should be covered by a
   101  large number of test cases to ensure stability in different scenarios.
   102  
   103  ## Size-based rebalancing
   104  
   105  Taking size into account seems like the simplest modification of our existing
   106  rebalancing logic, but even so there are a variety of available approaches:
   107  
   108  1. We already gossip each store’s total disk capacity and unused disk capacity.
   109     We could start trying to balance unused disk capacity across all the nodes of
   110     the cluster. That would mean that in the case of heterogeneous disk sizes,
   111     nodes with smaller disks might not get much (if any) data rebalanced to them
   112     if the cluster doesn’t have much data.
   113  
   114  2. We could try to balance used disk capacity (i.e. total - unused). In
   115     heterogeneous clusters, this would mean that some nodes would fill up way
   116     before others (and potentially way before the cluster fills up as a whole).
   117     Situations in which some nodes but not others are full are not regularly
   118     tested yet, so we may have to start if we go this way.
   119  
   120  3. We could try to balance fraction of the disk used. This is the happy
   121     compromise between the previous two options -- it will put data onto nodes
   122     with smaller disks right from the beginning (albeit less data), and it
   123     shouldn’t often lead to smaller nodes filling up way before others.
   124  
   125  The first option most directly parallels our existing logic that only attempts
   126  to balance the number of replicas without considering the size of each node’s
   127  disk, but the third option appears best overall. It’s likely that we’ll want to
   128  change the replica logic as part of this work to take disk size into account,
   129  such that we’ll balance replicas per GB of disk rather than absolute number of
   130  replicas.
   131  
   132  ## Load-based rebalancing
   133  
   134  As part of our [leaseholder locality](20170125_leaseholder_locality.md) work, we
   135  started tracking how many requests each range’s leaseholder receives. This gives
   136  us a QPS number for each leaseholder replica, but no data for replicas that
   137  aren’t leaseholders. If we left things this way, our replica rebalancing would
   138  suddenly take a dependency on the cluster’s current distribution of
   139  leaseholders, which is a scary thought given that leaseholder rebalancing
   140  conceptually already depends on replica rebalancing (because it can only balance
   141  leases to where the replicas are). As a result, I think we’ll want to start
   142  tracking the number of applied commands on each replica instead of relying on
   143  the existing leaseholder QPS.
   144  
   145  Once we have that per-replica QPS, though, we can aggregate it at the store
   146  level and start including it in the store’s capacity gossip messages to use it
   147  in balancing much like disk space.
   148  
   149  ## Load-based replica locality
   150  
   151  This is where things get tricky -- while the above goals are about bringing the
   152  cluster into greater balance, trying to move replicas toward the load is likely
   153  to reduce the balance within the cluster. Reducing the thrashing involved in the
   154  leaseholder locality project was quite a lot of work and still isn’t resilient
   155  to certain configurations. When we’re talking about moving replicas rather than
   156  just transferring leases, the cost of thrashing skyrockets because snapshots
   157  consume a lot of disk/network bandwidth.
   158  
   159  This also conflicts with one of our design goals from [the original rebalancing
   160  RFC](20150819_stateless_replica_relocation.md), which is that the decision to
   161  make any individual operation should be stateless. Because the counts of
   162  requests by locality are only tracked on the leaseholder, these types of
   163  decisions are inherently stateful, so we should tread into making them with
   164  caution.
   165  
   166  In the interest of not creating problem cases for users, I’d suggest pushing
   167  this back until we have known demand for it. Custom zone configs paired with
   168  leaseholder locality already do a pretty good job of enabling low-latency access
   169  to data.
   170  
   171  ## Load-based splitting
   172  
   173  Load-based splitting is conceptually pretty simple, but will likely produce
   174  some edge cases in practice. Consider a few representative examples:
   175  
   176  1. A range gets a lot of requests for single keys, evenly distributed over the
   177     range. Splitting will help a lot.
   178  
   179  2. A range gets a lot of requests for just a couple of individual keys (and
   180     the hot requests don't touch multiple hot keys in the same query, a la case
   181     4). Splitting will help if and only if the split is between the hot keys.
   182  
   183  3. A range gets a lot of requests for just a single key. Splitting won’t help at
   184     all.
   185  
   186  4. A range gets a lot of scan requests or other requests that touch multiple
   187     keys. Splitting could actually make things worse by flipping an operation
   188     from a single-range operation into a multi-range one.
   189  
   190  Given these possibilities, it’s clear that we’re going to need more granular
   191  information than how many requests a range is receiving in order to decide
   192  whether to split a range. What we really need is something that will keep track
   193  of the hottest keys (or key spans) in the hottest ranges. This is basically a
   194  streaming top-k problem, and there are plenty of algorithms that have been
   195  written about that should work for us given that we only need approximate
   196  results.
   197  
   198  It’s also worth noting that we’ll only need such stats for ranges that have a
   199  high enough QPS to justify splitting. Thus, our approach will look something
   200  like:
   201  
   202  1. Track the QPS to each leaseholder (which we’re already doing as of
   203     [#13426](https://github.com/cockroachdb/cockroach/pull/13426)).
   204  
   205  2. If a given range’s QPS is abnormally high (by virtue of comparing to the
   206     other ranges), start recording the approximate top-k key spans.
   207     Correspondingly, if a range's QPS drops down and we had been tracking its
   208     top-k key spans, we should notice this and stop.
   209  
   210  3. Periodically check the top key spans for these top ranges and determine if
   211     splitting would allow for better distributing the load without making too
   212     many more multi-range operations. Picking a split point and determining
   213     whether it'd be beneficial to split there could be done by sorting the top
   214     key spans and, between each of them, comparing how many requests would be
   215     to spans that are to the left of, the right of, or overlapping that
   216     possible split point.
   217  
   218  4. If a good split point was found, do the split.
   219  
   220  5. Sit back and let load-based rebalancing do its thing.
   221  
   222  This will take a bit of work to finish, and isn’t critical for 1.1, but
   223  would be a nice addition and comes with much less downside risk than something
   224  like load-based replica locality. We’ll try to get to it if we have the time,
   225  otherwise can implement it for 1.2.
   226  
   227  ### Alternatives
   228  
   229  The approximate top-k approach to determining split points is fairly precise,
   230  but also adds some fairly complex to the hot code path for serving requests to
   231  replicas. A simple alternative would be for us to do the following for each hot
   232  range:
   233  
   234  1. Pick a possible split point (the mid-point of the range to start with).
   235  
   236  1. For each incoming request to the hot replica, record whether the request is
   237     to the left side, the right side, or both.
   238  
   239  1. After a while, examine the results. If most of the requests touched both
   240     sides, abandon trying to split the range. If most of the requests were split
   241     pretty evenly between left and right, make the split at the tested key. If
   242     the results were pretty uneven, try moving the possible split point in the
   243     direction that received more requests and try again, a la binary search.
   244     After O(log n) possible split points, we'll either find a decent split point
   245     may determine that there isn't an equitable split point (because the
   246     requests are mostly to a single key).
   247  
   248  In fact, even if we do use a top-k approach, testing out the split point like
   249  this before making the split might still be smart to ensure that all of the
   250  spans that weren't included in the top-k aren't touching both sides of the
   251  split.
   252  
   253  Finally, the simplest alternative of all (proposed by bdarnell on #16296) is
   254  to not do load-based splitting at all, and instead just split more eagerly for
   255  tables with a small number of ranges (where "small" could reasonabl be defined
   256  as "less than the number of nodes in the cluster"). This wouldn't help with
   257  steady state load at all, but it would help with the arguably more common
   258  scenario of a "big bang" of data growth when a service launches or during a
   259  bulk load of data.
   260  
   261  ### Drawbacks
   262  
   263  Splitting ranges based on load could, for certain request patterns, lead to a
   264  large build-up of small ranges that don't receive traffic anymore. For example,
   265  if a table's primary keys are ordered by timestamp, and newer rows are more
   266  popular than old rows, it's very possible that newer parts of the table could
   267  get split based on load but then remain small forever even though they don't
   268  receive much traffic anymore.
   269  
   270  This won't cripple the cluster, but is less than ideal. Merge support is being
   271  tracked in [#2433](https://github.com/cockroachdb/cockroach/issues/2433).