github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20161026_leaseholder_rebalancing.md (about)

     1  - Feature Name: Leaseholder Rebalancing
     2  - Status: completed
     3  - Start Date: 2016-10-26
     4  - Authors: Peter Mattis
     5  - RFC PR: [#10262](https://github.com/cockroachdb/cockroach/pull/10262)
     6  - Cockroach Issue: [#9462](https://github.com/cockroachdb/cockroach/issues/9462) [#9435](https://github.com/cockroachdb/cockroach/issues/9435)
     7  
     8  # Summary
     9  
    10  Periodically rebalance range leaseholders in order to distribute the
    11  per-leaseholder work.
    12  
    13  # Motivation
    14  
    15  The primary goal of ensuring leaseholders are distributed throughout a
    16  cluster is to avoid scenarios in which a node is unable to rebalance
    17  replicas away because of the restriction that we refuse to rebalance a
    18  replica which holds a range lease. This restriction is present in
    19  order to prevent an availability hiccup on the range when the
    20  leaseholder is removed from it.
    21  
    22  It is interesting to note a problematic behavior of the current
    23  system. The current leaseholder will extend its lease as long as it is
    24  receiving operations for a range. And when a range is split, the lease
    25  for the left-hand side of the split is cloned and given to the
    26  right-hand side of the split. The combined effect is that a newly
    27  created cluster that has continuous load applied against it will see a
    28  single node slurp up all of the range leases which causes a severe
    29  replica imbalance (since we can't rebalance away from the leaseholder)
    30  as well as a performance bottleneck. We actually see increased
    31  performance by periodically killing nodes in the cluster.
    32  
    33  The second goal is to more evenly distributed load in a cluster. The
    34  leaseholder for a range has extra duties when compared to a follower:
    35  it performs all reads for a range and proposes almost all
    36  writes. [Proposer evaluated KV](20160420_proposer_evaluated_kv.md) will
    37  reduce the cost of write KV operations on followers exacerbating the
    38  difference between leaseholders and followers. These extra duties
    39  impose additional load on the leaseholder making it desirable to
    40  spread that load throughout a cluster in order to improve performance.
    41  
    42  The last goal is to place the leaseholder for a range near the gateway
    43  node that is accessing the range in order to minimize network RTT. As
    44  an obvious example: it is preferable for the leaseholder to be in the
    45  same datacenter as the gateway node. Note that there is usually more
    46  than one gateway node accessing a range and there will be common
    47  workloads where the traffic from gateway nodes is not coming from a
    48  single locality, making it impossible to satisfy this goal. In general,
    49  we'd like to minimize the aggregate RTT for accessing the range which
    50  makes the mixture of reads and writes important (reads only need a
    51  round-trip from the gateway to the leaseholder while writes need a
    52  round-trip to the leaseholder and from the leaseholder to a quorom of
    53  followers). Also, this goal is at odds with the second goal of
    54  distributing load throughout a cluster and we'll need to be careful
    55  with heuristics here. It may be beneficial to place the leaseholder in
    56  the same datacenter as the gateways accessing the range, but doing so
    57  can lower total throughput depending on the workload and if the
    58  latencies between datacenters are small.
    59  
    60  # Detailed design
    61  
    62  This RFC is intended to address the first two goals and punt on the
    63  last one (load-based leaseholder placement). Note that addressing the
    64  second goal of evenly distributing leaseholders across a cluster will
    65  also address the first goal of the inability to rebalance a replica
    66  away from the leaseholder as we'll always have sufficient
    67  non-leaseholder replicas in order to perform rebalancing.
    68  
    69  Leaseholder rebalancing will be performed using a similar mechanism to
    70  replica rebalancing. The periodically gossiped `StoreCapacity` proto
    71  will be extended with a `lease_count` field. We will also reuse the
    72  overfull/underfull classification used for replica rebalancing where
    73  overfull indicates a store that has x% more leases than the average
    74  and underfull indicates a store that has x% fewer leases than the
    75  average. Note that the average is computed using the candidate stores,
    76  not all stores in the cluster. Currently, `replicateQueue` has the
    77  following logic:
    78  
    79  1. If range has dead replicas, remove them.
    80  2. If range is under-replicated, add a replica.
    81  3. If range is over-replicated, remove a replica.
    82  4. If the range needs rebalancing, add a replica.
    83  
    84  The proposal is to add the following logic (after the above replica
    85  rebalancing logic):
    86  
    87  5. If the leaseholder is on an overfull store transfer the lease to
    88  the least loaded follower less than the mean.
    89  6. If the leaseholder store has a leaseholder count above the mean and
    90  one of the followers has an underfull leaseholder count transfer the
    91  lease to the least loaded follower.
    92  
    93  # Testing
    94  
    95  Individual rebalancing heuristics can be unit tested, but seeing how
    96  those heuristics interact with a real cluster can often reveal
    97  surprising behavior. We have an existing allocation simulation
    98  framework, but it has seen infrequent use. As an alternative, the
    99  `zerosum` tool has been useful in examining rebalancing heuristics. We
   100  propose to fork `zerosum` and create a new `allocsim` tool which will
   101  create a local N-node cluster with controls for generating load and
   102  using smaller range sizes to test various rebalancing scenarios.
   103  
   104  # Future Directions
   105  
   106  We eventually need to provide load-based leaseholder placement, both
   107  to place leaseholders close to gateway nodes and to more accurately
   108  balance load across a cluster. Balancing load by balancing replica
   109  counts or leaseholder counts does not capture differences in per-range
   110  activity. For example, one table might be significantly more active
   111  than others in the system making it desirable to distribute the ranges
   112  in that table more evenly.
   113  
   114  At a high-level, expected load on a node is proportional to the number
   115  of replicas/leaseholders on the node. A more accurate approximation is
   116  that it is proportional to the number of bytes on the node (though
   117  this can be thwarted by an administrator who recognizes a particular
   118  table has higher load and thus sets the target range size
   119  smaller). Rather than balancing replica/leaseholder counts we could
   120  balance based on the range size (i.e. the "used-bytes" metric).
   121  
   122  The second idea is to account for actual load on ranges. The simple
   123  approach to doing this is to maintain an exponentially decaying stat
   124  of operations per range and to multiply this metric by the range size
   125  giving us a range momentum metric. We then balance the range momentum
   126  metric across nodes. There are difficulties to making this work well
   127  with the primary one being that load (and thus momentum) can change
   128  rapidly and we want to avoid the system being overly sensitive to such
   129  changes. Transferring leaseholders is relatively inexpensive, but not
   130  free. Rebalancing a range is fairly heavyweight and can impose a
   131  systemic drain on system resources if done too frequently.
   132  
   133  Range momentum by itself does not aid in load-based leaseholder
   134  placement. For that we'll need to pass additional information in each
   135  `BatchRequest` indicating the locality of the originator of the
   136  request and to maintain per-range metrics about how much load a range
   137  is seeing from each locality. The rebalancer would then attempt to
   138  place leases such that they are spread within the localities they
   139  receiving load from, modulo their other placement constraints
   140  (i.e. diversity).
   141  
   142  # Drawbacks
   143  
   144  * The proposed leaseholder rebalancing mechanisms require a transfer
   145    lease operation. We have such an operation for use in testing but it
   146    isn't ready for use in production (yet). This should be rectified
   147    soon.
   148  
   149  # Alternatives
   150  
   151  * Rather than placing the leaseholder rebalancing burden on
   152    `replicateQueue`, we could perform rebalancing when leases are
   153    acquired/extended. This would work with the current expiration-based
   154    leases, but not with [epoch-based](20160210_range_leases.md) leases.
   155  
   156  * The overfull/underfull heuristics for leaseholder rebalancing
   157    mirrors the heuristics for replica rebalancing. For leaseholder
   158    rebalancing we could consider other heuristics. For example, we
   159    could periodically randomly transfer leases. We have some
   160    experimental evidence that this is better than the status quo due to
   161    tests which periodically restart nodes and thus cause the leases on
   162    that node to be redistributed in the cluster. The downside to this
   163    approach is that it isn't clear how to extend it to support more
   164    sophisticated decisions such as load-based leaseholder
   165    rebalancing. Random transfers also have the disadvantage of causing
   166    minor availability disruptions. The system should be able to reach
   167    an equilibrium in which lease transfers are rare.
   168  
   169  * Another signal that could be used in conjunction with the proposed
   170    overfull/underfull heuristic is the time since the lease was last
   171    transferred. If we disallow frequent transfers we can prevent
   172    thrashing and enforce an upper bound on the rate of transfer-related
   173    "hiccups". The time since last lease transfer can help us choose
   174    which lease to transfer from an overfull store. This signal will be
   175    explored if testing shows thrashing is a problem.
   176  
   177  * A simple mechanism for avoiding thrashing (moving leases back and
   178    forth) is to use a larger value for the overfull/underfull
   179    threshold. This satisfies the primary goal for the RFC at the
   180    expense of the second goal of balancing leases for improved
   181    performance.
   182  
   183  # Unresolved questions