github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20161026_leaseholder_rebalancing.md (about) 1 - Feature Name: Leaseholder Rebalancing 2 - Status: completed 3 - Start Date: 2016-10-26 4 - Authors: Peter Mattis 5 - RFC PR: [#10262](https://github.com/cockroachdb/cockroach/pull/10262) 6 - Cockroach Issue: [#9462](https://github.com/cockroachdb/cockroach/issues/9462) [#9435](https://github.com/cockroachdb/cockroach/issues/9435) 7 8 # Summary 9 10 Periodically rebalance range leaseholders in order to distribute the 11 per-leaseholder work. 12 13 # Motivation 14 15 The primary goal of ensuring leaseholders are distributed throughout a 16 cluster is to avoid scenarios in which a node is unable to rebalance 17 replicas away because of the restriction that we refuse to rebalance a 18 replica which holds a range lease. This restriction is present in 19 order to prevent an availability hiccup on the range when the 20 leaseholder is removed from it. 21 22 It is interesting to note a problematic behavior of the current 23 system. The current leaseholder will extend its lease as long as it is 24 receiving operations for a range. And when a range is split, the lease 25 for the left-hand side of the split is cloned and given to the 26 right-hand side of the split. The combined effect is that a newly 27 created cluster that has continuous load applied against it will see a 28 single node slurp up all of the range leases which causes a severe 29 replica imbalance (since we can't rebalance away from the leaseholder) 30 as well as a performance bottleneck. We actually see increased 31 performance by periodically killing nodes in the cluster. 32 33 The second goal is to more evenly distributed load in a cluster. The 34 leaseholder for a range has extra duties when compared to a follower: 35 it performs all reads for a range and proposes almost all 36 writes. [Proposer evaluated KV](20160420_proposer_evaluated_kv.md) will 37 reduce the cost of write KV operations on followers exacerbating the 38 difference between leaseholders and followers. These extra duties 39 impose additional load on the leaseholder making it desirable to 40 spread that load throughout a cluster in order to improve performance. 41 42 The last goal is to place the leaseholder for a range near the gateway 43 node that is accessing the range in order to minimize network RTT. As 44 an obvious example: it is preferable for the leaseholder to be in the 45 same datacenter as the gateway node. Note that there is usually more 46 than one gateway node accessing a range and there will be common 47 workloads where the traffic from gateway nodes is not coming from a 48 single locality, making it impossible to satisfy this goal. In general, 49 we'd like to minimize the aggregate RTT for accessing the range which 50 makes the mixture of reads and writes important (reads only need a 51 round-trip from the gateway to the leaseholder while writes need a 52 round-trip to the leaseholder and from the leaseholder to a quorom of 53 followers). Also, this goal is at odds with the second goal of 54 distributing load throughout a cluster and we'll need to be careful 55 with heuristics here. It may be beneficial to place the leaseholder in 56 the same datacenter as the gateways accessing the range, but doing so 57 can lower total throughput depending on the workload and if the 58 latencies between datacenters are small. 59 60 # Detailed design 61 62 This RFC is intended to address the first two goals and punt on the 63 last one (load-based leaseholder placement). Note that addressing the 64 second goal of evenly distributing leaseholders across a cluster will 65 also address the first goal of the inability to rebalance a replica 66 away from the leaseholder as we'll always have sufficient 67 non-leaseholder replicas in order to perform rebalancing. 68 69 Leaseholder rebalancing will be performed using a similar mechanism to 70 replica rebalancing. The periodically gossiped `StoreCapacity` proto 71 will be extended with a `lease_count` field. We will also reuse the 72 overfull/underfull classification used for replica rebalancing where 73 overfull indicates a store that has x% more leases than the average 74 and underfull indicates a store that has x% fewer leases than the 75 average. Note that the average is computed using the candidate stores, 76 not all stores in the cluster. Currently, `replicateQueue` has the 77 following logic: 78 79 1. If range has dead replicas, remove them. 80 2. If range is under-replicated, add a replica. 81 3. If range is over-replicated, remove a replica. 82 4. If the range needs rebalancing, add a replica. 83 84 The proposal is to add the following logic (after the above replica 85 rebalancing logic): 86 87 5. If the leaseholder is on an overfull store transfer the lease to 88 the least loaded follower less than the mean. 89 6. If the leaseholder store has a leaseholder count above the mean and 90 one of the followers has an underfull leaseholder count transfer the 91 lease to the least loaded follower. 92 93 # Testing 94 95 Individual rebalancing heuristics can be unit tested, but seeing how 96 those heuristics interact with a real cluster can often reveal 97 surprising behavior. We have an existing allocation simulation 98 framework, but it has seen infrequent use. As an alternative, the 99 `zerosum` tool has been useful in examining rebalancing heuristics. We 100 propose to fork `zerosum` and create a new `allocsim` tool which will 101 create a local N-node cluster with controls for generating load and 102 using smaller range sizes to test various rebalancing scenarios. 103 104 # Future Directions 105 106 We eventually need to provide load-based leaseholder placement, both 107 to place leaseholders close to gateway nodes and to more accurately 108 balance load across a cluster. Balancing load by balancing replica 109 counts or leaseholder counts does not capture differences in per-range 110 activity. For example, one table might be significantly more active 111 than others in the system making it desirable to distribute the ranges 112 in that table more evenly. 113 114 At a high-level, expected load on a node is proportional to the number 115 of replicas/leaseholders on the node. A more accurate approximation is 116 that it is proportional to the number of bytes on the node (though 117 this can be thwarted by an administrator who recognizes a particular 118 table has higher load and thus sets the target range size 119 smaller). Rather than balancing replica/leaseholder counts we could 120 balance based on the range size (i.e. the "used-bytes" metric). 121 122 The second idea is to account for actual load on ranges. The simple 123 approach to doing this is to maintain an exponentially decaying stat 124 of operations per range and to multiply this metric by the range size 125 giving us a range momentum metric. We then balance the range momentum 126 metric across nodes. There are difficulties to making this work well 127 with the primary one being that load (and thus momentum) can change 128 rapidly and we want to avoid the system being overly sensitive to such 129 changes. Transferring leaseholders is relatively inexpensive, but not 130 free. Rebalancing a range is fairly heavyweight and can impose a 131 systemic drain on system resources if done too frequently. 132 133 Range momentum by itself does not aid in load-based leaseholder 134 placement. For that we'll need to pass additional information in each 135 `BatchRequest` indicating the locality of the originator of the 136 request and to maintain per-range metrics about how much load a range 137 is seeing from each locality. The rebalancer would then attempt to 138 place leases such that they are spread within the localities they 139 receiving load from, modulo their other placement constraints 140 (i.e. diversity). 141 142 # Drawbacks 143 144 * The proposed leaseholder rebalancing mechanisms require a transfer 145 lease operation. We have such an operation for use in testing but it 146 isn't ready for use in production (yet). This should be rectified 147 soon. 148 149 # Alternatives 150 151 * Rather than placing the leaseholder rebalancing burden on 152 `replicateQueue`, we could perform rebalancing when leases are 153 acquired/extended. This would work with the current expiration-based 154 leases, but not with [epoch-based](20160210_range_leases.md) leases. 155 156 * The overfull/underfull heuristics for leaseholder rebalancing 157 mirrors the heuristics for replica rebalancing. For leaseholder 158 rebalancing we could consider other heuristics. For example, we 159 could periodically randomly transfer leases. We have some 160 experimental evidence that this is better than the status quo due to 161 tests which periodically restart nodes and thus cause the leases on 162 that node to be redistributed in the cluster. The downside to this 163 approach is that it isn't clear how to extend it to support more 164 sophisticated decisions such as load-based leaseholder 165 rebalancing. Random transfers also have the disadvantage of causing 166 minor availability disruptions. The system should be able to reach 167 an equilibrium in which lease transfers are rare. 168 169 * Another signal that could be used in conjunction with the proposed 170 overfull/underfull heuristic is the time since the lease was last 171 transferred. If we disallow frequent transfers we can prevent 172 thrashing and enforce an upper bound on the rate of transfer-related 173 "hiccups". The time since last lease transfer can help us choose 174 which lease to transfer from an overfull store. This signal will be 175 explored if testing shows thrashing is a problem. 176 177 * A simple mechanism for avoiding thrashing (moving leases back and 178 forth) is to use a larger value for the overfull/underfull 179 threshold. This satisfies the primary goal for the RFC at the 180 expense of the second goal of balancing leases for improved 181 performance. 182 183 # Unresolved questions