github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20161026_leaseholder_rebalancing.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20161026_leaseholder_rebalancing.md (about)

1 - Feature Name: Leaseholder Rebalancing
2 - Status: completed
3 - Start Date: 2016-10-26
4 - Authors: Peter Mattis
5 - RFC PR: [#10262](https://github.com/cockroachdb/cockroach/pull/10262)
6 - Cockroach Issue: [#9462](https://github.com/cockroachdb/cockroach/issues/9462) [#9435](https://github.com/cockroachdb/cockroach/issues/9435)
7
8 # Summary
9
10 Periodically rebalance range leaseholders in order to distribute the
11 per-leaseholder work.
12
13 # Motivation
14
15 The primary goal of ensuring leaseholders are distributed throughout a
16 cluster is to avoid scenarios in which a node is unable to rebalance
17 replicas away because of the restriction that we refuse to rebalance a
18 replica which holds a range lease. This restriction is present in
19 order to prevent an availability hiccup on the range when the
20 leaseholder is removed from it.
21
22 It is interesting to note a problematic behavior of the current
23 system. The current leaseholder will extend its lease as long as it is
24 receiving operations for a range. And when a range is split, the lease
25 for the left-hand side of the split is cloned and given to the
26 right-hand side of the split. The combined effect is that a newly
27 created cluster that has continuous load applied against it will see a
28 single node slurp up all of the range leases which causes a severe
29 replica imbalance (since we can't rebalance away from the leaseholder)
30 as well as a performance bottleneck. We actually see increased
31 performance by periodically killing nodes in the cluster.
32
33 The second goal is to more evenly distributed load in a cluster. The
34 leaseholder for a range has extra duties when compared to a follower:
35 it performs all reads for a range and proposes almost all
36 writes. [Proposer evaluated KV](20160420_proposer_evaluated_kv.md) will
37 reduce the cost of write KV operations on followers exacerbating the
38 difference between leaseholders and followers. These extra duties
39 impose additional load on the leaseholder making it desirable to
40 spread that load throughout a cluster in order to improve performance.
41
42 The last goal is to place the leaseholder for a range near the gateway
43 node that is accessing the range in order to minimize network RTT. As
44 an obvious example: it is preferable for the leaseholder to be in the
45 same datacenter as the gateway node. Note that there is usually more
46 than one gateway node accessing a range and there will be common
47 workloads where the traffic from gateway nodes is not coming from a
48 single locality, making it impossible to satisfy this goal. In general,
49 we'd like to minimize the aggregate RTT for accessing the range which
50 makes the mixture of reads and writes important (reads only need a
51 round-trip from the gateway to the leaseholder while writes need a
52 round-trip to the leaseholder and from the leaseholder to a quorom of
53 followers). Also, this goal is at odds with the second goal of
54 distributing load throughout a cluster and we'll need to be careful
55 with heuristics here. It may be beneficial to place the leaseholder in
56 the same datacenter as the gateways accessing the range, but doing so
57 can lower total throughput depending on the workload and if the
58 latencies between datacenters are small.
59
60 # Detailed design
61
62 This RFC is intended to address the first two goals and punt on the
63 last one (load-based leaseholder placement). Note that addressing the
64 second goal of evenly distributing leaseholders across a cluster will
65 also address the first goal of the inability to rebalance a replica
66 away from the leaseholder as we'll always have sufficient
67 non-leaseholder replicas in order to perform rebalancing.
68
69 Leaseholder rebalancing will be performed using a similar mechanism to
70 replica rebalancing. The periodically gossiped `StoreCapacity` proto
71 will be extended with a `lease_count` field. We will also reuse the
72 overfull/underfull classification used for replica rebalancing where
73 overfull indicates a store that has x% more leases than the average
74 and underfull indicates a store that has x% fewer leases than the
75 average. Note that the average is computed using the candidate stores,
76 not all stores in the cluster. Currently, `replicateQueue` has the
77 following logic:
78
79 1. If range has dead replicas, remove them.
80 2. If range is under-replicated, add a replica.
81 3. If range is over-replicated, remove a replica.
82 4. If the range needs rebalancing, add a replica.
83
84 The proposal is to add the following logic (after the above replica
85 rebalancing logic):
86
87 5. If the leaseholder is on an overfull store transfer the lease to
88 the least loaded follower less than the mean.
89 6. If the leaseholder store has a leaseholder count above the mean and
90 one of the followers has an underfull leaseholder count transfer the
91 lease to the least loaded follower.
92
93 # Testing
94
95 Individual rebalancing heuristics can be unit tested, but seeing how
96 those heuristics interact with a real cluster can often reveal
97 surprising behavior. We have an existing allocation simulation
98 framework, but it has seen infrequent use. As an alternative, the
99 `zerosum` tool has been useful in examining rebalancing heuristics. We
100 propose to fork `zerosum` and create a new `allocsim` tool which will
101 create a local N-node cluster with controls for generating load and
102 using smaller range sizes to test various rebalancing scenarios.
103
104 # Future Directions
105
106 We eventually need to provide load-based leaseholder placement, both
107 to place leaseholders close to gateway nodes and to more accurately
108 balance load across a cluster. Balancing load by balancing replica
109 counts or leaseholder counts does not capture differences in per-range
110 activity. For example, one table might be significantly more active
111 than others in the system making it desirable to distribute the ranges
112 in that table more evenly.
113
114 At a high-level, expected load on a node is proportional to the number
115 of replicas/leaseholders on the node. A more accurate approximation is
116 that it is proportional to the number of bytes on the node (though
117 this can be thwarted by an administrator who recognizes a particular
118 table has higher load and thus sets the target range size
119 smaller). Rather than balancing replica/leaseholder counts we could
120 balance based on the range size (i.e. the "used-bytes" metric).
121
122 The second idea is to account for actual load on ranges. The simple
123 approach to doing this is to maintain an exponentially decaying stat
124 of operations per range and to multiply this metric by the range size
125 giving us a range momentum metric. We then balance the range momentum
126 metric across nodes. There are difficulties to making this work well
127 with the primary one being that load (and thus momentum) can change
128 rapidly and we want to avoid the system being overly sensitive to such
129 changes. Transferring leaseholders is relatively inexpensive, but not
130 free. Rebalancing a range is fairly heavyweight and can impose a
131 systemic drain on system resources if done too frequently.
132
133 Range momentum by itself does not aid in load-based leaseholder
134 placement. For that we'll need to pass additional information in each
135 `BatchRequest` indicating the locality of the originator of the
136 request and to maintain per-range metrics about how much load a range
137 is seeing from each locality. The rebalancer would then attempt to
138 place leases such that they are spread within the localities they
139 receiving load from, modulo their other placement constraints
140 (i.e. diversity).
141
142 # Drawbacks
143
144 * The proposed leaseholder rebalancing mechanisms require a transfer
145 lease operation. We have such an operation for use in testing but it
146 isn't ready for use in production (yet). This should be rectified
147 soon.
148
149 # Alternatives
150
151 * Rather than placing the leaseholder rebalancing burden on
152 `replicateQueue`, we could perform rebalancing when leases are
153 acquired/extended. This would work with the current expiration-based
154 leases, but not with [epoch-based](20160210_range_leases.md) leases.
155
156 * The overfull/underfull heuristics for leaseholder rebalancing
157 mirrors the heuristics for replica rebalancing. For leaseholder
158 rebalancing we could consider other heuristics. For example, we
159 could periodically randomly transfer leases. We have some
160 experimental evidence that this is better than the status quo due to
161 tests which periodically restart nodes and thus cause the leases on
162 that node to be redistributed in the cluster. The downside to this
163 approach is that it isn't clear how to extend it to support more
164 sophisticated decisions such as load-based leaseholder
165 rebalancing. Random transfers also have the disadvantage of causing
166 minor availability disruptions. The system should be able to reach
167 an equilibrium in which lease transfers are rare.
168
169 * Another signal that could be used in conjunction with the proposed
170 overfull/underfull heuristic is the time since the lease was last
171 transferred. If we disallow frequent transfers we can prevent
172 thrashing and enforce an upper bound on the rate of transfer-related
173 "hiccups". The time since last lease transfer can help us choose
174 which lease to transfer from an overfull store. This signal will be
175 explored if testing shows thrashing is a problem.
176
177 * A simple mechanism for avoiding thrashing (moving leases back and
178 forth) is to use a larger value for the overfull/underfull
179 threshold. This satisfies the primary goal for the RFC at the
180 expense of the second goal of balancing leases for improved
181 performance.
182
183 # Unresolved questions