github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170602_rebalancing_for_1_1.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170602_rebalancing_for_1_1.md (about)

1 - Feature Name: Rebalancing plans for 1.1
2 - Status: in-progress
3 - Start Date: 2017-06-02
4 - Authors: Alex Robinson
5 - RFC PR: [#16296](https://github.com/cockroachdb/cockroach/pull/16296)
6 - Cockroach Issue:
7 - [#12996](https://github.com/cockroachdb/cockroach/issues/12996)
8 - [#15988](https://github.com/cockroachdb/cockroach/issues/15988)
9 - [#17979](https://github.com/cockroachdb/cockroach/issues/17979)
10
11 # Summary
12
13 Lay out plans for which rebalancing improvements to make (or not make) in the
14 1.1 release and designs for how to implement them.
15
16 # Background / Motivation
17
18 We’ve made a couple of efforts over the past year to improve the balance of
19 [replicas](20160503_rebalancing_v2.md) and
20 [leases](20170125_leaseholder_locality.md) across a
21 cluster, but our balancing algorithms still don’t take into account everything
22 that a user might care about balancing within their cluster. This document puts
23 forth plans for what we’ll work on with respect to rebalancing during the 1.1
24 release cycle. In particular, four different improvements have been proposed.
25
26 ## Balancing disk capacity, not just number of ranges ("size-based rebalancing")
27
28 Our existing rebalancing heuristics only consider the number of ranges on each
29 node, not the amount of bytes, effectively assuming that all ranges are the same
30 size. This is a flawed assumption -- a large number of empty ranges can be
31 created when a user drops/truncates a table or runs a restore from backup that
32 fails to finish. Not considering the size of the ranges in rebalancing can lead
33 to some nodes containing far more data than others.
34
35 ## Balancing request load, not just number of ranges ("load-based rebalancing")
36
37 Similarly, the rebalancing heuristics do not consider the amount of load on each
38 node when making placement decisions. While this works great for some of our
39 load generators (e.g. kv), it can cause problems with others like ycsb and with
40 many real-world workloads if many of the most popular ranges end up on the same
41 node. When deciding whether to move a given range, we should consider how much
42 load is on that range and on each of the candidate nodes.
43
44 ## Moving replicas closer to where their load is coming from ("load-based replica locality")
45
46 For the 1.0 release, [we added lease transfer
47 heuristics](20170125_leaseholder_locality.md) that move leases closer to the
48 where requests are coming from in high-latency environments. It’s easy to
49 imagine a similar heuristic for moving ranges -- if a lot of requests for a
50 range are coming from a locality that doesn’t have a replica of the range, then
51 we should add a replica there. That will then enable the lease-transferring
52 heuristics to transfer the lease there if appropriate, reducing the latency to
53 access the range.
54
55 ## Splitting ranges based on load ("load-based splitting")
56
57 A single hot range can become a bottleneck. We currently only split ranges when
58 they hit a size threshold, meaning that all of a cluster’s load could be to a
59 single range and we wouldn’t do anything about it, even if there are other nodes
60 in the cluster (that don’t contain the hot range) that are idle. While splitting
61 decisions may seem somewhat separate from rebalancing decisions, in some
62 situations splitting a hot range would allow us to more evenly distribute the
63 load across the cluster by rebalancing one of the halves.
64
65 This is so important for performance that we already support manually
66 introducing range splits, but an automated approach would be more appropriate as
67 a permanent solution.
68
69 # Detailed Design
70
71 ## Balancing based on multiple factors
72
73 Currently when we’re scoring a potential replica rebalance, we only have to
74 consider the relevant zone config settings and the number of replicas on each
75 store. This allows us to effectively treat all replicas as if they’re exactly
76 the same. Adding in factors like the size of the range and the number
77 of QPS to a range invalidates that assumption, and forces us to consider how a
78 replica differs from the typical replica on both dimensions. For example, if
79 node 1 has fewer replicas than node 2 but more bytes stored on it, then we might
80 be willing to move a big replica from node 1 to 2 or a small replica from node 2
81 to 1, but wouldn’t want the inverses.
82
83 Thus, in addition to knowing the size or QPS of the particular range
84 we’re considering rebalancing, we’ll also want to know some idea of the
85 distribution of size or QPS per range for the replicas in a store. This will
86 mean periodically iterating over all the replicas in a store to aggregate
87 statistics so that we can know whether a range is larger/smaller than others or
88 has more/less QPS than others. Specifically, we'll try computing a few
89 percentiles to help pick out the true outliers that would have the greatest
90 effect on the cluster's balance.
91
92 We can them compute rebalance scores by considering the percentiles of a
93 replica and under/over-fullness of stores amongst all the considered dimensions.
94 We will prefer moving away replicas at high percentiles from stores that are
95 overfull for that dimension toward stores that are less full for the dimension
96 (and vice versa for low percentiles and underful stores under the expectation
97 that the removed replicas can be replaced by higher percentile replicas). The
98 extremeness of a given percentile and under/over-fullness will increase the
99 weight we give to that dimension. These heuristics will allow us to combine
100 the different dimensions into a single final score, and should be covered by a
101 large number of test cases to ensure stability in different scenarios.
102
103 ## Size-based rebalancing
104
105 Taking size into account seems like the simplest modification of our existing
106 rebalancing logic, but even so there are a variety of available approaches:
107
108 1. We already gossip each store’s total disk capacity and unused disk capacity.
109 We could start trying to balance unused disk capacity across all the nodes of
110 the cluster. That would mean that in the case of heterogeneous disk sizes,
111 nodes with smaller disks might not get much (if any) data rebalanced to them
112 if the cluster doesn’t have much data.
113
114 2. We could try to balance used disk capacity (i.e. total - unused). In
115 heterogeneous clusters, this would mean that some nodes would fill up way
116 before others (and potentially way before the cluster fills up as a whole).
117 Situations in which some nodes but not others are full are not regularly
118 tested yet, so we may have to start if we go this way.
119
120 3. We could try to balance fraction of the disk used. This is the happy
121 compromise between the previous two options -- it will put data onto nodes
122 with smaller disks right from the beginning (albeit less data), and it
123 shouldn’t often lead to smaller nodes filling up way before others.
124
125 The first option most directly parallels our existing logic that only attempts
126 to balance the number of replicas without considering the size of each node’s
127 disk, but the third option appears best overall. It’s likely that we’ll want to
128 change the replica logic as part of this work to take disk size into account,
129 such that we’ll balance replicas per GB of disk rather than absolute number of
130 replicas.
131
132 ## Load-based rebalancing
133
134 As part of our [leaseholder locality](20170125_leaseholder_locality.md) work, we
135 started tracking how many requests each range’s leaseholder receives. This gives
136 us a QPS number for each leaseholder replica, but no data for replicas that
137 aren’t leaseholders. If we left things this way, our replica rebalancing would
138 suddenly take a dependency on the cluster’s current distribution of
139 leaseholders, which is a scary thought given that leaseholder rebalancing
140 conceptually already depends on replica rebalancing (because it can only balance
141 leases to where the replicas are). As a result, I think we’ll want to start
142 tracking the number of applied commands on each replica instead of relying on
143 the existing leaseholder QPS.
144
145 Once we have that per-replica QPS, though, we can aggregate it at the store
146 level and start including it in the store’s capacity gossip messages to use it
147 in balancing much like disk space.
148
149 ## Load-based replica locality
150
151 This is where things get tricky -- while the above goals are about bringing the
152 cluster into greater balance, trying to move replicas toward the load is likely
153 to reduce the balance within the cluster. Reducing the thrashing involved in the
154 leaseholder locality project was quite a lot of work and still isn’t resilient
155 to certain configurations. When we’re talking about moving replicas rather than
156 just transferring leases, the cost of thrashing skyrockets because snapshots
157 consume a lot of disk/network bandwidth.
158
159 This also conflicts with one of our design goals from [the original rebalancing
160 RFC](20150819_stateless_replica_relocation.md), which is that the decision to
161 make any individual operation should be stateless. Because the counts of
162 requests by locality are only tracked on the leaseholder, these types of
163 decisions are inherently stateful, so we should tread into making them with
164 caution.
165
166 In the interest of not creating problem cases for users, I’d suggest pushing
167 this back until we have known demand for it. Custom zone configs paired with
168 leaseholder locality already do a pretty good job of enabling low-latency access
169 to data.
170
171 ## Load-based splitting
172
173 Load-based splitting is conceptually pretty simple, but will likely produce
174 some edge cases in practice. Consider a few representative examples:
175
176 1. A range gets a lot of requests for single keys, evenly distributed over the
177 range. Splitting will help a lot.
178
179 2. A range gets a lot of requests for just a couple of individual keys (and
180 the hot requests don't touch multiple hot keys in the same query, a la case
181 4). Splitting will help if and only if the split is between the hot keys.
182
183 3. A range gets a lot of requests for just a single key. Splitting won’t help at
184 all.
185
186 4. A range gets a lot of scan requests or other requests that touch multiple
187 keys. Splitting could actually make things worse by flipping an operation
188 from a single-range operation into a multi-range one.
189
190 Given these possibilities, it’s clear that we’re going to need more granular
191 information than how many requests a range is receiving in order to decide
192 whether to split a range. What we really need is something that will keep track
193 of the hottest keys (or key spans) in the hottest ranges. This is basically a
194 streaming top-k problem, and there are plenty of algorithms that have been
195 written about that should work for us given that we only need approximate
196 results.
197
198 It’s also worth noting that we’ll only need such stats for ranges that have a
199 high enough QPS to justify splitting. Thus, our approach will look something
200 like:
201
202 1. Track the QPS to each leaseholder (which we’re already doing as of
203 [#13426](https://github.com/cockroachdb/cockroach/pull/13426)).
204
205 2. If a given range’s QPS is abnormally high (by virtue of comparing to the
206 other ranges), start recording the approximate top-k key spans.
207 Correspondingly, if a range's QPS drops down and we had been tracking its
208 top-k key spans, we should notice this and stop.
209
210 3. Periodically check the top key spans for these top ranges and determine if
211 splitting would allow for better distributing the load without making too
212 many more multi-range operations. Picking a split point and determining
213 whether it'd be beneficial to split there could be done by sorting the top
214 key spans and, between each of them, comparing how many requests would be
215 to spans that are to the left of, the right of, or overlapping that
216 possible split point.
217
218 4. If a good split point was found, do the split.
219
220 5. Sit back and let load-based rebalancing do its thing.
221
222 This will take a bit of work to finish, and isn’t critical for 1.1, but
223 would be a nice addition and comes with much less downside risk than something
224 like load-based replica locality. We’ll try to get to it if we have the time,
225 otherwise can implement it for 1.2.
226
227 ### Alternatives
228
229 The approximate top-k approach to determining split points is fairly precise,
230 but also adds some fairly complex to the hot code path for serving requests to
231 replicas. A simple alternative would be for us to do the following for each hot
232 range:
233
234 1. Pick a possible split point (the mid-point of the range to start with).
235
236 1. For each incoming request to the hot replica, record whether the request is
237 to the left side, the right side, or both.
238
239 1. After a while, examine the results. If most of the requests touched both
240 sides, abandon trying to split the range. If most of the requests were split
241 pretty evenly between left and right, make the split at the tested key. If
242 the results were pretty uneven, try moving the possible split point in the
243 direction that received more requests and try again, a la binary search.
244 After O(log n) possible split points, we'll either find a decent split point
245 may determine that there isn't an equitable split point (because the
246 requests are mostly to a single key).
247
248 In fact, even if we do use a top-k approach, testing out the split point like
249 this before making the split might still be smart to ensure that all of the
250 spans that weren't included in the top-k aren't touching both sides of the
251 split.
252
253 Finally, the simplest alternative of all (proposed by bdarnell on #16296) is
254 to not do load-based splitting at all, and instead just split more eagerly for
255 tables with a small number of ranges (where "small" could reasonabl be defined
256 as "less than the number of nodes in the cluster"). This wouldn't help with
257 steady state load at all, but it would help with the arguably more common
258 scenario of a "big bang" of data growth when a service launches or during a
259 bulk load of data.
260
261 ### Drawbacks
262
263 Splitting ranges based on load could, for certain request patterns, lead to a
264 large build-up of small ranges that don't receive traffic anymore. For example,
265 if a table's primary keys are ordered by timestamp, and newer rows are more
266 popular than old rows, it's very possible that newer parts of the table could
267 get split based on load but then remain small forever even though they don't
268 receive much traffic anymore.
269
270 This won't cripple the cluster, but is less than ideal. Merge support is being
271 tracked in [#2433](https://github.com/cockroachdb/cockroach/issues/2433).