github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/rebalancing.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/rebalancing.md (about)

1 # Rebalancing
2
3 **Last update:** January 2019
4
5 **Original author:** Alex Robinson
6
7 This document serves as an end-to-end description of the current state of range
8 and lease rebalancing in Cockroach as of v2.1. The target reader is anyone who
9 is interested in better understanding how and why replica and lease placement
10 decisions get made. Little detailed knowledge of core should be needed.
11
12 The most complete documentation, of course, is in the code, tests, and the
13 surrounding comments, but those are necessarily split across several files and
14 are much tougher to piece together into a logical understanding. That scattered
15 knowledge is centralized here, without excessive detail that is likely to become
16 stale.
17
18 ## Table of Contents
19
20 * [Overview](#overview)
21 * [Considerations](#considerations)
22 * [Implementation](#implementation)
23 * [Replicate Queue](#replicate-queue)
24 * [Choosing an action](#choosing-an-action)
25 * [Picking an up-replication target](#picking-an-up-replication-target)
26 * [Picking a down-replication target](#picking-a-down-replication-target)
27 * [Picking a rebalance target](#picking-a-rebalance-target)
28 * [Per-replica / expressive constraints](#per-replica--expressive-constraints)
29 * [Lease transfer decisions](#lease-transfer-decisions)
30 * [Store Rebalancer](#store-rebalancer)
31 * [Other details](#other-details)
32 * [Known issues](#known-issues)
33
34 ## Overview
35
36 Cockroach maintains multiple replicas of each range of data for fault tolerance.
37 It also maintains a single leaseholder for each range to optimize the
38 performance of reads and help maintain correctness invariants. The locations of
39 these replicas and leaseholders are hugely important to the fault tolerance and
40 performance of the system as a whole, and so Cockroach contains a bunch of logic
41 that proactively tries to ensure a reasonably optimal distribution of replicas
42 and leases throughout the cluster.
43
44 ## Considerations
45
46 There are a number of factors that need to be considered when making placement
47 decisions. For replicas, this includes:
48
49 * User-specified constraints in zone configs. These obviously need to be respected.
50 * Disk fullness. Moving replicas to a store that's out (or nearly out) of disk
51 space is clearly a bad idea. Moving replicas away from a store that's nearly
52 out of disk space is often a good idea, but not always.
53 * Diversity of localities. If we put all of the replicas for a range in just one
54 or two localities, then a single locality (datacenter, rack, region, etc)
55 failure will cause data unavailability / loss. We should try to spread
56 replicas as widely as possible for maximal fault tolerance.
57 * Number of ranges on each store.
58 * Load on each node. Uneven distribution of load can cause bottlenecks that
59 seriously affect the overall throughput of the cluster (e.g. [#26059]).
60 * Amount of data on each store. We don't want one disk in a cluster to fill up
61 long before the others, and it's also valuable for recovery time to be
62 roughly the same after any given node failure, which isn't the case if one
63 node has significantly more data on it than others.
64 * Number of writes on each store. Disks have a limited amount of IOPs and
65 bandwidth, so bottlenecks can be a problem here as well, at least
66 hypothetically.
67
68 We currently don't directly use the last two factors, instead hoping that
69 balancing the overall load and number of ranges are good enough proxies for the
70 number of writes and the amount of data, respectively. We previously tried to
71 integrate these factors into decisions, but allocation decisions became quite
72 complex and the approach ran into a number of issues ([#17979]), so it has since
73 been removed in favor of the `StoreRebalancer`'s pure QPS-based rebalancing.
74
75 For lease rebalancing, the considerations include:
76
77 * Lease count on each node. Balancing this should roughly balance out the amount
78 of load on each node, assuming a uniform distribution.
79 * QPS on each node. It turns out that not all workloads have a uniform
80 distribution.
81 * Locality of data access. If most requests to a range are coming from the other
82 side of the world, maybe we should move the lease closer to them.
83
84 Note that there is a built-in conflict here -- moving leases closer to where
85 requests are coming from may require imbalancing the lease count, or causing the
86 nodes in those localities to have more load than other nodes. Getting this right
87 can be a tough balancing act, and it's hard to ever be fully confident that
88 you've gotten it right because there are almost certainly workloads out there
89 that won't be handled optimally by whatever decision-making logic you implement.
90
91 ## Implementation
92
93 Historically, all rebalancing has been handled by the `ReplicateQueue`. As of
94 v2.1, there's also a separate component called the `StoreRebalancer` which
95 focuses specifically on the problem of balancing the QPS on each store. QPS is
96 used here essentially as a proxy for the (CPU/network/disk) load on each node.
97 It's not a perfect proxy in general, but it seems to work well in benchmarks and
98 tests.
99
100 ### Replicate Queue
101
102 The `ReplicateQueue` is one of our handful of replica queues which periodically
103 iterate over all the replicas in each store. Replicas are queued by the
104 `replicaScanner` on each store, which simply scans over all replicas at a
105 configurable pace and runs them through each of the replica queues. Replicas
106 are also sometimes manually queued in the `ReplicateQueue` upon certain
107 triggers, as will be explained later.
108
109 Upon being asked to operate on a replica, the `ReplicateQueue` must:
110
111 1. Decide whether the replica's range needs any replica/lease placement changes
112 2. Decide exactly what change to make
113 3. Make the change
114 4. Repeat until no more changes are needed
115
116 The main interesting bit here, of course, is how the decisions are made, which
117 is described in detail in the subsections below. The only other points of note
118 are:
119
120 * The `ReplicateQueue` only acts on ranges for which the local store is the
121 current leaseholder.
122 * Only one goroutine is doing all this processing, including the actual sending
123 of snapshots. This is desirable because it keeps snapshots from overloading
124 the network and seriously impacting user traffic, but it also has downsides.
125 In particular, if any part of the processing of a replica gets stalled
126 (sending a snapshot being the most likely slow part, but IIRC we also had
127 occasional problems with a lease transfer blocking in the past), then it will
128 take a long time for the replicate queue to get through all of its store's
129 replicas. There's a hard limit of 60 seconds processing time per replica, but
130 even this means that up-replication from a node failure can take a
131 surprisingly long time in some pathological cases, whether due to a bug or
132 just due to large replicas and low bandwidth between nodes.
133 * The 60 second deadline per replica means that sufficiently low snapshot
134 bandwidth or sufficiently large replicas can make some ranges impossible to
135 up-replicate or rebalance, because their snapshots can't complete in time
136 and just get canceled on every attempt. This shouldn't happen with the
137 default settings of `kv.snapshot_rebalance.max_rate`,
138 `kv.snapshot_recovery.max_rate`, and `ZoneConfig.RangeMaxBytes`, but
139 modifications to one or more of them can put a cluster in danger.
140 * We limit lease transfers away from each node to one per second. This is a
141 very long-standing policy that hasn't been reconsidered in a long time, but
142 it has minimal known downsides ([#19355]) that qps-based lease rebalancing
143 mostly obviates.
144 * If a node needs to be up-replicated but there are no available matching nodes,
145 or if a range needs to be processed but doesn't have a quorum of live
146 replicas (i.e. it's an "unavailable" range), the replica will be put in
147 purgatory to be re-processed when new nodes become live.
148
149 ##### Choosing an action
150
151 First, we must decide what action to do - up-replicate, down-replicate, or
152 consider a rebalance. This decision is quite simple and can be easily
153 understood from the code. Essentially we just have to compare the number of
154 desired replicas from the applicable `ZoneConfig` to the number of non-dead,
155 non-decommissioning replicas from the range. There's a bit of extra logic needed
156 to dynamically adjust the desired number of replicas when it's greater than the
157 number of nodes in the cluster ([#27349], [#30441], [#32949], [#34126]), but
158 that's about it.
159
160 ##### Picking an up-replication target
161
162 Picking an up-replication target is relatively straightforward. We can just
163 iterate over all live stores in the cluster, evaluating them on each of the
164 [considerations](#Considerations) in order, choosing one of the best results. We
165 will never, ever choose a store that doesn't meet the `ZoneConfig` constraints,
166 has an overfull disk, or is on the same node as another store that already
167 contains a replica for the range. After that, we will first prefer maximizing
168 diversity before considering factors such as the range count on each store. We
169 notably do not consider the QPS on each store here -- it's only taken into
170 account by the `StoreRebalancer`, never by the `ReplicateQueue`.
171
172 Rather than always choosing the best result, if there are two similarly good
173 options we will choose randomly between them. See
174 https://brooker.co.za/blog/2012/01/17/two-random.html or
175 https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf for details on
176 why this behavior is preferable.
177
178 ##### Picking a down-replication target
179
180 Picking a replica to remove from an over-replicated range is also quite
181 straightforward. We just iterate over each replica's store, grading it on the
182 same [considerations](#Considerations) as always, choosing one of the two
183 worst-scoring stores. The only real exception is if one of the replicas is dead;
184 in such cases, we'll always remove the dead store(s) first. Note that as of
185 [#28875] we don't remove replicas from dead stores until we have allocated a
186 replacement replica on a different store. This makes certain data loss scenarios
187 less likely (see [#25392] for details).
188
189 If the algorithm chooses to remove the local replica, the replica must first
190 transfer the lease away before it can be removed. Note that while the new
191 leaseholder's replicate queue will examine the range shortly after acquiring the
192 lease, it's possible for the new leaseholder to make a different decision. This
193 isn't a real problem, but it does mean that removing oneself involves more work
194 and less certainty than removing any of the other replicas.
195
196 ##### Picking a rebalance target
197
198 Deciding when to rebalance is when things start to get legitimately tricky, and
199 is what much of the allocator code is devoted to. This makes intuitive sense if
200 you consider that when adding or removing replica you both
201
202 1. know that you need to take action - unless all the options are truly terrible
203 you should pick one of them.
204 2. only have to consider each store with respect to the set of existing
205 replicas' stores. For adding a replica, this is roughly linear with respect
206 to the number of live stores in the cluster. For removing a replica, it's
207 linear with respect to the number of replicas in the range.
208
209 However, when rebalancing, you have to decide whether taking action is actually
210 desirable. And in practice, you want a bias against action, since there's a real
211 cost to moving data around, and we don't want to do so unless there's a
212 correspondingly real benefit. Also, the problem isn't linear any more - it's
213 roughly O(m*n) when there are m replicas in the range and n live stores in the cluster,
214 because we have to choose both the replica to be removed and the replica to add.
215 This is particularly an issue for diversity score calculations and per-replica /
216 expressive zone config constraints. For example, if you have the following
217 stores:
218
219 StoreID | Locality | Range Count
220 --------|-----------------------|------------
221 s1 | region=west,zone=a | 10
222 s2 | region=west,zone=b | 10
223 s3 | region=central,zone=a | 100
224 s4 | region=east,zone=a | 100
225
226 And a range has replicas on s1, s3, and s4, then going purely by range count it
227 would be great to take a replica off of range s3 or s4, which are both
228 relatively overfull. It would also be great to add a replica to s2, which is
229 relatively underfull. However, replacing s3 or s4 with s2 would hurt the range's
230 diversity, which we never choose to do without the user telling us to.
231
232 You can probably imagine that as the number of stores grows, doing all the
233 pairwise comparisons could become quite a bit of work. To optimize these
234 calculations, we group stores that share the same locality and the same
235 node/store attributes (a mostly-forgotten feature, but one that still needs to
236 be accounted for when considering `ZoneConfig` constraints). We can do all
237 constraint and diversity-scoring calculations just once for each group, and also
238 pair each group up against the only existing replicas that it could legally
239 replace without hurting diversity or violating constraints. We then only have to
240 do range count comparisons within these "comparable" classes of stores.
241
242 At the end, we can determine which (added replica, removed replica) pairs are
243 the largest improvement and choose from amongst them. As one last precautionary
244 step, we then simulate the down-replication logic on the set of replicas that
245 will result from adding the new replica. If the simulation finds that we would
246 remove the replica that was just added, we choose not to make that change. This
247 avoids thrashing, and is needed because we can't atomically add a member to the
248 raft group at the same time that we remove one. It's possible that this isn't
249 necessary right now, since the rebalancing code has been significantly improved
250 since it was added, but at the very least it's a nice fail-safe against future
251 mistakes.
252
253 ##### Per-replica / expressive constraints
254
255 We support two high-level types of constraints -- those which apply to all
256 replicas in a range, and those which are scoped to only apply to a particular
257 number of the replicas in a range (publicly referred to as [per-replica
258 constraints]). The latter option adds a good deal of subtlety to all allocator
259 decisions -- up-replication, down-replication, and especially rebalancing.
260
261 In order to satisfy the requirements, we had to split up constraint checking
262 into separate functions that work differently for adding, removing, and
263 rebalancing. We also had to add an internal concept of whether a replica is
264 "necessary" for meeting the required constraints, in addition to the existing
265 concept of whether or not the replica is valid. A replica is "necessary" if the
266 per-replica constraints wouldn't be satisfied if the replica weren't part of a
267 range.
268
269 For more details on the design of the feature, see the discussion on [#19985].
270 For the implementation, see [#22819].
271
272
273 #### Lease transfer decisions
274
275 For the most part, deciding whether to transfer a lease is a fairly
276 straightforward decision based on whether the current leaseholder node is in a
277 draining state and on the lease counts on all the stores holding replicas for a
278 range. The more complex logic is related to the follow-the-workload
279 functionality that kicks in if-and-only-if the various nodes holding replicas
280 are in different localities. The logic involved here is better explained in the
281 [original RFC](../RFCS/20170125_leaseholder_locality.md) than I could do in less
282 space here. The logic has not meaningfully changed since the original
283 design/implementation.
284
285 ### Store Rebalancer
286
287 As of v2.1, Cockroach also includes a separate control loop on each store called
288 the `StoreRebalancer`. The `StoreRebalancer` exists because we found in [#26059]
289 that an uneven balance of load on each node was causing serious performance
290 problems when attempting to run TPC-C at large scale without using partitioning.
291 Ensuring that each laod had a more even balance of work to do was experimentally
292 found to allow significantly higher and smoother performance.
293
294 The `StoreRebalancer` takes a somewhat different approach to rebalancing,
295 though. While the `ReplicateQueue` iterates over each replica one at a time,
296 deciding whether the replica would be better off somewhere else, the
297 `StoreRebalancer` looks at the overall amount of load (`BatchRequest` QPS
298 specifically, although it could in theory consider other factors) on each store
299 and attempts to take action if the local store is overloaded relative to the
300 other stores in the cluster. This difference is important -- our previous
301 attempt to rebalance based on load was integrated into the replicate queue, and
302 it didn't work very well for at least three different reasons:
303
304 1. We bit off more than we could chew, trying to rebalance on too many different
305 factors at once - range count, keys written per second, and disk space used.
306 2. Keys written per second was the wrong metric, at least for TPC-C. Experimentation
307 showed that the number of `BatchRequest`s being handled by a store per second
308 were much more strongly correlated with a load imbalance than keys written
309 per second.
310 3. Most importantly, the replicate queue only looks at one replica at a time. It
311 may see that the load on each store is uneven, but it doesn't have a good way
312 of knowing whether the replica in question would be a good one to move to try
313 to event things out (if a particular range is relatively low in the metric
314 we want to even out, it's intuitively a bad one to move). We did start
315 gossiping quantiles in order to help determine which quantile a range fell in
316 and thus whether it would be a good one to move, but this was still pretty
317 imprecise.
318
319 The `StoreRebalancer` solves all these problems. It only focuses on QPS, and by
320 focusing on the store-level imbalance first and picking ranges to rebalance
321 later, it can choose ranges that are specifically high in QPS in order to have
322 the biggest influence on store-level balance with the smallest disruption on
323 range count (which the `ReplicateQueue` is still responsible for attempting to
324 even out). Ranges to rebalance are efficiently chosen because we have started
325 tracking a priority queue of the hottest ranges by QPS on each store. This queue
326 gets repopulated once a minute, when the existing loop that iterates over all
327 replicas to compute store-level stats does its thing. This list of hot ranges
328 can have other uses as well, such as powering debug endpoints for the admin UI
329 ([#33336]).
330
331 Interpreting the exact details of how things work from the code should be pretty
332 straightforward; we attempt to move leases to resolve imbalances first, and only
333 resort to moving replicas around if moving leases was insufficient to resolve
334 the imbalance. There are some controls in place to avoid rebalancing when QPS is
335 too low to matter, or to avoid messing with a range that's so hot that it
336 constitutes the majority of a node's qps, or to not bother moving ranges with
337 too few qps to really matter, or a few other such things.
338
339 The `StoreRebalancer` can be controlled by a cluster setting that either fully
340 turns it off, enables just lease rebalancing, or enables both lease and replica
341 rebalancing, which is the default.
342
343 For more details, see the original prototype ([#26608]) or the final
344 implementation ([#28340], [#28852]).
345
346 ### Other details
347
348 Before removing a replica or transferring a lease, we need to take the raft
349 status of the various existing replicas into account. This is important to avoid
350 temporary unavailability.
351
352 For example, if you transfer the lease for a range to a replica that is way
353 behind in processing its raft log, it will take some time before that replica
354 gets around to processing the command which transferred the lease to it, and it
355 won't be able to serve any requests until it does so.
356
357 Or when considering which replica to remove from a range, we must take care not
358 to remove a replica that is critical for the range's quorum. If only 3 replicas
359 out of 5 are caught up with the raft leader's state, we can't remove any of
360 those 3, but can safely remove either of the other 2.
361
362 Note that it's possible that the raft state of the underlying replicas changes
363 between when we do this check and when the actual transfer/removal takes place,
364 so it isn't a foolproof protection, but the window of risk is very small and we
365 haven't noticed it being a problem in practice.
366
367 ## Known issues
368
369 * Rebalancing isn't atomic, meaning that adding a new replica and removing the
370 replica it replaces is done as two separate steps rather than just one. This
371 leaves room for locality failures between the two steps to cause
372 unavailability ([#12768]). For example, if a range has replicas in localities
373 `a`, `b`, and `c`, and wants to rebalance to a different store in `a`, there
374 will be a short period of time in which 2 of the range's 4 replicas are in
375 `a`. If `a` goes down before one of them is removed, the range will be
376 without a quorum until `a` comes back up.
377 * Rebalancing doesn't work well with multiple stores per node because we want to
378 avoid ever putting multiple replicas of the same range on the same node
379 ([#6782]). This has never been a deal breaker for anyone AFAIK, but occasionally
380 annoys a user or two.
381 * `RelocateRange` is flaky in v2.2-alpha versions because we now immediately put a
382 range through the replicate queue when a new lease is acquired on it ([#31287]).
383 It may fail to complete its desired changes successfully due to racing with
384 changes proposed by the new leaseholder.
385 * `RelocateRange` (and consequently the `StoreRebalancer` as a whole) doesn't
386 populate any useful information into the `system.rangelog` table, which has
387 traditionally been the best way to debug rebalancing decisions after the
388 fact (#34130).
389
390 [#6782]: https://github.com/cockroachdb/cockroach/issues/6782
391 [#12768]: https://github.com/cockroachdb/cockroach/issues/12768
392 [#17979]: https://github.com/cockroachdb/cockroach/issues/17979
393 [#19355]: https://github.com/cockroachdb/cockroach/issues/19355
394 [#19985]: https://github.com/cockroachdb/cockroach/issues/19985
395 [#22819]: https://github.com/cockroachdb/cockroach/pulls/22819
396 [#25392]: https://github.com/cockroachdb/cockroach/issues/25392
397 [#26059]: https://github.com/cockroachdb/cockroach/issues/26059
398 [#26608]: https://github.com/cockroachdb/cockroach/pull/26608
399 [#27349]: https://github.com/cockroachdb/cockroach/pull/27349
400 [#28340]: https://github.com/cockroachdb/cockroach/pull/28340
401 [#28852]: https://github.com/cockroachdb/cockroach/pull/28852
402 [#28875]: https://github.com/cockroachdb/cockroach/pull/28875
403 [#30441]: https://github.com/cockroachdb/cockroach/pull/30441
404 [#31287]: https://github.com/cockroachdb/cockroach/issues/31287
405 [#32949]: https://github.com/cockroachdb/cockroach/pull/32949
406 [#33336]: https://github.com/cockroachdb/cockroach/pull/33336
407 [#34126]: https://github.com/cockroachdb/cockroach/pull/34126
408 [#34130]: https://github.com/cockroachdb/cockroach/issues/34130
409 [per-replica constraints]: https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html#scope-of-constraints