github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/rebalancing.md (about) 1 # Rebalancing 2 3 **Last update:** January 2019 4 5 **Original author:** Alex Robinson 6 7 This document serves as an end-to-end description of the current state of range 8 and lease rebalancing in Cockroach as of v2.1. The target reader is anyone who 9 is interested in better understanding how and why replica and lease placement 10 decisions get made. Little detailed knowledge of core should be needed. 11 12 The most complete documentation, of course, is in the code, tests, and the 13 surrounding comments, but those are necessarily split across several files and 14 are much tougher to piece together into a logical understanding. That scattered 15 knowledge is centralized here, without excessive detail that is likely to become 16 stale. 17 18 ## Table of Contents 19 20 * [Overview](#overview) 21 * [Considerations](#considerations) 22 * [Implementation](#implementation) 23 * [Replicate Queue](#replicate-queue) 24 * [Choosing an action](#choosing-an-action) 25 * [Picking an up-replication target](#picking-an-up-replication-target) 26 * [Picking a down-replication target](#picking-a-down-replication-target) 27 * [Picking a rebalance target](#picking-a-rebalance-target) 28 * [Per-replica / expressive constraints](#per-replica--expressive-constraints) 29 * [Lease transfer decisions](#lease-transfer-decisions) 30 * [Store Rebalancer](#store-rebalancer) 31 * [Other details](#other-details) 32 * [Known issues](#known-issues) 33 34 ## Overview 35 36 Cockroach maintains multiple replicas of each range of data for fault tolerance. 37 It also maintains a single leaseholder for each range to optimize the 38 performance of reads and help maintain correctness invariants. The locations of 39 these replicas and leaseholders are hugely important to the fault tolerance and 40 performance of the system as a whole, and so Cockroach contains a bunch of logic 41 that proactively tries to ensure a reasonably optimal distribution of replicas 42 and leases throughout the cluster. 43 44 ## Considerations 45 46 There are a number of factors that need to be considered when making placement 47 decisions. For replicas, this includes: 48 49 * User-specified constraints in zone configs. These obviously need to be respected. 50 * Disk fullness. Moving replicas to a store that's out (or nearly out) of disk 51 space is clearly a bad idea. Moving replicas away from a store that's nearly 52 out of disk space is often a good idea, but not always. 53 * Diversity of localities. If we put all of the replicas for a range in just one 54 or two localities, then a single locality (datacenter, rack, region, etc) 55 failure will cause data unavailability / loss. We should try to spread 56 replicas as widely as possible for maximal fault tolerance. 57 * Number of ranges on each store. 58 * Load on each node. Uneven distribution of load can cause bottlenecks that 59 seriously affect the overall throughput of the cluster (e.g. [#26059]). 60 * Amount of data on each store. We don't want one disk in a cluster to fill up 61 long before the others, and it's also valuable for recovery time to be 62 roughly the same after any given node failure, which isn't the case if one 63 node has significantly more data on it than others. 64 * Number of writes on each store. Disks have a limited amount of IOPs and 65 bandwidth, so bottlenecks can be a problem here as well, at least 66 hypothetically. 67 68 We currently don't directly use the last two factors, instead hoping that 69 balancing the overall load and number of ranges are good enough proxies for the 70 number of writes and the amount of data, respectively. We previously tried to 71 integrate these factors into decisions, but allocation decisions became quite 72 complex and the approach ran into a number of issues ([#17979]), so it has since 73 been removed in favor of the `StoreRebalancer`'s pure QPS-based rebalancing. 74 75 For lease rebalancing, the considerations include: 76 77 * Lease count on each node. Balancing this should roughly balance out the amount 78 of load on each node, assuming a uniform distribution. 79 * QPS on each node. It turns out that not all workloads have a uniform 80 distribution. 81 * Locality of data access. If most requests to a range are coming from the other 82 side of the world, maybe we should move the lease closer to them. 83 84 Note that there is a built-in conflict here -- moving leases closer to where 85 requests are coming from may require imbalancing the lease count, or causing the 86 nodes in those localities to have more load than other nodes. Getting this right 87 can be a tough balancing act, and it's hard to ever be fully confident that 88 you've gotten it right because there are almost certainly workloads out there 89 that won't be handled optimally by whatever decision-making logic you implement. 90 91 ## Implementation 92 93 Historically, all rebalancing has been handled by the `ReplicateQueue`. As of 94 v2.1, there's also a separate component called the `StoreRebalancer` which 95 focuses specifically on the problem of balancing the QPS on each store. QPS is 96 used here essentially as a proxy for the (CPU/network/disk) load on each node. 97 It's not a perfect proxy in general, but it seems to work well in benchmarks and 98 tests. 99 100 ### Replicate Queue 101 102 The `ReplicateQueue` is one of our handful of replica queues which periodically 103 iterate over all the replicas in each store. Replicas are queued by the 104 `replicaScanner` on each store, which simply scans over all replicas at a 105 configurable pace and runs them through each of the replica queues. Replicas 106 are also sometimes manually queued in the `ReplicateQueue` upon certain 107 triggers, as will be explained later. 108 109 Upon being asked to operate on a replica, the `ReplicateQueue` must: 110 111 1. Decide whether the replica's range needs any replica/lease placement changes 112 2. Decide exactly what change to make 113 3. Make the change 114 4. Repeat until no more changes are needed 115 116 The main interesting bit here, of course, is how the decisions are made, which 117 is described in detail in the subsections below. The only other points of note 118 are: 119 120 * The `ReplicateQueue` only acts on ranges for which the local store is the 121 current leaseholder. 122 * Only one goroutine is doing all this processing, including the actual sending 123 of snapshots. This is desirable because it keeps snapshots from overloading 124 the network and seriously impacting user traffic, but it also has downsides. 125 In particular, if any part of the processing of a replica gets stalled 126 (sending a snapshot being the most likely slow part, but IIRC we also had 127 occasional problems with a lease transfer blocking in the past), then it will 128 take a long time for the replicate queue to get through all of its store's 129 replicas. There's a hard limit of 60 seconds processing time per replica, but 130 even this means that up-replication from a node failure can take a 131 surprisingly long time in some pathological cases, whether due to a bug or 132 just due to large replicas and low bandwidth between nodes. 133 * The 60 second deadline per replica means that sufficiently low snapshot 134 bandwidth or sufficiently large replicas can make some ranges impossible to 135 up-replicate or rebalance, because their snapshots can't complete in time 136 and just get canceled on every attempt. This shouldn't happen with the 137 default settings of `kv.snapshot_rebalance.max_rate`, 138 `kv.snapshot_recovery.max_rate`, and `ZoneConfig.RangeMaxBytes`, but 139 modifications to one or more of them can put a cluster in danger. 140 * We limit lease transfers away from each node to one per second. This is a 141 very long-standing policy that hasn't been reconsidered in a long time, but 142 it has minimal known downsides ([#19355]) that qps-based lease rebalancing 143 mostly obviates. 144 * If a node needs to be up-replicated but there are no available matching nodes, 145 or if a range needs to be processed but doesn't have a quorum of live 146 replicas (i.e. it's an "unavailable" range), the replica will be put in 147 purgatory to be re-processed when new nodes become live. 148 149 ##### Choosing an action 150 151 First, we must decide what action to do - up-replicate, down-replicate, or 152 consider a rebalance. This decision is quite simple and can be easily 153 understood from the code. Essentially we just have to compare the number of 154 desired replicas from the applicable `ZoneConfig` to the number of non-dead, 155 non-decommissioning replicas from the range. There's a bit of extra logic needed 156 to dynamically adjust the desired number of replicas when it's greater than the 157 number of nodes in the cluster ([#27349], [#30441], [#32949], [#34126]), but 158 that's about it. 159 160 ##### Picking an up-replication target 161 162 Picking an up-replication target is relatively straightforward. We can just 163 iterate over all live stores in the cluster, evaluating them on each of the 164 [considerations](#Considerations) in order, choosing one of the best results. We 165 will never, ever choose a store that doesn't meet the `ZoneConfig` constraints, 166 has an overfull disk, or is on the same node as another store that already 167 contains a replica for the range. After that, we will first prefer maximizing 168 diversity before considering factors such as the range count on each store. We 169 notably do not consider the QPS on each store here -- it's only taken into 170 account by the `StoreRebalancer`, never by the `ReplicateQueue`. 171 172 Rather than always choosing the best result, if there are two similarly good 173 options we will choose randomly between them. See 174 https://brooker.co.za/blog/2012/01/17/two-random.html or 175 https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf for details on 176 why this behavior is preferable. 177 178 ##### Picking a down-replication target 179 180 Picking a replica to remove from an over-replicated range is also quite 181 straightforward. We just iterate over each replica's store, grading it on the 182 same [considerations](#Considerations) as always, choosing one of the two 183 worst-scoring stores. The only real exception is if one of the replicas is dead; 184 in such cases, we'll always remove the dead store(s) first. Note that as of 185 [#28875] we don't remove replicas from dead stores until we have allocated a 186 replacement replica on a different store. This makes certain data loss scenarios 187 less likely (see [#25392] for details). 188 189 If the algorithm chooses to remove the local replica, the replica must first 190 transfer the lease away before it can be removed. Note that while the new 191 leaseholder's replicate queue will examine the range shortly after acquiring the 192 lease, it's possible for the new leaseholder to make a different decision. This 193 isn't a real problem, but it does mean that removing oneself involves more work 194 and less certainty than removing any of the other replicas. 195 196 ##### Picking a rebalance target 197 198 Deciding when to rebalance is when things start to get legitimately tricky, and 199 is what much of the allocator code is devoted to. This makes intuitive sense if 200 you consider that when adding or removing replica you both 201 202 1. know that you need to take action - unless all the options are truly terrible 203 you should pick one of them. 204 2. only have to consider each store with respect to the set of existing 205 replicas' stores. For adding a replica, this is roughly linear with respect 206 to the number of live stores in the cluster. For removing a replica, it's 207 linear with respect to the number of replicas in the range. 208 209 However, when rebalancing, you have to decide whether taking action is actually 210 desirable. And in practice, you want a bias against action, since there's a real 211 cost to moving data around, and we don't want to do so unless there's a 212 correspondingly real benefit. Also, the problem isn't linear any more - it's 213 roughly O(m*n) when there are m replicas in the range and n live stores in the cluster, 214 because we have to choose both the replica to be removed and the replica to add. 215 This is particularly an issue for diversity score calculations and per-replica / 216 expressive zone config constraints. For example, if you have the following 217 stores: 218 219 StoreID | Locality | Range Count 220 --------|-----------------------|------------ 221 s1 | region=west,zone=a | 10 222 s2 | region=west,zone=b | 10 223 s3 | region=central,zone=a | 100 224 s4 | region=east,zone=a | 100 225 226 And a range has replicas on s1, s3, and s4, then going purely by range count it 227 would be great to take a replica off of range s3 or s4, which are both 228 relatively overfull. It would also be great to add a replica to s2, which is 229 relatively underfull. However, replacing s3 or s4 with s2 would hurt the range's 230 diversity, which we never choose to do without the user telling us to. 231 232 You can probably imagine that as the number of stores grows, doing all the 233 pairwise comparisons could become quite a bit of work. To optimize these 234 calculations, we group stores that share the same locality and the same 235 node/store attributes (a mostly-forgotten feature, but one that still needs to 236 be accounted for when considering `ZoneConfig` constraints). We can do all 237 constraint and diversity-scoring calculations just once for each group, and also 238 pair each group up against the only existing replicas that it could legally 239 replace without hurting diversity or violating constraints. We then only have to 240 do range count comparisons within these "comparable" classes of stores. 241 242 At the end, we can determine which (added replica, removed replica) pairs are 243 the largest improvement and choose from amongst them. As one last precautionary 244 step, we then simulate the down-replication logic on the set of replicas that 245 will result from adding the new replica. If the simulation finds that we would 246 remove the replica that was just added, we choose not to make that change. This 247 avoids thrashing, and is needed because we can't atomically add a member to the 248 raft group at the same time that we remove one. It's possible that this isn't 249 necessary right now, since the rebalancing code has been significantly improved 250 since it was added, but at the very least it's a nice fail-safe against future 251 mistakes. 252 253 ##### Per-replica / expressive constraints 254 255 We support two high-level types of constraints -- those which apply to all 256 replicas in a range, and those which are scoped to only apply to a particular 257 number of the replicas in a range (publicly referred to as [per-replica 258 constraints]). The latter option adds a good deal of subtlety to all allocator 259 decisions -- up-replication, down-replication, and especially rebalancing. 260 261 In order to satisfy the requirements, we had to split up constraint checking 262 into separate functions that work differently for adding, removing, and 263 rebalancing. We also had to add an internal concept of whether a replica is 264 "necessary" for meeting the required constraints, in addition to the existing 265 concept of whether or not the replica is valid. A replica is "necessary" if the 266 per-replica constraints wouldn't be satisfied if the replica weren't part of a 267 range. 268 269 For more details on the design of the feature, see the discussion on [#19985]. 270 For the implementation, see [#22819]. 271 272 273 #### Lease transfer decisions 274 275 For the most part, deciding whether to transfer a lease is a fairly 276 straightforward decision based on whether the current leaseholder node is in a 277 draining state and on the lease counts on all the stores holding replicas for a 278 range. The more complex logic is related to the follow-the-workload 279 functionality that kicks in if-and-only-if the various nodes holding replicas 280 are in different localities. The logic involved here is better explained in the 281 [original RFC](../RFCS/20170125_leaseholder_locality.md) than I could do in less 282 space here. The logic has not meaningfully changed since the original 283 design/implementation. 284 285 ### Store Rebalancer 286 287 As of v2.1, Cockroach also includes a separate control loop on each store called 288 the `StoreRebalancer`. The `StoreRebalancer` exists because we found in [#26059] 289 that an uneven balance of load on each node was causing serious performance 290 problems when attempting to run TPC-C at large scale without using partitioning. 291 Ensuring that each laod had a more even balance of work to do was experimentally 292 found to allow significantly higher and smoother performance. 293 294 The `StoreRebalancer` takes a somewhat different approach to rebalancing, 295 though. While the `ReplicateQueue` iterates over each replica one at a time, 296 deciding whether the replica would be better off somewhere else, the 297 `StoreRebalancer` looks at the overall amount of load (`BatchRequest` QPS 298 specifically, although it could in theory consider other factors) on each store 299 and attempts to take action if the local store is overloaded relative to the 300 other stores in the cluster. This difference is important -- our previous 301 attempt to rebalance based on load was integrated into the replicate queue, and 302 it didn't work very well for at least three different reasons: 303 304 1. We bit off more than we could chew, trying to rebalance on too many different 305 factors at once - range count, keys written per second, and disk space used. 306 2. Keys written per second was the wrong metric, at least for TPC-C. Experimentation 307 showed that the number of `BatchRequest`s being handled by a store per second 308 were much more strongly correlated with a load imbalance than keys written 309 per second. 310 3. Most importantly, the replicate queue only looks at one replica at a time. It 311 may see that the load on each store is uneven, but it doesn't have a good way 312 of knowing whether the replica in question would be a good one to move to try 313 to event things out (if a particular range is relatively low in the metric 314 we want to even out, it's intuitively a bad one to move). We did start 315 gossiping quantiles in order to help determine which quantile a range fell in 316 and thus whether it would be a good one to move, but this was still pretty 317 imprecise. 318 319 The `StoreRebalancer` solves all these problems. It only focuses on QPS, and by 320 focusing on the store-level imbalance first and picking ranges to rebalance 321 later, it can choose ranges that are specifically high in QPS in order to have 322 the biggest influence on store-level balance with the smallest disruption on 323 range count (which the `ReplicateQueue` is still responsible for attempting to 324 even out). Ranges to rebalance are efficiently chosen because we have started 325 tracking a priority queue of the hottest ranges by QPS on each store. This queue 326 gets repopulated once a minute, when the existing loop that iterates over all 327 replicas to compute store-level stats does its thing. This list of hot ranges 328 can have other uses as well, such as powering debug endpoints for the admin UI 329 ([#33336]). 330 331 Interpreting the exact details of how things work from the code should be pretty 332 straightforward; we attempt to move leases to resolve imbalances first, and only 333 resort to moving replicas around if moving leases was insufficient to resolve 334 the imbalance. There are some controls in place to avoid rebalancing when QPS is 335 too low to matter, or to avoid messing with a range that's so hot that it 336 constitutes the majority of a node's qps, or to not bother moving ranges with 337 too few qps to really matter, or a few other such things. 338 339 The `StoreRebalancer` can be controlled by a cluster setting that either fully 340 turns it off, enables just lease rebalancing, or enables both lease and replica 341 rebalancing, which is the default. 342 343 For more details, see the original prototype ([#26608]) or the final 344 implementation ([#28340], [#28852]). 345 346 ### Other details 347 348 Before removing a replica or transferring a lease, we need to take the raft 349 status of the various existing replicas into account. This is important to avoid 350 temporary unavailability. 351 352 For example, if you transfer the lease for a range to a replica that is way 353 behind in processing its raft log, it will take some time before that replica 354 gets around to processing the command which transferred the lease to it, and it 355 won't be able to serve any requests until it does so. 356 357 Or when considering which replica to remove from a range, we must take care not 358 to remove a replica that is critical for the range's quorum. If only 3 replicas 359 out of 5 are caught up with the raft leader's state, we can't remove any of 360 those 3, but can safely remove either of the other 2. 361 362 Note that it's possible that the raft state of the underlying replicas changes 363 between when we do this check and when the actual transfer/removal takes place, 364 so it isn't a foolproof protection, but the window of risk is very small and we 365 haven't noticed it being a problem in practice. 366 367 ## Known issues 368 369 * Rebalancing isn't atomic, meaning that adding a new replica and removing the 370 replica it replaces is done as two separate steps rather than just one. This 371 leaves room for locality failures between the two steps to cause 372 unavailability ([#12768]). For example, if a range has replicas in localities 373 `a`, `b`, and `c`, and wants to rebalance to a different store in `a`, there 374 will be a short period of time in which 2 of the range's 4 replicas are in 375 `a`. If `a` goes down before one of them is removed, the range will be 376 without a quorum until `a` comes back up. 377 * Rebalancing doesn't work well with multiple stores per node because we want to 378 avoid ever putting multiple replicas of the same range on the same node 379 ([#6782]). This has never been a deal breaker for anyone AFAIK, but occasionally 380 annoys a user or two. 381 * `RelocateRange` is flaky in v2.2-alpha versions because we now immediately put a 382 range through the replicate queue when a new lease is acquired on it ([#31287]). 383 It may fail to complete its desired changes successfully due to racing with 384 changes proposed by the new leaseholder. 385 * `RelocateRange` (and consequently the `StoreRebalancer` as a whole) doesn't 386 populate any useful information into the `system.rangelog` table, which has 387 traditionally been the best way to debug rebalancing decisions after the 388 fact (#34130). 389 390 [#6782]: https://github.com/cockroachdb/cockroach/issues/6782 391 [#12768]: https://github.com/cockroachdb/cockroach/issues/12768 392 [#17979]: https://github.com/cockroachdb/cockroach/issues/17979 393 [#19355]: https://github.com/cockroachdb/cockroach/issues/19355 394 [#19985]: https://github.com/cockroachdb/cockroach/issues/19985 395 [#22819]: https://github.com/cockroachdb/cockroach/pulls/22819 396 [#25392]: https://github.com/cockroachdb/cockroach/issues/25392 397 [#26059]: https://github.com/cockroachdb/cockroach/issues/26059 398 [#26608]: https://github.com/cockroachdb/cockroach/pull/26608 399 [#27349]: https://github.com/cockroachdb/cockroach/pull/27349 400 [#28340]: https://github.com/cockroachdb/cockroach/pull/28340 401 [#28852]: https://github.com/cockroachdb/cockroach/pull/28852 402 [#28875]: https://github.com/cockroachdb/cockroach/pull/28875 403 [#30441]: https://github.com/cockroachdb/cockroach/pull/30441 404 [#31287]: https://github.com/cockroachdb/cockroach/issues/31287 405 [#32949]: https://github.com/cockroachdb/cockroach/pull/32949 406 [#33336]: https://github.com/cockroachdb/cockroach/pull/33336 407 [#34126]: https://github.com/cockroachdb/cockroach/pull/34126 408 [#34130]: https://github.com/cockroachdb/cockroach/issues/34130 409 [per-replica constraints]: https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html#scope-of-constraints