github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170602_rebalancing_for_1_1.md (about) 1 - Feature Name: Rebalancing plans for 1.1 2 - Status: in-progress 3 - Start Date: 2017-06-02 4 - Authors: Alex Robinson 5 - RFC PR: [#16296](https://github.com/cockroachdb/cockroach/pull/16296) 6 - Cockroach Issue: 7 - [#12996](https://github.com/cockroachdb/cockroach/issues/12996) 8 - [#15988](https://github.com/cockroachdb/cockroach/issues/15988) 9 - [#17979](https://github.com/cockroachdb/cockroach/issues/17979) 10 11 # Summary 12 13 Lay out plans for which rebalancing improvements to make (or not make) in the 14 1.1 release and designs for how to implement them. 15 16 # Background / Motivation 17 18 We’ve made a couple of efforts over the past year to improve the balance of 19 [replicas](20160503_rebalancing_v2.md) and 20 [leases](20170125_leaseholder_locality.md) across a 21 cluster, but our balancing algorithms still don’t take into account everything 22 that a user might care about balancing within their cluster. This document puts 23 forth plans for what we’ll work on with respect to rebalancing during the 1.1 24 release cycle. In particular, four different improvements have been proposed. 25 26 ## Balancing disk capacity, not just number of ranges ("size-based rebalancing") 27 28 Our existing rebalancing heuristics only consider the number of ranges on each 29 node, not the amount of bytes, effectively assuming that all ranges are the same 30 size. This is a flawed assumption -- a large number of empty ranges can be 31 created when a user drops/truncates a table or runs a restore from backup that 32 fails to finish. Not considering the size of the ranges in rebalancing can lead 33 to some nodes containing far more data than others. 34 35 ## Balancing request load, not just number of ranges ("load-based rebalancing") 36 37 Similarly, the rebalancing heuristics do not consider the amount of load on each 38 node when making placement decisions. While this works great for some of our 39 load generators (e.g. kv), it can cause problems with others like ycsb and with 40 many real-world workloads if many of the most popular ranges end up on the same 41 node. When deciding whether to move a given range, we should consider how much 42 load is on that range and on each of the candidate nodes. 43 44 ## Moving replicas closer to where their load is coming from ("load-based replica locality") 45 46 For the 1.0 release, [we added lease transfer 47 heuristics](20170125_leaseholder_locality.md) that move leases closer to the 48 where requests are coming from in high-latency environments. It’s easy to 49 imagine a similar heuristic for moving ranges -- if a lot of requests for a 50 range are coming from a locality that doesn’t have a replica of the range, then 51 we should add a replica there. That will then enable the lease-transferring 52 heuristics to transfer the lease there if appropriate, reducing the latency to 53 access the range. 54 55 ## Splitting ranges based on load ("load-based splitting") 56 57 A single hot range can become a bottleneck. We currently only split ranges when 58 they hit a size threshold, meaning that all of a cluster’s load could be to a 59 single range and we wouldn’t do anything about it, even if there are other nodes 60 in the cluster (that don’t contain the hot range) that are idle. While splitting 61 decisions may seem somewhat separate from rebalancing decisions, in some 62 situations splitting a hot range would allow us to more evenly distribute the 63 load across the cluster by rebalancing one of the halves. 64 65 This is so important for performance that we already support manually 66 introducing range splits, but an automated approach would be more appropriate as 67 a permanent solution. 68 69 # Detailed Design 70 71 ## Balancing based on multiple factors 72 73 Currently when we’re scoring a potential replica rebalance, we only have to 74 consider the relevant zone config settings and the number of replicas on each 75 store. This allows us to effectively treat all replicas as if they’re exactly 76 the same. Adding in factors like the size of the range and the number 77 of QPS to a range invalidates that assumption, and forces us to consider how a 78 replica differs from the typical replica on both dimensions. For example, if 79 node 1 has fewer replicas than node 2 but more bytes stored on it, then we might 80 be willing to move a big replica from node 1 to 2 or a small replica from node 2 81 to 1, but wouldn’t want the inverses. 82 83 Thus, in addition to knowing the size or QPS of the particular range 84 we’re considering rebalancing, we’ll also want to know some idea of the 85 distribution of size or QPS per range for the replicas in a store. This will 86 mean periodically iterating over all the replicas in a store to aggregate 87 statistics so that we can know whether a range is larger/smaller than others or 88 has more/less QPS than others. Specifically, we'll try computing a few 89 percentiles to help pick out the true outliers that would have the greatest 90 effect on the cluster's balance. 91 92 We can them compute rebalance scores by considering the percentiles of a 93 replica and under/over-fullness of stores amongst all the considered dimensions. 94 We will prefer moving away replicas at high percentiles from stores that are 95 overfull for that dimension toward stores that are less full for the dimension 96 (and vice versa for low percentiles and underful stores under the expectation 97 that the removed replicas can be replaced by higher percentile replicas). The 98 extremeness of a given percentile and under/over-fullness will increase the 99 weight we give to that dimension. These heuristics will allow us to combine 100 the different dimensions into a single final score, and should be covered by a 101 large number of test cases to ensure stability in different scenarios. 102 103 ## Size-based rebalancing 104 105 Taking size into account seems like the simplest modification of our existing 106 rebalancing logic, but even so there are a variety of available approaches: 107 108 1. We already gossip each store’s total disk capacity and unused disk capacity. 109 We could start trying to balance unused disk capacity across all the nodes of 110 the cluster. That would mean that in the case of heterogeneous disk sizes, 111 nodes with smaller disks might not get much (if any) data rebalanced to them 112 if the cluster doesn’t have much data. 113 114 2. We could try to balance used disk capacity (i.e. total - unused). In 115 heterogeneous clusters, this would mean that some nodes would fill up way 116 before others (and potentially way before the cluster fills up as a whole). 117 Situations in which some nodes but not others are full are not regularly 118 tested yet, so we may have to start if we go this way. 119 120 3. We could try to balance fraction of the disk used. This is the happy 121 compromise between the previous two options -- it will put data onto nodes 122 with smaller disks right from the beginning (albeit less data), and it 123 shouldn’t often lead to smaller nodes filling up way before others. 124 125 The first option most directly parallels our existing logic that only attempts 126 to balance the number of replicas without considering the size of each node’s 127 disk, but the third option appears best overall. It’s likely that we’ll want to 128 change the replica logic as part of this work to take disk size into account, 129 such that we’ll balance replicas per GB of disk rather than absolute number of 130 replicas. 131 132 ## Load-based rebalancing 133 134 As part of our [leaseholder locality](20170125_leaseholder_locality.md) work, we 135 started tracking how many requests each range’s leaseholder receives. This gives 136 us a QPS number for each leaseholder replica, but no data for replicas that 137 aren’t leaseholders. If we left things this way, our replica rebalancing would 138 suddenly take a dependency on the cluster’s current distribution of 139 leaseholders, which is a scary thought given that leaseholder rebalancing 140 conceptually already depends on replica rebalancing (because it can only balance 141 leases to where the replicas are). As a result, I think we’ll want to start 142 tracking the number of applied commands on each replica instead of relying on 143 the existing leaseholder QPS. 144 145 Once we have that per-replica QPS, though, we can aggregate it at the store 146 level and start including it in the store’s capacity gossip messages to use it 147 in balancing much like disk space. 148 149 ## Load-based replica locality 150 151 This is where things get tricky -- while the above goals are about bringing the 152 cluster into greater balance, trying to move replicas toward the load is likely 153 to reduce the balance within the cluster. Reducing the thrashing involved in the 154 leaseholder locality project was quite a lot of work and still isn’t resilient 155 to certain configurations. When we’re talking about moving replicas rather than 156 just transferring leases, the cost of thrashing skyrockets because snapshots 157 consume a lot of disk/network bandwidth. 158 159 This also conflicts with one of our design goals from [the original rebalancing 160 RFC](20150819_stateless_replica_relocation.md), which is that the decision to 161 make any individual operation should be stateless. Because the counts of 162 requests by locality are only tracked on the leaseholder, these types of 163 decisions are inherently stateful, so we should tread into making them with 164 caution. 165 166 In the interest of not creating problem cases for users, I’d suggest pushing 167 this back until we have known demand for it. Custom zone configs paired with 168 leaseholder locality already do a pretty good job of enabling low-latency access 169 to data. 170 171 ## Load-based splitting 172 173 Load-based splitting is conceptually pretty simple, but will likely produce 174 some edge cases in practice. Consider a few representative examples: 175 176 1. A range gets a lot of requests for single keys, evenly distributed over the 177 range. Splitting will help a lot. 178 179 2. A range gets a lot of requests for just a couple of individual keys (and 180 the hot requests don't touch multiple hot keys in the same query, a la case 181 4). Splitting will help if and only if the split is between the hot keys. 182 183 3. A range gets a lot of requests for just a single key. Splitting won’t help at 184 all. 185 186 4. A range gets a lot of scan requests or other requests that touch multiple 187 keys. Splitting could actually make things worse by flipping an operation 188 from a single-range operation into a multi-range one. 189 190 Given these possibilities, it’s clear that we’re going to need more granular 191 information than how many requests a range is receiving in order to decide 192 whether to split a range. What we really need is something that will keep track 193 of the hottest keys (or key spans) in the hottest ranges. This is basically a 194 streaming top-k problem, and there are plenty of algorithms that have been 195 written about that should work for us given that we only need approximate 196 results. 197 198 It’s also worth noting that we’ll only need such stats for ranges that have a 199 high enough QPS to justify splitting. Thus, our approach will look something 200 like: 201 202 1. Track the QPS to each leaseholder (which we’re already doing as of 203 [#13426](https://github.com/cockroachdb/cockroach/pull/13426)). 204 205 2. If a given range’s QPS is abnormally high (by virtue of comparing to the 206 other ranges), start recording the approximate top-k key spans. 207 Correspondingly, if a range's QPS drops down and we had been tracking its 208 top-k key spans, we should notice this and stop. 209 210 3. Periodically check the top key spans for these top ranges and determine if 211 splitting would allow for better distributing the load without making too 212 many more multi-range operations. Picking a split point and determining 213 whether it'd be beneficial to split there could be done by sorting the top 214 key spans and, between each of them, comparing how many requests would be 215 to spans that are to the left of, the right of, or overlapping that 216 possible split point. 217 218 4. If a good split point was found, do the split. 219 220 5. Sit back and let load-based rebalancing do its thing. 221 222 This will take a bit of work to finish, and isn’t critical for 1.1, but 223 would be a nice addition and comes with much less downside risk than something 224 like load-based replica locality. We’ll try to get to it if we have the time, 225 otherwise can implement it for 1.2. 226 227 ### Alternatives 228 229 The approximate top-k approach to determining split points is fairly precise, 230 but also adds some fairly complex to the hot code path for serving requests to 231 replicas. A simple alternative would be for us to do the following for each hot 232 range: 233 234 1. Pick a possible split point (the mid-point of the range to start with). 235 236 1. For each incoming request to the hot replica, record whether the request is 237 to the left side, the right side, or both. 238 239 1. After a while, examine the results. If most of the requests touched both 240 sides, abandon trying to split the range. If most of the requests were split 241 pretty evenly between left and right, make the split at the tested key. If 242 the results were pretty uneven, try moving the possible split point in the 243 direction that received more requests and try again, a la binary search. 244 After O(log n) possible split points, we'll either find a decent split point 245 may determine that there isn't an equitable split point (because the 246 requests are mostly to a single key). 247 248 In fact, even if we do use a top-k approach, testing out the split point like 249 this before making the split might still be smart to ensure that all of the 250 spans that weren't included in the top-k aren't touching both sides of the 251 split. 252 253 Finally, the simplest alternative of all (proposed by bdarnell on #16296) is 254 to not do load-based splitting at all, and instead just split more eagerly for 255 tables with a small number of ranges (where "small" could reasonabl be defined 256 as "less than the number of nodes in the cluster"). This wouldn't help with 257 steady state load at all, but it would help with the arguably more common 258 scenario of a "big bang" of data growth when a service launches or during a 259 bulk load of data. 260 261 ### Drawbacks 262 263 Splitting ranges based on load could, for certain request patterns, lead to a 264 large build-up of small ranges that don't receive traffic anymore. For example, 265 if a table's primary keys are ordered by timestamp, and newer rows are more 266 popular than old rows, it's very possible that newer parts of the table could 267 get split based on load but then remain small forever even though they don't 268 receive much traffic anymore. 269 270 This won't cripple the cluster, but is less than ideal. Merge support is being 271 tracked in [#2433](https://github.com/cockroachdb/cockroach/issues/2433).