github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170125_leaseholder_locality.md (about) 1 - Feature Name: Leaseholder Locality ("Leases follow the workload") 2 - Status: completed 3 - Start Date: 2017-01-25 4 - Authors: Alex Robinson 5 - RFC PR: [#13233](https://github.com/cockroachdb/cockroach/pull/13233) 6 - Cockroach Issue: 7 - [#13232](https://github.com/cockroachdb/cockroach/issues/13232) 8 9 # Summary 10 11 Enable range leaseholders to move closer to their clients, which reduces the 12 latency of KV requests when replicas are far apart. 13 14 # Motivation 15 16 The primary motivation for moving leaseholders around based on where their 17 requests are coming from is to reduce the latency of those requests. When a 18 request for a given key is received by a gateway node, the node must forward the 19 request to the leaseholder for the key's range. This isn't a big deal if all a 20 cluster's nodes are nearby each other in the network, but if the leaseholder 21 happens to be halfway around the world from where the request originated, the 22 network round trip time that gets added to the request latency by this can be 23 quite high. 24 25 This affects both reads and writes. For reads, getting the request to and from 26 the leaseholder dominates the request latency in widely distributed clusters 27 since the leaseholder can serve reads without talking to other replicas. For 28 writes, even though raft commands will incur an additional round trip between 29 replicas, removing the round trip to the leaseholder could nearly halve the 30 total request latency. 31 32 While there are typically multiple gateway nodes accessing a range and that they 33 won't necessarily all be from the same locality, our goal is to minimize the 34 aggregate RTT cost of accessing a given range by properly placing its 35 leaseholder. 36 37 We believe there are usage patterns that would benefit greatly from better 38 leaseholder placement. Consider a system that spans datacenters all around the 39 world. When it's daytime in the Asia/Australia, the datacenter(s) there will be 40 receiving most of the requests. As time passes, more of the requests will start 41 to originate from Europe, and later on from the Americas. If the leaseholder for 42 a range is always in an Asian data center, then the latency of accessing that 43 range will be much worse when most of its requests come from elsewhere. This is 44 where the idea of the lease "following the workload" or "following the sun" comes 45 from. 46 47 Finally, it's worth noting that this goal is somewhat at odds with our desire 48 to distribute load evenly throughout a cluster (e.g. via [range 49 rebalancing](20160503_rebalancing_v2.md) and [leaseholder 50 rebalancing](20161026_leaseholder_rebalancing.md). In fact, this goal was 51 specifically pushed off when the initial form of leaseholder rebalancing was 52 designed and implemented. Placing the leaseholders near their most common 53 gateways may lower total throughput if it maxes out what the local machines can 54 do while leaving the machines in other datacenters underutilized, particularly 55 if the latency between datacenters is small. There is a fine balance to be kept 56 to avoid minimizing latency at the expense of too much throughput and vice 57 versa. 58 59 # Detailed design 60 61 Given that a lease transfer mechanism already exists, the remaining difficulty 62 lies in deciding when to transfer leases. We [already have 63 logic](20161026_leaseholder_rebalancing.md) that decides when to transfer leases 64 with the goal of ensuring each node has approximately the same number of leases. 65 Anything we add to make leases follow the workload will need to play nicely with 66 that existing goal. 67 68 ## Tracking request origins 69 70 In order to have any idea where to rebalance leases to, we first need to track 71 where the requests for a given range are coming from. To that end, we can add 72 request origin information to the `Header` that's included in all 73 `BatchRequest`s. 74 75 When tracking the origin of requests, we don't just care about about the 76 individual node that a request came from, but its locality. To understand this, 77 consider a cluster with nodes 1-3 in datacenter 1, nodes 4-6 in datacenter 2, 78 and nodes 6-9 in datacenter 3. Range x in this cluster has replicas on nodes 1, 79 4, and 7, and the current leaseholder is node 1. If the leaseholder is receiving 80 a lot of its requests from nodes 8 and 9, then we may want to move the lease to 81 node 7 even if node 7 itself isn't sending the range much traffic, because node 82 7 shares a locality with the origin of the requests. 83 84 Luckily, each node already has a `NodeDescriptor` for every other node in the 85 cluster, and the `NodeDescriptor` proto contains its node's locality 86 information. Thus, all we need to add to the `BatchRequest` `Header` proto is a 87 `gateway_node_id` field that gets filled in by the client kv transport. While 88 the client transport will typically fill this field in with its own ID, it can 89 also be spoofed when appropriate, such as by DistSQL when a node that wasn't 90 actually the gateway for a given request makes KV requests on behalf of the real 91 gateway. 92 93 ### Alternatives for tracking request origins 94 95 Given that we know the IP address of each node in the cluster, we could 96 potentially try to skip adding the `gateway_node_id` field to each 97 `BatchRequest` and just rely on the source IP address. That would work in most 98 cases, but could break down when nodes are communicating with each other via a 99 load balancer or proxy, without saving much (adding an int to each 100 `BatchRequest` should have a negligible effect on request size). 101 102 We could alternatively include all the locality tags from the source node in 103 each request, which would eliminate the need to look up the locality of each 104 node when making decisions based on the collected data. However, this requires 105 much more data to be sent over the wire and stored per-range in the system, 106 effectively duplicating the work already done by the gossiping and storage of 107 node descriptors. It would be reasonable to take this approach -- it would save 108 an integer's worth of work in the case that the nodes in a cluster don't have 109 any locality labels -- but it doesn't simplify enough to make up for its added 110 cost. 111 112 ## Recording request origins 113 114 Each leaseholder replica will maintain a map from locality to the number of 115 requests received from that locality. Ideally, the request counts would decay 116 exponentially over time. If this proves too difficult to implement efficiently 117 (e.g. in a cluster with requests coming from tens of localities), we could opt 118 for swapping out the map with a new map periodically, such as when we examine it 119 to decide whether to transfer the replica's lease. 120 121 If we go with the latter approach we should also maintain the time that we last 122 cleared out the map of request counts so that we can determine the rate of 123 requests to the range. Understanding the load on each range will help us 124 prioritize them for transfer (and help us to balance the load on each node, even 125 if that isn't the immediate goal of this work). 126 127 For the purposes of optimizing lease locality, we'll count each BatchRequest as 128 a single request. For tracking the load on each replica, we'll likely want to 129 measure something a little fancier, such as the number of non-noop subrequests 130 in the BatchRequest or the amount of KVs scanned by the request. These would be 131 better estimates of load, but for locality we want to focus on the number of 132 requests suffering from the large network RTT. 133 134 ## Making lease transfer decisions 135 136 Much of the infrastructure needed for making decisions about when to transfer 137 leases was [already added](https://github.com/cockroachdb/cockroach/pull/10464) 138 as part of the initial lease balancing work, so all we need to do is start using 139 request locality as an additional input to the decision-making process. The 140 difficult problems that we really need to solve for are: 141 142 * Prioritizing transferring leases for the replicas that are receiving the 143 greatest number of requests from distant localities. 144 * Avoiding thrashing if the origins of requests change frequently. 145 * Finding the right balance between keeping a similar number of leases on each 146 node or moving all the leases to where the most traffic is. 147 148 ### Prioritizing ranges with the most cross-DC traffic 149 150 If a cluster really is spread over datacenters around the world, it's likely 151 that most of a cluster's ranges will be getting most of their traffic from the 152 same locality. If this is the case, the most bang-for-the-buck would come from 153 moving the leases for ranges that are receiving the most requests and whose 154 request distributions are most skewed toward the given location. To that end, we 155 may want to periodically record stats about the rate of requests to the ranges 156 on a store and how many of those requests are from distant localities. These 157 would be much like counts we calculate of the number of ranges and leases on a 158 store. However, unlike those stats, we may not want to start gossiping these as 159 part of the store descriptor until we have a concrete use for them. That will 160 likely come once we start getting smarter about determining just how much load a 161 node can handle. 162 163 Given these stats, we can make decisions per-replica by only transferring leases 164 for replicas whose requests skew more strongly from distant localities. 165 166 Of course, there is a risk that moving all the hottest ranges to the same 167 locality will could have a worse effect on throughput if those nodes get 168 overloaded. We'll have to test the heuristics carefully to tune them and 169 determine when this could become a problem. It's very possible that we'll have 170 to expose a tuning knob for this to give users more control over the tradeoff. 171 172 ### Avoiding thrashing 173 174 In order to avoid thrashing of leases, we can partially reuse the mechanisms 175 already in place for avoiding lease thrashing with respect to leaseholder 176 balance, particularly the rate-limiting of lease transfers 177 ([#11729](https://github.com/cockroachdb/cockroach/pull/11729)). 178 179 Additionally, we can learn from both lease and replica rebalancing that there 180 needs to be a wide range of configurations in which no action is taken -- e.g. 181 a node with 4% more ranges than the mean won't bother transferring a range to a 182 node with 4% fewer ranges than the mean. We'll need a cushion in our heuristic 183 such that we only transfer leases away if there's a significant difference in 184 the number of requests coming from the different localities. 185 186 Along the same lines, we shouldn't make hasty decisions. Measurements of request 187 distributions are less reliable if we only make them over short time windows. 188 We'll need to pick a reasonable minimum duration of time to measure before 189 transferring leases. This will likely be in the tens of minutes -- enough time 190 to get good data, but not so long that traffic patterns are likely to change 191 drastically in shorter time periods. An alternative would be to factor in the 192 number of requests, since a 5 minute measurement with a million samples is much 193 more trustworthy than a 20 minute measurement with a thousand samples. 194 195 ### What to do if nodes don't have locality information 196 197 If nodes don't have any locality information attached to them, we lose most of 198 our ability to determine where nodes are located with respect to each other. 199 While we should make an effort to encourage all multi-datacenter deployments to 200 specify locality information (and can automate the assignment of locality info 201 in cloud environments), there will certainly be some deployments without it. We 202 have a few options in such cases: 203 204 * Fall back to operating exactly the same as today, not doing any load-based 205 lease placement. We'll still be able to use the per-replica request counts 206 once we start factoring load into balancing decisions. 207 * Fall back to a very limited version of load-based lease placement where we 208 only move a lease based on request locality if a node holding one of the other 209 replicas is forwarding the vast majority of traffic to the range. 210 * Calculate (and gossip) estimates of the latency between all pairs of nodes and 211 use that to guess at the localities of nodes for the sake of rebalancing. 212 213 The last option can be put off to future optimization efforts (if ever). The 214 second option may be beneficial if a user sets up a multi-datacenter deployment 215 without locality info, but in the more common case of low-latency, single-DC 216 deployments it'd likely just make the lease balancing worse for no real latency 217 gain. We'll stick with the first option for now. 218 219 ### Reconciling lease locality with lease balance 220 221 As mentioned above, our goals in this RFC come into fairly direct conflict with 222 the goals of the [lease rebalancing RFC](20161026_leaseholder_rebalancing.md). 223 If all the requests are coming from a single locality, it would be ideal from a 224 latency perspective for all the leases to be there as well, so long as that 225 wouldn't overload the nodes with too much work. However, it's possible that 226 putting all the leases into that locality will overload the nodes such that 227 overall cluster throughput (and even latency) is worse than it would be if the 228 leases were properly balanced. 229 230 Ideally, we would have some understanding of how fully utilized each node is in 231 terms of throughput. That measurement could be gossiped in the same way as each 232 node's storage utilization and used in allocation decisions. It's not totally 233 clear what metric to use for this, though, so for now we will leave it out of 234 allocation decisions (suggestions welcome -- perhaps a rather naive utilization 235 measurement can go a long way here). We could easily add a measure of how much 236 load a node has on it, but not of how close to fully utilized it is. 237 238 Similarly, it'd be useful to have an estimate of the RTT between localities when 239 making allocation decisions. While we don't currently have this (as far as I'm 240 aware), measuring it wouldn't be very hard, so I suggest that we do so and use 241 it to help tune lease allocations. We'll have to make sure that this can't lead 242 to too much thrashing if different nodes come up with different estimates, but 243 re-measuring this periodically will help fight such issues and also have the 244 nice benefit that the cluster will react accordingly if the RTT between 245 localities changes. 246 247 Thus, our heuristic for deciding when to transfer a lease will look something 248 like this: 249 250 * If the cluster doesn't have locality information, fall back to the existing 251 behavior. The existing behavior is to transfer the lease if the current 252 leaseholder node is considered overfull (based on the `rebalanceThreshold` 253 constant, which is currently 5% of the mean number of leases) or if it has 254 more than the mean number of leases and another replica is underfull. 255 * Note that we can start using the new information on how many requests each 256 replica has been serving lately to get an idea of how many requests each 257 node is serving and use that rather than just the number of leaseholders. 258 That work is somewhat orthogonal to optimizing leaseholder locality, but is 259 worth noting and working on soon. 260 * If the cluster does have locality information, then measure the RTT between 261 localities. As the latency between localities increases, raise a new 262 `interLocalityRebalanceThreshold` proportionately. This will affect the 263 underfull/overfull calculations for leases when comparing replicas in 264 different localities, but the normal `rebalanceThreshold` would still be used 265 when comparing replicas within the same locality. The exact numbers for this 266 can be worked out in testing, but it will make lease balancing less and less 267 important until we eventually don't consider it at all for replicas in 268 different localities. If it's legal to make a cross-locality transfer based on 269 the nodes in question and the current `interLocalityRebalanceThreshold`, then 270 we can begin considering whether it's worth it to make a transfer based on the 271 distributon of requests to the replica. 272 273 # Testing 274 275 Much like how `allocsim` has been useful for testing the lease rebalancing 276 heuristics, it would be very nice to have a repeatable tool for testing this as 277 well. However, whereas the only real variable when testing lease rebalancing was 278 the relevant code and the only real output was the number of leases held by each 279 node, we now have multiple inputs (the code, the latency between nodes, the 280 locality labels) and multiple outputs (the number of leases held by each node, 281 the distribution of request latency, and the request throughput) to consider. 282 283 In order to better test this new functionality, I propose: 284 285 * Adding a testing knob to simulate additional latency between nodes. 286 * Extending `allocsim` to be able to set up different locality configurations 287 with the new latency knob. 288 * Extending `allocsim` to send load against the specific nodes/localities and 289 measure the resulting throughput and latency. 290 291 The output of allocsim will enable us to better understand how different 292 heuristics perform on different cluster configurations. We may be able to find a 293 heuristic that performs reasonably well across a variety of clusters, or may 294 find that we have to introduce a tuning knob of some sort to shift the balance 295 one way or another. Either way, it's tough for us to know exactly what'll work 296 without a testing tool like this, so we'll rely heavily on it when tuning the 297 heuristic. 298 299 # Future Directions 300 301 * It would be helpful to expose information about balancing decisions in the UI 302 so that users and developers can better understand what's happening in their 303 clusters, which will be particularly important when bad decisions are made, 304 causing performance dips. 305 306 * There could be situations in which the locality that's generating most of the 307 requests to a range doesn't have a local replica for that range. While ideally 308 cluster admins would construct `ZoneConfig` settings that make sense for their 309 environments and workloads, we could potentially benefit from taking 310 per-locality load into account when making replica placement decisions (not 311 just lease placement decisions). 312 313 * As mentioned above, we should start using the recent request counts being 314 added as part of this work to improve the existing replica and lease placement 315 heuristics. This should be fairly straightforward, it'll just require some 316 benchmarking. There's more discussion of this in the [future directions 317 section of the leaseholder rebalancing 318 RFC](20161026_leaseholder_rebalancing.md#future-directions). 319 320 * As mentioned above, it would be very beneficial if we came up with some way of 321 measuring the true load on each node, either as a collection of measurements 322 or as some combination of CPU utilization, memory pressure, and disk I/O 323 utilization. This would make improving locality and balancing load 324 significantly easier and more effective. Ideally we could even differentiate 325 between load caused by being a leaseholder and load caused by being a follower 326 replica. 327 328 * It's possible that this work will actually be harmful in certain odd cluster 329 configurations. For example, if a range has two replicas in different parts of 330 Europe and only a single replica in Australia, then writes will perform better 331 if the lease is in Europe even if most of its requests are coming from 332 Australia. This is because raft commands will commit much more quickly if 333 proposed from one of the replicas in Europe. We may want to take into account 334 the latency between the localities of all the different replicas to avoid such 335 problems, but such cases aren't critical since they're not recommended 336 configurations to begin with. 337 338 # Drawbacks 339 340 * This approach does not have any true safeguards against overloading nodes in 341 localities where most requests are coming from. We will prefer moving load 342 there as long as the inter-locality RTT is high enough regardless of how much 343 load the nodes can handle. At a high level, this is an existing issue of our 344 system, but intentionally imbalancing the leases will make the risk much worse 345 (until we start factoring in some measure of load percentage when making 346 decisions). We're going to need some sort of flow control mechanism regardless 347 of this change, and once we have it we can use it to help with these 348 decisions. 349 * Relying on measurements of latency from each node to each other locality may 350 lead to unexpected thrashing if nodes get drastically different measurements. 351 Taking multiple measurements over time should help with this, but it's 352 conceivable that certain networks could exhibit a persistent difference. 353 * Unless we decide to do a lot more work determining and gossiping latency 354 information between all nodes, these optimizations won't be used if cluster 355 admins don't add locality info to the nodes. 356 357 # Alternatives 358 359 * As mentioned above, we could track the origin of each request by sending the 360 locality labels along with each request rather than just the source node ID, 361 but that would add a lot of duplicated info along with every request. 362 * Rather than considering it future work, it'd be beneficial if we could make 363 allocation decisions based on the actual load on each node. Skipping right to 364 that solution would be great (suggestions appreciated!). 365 * We may find it more effective to measure something other than number of 366 requests handled by a replica. For example, perhaps time spent processing 367 requests for the replica or bytes returned in response to requests to the 368 replica would be more accurate measurements of the load on the replica. These 369 are things that we could potentially experiment with using allocsim to see if 370 they provide better balance. 371 372 # Unresolved questions