github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170125_leaseholder_locality.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170125_leaseholder_locality.md (about)

1 - Feature Name: Leaseholder Locality ("Leases follow the workload")
2 - Status: completed
3 - Start Date: 2017-01-25
4 - Authors: Alex Robinson
5 - RFC PR: [#13233](https://github.com/cockroachdb/cockroach/pull/13233)
6 - Cockroach Issue:
7 - [#13232](https://github.com/cockroachdb/cockroach/issues/13232)
8
9 # Summary
10
11 Enable range leaseholders to move closer to their clients, which reduces the
12 latency of KV requests when replicas are far apart.
13
14 # Motivation
15
16 The primary motivation for moving leaseholders around based on where their
17 requests are coming from is to reduce the latency of those requests. When a
18 request for a given key is received by a gateway node, the node must forward the
19 request to the leaseholder for the key's range. This isn't a big deal if all a
20 cluster's nodes are nearby each other in the network, but if the leaseholder
21 happens to be halfway around the world from where the request originated, the
22 network round trip time that gets added to the request latency by this can be
23 quite high.
24
25 This affects both reads and writes. For reads, getting the request to and from
26 the leaseholder dominates the request latency in widely distributed clusters
27 since the leaseholder can serve reads without talking to other replicas. For
28 writes, even though raft commands will incur an additional round trip between
29 replicas, removing the round trip to the leaseholder could nearly halve the
30 total request latency.
31
32 While there are typically multiple gateway nodes accessing a range and that they
33 won't necessarily all be from the same locality, our goal is to minimize the
34 aggregate RTT cost of accessing a given range by properly placing its
35 leaseholder.
36
37 We believe there are usage patterns that would benefit greatly from better
38 leaseholder placement. Consider a system that spans datacenters all around the
39 world. When it's daytime in the Asia/Australia, the datacenter(s) there will be
40 receiving most of the requests. As time passes, more of the requests will start
41 to originate from Europe, and later on from the Americas. If the leaseholder for
42 a range is always in an Asian data center, then the latency of accessing that
43 range will be much worse when most of its requests come from elsewhere. This is
44 where the idea of the lease "following the workload" or "following the sun" comes
45 from.
46
47 Finally, it's worth noting that this goal is somewhat at odds with our desire
48 to distribute load evenly throughout a cluster (e.g. via [range
49 rebalancing](20160503_rebalancing_v2.md) and [leaseholder
50 rebalancing](20161026_leaseholder_rebalancing.md). In fact, this goal was
51 specifically pushed off when the initial form of leaseholder rebalancing was
52 designed and implemented. Placing the leaseholders near their most common
53 gateways may lower total throughput if it maxes out what the local machines can
54 do while leaving the machines in other datacenters underutilized, particularly
55 if the latency between datacenters is small. There is a fine balance to be kept
56 to avoid minimizing latency at the expense of too much throughput and vice
57 versa.
58
59 # Detailed design
60
61 Given that a lease transfer mechanism already exists, the remaining difficulty
62 lies in deciding when to transfer leases. We [already have
63 logic](20161026_leaseholder_rebalancing.md) that decides when to transfer leases
64 with the goal of ensuring each node has approximately the same number of leases.
65 Anything we add to make leases follow the workload will need to play nicely with
66 that existing goal.
67
68 ## Tracking request origins
69
70 In order to have any idea where to rebalance leases to, we first need to track
71 where the requests for a given range are coming from. To that end, we can add
72 request origin information to the `Header` that's included in all
73 `BatchRequest`s.
74
75 When tracking the origin of requests, we don't just care about about the
76 individual node that a request came from, but its locality. To understand this,
77 consider a cluster with nodes 1-3 in datacenter 1, nodes 4-6 in datacenter 2,
78 and nodes 6-9 in datacenter 3. Range x in this cluster has replicas on nodes 1,
79 4, and 7, and the current leaseholder is node 1. If the leaseholder is receiving
80 a lot of its requests from nodes 8 and 9, then we may want to move the lease to
81 node 7 even if node 7 itself isn't sending the range much traffic, because node
82 7 shares a locality with the origin of the requests.
83
84 Luckily, each node already has a `NodeDescriptor` for every other node in the
85 cluster, and the `NodeDescriptor` proto contains its node's locality
86 information. Thus, all we need to add to the `BatchRequest` `Header` proto is a
87 `gateway_node_id` field that gets filled in by the client kv transport. While
88 the client transport will typically fill this field in with its own ID, it can
89 also be spoofed when appropriate, such as by DistSQL when a node that wasn't
90 actually the gateway for a given request makes KV requests on behalf of the real
91 gateway.
92
93 ### Alternatives for tracking request origins
94
95 Given that we know the IP address of each node in the cluster, we could
96 potentially try to skip adding the `gateway_node_id` field to each
97 `BatchRequest` and just rely on the source IP address. That would work in most
98 cases, but could break down when nodes are communicating with each other via a
99 load balancer or proxy, without saving much (adding an int to each
100 `BatchRequest` should have a negligible effect on request size).
101
102 We could alternatively include all the locality tags from the source node in
103 each request, which would eliminate the need to look up the locality of each
104 node when making decisions based on the collected data. However, this requires
105 much more data to be sent over the wire and stored per-range in the system,
106 effectively duplicating the work already done by the gossiping and storage of
107 node descriptors. It would be reasonable to take this approach -- it would save
108 an integer's worth of work in the case that the nodes in a cluster don't have
109 any locality labels -- but it doesn't simplify enough to make up for its added
110 cost.
111
112 ## Recording request origins
113
114 Each leaseholder replica will maintain a map from locality to the number of
115 requests received from that locality. Ideally, the request counts would decay
116 exponentially over time. If this proves too difficult to implement efficiently
117 (e.g. in a cluster with requests coming from tens of localities), we could opt
118 for swapping out the map with a new map periodically, such as when we examine it
119 to decide whether to transfer the replica's lease.
120
121 If we go with the latter approach we should also maintain the time that we last
122 cleared out the map of request counts so that we can determine the rate of
123 requests to the range. Understanding the load on each range will help us
124 prioritize them for transfer (and help us to balance the load on each node, even
125 if that isn't the immediate goal of this work).
126
127 For the purposes of optimizing lease locality, we'll count each BatchRequest as
128 a single request. For tracking the load on each replica, we'll likely want to
129 measure something a little fancier, such as the number of non-noop subrequests
130 in the BatchRequest or the amount of KVs scanned by the request. These would be
131 better estimates of load, but for locality we want to focus on the number of
132 requests suffering from the large network RTT.
133
134 ## Making lease transfer decisions
135
136 Much of the infrastructure needed for making decisions about when to transfer
137 leases was [already added](https://github.com/cockroachdb/cockroach/pull/10464)
138 as part of the initial lease balancing work, so all we need to do is start using
139 request locality as an additional input to the decision-making process. The
140 difficult problems that we really need to solve for are:
141
142 * Prioritizing transferring leases for the replicas that are receiving the
143 greatest number of requests from distant localities.
144 * Avoiding thrashing if the origins of requests change frequently.
145 * Finding the right balance between keeping a similar number of leases on each
146 node or moving all the leases to where the most traffic is.
147
148 ### Prioritizing ranges with the most cross-DC traffic
149
150 If a cluster really is spread over datacenters around the world, it's likely
151 that most of a cluster's ranges will be getting most of their traffic from the
152 same locality. If this is the case, the most bang-for-the-buck would come from
153 moving the leases for ranges that are receiving the most requests and whose
154 request distributions are most skewed toward the given location. To that end, we
155 may want to periodically record stats about the rate of requests to the ranges
156 on a store and how many of those requests are from distant localities. These
157 would be much like counts we calculate of the number of ranges and leases on a
158 store. However, unlike those stats, we may not want to start gossiping these as
159 part of the store descriptor until we have a concrete use for them. That will
160 likely come once we start getting smarter about determining just how much load a
161 node can handle.
162
163 Given these stats, we can make decisions per-replica by only transferring leases
164 for replicas whose requests skew more strongly from distant localities.
165
166 Of course, there is a risk that moving all the hottest ranges to the same
167 locality will could have a worse effect on throughput if those nodes get
168 overloaded. We'll have to test the heuristics carefully to tune them and
169 determine when this could become a problem. It's very possible that we'll have
170 to expose a tuning knob for this to give users more control over the tradeoff.
171
172 ### Avoiding thrashing
173
174 In order to avoid thrashing of leases, we can partially reuse the mechanisms
175 already in place for avoiding lease thrashing with respect to leaseholder
176 balance, particularly the rate-limiting of lease transfers
177 ([#11729](https://github.com/cockroachdb/cockroach/pull/11729)).
178
179 Additionally, we can learn from both lease and replica rebalancing that there
180 needs to be a wide range of configurations in which no action is taken -- e.g.
181 a node with 4% more ranges than the mean won't bother transferring a range to a
182 node with 4% fewer ranges than the mean. We'll need a cushion in our heuristic
183 such that we only transfer leases away if there's a significant difference in
184 the number of requests coming from the different localities.
185
186 Along the same lines, we shouldn't make hasty decisions. Measurements of request
187 distributions are less reliable if we only make them over short time windows.
188 We'll need to pick a reasonable minimum duration of time to measure before
189 transferring leases. This will likely be in the tens of minutes -- enough time
190 to get good data, but not so long that traffic patterns are likely to change
191 drastically in shorter time periods. An alternative would be to factor in the
192 number of requests, since a 5 minute measurement with a million samples is much
193 more trustworthy than a 20 minute measurement with a thousand samples.
194
195 ### What to do if nodes don't have locality information
196
197 If nodes don't have any locality information attached to them, we lose most of
198 our ability to determine where nodes are located with respect to each other.
199 While we should make an effort to encourage all multi-datacenter deployments to
200 specify locality information (and can automate the assignment of locality info
201 in cloud environments), there will certainly be some deployments without it. We
202 have a few options in such cases:
203
204 * Fall back to operating exactly the same as today, not doing any load-based
205 lease placement. We'll still be able to use the per-replica request counts
206 once we start factoring load into balancing decisions.
207 * Fall back to a very limited version of load-based lease placement where we
208 only move a lease based on request locality if a node holding one of the other
209 replicas is forwarding the vast majority of traffic to the range.
210 * Calculate (and gossip) estimates of the latency between all pairs of nodes and
211 use that to guess at the localities of nodes for the sake of rebalancing.
212
213 The last option can be put off to future optimization efforts (if ever). The
214 second option may be beneficial if a user sets up a multi-datacenter deployment
215 without locality info, but in the more common case of low-latency, single-DC
216 deployments it'd likely just make the lease balancing worse for no real latency
217 gain. We'll stick with the first option for now.
218
219 ### Reconciling lease locality with lease balance
220
221 As mentioned above, our goals in this RFC come into fairly direct conflict with
222 the goals of the [lease rebalancing RFC](20161026_leaseholder_rebalancing.md).
223 If all the requests are coming from a single locality, it would be ideal from a
224 latency perspective for all the leases to be there as well, so long as that
225 wouldn't overload the nodes with too much work. However, it's possible that
226 putting all the leases into that locality will overload the nodes such that
227 overall cluster throughput (and even latency) is worse than it would be if the
228 leases were properly balanced.
229
230 Ideally, we would have some understanding of how fully utilized each node is in
231 terms of throughput. That measurement could be gossiped in the same way as each
232 node's storage utilization and used in allocation decisions. It's not totally
233 clear what metric to use for this, though, so for now we will leave it out of
234 allocation decisions (suggestions welcome -- perhaps a rather naive utilization
235 measurement can go a long way here). We could easily add a measure of how much
236 load a node has on it, but not of how close to fully utilized it is.
237
238 Similarly, it'd be useful to have an estimate of the RTT between localities when
239 making allocation decisions. While we don't currently have this (as far as I'm
240 aware), measuring it wouldn't be very hard, so I suggest that we do so and use
241 it to help tune lease allocations. We'll have to make sure that this can't lead
242 to too much thrashing if different nodes come up with different estimates, but
243 re-measuring this periodically will help fight such issues and also have the
244 nice benefit that the cluster will react accordingly if the RTT between
245 localities changes.
246
247 Thus, our heuristic for deciding when to transfer a lease will look something
248 like this:
249
250 * If the cluster doesn't have locality information, fall back to the existing
251 behavior. The existing behavior is to transfer the lease if the current
252 leaseholder node is considered overfull (based on the `rebalanceThreshold`
253 constant, which is currently 5% of the mean number of leases) or if it has
254 more than the mean number of leases and another replica is underfull.
255 * Note that we can start using the new information on how many requests each
256 replica has been serving lately to get an idea of how many requests each
257 node is serving and use that rather than just the number of leaseholders.
258 That work is somewhat orthogonal to optimizing leaseholder locality, but is
259 worth noting and working on soon.
260 * If the cluster does have locality information, then measure the RTT between
261 localities. As the latency between localities increases, raise a new
262 `interLocalityRebalanceThreshold` proportionately. This will affect the
263 underfull/overfull calculations for leases when comparing replicas in
264 different localities, but the normal `rebalanceThreshold` would still be used
265 when comparing replicas within the same locality. The exact numbers for this
266 can be worked out in testing, but it will make lease balancing less and less
267 important until we eventually don't consider it at all for replicas in
268 different localities. If it's legal to make a cross-locality transfer based on
269 the nodes in question and the current `interLocalityRebalanceThreshold`, then
270 we can begin considering whether it's worth it to make a transfer based on the
271 distributon of requests to the replica.
272
273 # Testing
274
275 Much like how `allocsim` has been useful for testing the lease rebalancing
276 heuristics, it would be very nice to have a repeatable tool for testing this as
277 well. However, whereas the only real variable when testing lease rebalancing was
278 the relevant code and the only real output was the number of leases held by each
279 node, we now have multiple inputs (the code, the latency between nodes, the
280 locality labels) and multiple outputs (the number of leases held by each node,
281 the distribution of request latency, and the request throughput) to consider.
282
283 In order to better test this new functionality, I propose:
284
285 * Adding a testing knob to simulate additional latency between nodes.
286 * Extending `allocsim` to be able to set up different locality configurations
287 with the new latency knob.
288 * Extending `allocsim` to send load against the specific nodes/localities and
289 measure the resulting throughput and latency.
290
291 The output of allocsim will enable us to better understand how different
292 heuristics perform on different cluster configurations. We may be able to find a
293 heuristic that performs reasonably well across a variety of clusters, or may
294 find that we have to introduce a tuning knob of some sort to shift the balance
295 one way or another. Either way, it's tough for us to know exactly what'll work
296 without a testing tool like this, so we'll rely heavily on it when tuning the
297 heuristic.
298
299 # Future Directions
300
301 * It would be helpful to expose information about balancing decisions in the UI
302 so that users and developers can better understand what's happening in their
303 clusters, which will be particularly important when bad decisions are made,
304 causing performance dips.
305
306 * There could be situations in which the locality that's generating most of the
307 requests to a range doesn't have a local replica for that range. While ideally
308 cluster admins would construct `ZoneConfig` settings that make sense for their
309 environments and workloads, we could potentially benefit from taking
310 per-locality load into account when making replica placement decisions (not
311 just lease placement decisions).
312
313 * As mentioned above, we should start using the recent request counts being
314 added as part of this work to improve the existing replica and lease placement
315 heuristics. This should be fairly straightforward, it'll just require some
316 benchmarking. There's more discussion of this in the [future directions
317 section of the leaseholder rebalancing
318 RFC](20161026_leaseholder_rebalancing.md#future-directions).
319
320 * As mentioned above, it would be very beneficial if we came up with some way of
321 measuring the true load on each node, either as a collection of measurements
322 or as some combination of CPU utilization, memory pressure, and disk I/O
323 utilization. This would make improving locality and balancing load
324 significantly easier and more effective. Ideally we could even differentiate
325 between load caused by being a leaseholder and load caused by being a follower
326 replica.
327
328 * It's possible that this work will actually be harmful in certain odd cluster
329 configurations. For example, if a range has two replicas in different parts of
330 Europe and only a single replica in Australia, then writes will perform better
331 if the lease is in Europe even if most of its requests are coming from
332 Australia. This is because raft commands will commit much more quickly if
333 proposed from one of the replicas in Europe. We may want to take into account
334 the latency between the localities of all the different replicas to avoid such
335 problems, but such cases aren't critical since they're not recommended
336 configurations to begin with.
337
338 # Drawbacks
339
340 * This approach does not have any true safeguards against overloading nodes in
341 localities where most requests are coming from. We will prefer moving load
342 there as long as the inter-locality RTT is high enough regardless of how much
343 load the nodes can handle. At a high level, this is an existing issue of our
344 system, but intentionally imbalancing the leases will make the risk much worse
345 (until we start factoring in some measure of load percentage when making
346 decisions). We're going to need some sort of flow control mechanism regardless
347 of this change, and once we have it we can use it to help with these
348 decisions.
349 * Relying on measurements of latency from each node to each other locality may
350 lead to unexpected thrashing if nodes get drastically different measurements.
351 Taking multiple measurements over time should help with this, but it's
352 conceivable that certain networks could exhibit a persistent difference.
353 * Unless we decide to do a lot more work determining and gossiping latency
354 information between all nodes, these optimizations won't be used if cluster
355 admins don't add locality info to the nodes.
356
357 # Alternatives
358
359 * As mentioned above, we could track the origin of each request by sending the
360 locality labels along with each request rather than just the source node ID,
361 but that would add a lot of duplicated info along with every request.
362 * Rather than considering it future work, it'd be beneficial if we could make
363 allocation decisions based on the actual load on each node. Skipping right to
364 that solution would be great (suggestions appreciated!).
365 * We may find it more effective to measure something other than number of
366 requests handled by a replica. For example, perhaps time spent processing
367 requests for the replica or bytes returned in response to requests to the
368 replica would be more accurate measurements of the load on the replica. These
369 are things that we could potentially experiment with using allocsim to see if
370 they provide better balance.
371
372 # Unresolved questions