github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170125_leaseholder_locality.md (about)

     1  - Feature Name: Leaseholder Locality ("Leases follow the workload")
     2  - Status: completed
     3  - Start Date: 2017-01-25
     4  - Authors: Alex Robinson
     5  - RFC PR: [#13233](https://github.com/cockroachdb/cockroach/pull/13233)
     6  - Cockroach Issue:
     7    - [#13232](https://github.com/cockroachdb/cockroach/issues/13232)
     8  
     9  # Summary
    10  
    11  Enable range leaseholders to move closer to their clients, which reduces the
    12  latency of KV requests when replicas are far apart.
    13  
    14  # Motivation
    15  
    16  The primary motivation for moving leaseholders around based on where their
    17  requests are coming from is to reduce the latency of those requests.  When a
    18  request for a given key is received by a gateway node, the node must forward the
    19  request to the leaseholder for the key's range. This isn't a big deal if all a
    20  cluster's nodes are nearby each other in the network, but if the leaseholder
    21  happens to be halfway around the world from where the request originated, the
    22  network round trip time that gets added to the request latency by this can be
    23  quite high.
    24  
    25  This affects both reads and writes. For reads, getting the request to and from
    26  the leaseholder dominates the request latency in widely distributed clusters
    27  since the leaseholder can serve reads without talking to other replicas. For
    28  writes, even though raft commands will incur an additional round trip between
    29  replicas, removing the round trip to the leaseholder could nearly halve the
    30  total request latency.
    31  
    32  While there are typically multiple gateway nodes accessing a range and that they
    33  won't necessarily all be from the same locality, our goal is to minimize the
    34  aggregate RTT cost of accessing a given range by properly placing its
    35  leaseholder.
    36  
    37  We believe there are usage patterns that would benefit greatly from better
    38  leaseholder placement. Consider a system that spans datacenters all around the
    39  world. When it's daytime in the Asia/Australia, the datacenter(s) there will be
    40  receiving most of the requests. As time passes, more of the requests will start
    41  to originate from Europe, and later on from the Americas. If the leaseholder for
    42  a range is always in an Asian data center, then the latency of accessing that
    43  range will be much worse when most of its requests come from elsewhere. This is
    44  where the idea of the lease "following the workload" or "following the sun" comes
    45  from.
    46  
    47  Finally, it's worth noting that this goal is somewhat at odds with our desire
    48  to distribute load evenly throughout a cluster (e.g. via [range
    49  rebalancing](20160503_rebalancing_v2.md) and [leaseholder
    50  rebalancing](20161026_leaseholder_rebalancing.md). In fact, this goal was
    51  specifically pushed off when the initial form of leaseholder rebalancing was
    52  designed and implemented.  Placing the leaseholders near their most common
    53  gateways may lower total throughput if it maxes out what the local machines can
    54  do while leaving the machines in other datacenters underutilized, particularly
    55  if the latency between datacenters is small. There is a fine balance to be kept
    56  to avoid minimizing latency at the expense of too much throughput and vice
    57  versa.
    58  
    59  # Detailed design
    60  
    61  Given that a lease transfer mechanism already exists, the remaining difficulty
    62  lies in deciding when to transfer leases. We [already have
    63  logic](20161026_leaseholder_rebalancing.md) that decides when to transfer leases
    64  with the goal of ensuring each node has approximately the same number of leases.
    65  Anything we add to make leases follow the workload will need to play nicely with
    66  that existing goal.
    67  
    68  ## Tracking request origins
    69  
    70  In order to have any idea where to rebalance leases to, we first need to track
    71  where the requests for a given range are coming from. To that end, we can add
    72  request origin information to the `Header` that's included in all
    73  `BatchRequest`s.
    74  
    75  When tracking the origin of requests, we don't just care about about the
    76  individual node that a request came from, but its locality. To understand this,
    77  consider a cluster with nodes 1-3 in datacenter 1, nodes 4-6 in datacenter 2,
    78  and nodes 6-9 in datacenter 3. Range x in this cluster has replicas on nodes 1,
    79  4, and 7, and the current leaseholder is node 1. If the leaseholder is receiving
    80  a lot of its requests from nodes 8 and 9, then we may want to move the lease to
    81  node 7 even if node 7 itself isn't sending the range much traffic, because node
    82  7 shares a locality with the origin of the requests.
    83  
    84  Luckily, each node already has a `NodeDescriptor` for every other node in the
    85  cluster, and the `NodeDescriptor` proto contains its node's locality
    86  information. Thus, all we need to add to the `BatchRequest` `Header` proto is a
    87  `gateway_node_id` field that gets filled in by the client kv transport. While
    88  the client transport will typically fill this field in with its own ID, it can
    89  also be spoofed when appropriate, such as by DistSQL when a node that wasn't
    90  actually the gateway for a given request makes KV requests on behalf of the real
    91  gateway.
    92  
    93  ### Alternatives for tracking request origins
    94  
    95  Given that we know the IP address of each node in the cluster, we could
    96  potentially try to skip adding the `gateway_node_id` field to each
    97  `BatchRequest` and just rely on the source IP address. That would work in most
    98  cases, but could break down when nodes are communicating with each other via a
    99  load balancer or proxy, without saving much (adding an int to each
   100  `BatchRequest` should have a negligible effect on request size).
   101  
   102  We could alternatively include all the locality tags from the source node in
   103  each request, which would eliminate the need to look up the locality of each
   104  node when making decisions based on the collected data. However, this requires
   105  much more data to be sent over the wire and stored per-range in the system,
   106  effectively duplicating the work already done by the gossiping and storage of
   107  node descriptors. It would be reasonable to take this approach -- it would save
   108  an integer's worth of work in the case that the nodes in a cluster don't have
   109  any locality labels -- but it doesn't simplify enough to make up for its added
   110  cost.
   111  
   112  ## Recording request origins
   113  
   114  Each leaseholder replica will maintain a map from locality to the number of
   115  requests received from that locality. Ideally, the request counts would decay
   116  exponentially over time. If this proves too difficult to implement efficiently
   117  (e.g. in a cluster with requests coming from tens of localities), we could opt
   118  for swapping out the map with a new map periodically, such as when we examine it
   119  to decide whether to transfer the replica's lease.
   120  
   121  If we go with the latter approach we should also maintain the time that we last
   122  cleared out the map of request counts so that we can determine the rate of
   123  requests to the range. Understanding the load on each range will help us
   124  prioritize them for transfer (and help us to balance the load on each node, even
   125  if that isn't the immediate goal of this work).
   126  
   127  For the purposes of optimizing lease locality, we'll count each BatchRequest as
   128  a single request. For tracking the load on each replica, we'll likely want to
   129  measure something a little fancier, such as the number of non-noop subrequests
   130  in the BatchRequest or the amount of KVs scanned by the request. These would be
   131  better estimates of load, but for locality we want to focus on the number of
   132  requests suffering from the large network RTT.
   133  
   134  ## Making lease transfer decisions
   135  
   136  Much of the infrastructure needed for making decisions about when to transfer
   137  leases was [already added](https://github.com/cockroachdb/cockroach/pull/10464)
   138  as part of the initial lease balancing work, so all we need to do is start using
   139  request locality as an additional input to the decision-making process. The
   140  difficult problems that we really need to solve for are:
   141  
   142  * Prioritizing transferring leases for the replicas that are receiving the
   143    greatest number of requests from distant localities.
   144  * Avoiding thrashing if the origins of requests change frequently.
   145  * Finding the right balance between keeping a similar number of leases on each
   146    node or moving all the leases to where the most traffic is.
   147  
   148  ### Prioritizing ranges with the most cross-DC traffic
   149  
   150  If a cluster really is spread over datacenters around the world, it's likely
   151  that most of a cluster's ranges will be getting most of their traffic from the
   152  same locality. If this is the case, the most bang-for-the-buck would come from
   153  moving the leases for ranges that are receiving the most requests and whose
   154  request distributions are most skewed toward the given location. To that end, we
   155  may want to periodically record stats about the rate of requests to the ranges
   156  on a store and how many of those requests are from distant localities.  These
   157  would be much like counts we calculate of the number of ranges and leases on a
   158  store. However, unlike those stats, we may not want to start gossiping these as
   159  part of the store descriptor until we have a concrete use for them.  That will
   160  likely come once we start getting smarter about determining just how much load a
   161  node can handle.
   162  
   163  Given these stats, we can make decisions per-replica by only transferring leases
   164  for replicas whose requests skew more strongly from distant localities.
   165  
   166  Of course, there is a risk that moving all the hottest ranges to the same
   167  locality will could have a worse effect on throughput if those nodes get
   168  overloaded. We'll have to test the heuristics carefully to tune them and
   169  determine when this could become a problem. It's very possible that we'll have
   170  to expose a tuning knob for this to give users more control over the tradeoff.
   171  
   172  ### Avoiding thrashing
   173  
   174  In order to avoid thrashing of leases, we can partially reuse the mechanisms
   175  already in place for avoiding lease thrashing with respect to leaseholder
   176  balance, particularly the rate-limiting of lease transfers
   177  ([#11729](https://github.com/cockroachdb/cockroach/pull/11729)).
   178  
   179  Additionally, we can learn from both lease and replica rebalancing that there
   180  needs to be a wide range of configurations in which no action is taken -- e.g.
   181  a node with 4% more ranges than the mean won't bother transferring a range to a
   182  node with 4% fewer ranges than the mean. We'll need a cushion in our heuristic
   183  such that we only transfer leases away if there's a significant difference in
   184  the number of requests coming from the different localities.
   185  
   186  Along the same lines, we shouldn't make hasty decisions. Measurements of request
   187  distributions are less reliable if we only make them over short time windows.
   188  We'll need to pick a reasonable minimum duration of time to measure before
   189  transferring leases. This will likely be in the tens of minutes -- enough time
   190  to get good data, but not so long that traffic patterns are likely to change
   191  drastically in shorter time periods. An alternative would be to factor in the
   192  number of requests, since a 5 minute measurement with a million samples is much
   193  more trustworthy than a 20 minute measurement with a thousand samples.
   194  
   195  ### What to do if nodes don't have locality information
   196  
   197  If nodes don't have any locality information attached to them, we lose most of
   198  our ability to determine where nodes are located with respect to each other.
   199  While we should make an effort to encourage all multi-datacenter deployments to
   200  specify locality information (and can automate the assignment of locality info
   201  in cloud environments), there will certainly be some deployments without it. We
   202  have a few options in such cases:
   203  
   204  * Fall back to operating exactly the same as today, not doing any load-based
   205    lease placement. We'll still be able to use the per-replica request counts
   206    once we start factoring load into balancing decisions.
   207  * Fall back to a very limited version of load-based lease placement where we
   208    only move a lease based on request locality if a node holding one of the other
   209    replicas is forwarding the vast majority of traffic to the range.
   210  * Calculate (and gossip) estimates of the latency between all pairs of nodes and
   211    use that to guess at the localities of nodes for the sake of rebalancing.
   212  
   213  The last option can be put off to future optimization efforts (if ever). The
   214  second option may be beneficial if a user sets up a multi-datacenter deployment
   215  without locality info, but in the more common case of low-latency, single-DC
   216  deployments it'd likely just make the lease balancing worse for no real latency
   217  gain. We'll stick with the first option for now.
   218  
   219  ### Reconciling lease locality with lease balance
   220  
   221  As mentioned above, our goals in this RFC come into fairly direct conflict with
   222  the goals of the [lease rebalancing RFC](20161026_leaseholder_rebalancing.md).
   223  If all the requests are coming from a single locality, it would be ideal from a
   224  latency perspective for all the leases to be there as well, so long as that
   225  wouldn't overload the nodes with too much work. However, it's possible that
   226  putting all the leases into that locality will overload the nodes such that
   227  overall cluster throughput (and even latency) is worse than it would be if the
   228  leases were properly balanced.
   229  
   230  Ideally, we would have some understanding of how fully utilized each node is in
   231  terms of throughput. That measurement could be gossiped in the same way as each
   232  node's storage utilization and used in allocation decisions. It's not totally
   233  clear what metric to use for this, though, so for now we will leave it out of
   234  allocation decisions (suggestions welcome -- perhaps a rather naive utilization
   235  measurement can go a long way here). We could easily add a measure of how much
   236  load a node has on it, but not of how close to fully utilized it is.
   237  
   238  Similarly, it'd be useful to have an estimate of the RTT between localities when
   239  making allocation decisions. While we don't currently have this (as far as I'm
   240  aware), measuring it wouldn't be very hard, so I suggest that we do so and use
   241  it to help tune lease allocations. We'll have to make sure that this can't lead
   242  to too much thrashing if different nodes come up with different estimates, but
   243  re-measuring this periodically will help fight such issues and also have the
   244  nice benefit that the cluster will react accordingly if the RTT between
   245  localities changes.
   246  
   247  Thus, our heuristic for deciding when to transfer a lease will look something
   248  like this:
   249  
   250  * If the cluster doesn't have locality information, fall back to the existing
   251    behavior. The existing behavior is to transfer the lease if the current
   252    leaseholder node is considered overfull (based on the `rebalanceThreshold`
   253    constant, which is currently 5% of the mean number of leases) or if it has
   254    more than the mean number of leases and another replica is underfull.
   255    * Note that we can start using the new information on how many requests each
   256      replica has been serving lately to get an idea of how many requests each
   257      node is serving and use that rather than just the number of leaseholders.
   258      That work is somewhat orthogonal to optimizing leaseholder locality, but is
   259      worth noting and working on soon.
   260  * If the cluster does have locality information, then measure the RTT between
   261    localities. As the latency between localities increases, raise a new
   262    `interLocalityRebalanceThreshold` proportionately. This will affect the
   263    underfull/overfull calculations for leases when comparing replicas in
   264    different localities, but the normal `rebalanceThreshold` would still be used
   265    when comparing replicas within the same locality. The exact numbers for this
   266    can be worked out in testing, but it will make lease balancing less and less
   267    important until we eventually don't consider it at all for replicas in
   268    different localities. If it's legal to make a cross-locality transfer based on
   269    the nodes in question and the current `interLocalityRebalanceThreshold`, then
   270    we can begin considering whether it's worth it to make a transfer based on the
   271    distributon of requests to the replica.
   272  
   273  # Testing
   274  
   275  Much like how `allocsim` has been useful for testing the lease rebalancing
   276  heuristics, it would be very nice to have a repeatable tool for testing this as
   277  well. However, whereas the only real variable when testing lease rebalancing was
   278  the relevant code and the only real output was the number of leases held by each
   279  node, we now have multiple inputs (the code, the latency between nodes, the
   280  locality labels) and multiple outputs (the number of leases held by each node,
   281  the distribution of request latency, and the request throughput) to consider.
   282  
   283  In order to better test this new functionality, I propose:
   284  
   285  * Adding a testing knob to simulate additional latency between nodes.
   286  * Extending `allocsim` to be able to set up different locality configurations
   287    with the new latency knob.
   288  * Extending `allocsim` to send load against the specific nodes/localities and
   289    measure the resulting throughput and latency.
   290  
   291  The output of allocsim will enable us to better understand how different
   292  heuristics perform on different cluster configurations. We may be able to find a
   293  heuristic that performs reasonably well across a variety of clusters, or may
   294  find that we have to introduce a tuning knob of some sort to shift the balance
   295  one way or another. Either way, it's tough for us to know exactly what'll work
   296  without a testing tool like this, so we'll rely heavily on it when tuning the
   297  heuristic.
   298  
   299  # Future Directions
   300  
   301  * It would be helpful to expose information about balancing decisions in the UI
   302    so that users and developers can better understand what's happening in their
   303    clusters, which will be particularly important when bad decisions are made,
   304    causing performance dips.
   305  
   306  * There could be situations in which the locality that's generating most of the
   307    requests to a range doesn't have a local replica for that range. While ideally
   308    cluster admins would construct `ZoneConfig` settings that make sense for their
   309    environments and workloads, we could potentially benefit from taking
   310    per-locality load into account when making replica placement decisions (not
   311    just lease placement decisions).
   312  
   313  * As mentioned above, we should start using the recent request counts being
   314    added as part of this work to improve the existing replica and lease placement
   315    heuristics. This should be fairly straightforward, it'll just require some
   316    benchmarking. There's more discussion of this in the [future directions
   317    section of the leaseholder rebalancing
   318    RFC](20161026_leaseholder_rebalancing.md#future-directions).
   319  
   320  * As mentioned above, it would be very beneficial if we came up with some way of
   321    measuring the true load on each node, either as a collection of measurements
   322    or as some combination of CPU utilization, memory pressure, and disk I/O
   323    utilization. This would make improving locality and balancing load
   324    significantly easier and more effective. Ideally we could even differentiate
   325    between load caused by being a leaseholder and load caused by being a follower
   326    replica.
   327  
   328  * It's possible that this work will actually be harmful in certain odd cluster
   329    configurations. For example, if a range has two replicas in different parts of
   330    Europe and only a single replica in Australia, then writes will perform better
   331    if the lease is in Europe even if most of its requests are coming from
   332    Australia. This is because raft commands will commit much more quickly if
   333    proposed from one of the replicas in Europe. We may want to take into account
   334    the latency between the localities of all the different replicas to avoid such
   335    problems, but such cases aren't critical since they're not recommended
   336    configurations to begin with.
   337  
   338  # Drawbacks
   339  
   340  * This approach does not have any true safeguards against overloading nodes in
   341    localities where most requests are coming from. We will prefer moving load
   342    there as long as the inter-locality RTT is high enough regardless of how much
   343    load the nodes can handle. At a high level, this is an existing issue of our
   344    system, but intentionally imbalancing the leases will make the risk much worse
   345    (until we start factoring in some measure of load percentage when making
   346    decisions). We're going to need some sort of flow control mechanism regardless
   347    of this change, and once we have it we can use it to help with these
   348    decisions.
   349  * Relying on measurements of latency from each node to each other locality may
   350    lead to unexpected thrashing if nodes get drastically different measurements.
   351    Taking multiple measurements over time should help with this, but it's
   352    conceivable that certain networks could exhibit a persistent difference.
   353  * Unless we decide to do a lot more work determining and gossiping latency
   354    information between all nodes, these optimizations won't be used if cluster
   355    admins don't add locality info to the nodes.
   356  
   357  # Alternatives
   358  
   359  * As mentioned above, we could track the origin of each request by sending the
   360    locality labels along with each request rather than just the source node ID,
   361    but that would add a lot of duplicated info along with every request.
   362  * Rather than considering it future work, it'd be beneficial if we could make
   363    allocation decisions based on the actual load on each node. Skipping right to
   364    that solution would be great (suggestions appreciated!).
   365  * We may find it more effective to measure something other than number of
   366    requests handled by a replica. For example, perhaps time spent processing
   367    requests for the replica or bytes returned in response to requests to the
   368    replica would be more accurate measurements of the load on the replica. These
   369    are things that we could potentially experiment with using allocsim to see if
   370    they provide better balance.
   371  
   372  # Unresolved questions