github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160210_range_leases.md (about)

     1  - Feature Name: Node-level mechanism for refreshing range leases
     2  - Status: completed
     3  - Start Date: 2016-02-10
     4  - Authors: Ben Darnell, Spencer Kimball
     5  - RFC PR: [#4288](https://github.com/cockroachdb/cockroach/pull/4288)
     6  - Cockroach Issue: [#315](https://github.com/cockroachdb/cockroach/issues/315),
     7                     [#6107](https://github.com/cockroachdb/cockroach/issues/6107)
     8  
     9  # Summary
    10  
    11  This is a proposal to replace the current per-range lease mechanism with
    12  a coarser-grained per-node lease in order to minimize range lease renewal
    13  traffic. In the new system, a range lease will have two components:
    14  a short-lived node lease (managed by the node) and a range lease of indefinite
    15  duration (managed by the range – as it is currently). Only the node-level lease
    16  will need to be automatically refreshed.
    17  
    18  
    19  # Motivation
    20  
    21  All active ranges require a range lease, which is currently updated
    22  via Raft and persisted in the range-local keyspace. Range leases have
    23  a moderate duration (currently 9 seconds) in order to be responsive to
    24  failures. Since they are stored through Raft, they must be maintained
    25  independently for each range and cannot be coalesced as is possible
    26  for heartbeats. If a range is active the lease is renewed before it
    27  expires (currently, after 7.2 seconds). This can result in a
    28  significant amount of traffic to renew leases on ranges.
    29  
    30  A motivating example is a table with 10,000 ranges experience heavy
    31  read traffic. If the primary key for the table is chosen such that
    32  load is evenly distributed, then read traffic will likely keep each
    33  range active. The range lease must be renewed in order to serve
    34  consistent reads. For 10,000 ranges, that would require 1,388 Raft
    35  commits per second. This imposes a drain on system resources that
    36  grows with the dataset size.
    37  
    38  An insight which suggests possible alternatives is that renewing
    39  10,000 range leases is duplicating a lot of work in a system which has
    40  only a small number of nodes. In particular, we mostly care about node
    41  liveness. Currently, each replica holding range leases must
    42  individually renew. What if we could have the node renew for all of
    43  its replicas holding range leases at once?
    44  
    45  
    46  # Detailed design
    47  
    48  We introduce a new node liveness system KV range at the beginning of
    49  the keyspace (not an actual SQL table; it will need to be accessed
    50  with lower-level APIs). This table is special in several respects: it
    51  is gossiped, and leases within its keyspace (and all ranges that
    52  precede it, including meta1 and meta2) use the current, per-range
    53  lease mechanism to avoid circular dependencies. This table maps node
    54  IDs to an epoch counter, and an expiration timestamp.
    55  
    56  ## Node liveness proto
    57  
    58  | Column     | Description |
    59  | ---------- | ----------- |
    60  | NodeID     | node identifier |
    61  | Epoch      | monotonically increasing liveness epoch |
    62  | Expiration | timestamp at which the liveness record expires |
    63  
    64  The node liveness system KV range stores supports a new type of range
    65  lease, referred to hereafter as an "epoch-based" range
    66  lease. Epoch-based range leases specify an epoch in addition to the
    67  owner, instead of using a timestamp expiration. The lease is valid for
    68  as long as the epoch for the lease holder is valid according to the
    69  node liveness table. To hold a valid epoch-based range lease to
    70  execute a batch command, a node must be the owner of the lease, the
    71  lease epoch must match the node's liveness epoch, and the node's
    72  liveness expiration must be at least the maximum clock offset further
    73  in the future than the command timestamp. If any of these conditions
    74  are not true, commands are rejected before being executed (in the case
    75  of read-only commands) or being proposed to raft (in the case of
    76  read-write commands).
    77  
    78  Expiration-based range leases were previously verified when applying
    79  the raft command by checking the command timestamp against the lease's
    80  expiration. Epoch-based range leases cannot be independently verified
    81  in the same way by each Raft applier, as they rely on state which may
    82  or may not be available (i.e. slow or broken gossip connecton at an
    83  applier). Instead of checking lease parameters both upstream and
    84  downstream of Raft, this new design accommodates both lease types by
    85  checking lease parameters upstream and then verifying that the lease
    86  **has not changed** downstream. The proposer includes its lease with
    87  Raft commands as `OriginLease`. At command-apply time, each node
    88  verifies that the lease in the FSM is equal to the lease verified
    89  upstream of Raft by the proposer.
    90  
    91  To see why lease verification downstream of Raft is required,
    92  consider the following example:
    93  
    94  - replica 1 receives a client request for a write
    95  - replica 1 checks the lease; the write is permitted
    96  - replica 1 proposes the command
    97  - time passes, replica 2 commits a new lease
    98  - the command applies on replica 1
    99  - replica 2 serves anomalous reads which don't see the write
   100  - the command applies on replica 2
   101  
   102  Each node periodically heartbeats its liveness record, which is
   103  implemented as a conditional put which increases the expiration
   104  timestamp and ensures that the epoch has not changed. If the epoch
   105  does change, *all* of the range leases held by this node are
   106  revoked. A node can only execute commands (propose writes to Raft or
   107  serve reads) if it's the range `LeaseHolder`, the range lease epoch is
   108  equal to the node's liveness epoch, and the command timestamp is less
   109  than the node's liveness expiration minus the maximum clock offset.
   110  
   111  A range lease is valid for as long as the node’s lease has the same
   112  epoch. If a node is down (and its node liveness has expired), another
   113  node may revoke its lease(s) by incrementing the non-live node's
   114  liveness epoch. Once this is done the old range lease is invalidated
   115  and a new node may claim the range lease. A range lease can move from
   116  node A to node B only after node A's liveness record has expired and
   117  its epoch has been incremented.
   118  
   119  A node can transfer a range lease it owns without incrementing the
   120  epoch counter by means of a conditional put to the range lease record
   121  to set the new `LeaseHolder` or else set the `LeaseHolder` to 0. This is
   122  necessary in the case of rebalancing when the node that holds the
   123  range lease is being removed. `AdminTransferLease` will be enhanced to
   124  perform transfers correctly using epoch-based range leases.
   125  
   126  An existing lease which uses the traditional, expiration-based
   127  mechanism may be upgraded to an epoch-based lease if the proposer
   128  is the `LeaseHolder` or the lease is expired.
   129  
   130  An existing lease which uses the epoch-based mechanism may be acquired
   131  if the `LeaseHolder` is set to 0 or the proposer is incrementing the
   132  epoch. Replicas in the same range will always accept a range lease
   133  request where the epoch is being incremented -- that is, they defer to
   134  the veracity of the proposer's outlook on the liveness record. They do
   135  not consult their outlook on liveness and can even be disconnected
   136  from gossip and still proceed.
   137  
   138  [NB: previously this RFC recommended a distributed transaction to
   139  update the range lease record. See note in "Alternatives" below for
   140  details on why that's unnecessary.]
   141  
   142  At the raft level, each command currently contains the node ID that
   143  held the lease at the time the command was proposed. This will be
   144  extended to include the epoch of that node’s lease. Commands are
   145  applied or rejected based on their position in the raft log: if the
   146  node ID and epoch match the last committed lease, the command will be
   147  applied; otherwise it will be rejected.
   148  
   149  Node liveness records are gossiped by the range lease holder for the
   150  range which contains it. Gossip is used in order to minimize fanout
   151  and make distribution efficient. The best-effort nature of gossip is
   152  acceptable here because timely delivery of node liveness updates are
   153  not required for system correctness. Any node which fails to receive
   154  liveness updates will simply resort to a conditional put to increment
   155  a seemingly not-live node's liveness epoch. The conditional put will
   156  fail because the expected value is out of date and the correct liveness
   157  record is returned to the caller.
   158  
   159  
   160  # Performance implications
   161  
   162  We expect traffic proportional to the number of nodes in the system.
   163  With 1,000 nodes and a 3s liveness duration threshold, we expect every
   164  node to do a conditional put to update the expiration timestamp every
   165  2.4s. That would correspond to ~417 reqs/second, a not-unreasonable
   166  load for this function. By contrast, using expiration-based leases in
   167  a cluster with 1,000 nodes and 10,000 ranges / node, we'd expect to
   168  see (10,000 ranges * 1,000 nodes / 3 replicas-per-range / 2.4s)
   169  ~= 1.39M reqs / second.
   170  
   171  We still require the traditional expiration-based range leases for any
   172  ranges located at or before the range containing liveness records. This
   173  might be problematic in the case of meta2 address record ranges, which
   174  are expected to proliferate in a large cluster. This lease traffic
   175  could be obviated if we moved the node liveness records to the very
   176  start of the keyspace, but the historical apportionment of that
   177  keyspace makes such a change difficult. A rough calculation puts the
   178  number of meta2 ranges at between 10 and 50 for a 10M range cluster,
   179  so this seems safe to ignore for the conceivable future.
   180  
   181  
   182  # Drawbacks
   183  
   184  The greatest drawback is relying on the availability of the range
   185  containing the node liveness records. This presents a single point of
   186  failure which is not as severe in the current system. Even though the
   187  first range is crucial to addressing data in the system, those reads
   188  can be inconsistent and meta1 records change slowly, so availability
   189  is likely to be good even in the event the first range can’t reach
   190  consensus. A reasonable solution is to increase the number of replicas
   191  in the zones containing the node liveness records - something that is
   192  generally considered sound practice in any case. [NB: we also rely on
   193  the availability of various system data. For example, if the
   194  `system.lease` info isn't available we won't be able to serve any SQL
   195  traffic].
   196  
   197  Another drawback is the concentration of write traffic to the node
   198  liveness records. This could be mitigated by splitting the node
   199  liveness range at arbitrary resolutions, perhaps even so there’s a
   200  single node liveness record per range. This is unlikely to be much of
   201  a problem unless the number of nodes in the system is significant.
   202  
   203  
   204  # Alternatives
   205  
   206  The cost of the current system of per-range lease renewals could be
   207  reduced easily by changing some constants (increasing range sizes and
   208  lease durations), although the gains would be much less than what is
   209  being proposed here and read-heavy workloads would still spend much of
   210  their time on lease updates.
   211  
   212  If we used copysets, there may be an opportunity to maintain lease holder
   213  leases at the granularity of copysets.
   214  
   215  ## Use of distributed txn for updating liveness records
   216  
   217  The original proposal mentioned: "The range lease record is always
   218  updated in a distributed transaction with the node liveness record to
   219  ensure that the epoch counter is consistent and the start time is
   220  greater than the prior range lease holder’s node liveness expiration
   221  (plus the maximum clock offset)."
   222  
   223  This has been abandoned mostly out of a desire to avoid changing the
   224  nature of the range lease record and the range lease raft command. To
   225  see why it's not necessary, consider a range lease being updated out
   226  of sync with the node liveness records. That would mean either that
   227  the epoch being incremented is older than the epoch in the liveness
   228  record or else at a timestamp which has already expired. It's not
   229  possible to update to a later epoch or newer timestamp than what's in
   230  the liveness record because epochs are taken directly from the source
   231  and are incremented monotonically; timestamps are proposed only within
   232  the bounds by which a node has successfully heartbeat the liveness
   233  record.
   234  
   235  In the event of an earlier timestamp or epoch, the proposer would
   236  succeed at the range lease, but then fail immediately on attempting to
   237  use the range lease, as it could not possibly still have an HLC clock
   238  time corresponding to the now-old epoch at which it acquired the lease.
   239  
   240  
   241  # Unresolved questions
   242  
   243  Should we have a general purpose lease protobuf to describe both?
   244  Single lease for first range leases using current system and all other
   245  range leases using the proposed system.
   246  
   247  How does this mechanism inform future designs to incorporate quorum
   248  leases?