github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160210_range_leases.md (about) 1 - Feature Name: Node-level mechanism for refreshing range leases 2 - Status: completed 3 - Start Date: 2016-02-10 4 - Authors: Ben Darnell, Spencer Kimball 5 - RFC PR: [#4288](https://github.com/cockroachdb/cockroach/pull/4288) 6 - Cockroach Issue: [#315](https://github.com/cockroachdb/cockroach/issues/315), 7 [#6107](https://github.com/cockroachdb/cockroach/issues/6107) 8 9 # Summary 10 11 This is a proposal to replace the current per-range lease mechanism with 12 a coarser-grained per-node lease in order to minimize range lease renewal 13 traffic. In the new system, a range lease will have two components: 14 a short-lived node lease (managed by the node) and a range lease of indefinite 15 duration (managed by the range – as it is currently). Only the node-level lease 16 will need to be automatically refreshed. 17 18 19 # Motivation 20 21 All active ranges require a range lease, which is currently updated 22 via Raft and persisted in the range-local keyspace. Range leases have 23 a moderate duration (currently 9 seconds) in order to be responsive to 24 failures. Since they are stored through Raft, they must be maintained 25 independently for each range and cannot be coalesced as is possible 26 for heartbeats. If a range is active the lease is renewed before it 27 expires (currently, after 7.2 seconds). This can result in a 28 significant amount of traffic to renew leases on ranges. 29 30 A motivating example is a table with 10,000 ranges experience heavy 31 read traffic. If the primary key for the table is chosen such that 32 load is evenly distributed, then read traffic will likely keep each 33 range active. The range lease must be renewed in order to serve 34 consistent reads. For 10,000 ranges, that would require 1,388 Raft 35 commits per second. This imposes a drain on system resources that 36 grows with the dataset size. 37 38 An insight which suggests possible alternatives is that renewing 39 10,000 range leases is duplicating a lot of work in a system which has 40 only a small number of nodes. In particular, we mostly care about node 41 liveness. Currently, each replica holding range leases must 42 individually renew. What if we could have the node renew for all of 43 its replicas holding range leases at once? 44 45 46 # Detailed design 47 48 We introduce a new node liveness system KV range at the beginning of 49 the keyspace (not an actual SQL table; it will need to be accessed 50 with lower-level APIs). This table is special in several respects: it 51 is gossiped, and leases within its keyspace (and all ranges that 52 precede it, including meta1 and meta2) use the current, per-range 53 lease mechanism to avoid circular dependencies. This table maps node 54 IDs to an epoch counter, and an expiration timestamp. 55 56 ## Node liveness proto 57 58 | Column | Description | 59 | ---------- | ----------- | 60 | NodeID | node identifier | 61 | Epoch | monotonically increasing liveness epoch | 62 | Expiration | timestamp at which the liveness record expires | 63 64 The node liveness system KV range stores supports a new type of range 65 lease, referred to hereafter as an "epoch-based" range 66 lease. Epoch-based range leases specify an epoch in addition to the 67 owner, instead of using a timestamp expiration. The lease is valid for 68 as long as the epoch for the lease holder is valid according to the 69 node liveness table. To hold a valid epoch-based range lease to 70 execute a batch command, a node must be the owner of the lease, the 71 lease epoch must match the node's liveness epoch, and the node's 72 liveness expiration must be at least the maximum clock offset further 73 in the future than the command timestamp. If any of these conditions 74 are not true, commands are rejected before being executed (in the case 75 of read-only commands) or being proposed to raft (in the case of 76 read-write commands). 77 78 Expiration-based range leases were previously verified when applying 79 the raft command by checking the command timestamp against the lease's 80 expiration. Epoch-based range leases cannot be independently verified 81 in the same way by each Raft applier, as they rely on state which may 82 or may not be available (i.e. slow or broken gossip connecton at an 83 applier). Instead of checking lease parameters both upstream and 84 downstream of Raft, this new design accommodates both lease types by 85 checking lease parameters upstream and then verifying that the lease 86 **has not changed** downstream. The proposer includes its lease with 87 Raft commands as `OriginLease`. At command-apply time, each node 88 verifies that the lease in the FSM is equal to the lease verified 89 upstream of Raft by the proposer. 90 91 To see why lease verification downstream of Raft is required, 92 consider the following example: 93 94 - replica 1 receives a client request for a write 95 - replica 1 checks the lease; the write is permitted 96 - replica 1 proposes the command 97 - time passes, replica 2 commits a new lease 98 - the command applies on replica 1 99 - replica 2 serves anomalous reads which don't see the write 100 - the command applies on replica 2 101 102 Each node periodically heartbeats its liveness record, which is 103 implemented as a conditional put which increases the expiration 104 timestamp and ensures that the epoch has not changed. If the epoch 105 does change, *all* of the range leases held by this node are 106 revoked. A node can only execute commands (propose writes to Raft or 107 serve reads) if it's the range `LeaseHolder`, the range lease epoch is 108 equal to the node's liveness epoch, and the command timestamp is less 109 than the node's liveness expiration minus the maximum clock offset. 110 111 A range lease is valid for as long as the node’s lease has the same 112 epoch. If a node is down (and its node liveness has expired), another 113 node may revoke its lease(s) by incrementing the non-live node's 114 liveness epoch. Once this is done the old range lease is invalidated 115 and a new node may claim the range lease. A range lease can move from 116 node A to node B only after node A's liveness record has expired and 117 its epoch has been incremented. 118 119 A node can transfer a range lease it owns without incrementing the 120 epoch counter by means of a conditional put to the range lease record 121 to set the new `LeaseHolder` or else set the `LeaseHolder` to 0. This is 122 necessary in the case of rebalancing when the node that holds the 123 range lease is being removed. `AdminTransferLease` will be enhanced to 124 perform transfers correctly using epoch-based range leases. 125 126 An existing lease which uses the traditional, expiration-based 127 mechanism may be upgraded to an epoch-based lease if the proposer 128 is the `LeaseHolder` or the lease is expired. 129 130 An existing lease which uses the epoch-based mechanism may be acquired 131 if the `LeaseHolder` is set to 0 or the proposer is incrementing the 132 epoch. Replicas in the same range will always accept a range lease 133 request where the epoch is being incremented -- that is, they defer to 134 the veracity of the proposer's outlook on the liveness record. They do 135 not consult their outlook on liveness and can even be disconnected 136 from gossip and still proceed. 137 138 [NB: previously this RFC recommended a distributed transaction to 139 update the range lease record. See note in "Alternatives" below for 140 details on why that's unnecessary.] 141 142 At the raft level, each command currently contains the node ID that 143 held the lease at the time the command was proposed. This will be 144 extended to include the epoch of that node’s lease. Commands are 145 applied or rejected based on their position in the raft log: if the 146 node ID and epoch match the last committed lease, the command will be 147 applied; otherwise it will be rejected. 148 149 Node liveness records are gossiped by the range lease holder for the 150 range which contains it. Gossip is used in order to minimize fanout 151 and make distribution efficient. The best-effort nature of gossip is 152 acceptable here because timely delivery of node liveness updates are 153 not required for system correctness. Any node which fails to receive 154 liveness updates will simply resort to a conditional put to increment 155 a seemingly not-live node's liveness epoch. The conditional put will 156 fail because the expected value is out of date and the correct liveness 157 record is returned to the caller. 158 159 160 # Performance implications 161 162 We expect traffic proportional to the number of nodes in the system. 163 With 1,000 nodes and a 3s liveness duration threshold, we expect every 164 node to do a conditional put to update the expiration timestamp every 165 2.4s. That would correspond to ~417 reqs/second, a not-unreasonable 166 load for this function. By contrast, using expiration-based leases in 167 a cluster with 1,000 nodes and 10,000 ranges / node, we'd expect to 168 see (10,000 ranges * 1,000 nodes / 3 replicas-per-range / 2.4s) 169 ~= 1.39M reqs / second. 170 171 We still require the traditional expiration-based range leases for any 172 ranges located at or before the range containing liveness records. This 173 might be problematic in the case of meta2 address record ranges, which 174 are expected to proliferate in a large cluster. This lease traffic 175 could be obviated if we moved the node liveness records to the very 176 start of the keyspace, but the historical apportionment of that 177 keyspace makes such a change difficult. A rough calculation puts the 178 number of meta2 ranges at between 10 and 50 for a 10M range cluster, 179 so this seems safe to ignore for the conceivable future. 180 181 182 # Drawbacks 183 184 The greatest drawback is relying on the availability of the range 185 containing the node liveness records. This presents a single point of 186 failure which is not as severe in the current system. Even though the 187 first range is crucial to addressing data in the system, those reads 188 can be inconsistent and meta1 records change slowly, so availability 189 is likely to be good even in the event the first range can’t reach 190 consensus. A reasonable solution is to increase the number of replicas 191 in the zones containing the node liveness records - something that is 192 generally considered sound practice in any case. [NB: we also rely on 193 the availability of various system data. For example, if the 194 `system.lease` info isn't available we won't be able to serve any SQL 195 traffic]. 196 197 Another drawback is the concentration of write traffic to the node 198 liveness records. This could be mitigated by splitting the node 199 liveness range at arbitrary resolutions, perhaps even so there’s a 200 single node liveness record per range. This is unlikely to be much of 201 a problem unless the number of nodes in the system is significant. 202 203 204 # Alternatives 205 206 The cost of the current system of per-range lease renewals could be 207 reduced easily by changing some constants (increasing range sizes and 208 lease durations), although the gains would be much less than what is 209 being proposed here and read-heavy workloads would still spend much of 210 their time on lease updates. 211 212 If we used copysets, there may be an opportunity to maintain lease holder 213 leases at the granularity of copysets. 214 215 ## Use of distributed txn for updating liveness records 216 217 The original proposal mentioned: "The range lease record is always 218 updated in a distributed transaction with the node liveness record to 219 ensure that the epoch counter is consistent and the start time is 220 greater than the prior range lease holder’s node liveness expiration 221 (plus the maximum clock offset)." 222 223 This has been abandoned mostly out of a desire to avoid changing the 224 nature of the range lease record and the range lease raft command. To 225 see why it's not necessary, consider a range lease being updated out 226 of sync with the node liveness records. That would mean either that 227 the epoch being incremented is older than the epoch in the liveness 228 record or else at a timestamp which has already expired. It's not 229 possible to update to a later epoch or newer timestamp than what's in 230 the liveness record because epochs are taken directly from the source 231 and are incremented monotonically; timestamps are proposed only within 232 the bounds by which a node has successfully heartbeat the liveness 233 record. 234 235 In the event of an earlier timestamp or epoch, the proposer would 236 succeed at the range lease, but then fail immediately on attempting to 237 use the range lease, as it could not possibly still have an HLC clock 238 time corresponding to the now-old epoch at which it acquired the lease. 239 240 241 # Unresolved questions 242 243 Should we have a general purpose lease protobuf to describe both? 244 Single lease for first range leases using current system and all other 245 range leases using the proposed system. 246 247 How does this mechanism inform future designs to incorporate quorum 248 leases?