github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20180603_follower_reads.md (about) 1 - Feature Name: Follower Reads 2 - Status: accepted 3 - Start Date: 2018-06-03 4 - Authors: Spencer Kimball, Tobias Schottdorf 5 - RFC PR: #21056 6 - Cockroach Issue: #16593 7 8 # Summary 9 10 Follower reads are consistent reads at historical timestamps from follower 11 replicas. They make the non-leader replicas in a range suitable sources for 12 historical reads. Historical reads include both `AS OF SYSTEM TIME` queries 13 as well as transactions with a read timestamp sufficiently in the past (for 14 example long-running analytics queries). 15 16 The key enabling technology is the exchange of **closed timestamp updates** 17 between stores. A closed timestamp update (*CT update*) is a store-wide 18 timestamp together with (sparse) per-range information on the Raft progress. 19 Follower replicas use local state built up from successive received CT updates 20 to ascertain that they have all the state necessary to serve consistent reads 21 at and below the leaseholder store's closed timestamp. 22 23 Follower reads are only possible for epoch-based leases, which includes all user 24 ranges but excludes some system ranges (such as the addressing metadata ranges). 25 In what follows all mentioned leases are epoch-based. 26 27 # Motivation 28 29 Consistent historical reads are useful for analytics queries and in particular 30 allow such queries to be carried out more efficiently and, with appropriate 31 configuration, away from foreground traffic. But historical reads are also key 32 to a proposal for [reference-like 33 tables](https://github.com/cockroachdb/cockroach/issues/26301) aimed at cutting 34 down on foreign key check latencies particularly in geo-distributed clusters; 35 they help recover a reasonably recent consistent snapshot of a cluster after a 36 loss of quorum; and they are one of the ingredients for [Change Data 37 Capture](https://github.com/cockroachdb/cockroach/pull/25229). 38 39 # Guide-level explanation 40 41 Fundamentally, the idea is that we already keep multiple consistent copies of 42 all data via replication, and that we want to utilize all of the copies to 43 serve reads. Morally speaking, a read which only cares to access data that was 44 written at some timestamp well in the past *should* be servable from all 45 replicas (assuming normal operation), as replication typically catches up all 46 the followers quickly, and most writes happen at "newer" timestamps. Clearly 47 neither of these two properties are guaranteed though, so replicas have to be 48 provided with a a way of deciding whether a given read request can be served 49 consistently. 50 51 The closed timestamp mechanism provides between each pair of stores a regular 52 (on the order of seconds) exchange of information to that end. At a high level, 53 these updates contain what one might intuitively expect: 54 55 A follower trying to serve a read needs to know that a given timestamp is "safe" 56 (in parlance of this RFC, "closed") to serve reads for; there must not be some 57 in-flight or future write that would invalidate a follower read retroactively. 58 Each store maintains a data structure, the **min proposal tracker** (*MPT*) 59 described later, to establish this timestamp. 60 61 Similarly, if a range's leaseholder commits a write into its Raft log at index 62 `P` before announcing a *closed timestamp*, then the follower must wait until it 63 has caught up to that index `P` before serving reads at the closed timestamp. To 64 provide this information, each store also includes with each closed timestamp an 65 updated minimum log index that the follower must reach before "activating" the 66 associated closed timestamp on that replica. 67 68 Providing the information only when there has been write activity on a given 69 range since the last closed timestamp is key to performance, as a store can 70 house upwards of 50000 replicas, and including information about every single 71 one of them in each update is prohibitive due to the overhead of visiting them. 72 73 This is similar to *range quiescence*, which avoids Raft heartbeats between 74 inactive ranges. It's worth pointing out that quiescent ranges are able to serve 75 follower reads, and that there is no architectural connection between follower 76 reads in quiescence, though a range that is quiescent is typically one that 77 requires no per-range CT update. 78 79 As we've seen above, this RFC deals in "log positions" (and closed timestamps). 80 For technical reasons, the "log position" is not the Raft log position but the 81 **Lease Applied Index**, a concept introduced by us on top of Raft to handle 82 Raft-internal reproposals and reorderings. Ultimately, what we're after is a 83 promise of the form 84 85 > no more proposals writing to timestamps less than or equal to `T` are going 86 to apply after log index `I`. 87 88 This guarantee is tricky to extract from the Raft log index since proposing a 89 command at log index `I` does not restrict it from showing up at higher log 90 indices later, especially in leader-not-leaseholder situations. The *Lease 91 Applied Index* was introduced precisely to have better control, and allows us to 92 make the above promise. 93 94 # Reference-level explanation 95 96 This section will focus on the technical details of the closed timestamp 97 mechanism, with an emphasis on correctness. 98 99 A closed timestamp update contains the following information (sent by an origin `Store`): 100 101 - the **liveness epoch** (of the origin `Store`) 102 - a **closed timestamp** (`hlc.Timestamp`, typically trails "real time" by at least a constant target duration) 103 - a **sequence number** (to allow discarding built-up state on missed updates) 104 - a map from `RangeID` to **minimum lease applied index** (*MLAI*) that specifies 105 the updates to the recipient's map accumulated from all previous updates. 106 107 The accumulated per-range state together with the closed timestamp serve as a 108 guarantee of the form 109 110 > Every Raft command proposed after the min lease applied index (MLAI) 111 will be ahead of the closed timestamp (CT). 112 113 Each store starts out with an empty state for each peer store and epoch, and 114 merges the *MLAI* updates into the state (overwriting existing *MLAI*s). 115 Whenever the sequence number received in an update from a peer store displays a 116 gap, the state for that peer store is reset, and the current update merged into 117 the empty state: this means that all information regarding ranges not explicitly 118 mentioned in the current update is lost. Similarly, if the epoch changes, the 119 state for any prior epoch is discarded and the update applied to an empty state 120 for the new epoch. 121 122 At a high level, the design splits into three parts: 123 124 1. How are the outgoing updates assembled? This will mainly live in the Replica write 125 path: whenever something is proposed to Raft, action needs to be taken to 126 reflect this proposal in the next CT update. 127 2. How are the received updates used and which reads can be served? This lives 128 mostly in the read path. 129 3. How are reads routed to eligible follower replicas? This lives both in 130 `DistSender` and the DistSQL physical planner. 131 132 We will talk about how they are used first, as that is the most natural 133 starting point for correctness. 134 135 To serve a read request at timestamp `T` via follower reads, a replica 136 137 1. looks up the lease, noting the store (and thus node) and epoch it belongs to. 138 1. looks up the CT state known for this node and epoch. 139 1. checks whether the read timestamp `T` is less than or equal to the closed timestamp. 140 1. checks whether its *Lease Applied Index* matches or exceeds the *MLAI* for the range (in the absence of an *MLAI*, this check fails by default). 141 142 If the checks succeed, the follower serves the read (without an update to the 143 timestamp cache necessary). If they don't, a `NotLeaseholderError` is returned. 144 145 Note that if the read fails because no *MLAI* is known for that range, there 146 needs to be some proactive action to prompt re-sending of the *MLAI*. This is 147 because without write activity on the range (which is not necessarily going to 148 happen any time soon) the origin store will not send an update. Strategies to 149 mitigate this are discussed in a dedicated section below. 150 151 ## Implied guarantees 152 153 Implicitly, a received update represents the following essential promises: 154 155 - the origin node was, at any point in time, live for the given epoch and closed 156 timestamp. Concretely, this means that the origin node had a liveness update (for 157 the epoch) with the closed timestamp falling *before* the stasis period. 158 159 This guarantees that no other node could forcibly take over the lease at a 160 timestamp less than or equal to the closed timestamp, and consequently for any 161 lease (as seen on a follower) for that origin store and epoch the origin store 162 knows about all relevant Raft proposals that need to be applied before serving 163 follower reads. 164 165 In other words, **the ranges map in the update is authoritative** as long as: 166 - the *MLAI map* contains an update for any range for which a command has been 167 proposed since the last update. 168 169 This guarantee is hopefully not a surprise, but implicit in this is the 170 requirement that any relevant write actually increments the lease applied 171 index. Luckily, all commands do, except for lease requests (not transfers -- 172 see below for those), which don't mutate user-visible state. 173 - the origin store won't (ever) initiate a lease transfer that would allow 174 another node to write at or below the closed timestamp. In other words, in the 175 case of a lease transfer the next lease would start at a timestamp greater than 176 the closed timestamp. This is likely impossible in practice since the transfer 177 timestamps and proposed closed timestamps are taken from the same hybrid logical 178 clock, but an explicit safeguard will be added just in case. 179 180 If this rule were broken, another lease holder could propose commands that 181 violate the closed timestamp sent by the original node (and a lagging follower 182 would continue seeing the old lease and convince itself that it was fine to 183 serve reads). 184 185 Lease transfers also require an update in the *MLAI map*; they need to 186 essentially force the follower to see the new lease before they serve further 187 follower reads (at which point they will turn to the new leaseholder's store 188 for guidance). Nothing special is required to get this behavior; a lease 189 transfer requires a valid *Lease Applied Index*, so the same mechanism that 190 forces followers to catch up on the Raft log for writes also makes them 191 observe the new lease. This requires that we wait until reaching the MLAI 192 for a closed timestamp until we decide which node's state to query. 193 194 Note that a node restart implies a change in the liveness epoch, which in 195 turn invalidates all of the information sent before the restart. 196 197 ## Recovering from missed updates 198 199 To regain a fully populated *MLAI* map when first receiving updates (or after 200 resetting the state for a peer node), there are two strategies: 201 202 1. special case sequence number zero so that it includes an *MLAI* for all 203 ranges for which the lease is held. When an update is missed, the recipient 204 notifies the sender and it resets its sequence number to zero (thus sending 205 a full update next). 206 2. ask for updates on individual ranges whenever a follower read request fails 207 because of a missing *MLAI*. 208 209 We opt to implement both strategies, with the first doing the bulk of the work. 210 The first strategy is worthwhile because 211 212 1. the payload is essentially two varints for each range, amounting to no more than 213 20 bytes on the wire, adding up to a 1mb payload at 50000 leaseholder replicas 214 (but likely much less in practice). 215 Even with 10x as many, a rare enough 10mb payload seems unproblematic, 216 especially since it can be streamed. 217 2. without an eager catch-up, followers will have to warm up "on demand" but the 218 routing layer has no insight into this process and will blindly route reads 219 to followers, which makes for a poor experience after a node restart. 220 221 But this strategy can miss necessary updates as leases get transferred to 222 otherwise inactive ranges. To guard against these rare cases, the second 223 strategy serves as a fallback: recipients of updates can specify ranges they 224 would like to receive an MLAI for in the next update. They do this when they 225 observe a range state that suggests that an update has been missed, in 226 particular when a replica has no known MLAI stored for the (non-recent) lease. 227 228 ## Constructing outgoing updates 229 230 To get in the right mindset of this, consider the simplified situation of a 231 `Store` without any pending or (near) future write activity, that is, there are 232 (and will be) no in-flight Raft proposals. Now, we want to send an initial CT 233 update to another store. This means two things: 234 235 1. the need to "close" a timestamp, i.e. preventing any future write activity visible 236 at this timestamp, for any write proposed by this store as a leaseholder (for the 237 current epoch). 238 2. Tracking an *MLAI* for each replica (for which the lease for the epoch is held). 239 240 The first requirement is roughly equivalent to bumping the low water mark of 241 the timestamp cache to one logical tick above the desired closed timestamp 242 (though doing that in practice would perform poorly). 243 244 The second one is also straightforward: simply read the *Lease Applied Index* for 245 each replica; since nothing is in-flight, that's all the followers need to know 246 about. 247 248 In reality, there will sometimes be ongoing writes on a replica for which we want 249 to obtain an *MLAI*, and so 1) and 2) get more complicated. 250 251 Instead of adjusting the timestamp cache, we introduce a dedicated data 252 structure, the **minimum proposal tracker** (*MPT*), which tracks (at coarse 253 granularity) the timestamps for which proposals are still ongoing. In 254 particular, it can decide when it is safe to close out a higher timestamp than 255 before. This replaces 1), but retrieving an *MLAI* is also less straightforward 256 than before. 257 258 Assume the replica shows a *Lease Applied Index* of 12, but three proposals are 259 in-flight whereas another two have acquired latches and are still evaluating. 260 Presumably the in-flight proposals were assigned to *Lease Applied Indexes* 13 261 through 15, and the ones being evaluated will receive 15 and 16 (depending on 262 the order in which they enter Raft). This is where the *MPT*'s second function 263 comes in: it tracks writes until they are assigned a (provisional) *Lease 264 Applied Index*, and makes sure that an authoritative *MLAI* delta is returned 265 with each closed timestamp. This delta is *authoritative* in the sense that it 266 will reflect the largest **proposed** *MLAI* seen relevant to the newly closed 267 timestamp (relative to the previous one). 268 269 Consequently when we say that a proposal is tracked, we're talking about the 270 interval between determining the request timestamp (which is after acquiring 271 latches) and determining the proposal's *Lease Applied Index*. 272 273 It's natural to ask whether there may be "false positives", i.e. whether a 274 command proposed for some *Lease Applied Index* may never actually return from 275 Raft with a corresponding update to the range state. The response is that this 276 isn't possible: a command proposed to Raft is either retried until's clear that 277 the desired *Lease Applied Index* has already been surpassed (in which case 278 there is no problem) or the leaseholder process exits (in which case there will 279 be a new leaseholder and previous in-flight commands that never made it into the 280 log are irrelevant). 281 282 The naive approach of tracking the maximum assigned lease applied index is 283 problematic. To see this, consider the relevant example of a store that wants to 284 close out a timestamp around five seconds in the past, but which has high write 285 throughput on some range. Tracking the maximum proposed lease applied index 286 until we close out the timestamp `now()-5s` means that a follower won't be able 287 to serve reads until it has caught up on the last five seconds as well, even 288 though they are likely not relevant to the reads it wants to serve. This 289 motivates the precise form of the *MPT*, which has two adjacent "buckets" that 290 it moves forward in time: one tracking proposals relevant to the next closed 291 timestamp, and one with proposals relevant for the one after that. 292 293 The MPT consists of the previously emitted closed timestamp (zero initially) and 294 a prospective next closed timestamp aptly named `next` (always strictly larger 295 than `closed`) at or below which new writes are not accepted. It also contains 296 two ref counts and *MLAI* maps associated to below and above `next`, 297 respectively. 298 299 Its API is roughly the following: 300 301 ```go 302 // t := NewTracker() 303 304 // In Replica write path: 305 waitCmdQueue(ba) 306 applyTimestampCache(ba) 307 ts, done:= t.Track(ba.Timestamp) 308 ba.ForwardTimestamp(ts) 309 proposal := evaluate(ba) 310 proposal.LeaseAppliedIndex = <X> 311 done(proposal.LeaseAppliedIndex) 312 propose(proposal) 313 314 // In periodic store-level loop: 315 closedTS, mlaiMap := t.CloseWithNext(clock.Now()-TargetDuration) 316 sendUpdateToPeers(closedTS, mlaiMap) 317 ``` 318 319 Note that by using this API for *any* proposal it is guaranteed that we produce 320 all the updates promised to consumers of the CT updates. A few redundant pieces 321 of information may be sent (i.e. for lease requests triggering on a follower 322 range) but these are infrequent and cause no harm. 323 324 In what follows we'll to through an example, which for simplicity assumes that 325 all writes relate to the same range (thus reducing the *MLAI* maps to scalars). 326 The state of the *MPT* is laid out as in the diagram below. You see a previously 327 closed timestamp as well as a prospective next closed timestamp. There are three 328 proposals tracked at timestamps strictly less than `next`, and one proposal at 329 `next` or higher. Additionally, for proposals strictly less than `next`, the 330 *MLAI* `8` was recorded while that for the other side is `17`. 331 332 ``` 333 closed next 334 | @8 | @17 335 | #3 | #1 336 | | 337 v v 338 ---------------------------------------------------------> time 339 ``` 340 341 Let's walk through an example of how the MPT works. For ease of illustration, we 342 restrict to activity on a single replica (which avoids having a *map* of 343 *MLAI*s; now it's just one). Initially, `closed` and `next` demarcate some time 344 interval. Three commands arrive; `next`'s right side picks up a refcount of three 345 (new commands are forced above `next`, though in this case they were there to begin 346 with): 347 348 ``` 349 closed next commands 350 | @0 | @0 /\ \_______ 351 | #0 | #3 / \ | 352 v v v v v 353 ------------------------------x--x----------x------------> time 354 ``` 355 356 Next, it's time to construct a CT update. Since `next`'s left has a refcount of 357 zero, we know that nothing is in progress for timestamps below `next`, which 358 will now officially become a closed timestamp. To do so, `next` is returned to 359 the client along with the *MLAI* for its left (there is none this time around). 360 Additionally, the data structure is set up for the next iteration: `closed` is 361 forwarded to `next`, and `next` forwarded to a suitable timestamp some constant 362 target duration away from the current time. The commands previously tracked 363 ahead of `next` are now on its left. Note that even though one of the commands 364 has a timestamp ahead of `next`, it is now tracked to its left. This is fine; it 365 just means that we're taking a command into account earlier than required for 366 correctness. 367 368 ``` 369 next 370 @0 | @0 371 closed commands #3 | #0 372 | /\ \_____|__ 373 | / \ | | 374 v v v v v 375 ------------------------------x--x----------x------------> time 376 ``` 377 378 Two of the commands get proposed (at *LAI*s, say, 10 and 11), decrementing 379 the left refcount and adding an *MLAI* entry of 11 (the max of the two) to it. 380 Additionally, two new commands arrive, this time at timestamps below `next`. 381 These commands are forced above `next` first, so the refcount goes to the right. 382 These new commands get proposed quickly (so they don't show 383 up again) and the right refcount will drop back to zero (though it will retain the 384 max *MLAI* seen, likely 13). 385 386 ``` 387 in-flight 388 closed command next 389 | \ @11| @0 390 | \ #1 | #2 391 v v v 392 ---------------------------------x-----------------------> time 393 ĘŚ 394 | 395 _______________________________/ 396 | forwarding | 397 | | 398 new command new command 399 (finishes quickly @13) (finishes quickly @12) 400 ``` 401 402 The remaining command sticks around in the evaluation phase. This is 403 unfortunate; it's time for another CT update, but we can't send a higher closed 404 timestamp than before (and must stick to the same one with an empty *MLAI* map) 405 406 ``` 407 (blocked) (blocked) 408 in-flight 409 closed command next 410 | \ @11| @13 411 | \ #1 | #0 412 v v v 413 ---------------------------------x-----------------------> time 414 ``` 415 416 Finally the command gets proposed at LAI 14. A new command comes in at some 417 reasonable timestamp and the right side picks up a ref. Note the resulting 418 odd-looking situation in which the left is @14 and the right @13 (this is fine; 419 the client tracks the maximum seen): 420 421 ``` 422 closed next in-flight 423 | @14| @13 proposal 424 | #0 | #1 | 425 v v v 426 ----------------------------------------------------x----> time 427 ``` 428 429 Time for the next CT update. We can finally close `next` (emitting @14) and move 430 it to `now-target duration`, moving the right side refcount and *MLAI* to the 431 left in the process. 432 433 ``` 434 closed in-flight @13| @0 435 | proposal #1 | #0 436 | | _____/ 437 | | / 438 v v v 439 ----------------------------------------------------x----> time 440 ``` 441 442 ## Initial catch-up 443 444 The main mechanism for propagating *MLAI*s is triggered by proposals. When an 445 initial update is created, valid *MLAI*s have to be obtained for all ranges for 446 which followers are supposed to be able to serve reads. This raises two practical 447 questions: for which replicas should an *MLAI* be produced, and how to produce one. 448 449 We create an *MLAI* for all ranges for which (at the time of checking) the 450 current state indicates that the lease is held by the local store (this can have 451 both false positives and false negatives but a missed follower read will trigger 452 a proactive upgrade for the range it occurred on). 453 454 The initial catch-up is simple: before closing a timestamp (via the MPT), iterate 455 through all ranges and (if they show the store as holding the lease) feed the 456 MPT a proposal that lets it know the most recent *Lease Applied Index* on that 457 replica: 458 459 ```go 460 _, done:= t.Track(hlc.Timestamp{}) 461 repl.mu.Lock() 462 lai := repl.mu.lastAssignedLeaseIndex 463 repl.mu.Unlock() 464 done(lai) 465 ``` 466 467 This can race with other proposals, but the MPT will track the maximum seen. 468 469 ## Timestamp forwarding and intents 470 471 We forward commands' timestamps in order to guarantee that they don't 472 produce visible data at timestamps below the CT. A case in which that 473 is less obvious is that of an intent. 474 475 To see this, consider that a transaction has two relevant timestamps: 476 `OrigTimestamp` (also known as its read timestamp) and `Timestamp` 477 (also known as its commit timestamp). while the timestamp we forward 478 is `Timestamp`, the transaction internally will in fact attempt to 479 write at OrigTimestamp (but relies on moving these intents to their 480 actual timestamp later, when they are resolved). This prevents certain 481 anomalies, particularly with `SNAPSHOT` isolation. 482 483 Naively, this violates the guarantee: we promise that no more data will appear 484 below a certain timestamp. Note however that this data isn't visible at 485 timestamps below the commit timestamp (which was forwarded): to read the value, 486 the intent has to be resolved first, which implies that it will move at least to 487 `Timestamp` in the progress, restoring the guarantee required. 488 489 Similarly, this does not impede the usefulness of the CT mechanism for 490 recovery: the restored consistent state may contain intents. But the 491 restored consistent state also allows resolving all of the intents in 492 the same way, since what matters is the transaction record. The result 493 will be that the intents are simply dropped, unless there is a committed 494 transaction record, in which case they will commit. 495 496 Note that for the CDC use case, this closed timestamp mechanism is a necessary, 497 but not sufficient, solution. In particular, a CDC consumer must find (or track) 498 and resolve all intents at timestamps below a given closed timestamp first. 499 500 ## Splits/Merges 501 502 No action is necessary for splits: the leaseholders of the LHS and RHS are 503 colocated and so share the same closed timestamp mechanisms. For convenience an 504 update for the RHS is added to the next round of outgoing updates, otherwise 505 follower reads for the RHS would cut out for a moment. 506 507 Merges are more interesting since the leaseholders of the RHS and the LHS are 508 not necessarily colocated. If the RHS's store has closed a higher timestamp, say 509 1000, while the LHS's store is only at 500, after the merge commands might be 510 accepted on the combined range under the closed timestamp 500 that violate the 511 closed timestamp 1000. To counteract this, the `Subsume` operation 512 returns the closed timestamp on the origin store and the merging replica 513 takes it into account. Initially, the split trigger will populate the 514 timestamp cache for the right side of the merge; if this has too big an impact 515 on the timestamp cache (especially as merges are rolled out, we might merge 516 away large swaths of empty ranges), we can also store the timestamp on the 517 replica and use it to forward proposals manually. 518 519 ## Routing layer 520 521 This RFC proposes a somewhat simplistic implementation at the routing layer: At 522 `DistSender` and its DistSQL counterpart, if a read is for a timestamp earlier 523 than the current time less a target duration (which adds comfortable padding to 524 when followers are ideally able to serve these reads), it is sent to the nearest 525 replica (as measured by health, latency, locality, and perhaps a jitter), 526 instead of to the leaseholder. 527 528 When a read is handled by a replica not equipped to serve it via a regular or 529 follower read, a `NotLeaseHolderError` is returned and future requests for that 530 same (chunk of) batch will make no attempt to use follower reads; this avoids 531 getting stuck in an endless loop when followers lag significantly. Similarly, 532 follower reads are never attempted for ranges known not to use epoch based 533 leases. 534 535 ## Further work 536 537 While the design outlined so far should give a reasonably performant baseline, 538 it has several shortcomings that will need to be addressed in follow-up work: 539 540 ### Lagging followers 541 542 Assume that timestamps are closed at a 5s target duration every second, and 543 that the last proposal taken into account for each closed timestamp finishes 544 evaluating just before the timestamp is closed out. In that case, the *MLAI* 545 check on the followers is more likely to fail for a short moment until the Raft 546 log has caught up with the very recent proposal; if the catch-up takes longer 547 than the interval at which the timestamps are closed out, no follower read will 548 ever be possible. A similar scenario applies to followers far removed from the 549 usual commit quorum or lagging for any other reason. This should be fairly 550 rare, but seems important enough to be tackled in follow-up work. 551 552 The fundamental problem here is that older closed timestamps are discarded when 553 a new one is received, resulting in the follower never catching up to the current 554 closed timestamp. If it remembered the previous CT updates, it could at least 555 serve reads for that timestamp. This calls for a mechanism that holds on to 556 previous *CT*s and *MLAI*s so that reads further in the past can be served. 557 This won't be implemented initially to keep the complexity in the first version 558 to a minimum. 559 560 One way to address the problem is the following: On receipt of a CT update, copy 561 the CT and MLAI into the range state if the Raft log has caught up to the MLAI 562 (keeping the most recently overwritten value around to serve reads for). This 563 means that the replica will always have a valid CT during normal operation, 564 though one that lags the received updates (various variations on this theme 565 exist). However, note the strong connection to the following section: 566 567 ### Recovery from insufficient quorum 568 569 As mentioned in the initial paragraphs, follower reads can help recover a 570 recent consistent state of an unavailable cluster, by determining the maximum 571 timestamp at which every range has a surviving replica that can serve a 572 follower read (if all replicas of a range are lost, there is obviously no hope 573 of consistent recovery). 574 At this timestamp, a consistent read of the entire keyspace (excluding 575 expiration-based ranges) can be carried out and used to construct a backup. 576 Note that if expiration-based replicas persisted the last lease they held, the 577 timestamp could be lowered to the minimum over all surviving expiration based 578 replicas' last leases, for a consistent (but less recent) read of the *whole* 579 keyspace. 580 581 For maximum generality, it is desirable to in principle be able to recover 582 without relying on in-memory state, so that a termination of the running 583 process does not bar a subsequent recovery. 584 585 Naively this can be achieved by persisting all received *CT* updates (with 586 some eviction policy that rolls up old updates into a more recent initial 587 state), though the eventual implementation may opt to persist at the Replica 588 level instead (where updates caught up to can more easily be pruned). 589 590 ### Range feeds 591 592 [Range feeds] are a range-level mechanism to stream updates to an upstream 593 Change Data Capture processor. Range feeds will rely on closed timestamps and 594 will want to relay them to an upstream consumer as soon as possible. This 595 suggests a reactive mechanism that notifies the replicas with an active Range 596 feed on receipt of a CT update; given a registry of such replicas, this is easy 597 to add. 598 599 ### `AS OF SYSTEM TIME RECENT`; 600 601 With the advent of closed timestamps, we can also simplify `AS OF SYSTEM TIME` 602 by allowing users to let the server chose a reasonable "recent" timestamp in 603 the past for which reads can be distributed better. Note that, other than 604 what was requested in [this issue][autoaost], there is no guarantee about 605 blocking on conflicting writers. However, since a transaction that has 606 `PENDING` status with a timestamp that has since been closed out is likely 607 to have to restart (or ideally refresh) anyway, we could consider allowing it 608 to be pushed. 609 610 ## Rationale and Alternatives 611 612 This design appears to be the sane solution given boundary conditions. 613 614 ## Unresolved questions 615 616 ### Configurability 617 618 For now, the min proposal timestamp roughly trails real time by five seconds. 619 This can be made configurable, for example via a cluster setting or, if more 620 granularity is required, via zone configs (which in turn requires being able to 621 retrieve the history of the settings value or a mechanism that smears out the 622 change over some period of time, to avoid failed follower reads). 623 624 Transactions which exceed the lag are usually forced to restart, though this 625 will often happen through a refresh (which is comparatively cheap, though it 626 needs to be tested). 627 628 [RangeFeed]: https://github.com/cockroachdb/cockroach/pull/26782 629 [autoaost]: https://github.com/cockroachdb/cockroach/issues/25405