github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20180603_follower_reads.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20180603_follower_reads.md (about)

1 - Feature Name: Follower Reads
2 - Status: accepted
3 - Start Date: 2018-06-03
4 - Authors: Spencer Kimball, Tobias Schottdorf
5 - RFC PR: #21056
6 - Cockroach Issue: #16593
7
8 # Summary
9
10 Follower reads are consistent reads at historical timestamps from follower
11 replicas. They make the non-leader replicas in a range suitable sources for
12 historical reads. Historical reads include both `AS OF SYSTEM TIME` queries
13 as well as transactions with a read timestamp sufficiently in the past (for
14 example long-running analytics queries).
15
16 The key enabling technology is the exchange of **closed timestamp updates**
17 between stores. A closed timestamp update (*CT update*) is a store-wide
18 timestamp together with (sparse) per-range information on the Raft progress.
19 Follower replicas use local state built up from successive received CT updates
20 to ascertain that they have all the state necessary to serve consistent reads
21 at and below the leaseholder store's closed timestamp.
22
23 Follower reads are only possible for epoch-based leases, which includes all user
24 ranges but excludes some system ranges (such as the addressing metadata ranges).
25 In what follows all mentioned leases are epoch-based.
26
27 # Motivation
28
29 Consistent historical reads are useful for analytics queries and in particular
30 allow such queries to be carried out more efficiently and, with appropriate
31 configuration, away from foreground traffic. But historical reads are also key
32 to a proposal for [reference-like
33 tables](https://github.com/cockroachdb/cockroach/issues/26301) aimed at cutting
34 down on foreign key check latencies particularly in geo-distributed clusters;
35 they help recover a reasonably recent consistent snapshot of a cluster after a
36 loss of quorum; and they are one of the ingredients for [Change Data
37 Capture](https://github.com/cockroachdb/cockroach/pull/25229).
38
39 # Guide-level explanation
40
41 Fundamentally, the idea is that we already keep multiple consistent copies of
42 all data via replication, and that we want to utilize all of the copies to
43 serve reads. Morally speaking, a read which only cares to access data that was
44 written at some timestamp well in the past *should* be servable from all
45 replicas (assuming normal operation), as replication typically catches up all
46 the followers quickly, and most writes happen at "newer" timestamps. Clearly
47 neither of these two properties are guaranteed though, so replicas have to be
48 provided with a a way of deciding whether a given read request can be served
49 consistently.
50
51 The closed timestamp mechanism provides between each pair of stores a regular
52 (on the order of seconds) exchange of information to that end. At a high level,
53 these updates contain what one might intuitively expect:
54
55 A follower trying to serve a read needs to know that a given timestamp is "safe"
56 (in parlance of this RFC, "closed") to serve reads for; there must not be some
57 in-flight or future write that would invalidate a follower read retroactively.
58 Each store maintains a data structure, the **min proposal tracker** (*MPT*)
59 described later, to establish this timestamp.
60
61 Similarly, if a range's leaseholder commits a write into its Raft log at index
62 `P` before announcing a *closed timestamp*, then the follower must wait until it
63 has caught up to that index `P` before serving reads at the closed timestamp. To
64 provide this information, each store also includes with each closed timestamp an
65 updated minimum log index that the follower must reach before "activating" the
66 associated closed timestamp on that replica.
67
68 Providing the information only when there has been write activity on a given
69 range since the last closed timestamp is key to performance, as a store can
70 house upwards of 50000 replicas, and including information about every single
71 one of them in each update is prohibitive due to the overhead of visiting them.
72
73 This is similar to *range quiescence*, which avoids Raft heartbeats between
74 inactive ranges. It's worth pointing out that quiescent ranges are able to serve
75 follower reads, and that there is no architectural connection between follower
76 reads in quiescence, though a range that is quiescent is typically one that
77 requires no per-range CT update.
78
79 As we've seen above, this RFC deals in "log positions" (and closed timestamps).
80 For technical reasons, the "log position" is not the Raft log position but the
81 **Lease Applied Index**, a concept introduced by us on top of Raft to handle
82 Raft-internal reproposals and reorderings. Ultimately, what we're after is a
83 promise of the form
84
85 > no more proposals writing to timestamps less than or equal to `T` are going
86 to apply after log index `I`.
87
88 This guarantee is tricky to extract from the Raft log index since proposing a
89 command at log index `I` does not restrict it from showing up at higher log
90 indices later, especially in leader-not-leaseholder situations. The *Lease
91 Applied Index* was introduced precisely to have better control, and allows us to
92 make the above promise.
93
94 # Reference-level explanation
95
96 This section will focus on the technical details of the closed timestamp
97 mechanism, with an emphasis on correctness.
98
99 A closed timestamp update contains the following information (sent by an origin `Store`):
100
101 - the **liveness epoch** (of the origin `Store`)
102 - a **closed timestamp** (`hlc.Timestamp`, typically trails "real time" by at least a constant target duration)
103 - a **sequence number** (to allow discarding built-up state on missed updates)
104 - a map from `RangeID` to **minimum lease applied index** (*MLAI*) that specifies
105 the updates to the recipient's map accumulated from all previous updates.
106
107 The accumulated per-range state together with the closed timestamp serve as a
108 guarantee of the form
109
110 > Every Raft command proposed after the min lease applied index (MLAI)
111 will be ahead of the closed timestamp (CT).
112
113 Each store starts out with an empty state for each peer store and epoch, and
114 merges the *MLAI* updates into the state (overwriting existing *MLAI*s).
115 Whenever the sequence number received in an update from a peer store displays a
116 gap, the state for that peer store is reset, and the current update merged into
117 the empty state: this means that all information regarding ranges not explicitly
118 mentioned in the current update is lost. Similarly, if the epoch changes, the
119 state for any prior epoch is discarded and the update applied to an empty state
120 for the new epoch.
121
122 At a high level, the design splits into three parts:
123
124 1. How are the outgoing updates assembled? This will mainly live in the Replica write
125 path: whenever something is proposed to Raft, action needs to be taken to
126 reflect this proposal in the next CT update.
127 2. How are the received updates used and which reads can be served? This lives
128 mostly in the read path.
129 3. How are reads routed to eligible follower replicas? This lives both in
130 `DistSender` and the DistSQL physical planner.
131
132 We will talk about how they are used first, as that is the most natural
133 starting point for correctness.
134
135 To serve a read request at timestamp `T` via follower reads, a replica
136
137 1. looks up the lease, noting the store (and thus node) and epoch it belongs to.
138 1. looks up the CT state known for this node and epoch.
139 1. checks whether the read timestamp `T` is less than or equal to the closed timestamp.
140 1. checks whether its *Lease Applied Index* matches or exceeds the *MLAI* for the range (in the absence of an *MLAI*, this check fails by default).
141
142 If the checks succeed, the follower serves the read (without an update to the
143 timestamp cache necessary). If they don't, a `NotLeaseholderError` is returned.
144
145 Note that if the read fails because no *MLAI* is known for that range, there
146 needs to be some proactive action to prompt re-sending of the *MLAI*. This is
147 because without write activity on the range (which is not necessarily going to
148 happen any time soon) the origin store will not send an update. Strategies to
149 mitigate this are discussed in a dedicated section below.
150
151 ## Implied guarantees
152
153 Implicitly, a received update represents the following essential promises:
154
155 - the origin node was, at any point in time, live for the given epoch and closed
156 timestamp. Concretely, this means that the origin node had a liveness update (for
157 the epoch) with the closed timestamp falling *before* the stasis period.
158
159 This guarantees that no other node could forcibly take over the lease at a
160 timestamp less than or equal to the closed timestamp, and consequently for any
161 lease (as seen on a follower) for that origin store and epoch the origin store
162 knows about all relevant Raft proposals that need to be applied before serving
163 follower reads.
164
165 In other words, **the ranges map in the update is authoritative** as long as:
166 - the *MLAI map* contains an update for any range for which a command has been
167 proposed since the last update.
168
169 This guarantee is hopefully not a surprise, but implicit in this is the
170 requirement that any relevant write actually increments the lease applied
171 index. Luckily, all commands do, except for lease requests (not transfers --
172 see below for those), which don't mutate user-visible state.
173 - the origin store won't (ever) initiate a lease transfer that would allow
174 another node to write at or below the closed timestamp. In other words, in the
175 case of a lease transfer the next lease would start at a timestamp greater than
176 the closed timestamp. This is likely impossible in practice since the transfer
177 timestamps and proposed closed timestamps are taken from the same hybrid logical
178 clock, but an explicit safeguard will be added just in case.
179
180 If this rule were broken, another lease holder could propose commands that
181 violate the closed timestamp sent by the original node (and a lagging follower
182 would continue seeing the old lease and convince itself that it was fine to
183 serve reads).
184
185 Lease transfers also require an update in the *MLAI map*; they need to
186 essentially force the follower to see the new lease before they serve further
187 follower reads (at which point they will turn to the new leaseholder's store
188 for guidance). Nothing special is required to get this behavior; a lease
189 transfer requires a valid *Lease Applied Index*, so the same mechanism that
190 forces followers to catch up on the Raft log for writes also makes them
191 observe the new lease. This requires that we wait until reaching the MLAI
192 for a closed timestamp until we decide which node's state to query.
193
194 Note that a node restart implies a change in the liveness epoch, which in
195 turn invalidates all of the information sent before the restart.
196
197 ## Recovering from missed updates
198
199 To regain a fully populated *MLAI* map when first receiving updates (or after
200 resetting the state for a peer node), there are two strategies:
201
202 1. special case sequence number zero so that it includes an *MLAI* for all
203 ranges for which the lease is held. When an update is missed, the recipient
204 notifies the sender and it resets its sequence number to zero (thus sending
205 a full update next).
206 2. ask for updates on individual ranges whenever a follower read request fails
207 because of a missing *MLAI*.
208
209 We opt to implement both strategies, with the first doing the bulk of the work.
210 The first strategy is worthwhile because
211
212 1. the payload is essentially two varints for each range, amounting to no more than
213 20 bytes on the wire, adding up to a 1mb payload at 50000 leaseholder replicas
214 (but likely much less in practice).
215 Even with 10x as many, a rare enough 10mb payload seems unproblematic,
216 especially since it can be streamed.
217 2. without an eager catch-up, followers will have to warm up "on demand" but the
218 routing layer has no insight into this process and will blindly route reads
219 to followers, which makes for a poor experience after a node restart.
220
221 But this strategy can miss necessary updates as leases get transferred to
222 otherwise inactive ranges. To guard against these rare cases, the second
223 strategy serves as a fallback: recipients of updates can specify ranges they
224 would like to receive an MLAI for in the next update. They do this when they
225 observe a range state that suggests that an update has been missed, in
226 particular when a replica has no known MLAI stored for the (non-recent) lease.
227
228 ## Constructing outgoing updates
229
230 To get in the right mindset of this, consider the simplified situation of a
231 `Store` without any pending or (near) future write activity, that is, there are
232 (and will be) no in-flight Raft proposals. Now, we want to send an initial CT
233 update to another store. This means two things:
234
235 1. the need to "close" a timestamp, i.e. preventing any future write activity visible
236 at this timestamp, for any write proposed by this store as a leaseholder (for the
237 current epoch).
238 2. Tracking an *MLAI* for each replica (for which the lease for the epoch is held).
239
240 The first requirement is roughly equivalent to bumping the low water mark of
241 the timestamp cache to one logical tick above the desired closed timestamp
242 (though doing that in practice would perform poorly).
243
244 The second one is also straightforward: simply read the *Lease Applied Index* for
245 each replica; since nothing is in-flight, that's all the followers need to know
246 about.
247
248 In reality, there will sometimes be ongoing writes on a replica for which we want
249 to obtain an *MLAI*, and so 1) and 2) get more complicated.
250
251 Instead of adjusting the timestamp cache, we introduce a dedicated data
252 structure, the **minimum proposal tracker** (*MPT*), which tracks (at coarse
253 granularity) the timestamps for which proposals are still ongoing. In
254 particular, it can decide when it is safe to close out a higher timestamp than
255 before. This replaces 1), but retrieving an *MLAI* is also less straightforward
256 than before.
257
258 Assume the replica shows a *Lease Applied Index* of 12, but three proposals are
259 in-flight whereas another two have acquired latches and are still evaluating.
260 Presumably the in-flight proposals were assigned to *Lease Applied Indexes* 13
261 through 15, and the ones being evaluated will receive 15 and 16 (depending on
262 the order in which they enter Raft). This is where the *MPT*'s second function
263 comes in: it tracks writes until they are assigned a (provisional) *Lease
264 Applied Index*, and makes sure that an authoritative *MLAI* delta is returned
265 with each closed timestamp. This delta is *authoritative* in the sense that it
266 will reflect the largest **proposed** *MLAI* seen relevant to the newly closed
267 timestamp (relative to the previous one).
268
269 Consequently when we say that a proposal is tracked, we're talking about the
270 interval between determining the request timestamp (which is after acquiring
271 latches) and determining the proposal's *Lease Applied Index*.
272
273 It's natural to ask whether there may be "false positives", i.e. whether a
274 command proposed for some *Lease Applied Index* may never actually return from
275 Raft with a corresponding update to the range state. The response is that this
276 isn't possible: a command proposed to Raft is either retried until's clear that
277 the desired *Lease Applied Index* has already been surpassed (in which case
278 there is no problem) or the leaseholder process exits (in which case there will
279 be a new leaseholder and previous in-flight commands that never made it into the
280 log are irrelevant).
281
282 The naive approach of tracking the maximum assigned lease applied index is
283 problematic. To see this, consider the relevant example of a store that wants to
284 close out a timestamp around five seconds in the past, but which has high write
285 throughput on some range. Tracking the maximum proposed lease applied index
286 until we close out the timestamp `now()-5s` means that a follower won't be able
287 to serve reads until it has caught up on the last five seconds as well, even
288 though they are likely not relevant to the reads it wants to serve. This
289 motivates the precise form of the *MPT*, which has two adjacent "buckets" that
290 it moves forward in time: one tracking proposals relevant to the next closed
291 timestamp, and one with proposals relevant for the one after that.
292
293 The MPT consists of the previously emitted closed timestamp (zero initially) and
294 a prospective next closed timestamp aptly named `next` (always strictly larger
295 than `closed`) at or below which new writes are not accepted. It also contains
296 two ref counts and *MLAI* maps associated to below and above `next`,
297 respectively.
298
299 Its API is roughly the following:
300
301 ```go
302 // t := NewTracker()
303
304 // In Replica write path:
305 waitCmdQueue(ba)
306 applyTimestampCache(ba)
307 ts, done:= t.Track(ba.Timestamp)
308 ba.ForwardTimestamp(ts)
309 proposal := evaluate(ba)
310 proposal.LeaseAppliedIndex = <X>
311 done(proposal.LeaseAppliedIndex)
312 propose(proposal)
313
314 // In periodic store-level loop:
315 closedTS, mlaiMap := t.CloseWithNext(clock.Now()-TargetDuration)
316 sendUpdateToPeers(closedTS, mlaiMap)
317 ```
318
319 Note that by using this API for *any* proposal it is guaranteed that we produce
320 all the updates promised to consumers of the CT updates. A few redundant pieces
321 of information may be sent (i.e. for lease requests triggering on a follower
322 range) but these are infrequent and cause no harm.
323
324 In what follows we'll to through an example, which for simplicity assumes that
325 all writes relate to the same range (thus reducing the *MLAI* maps to scalars).
326 The state of the *MPT* is laid out as in the diagram below. You see a previously
327 closed timestamp as well as a prospective next closed timestamp. There are three
328 proposals tracked at timestamps strictly less than `next`, and one proposal at
329 `next` or higher. Additionally, for proposals strictly less than `next`, the
330 *MLAI* `8` was recorded while that for the other side is `17`.
331
332 ```
333 closed next
334 | @8 | @17
335 | #3 | #1
336 | |
337 v v
338 ---------------------------------------------------------> time
339 ```
340
341 Let's walk through an example of how the MPT works. For ease of illustration, we
342 restrict to activity on a single replica (which avoids having a *map* of
343 *MLAI*s; now it's just one). Initially, `closed` and `next` demarcate some time
344 interval. Three commands arrive; `next`'s right side picks up a refcount of three
345 (new commands are forced above `next`, though in this case they were there to begin
346 with):
347
348 ```
349 closed next commands
350 | @0 | @0 /\ \_______
351 | #0 | #3 / \ |
352 v v v v v
353 ------------------------------x--x----------x------------> time
354 ```
355
356 Next, it's time to construct a CT update. Since `next`'s left has a refcount of
357 zero, we know that nothing is in progress for timestamps below `next`, which
358 will now officially become a closed timestamp. To do so, `next` is returned to
359 the client along with the *MLAI* for its left (there is none this time around).
360 Additionally, the data structure is set up for the next iteration: `closed` is
361 forwarded to `next`, and `next` forwarded to a suitable timestamp some constant
362 target duration away from the current time. The commands previously tracked
363 ahead of `next` are now on its left. Note that even though one of the commands
364 has a timestamp ahead of `next`, it is now tracked to its left. This is fine; it
365 just means that we're taking a command into account earlier than required for
366 correctness.
367
368 ```
369 next
370 @0 | @0
371 closed commands #3 | #0
372 | /\ \_____|__
373 | / \ | |
374 v v v v v
375 ------------------------------x--x----------x------------> time
376 ```
377
378 Two of the commands get proposed (at *LAI*s, say, 10 and 11), decrementing
379 the left refcount and adding an *MLAI* entry of 11 (the max of the two) to it.
380 Additionally, two new commands arrive, this time at timestamps below `next`.
381 These commands are forced above `next` first, so the refcount goes to the right.
382 These new commands get proposed quickly (so they don't show
383 up again) and the right refcount will drop back to zero (though it will retain the
384 max *MLAI* seen, likely 13).
385
386 ```
387 in-flight
388 closed command next
389 | \ @11| @0
390 | \ #1 | #2
391 v v v
392 ---------------------------------x-----------------------> time
393 ʌ
394 |
395 _______________________________/
396 | forwarding |
397 | |
398 new command new command
399 (finishes quickly @13) (finishes quickly @12)
400 ```
401
402 The remaining command sticks around in the evaluation phase. This is
403 unfortunate; it's time for another CT update, but we can't send a higher closed
404 timestamp than before (and must stick to the same one with an empty *MLAI* map)
405
406 ```
407 (blocked) (blocked)
408 in-flight
409 closed command next
410 | \ @11| @13
411 | \ #1 | #0
412 v v v
413 ---------------------------------x-----------------------> time
414 ```
415
416 Finally the command gets proposed at LAI 14. A new command comes in at some
417 reasonable timestamp and the right side picks up a ref. Note the resulting
418 odd-looking situation in which the left is @14 and the right @13 (this is fine;
419 the client tracks the maximum seen):
420
421 ```
422 closed next in-flight
423 | @14| @13 proposal
424 | #0 | #1 |
425 v v v
426 ----------------------------------------------------x----> time
427 ```
428
429 Time for the next CT update. We can finally close `next` (emitting @14) and move
430 it to `now-target duration`, moving the right side refcount and *MLAI* to the
431 left in the process.
432
433 ```
434 closed in-flight @13| @0
435 | proposal #1 | #0
436 | | _____/
437 | | /
438 v v v
439 ----------------------------------------------------x----> time
440 ```
441
442 ## Initial catch-up
443
444 The main mechanism for propagating *MLAI*s is triggered by proposals. When an
445 initial update is created, valid *MLAI*s have to be obtained for all ranges for
446 which followers are supposed to be able to serve reads. This raises two practical
447 questions: for which replicas should an *MLAI* be produced, and how to produce one.
448
449 We create an *MLAI* for all ranges for which (at the time of checking) the
450 current state indicates that the lease is held by the local store (this can have
451 both false positives and false negatives but a missed follower read will trigger
452 a proactive upgrade for the range it occurred on).
453
454 The initial catch-up is simple: before closing a timestamp (via the MPT), iterate
455 through all ranges and (if they show the store as holding the lease) feed the
456 MPT a proposal that lets it know the most recent *Lease Applied Index* on that
457 replica:
458
459 ```go
460 _, done:= t.Track(hlc.Timestamp{})
461 repl.mu.Lock()
462 lai := repl.mu.lastAssignedLeaseIndex
463 repl.mu.Unlock()
464 done(lai)
465 ```
466
467 This can race with other proposals, but the MPT will track the maximum seen.
468
469 ## Timestamp forwarding and intents
470
471 We forward commands' timestamps in order to guarantee that they don't
472 produce visible data at timestamps below the CT. A case in which that
473 is less obvious is that of an intent.
474
475 To see this, consider that a transaction has two relevant timestamps:
476 `OrigTimestamp` (also known as its read timestamp) and `Timestamp`
477 (also known as its commit timestamp). while the timestamp we forward
478 is `Timestamp`, the transaction internally will in fact attempt to
479 write at OrigTimestamp (but relies on moving these intents to their
480 actual timestamp later, when they are resolved). This prevents certain
481 anomalies, particularly with `SNAPSHOT` isolation.
482
483 Naively, this violates the guarantee: we promise that no more data will appear
484 below a certain timestamp. Note however that this data isn't visible at
485 timestamps below the commit timestamp (which was forwarded): to read the value,
486 the intent has to be resolved first, which implies that it will move at least to
487 `Timestamp` in the progress, restoring the guarantee required.
488
489 Similarly, this does not impede the usefulness of the CT mechanism for
490 recovery: the restored consistent state may contain intents. But the
491 restored consistent state also allows resolving all of the intents in
492 the same way, since what matters is the transaction record. The result
493 will be that the intents are simply dropped, unless there is a committed
494 transaction record, in which case they will commit.
495
496 Note that for the CDC use case, this closed timestamp mechanism is a necessary,
497 but not sufficient, solution. In particular, a CDC consumer must find (or track)
498 and resolve all intents at timestamps below a given closed timestamp first.
499
500 ## Splits/Merges
501
502 No action is necessary for splits: the leaseholders of the LHS and RHS are
503 colocated and so share the same closed timestamp mechanisms. For convenience an
504 update for the RHS is added to the next round of outgoing updates, otherwise
505 follower reads for the RHS would cut out for a moment.
506
507 Merges are more interesting since the leaseholders of the RHS and the LHS are
508 not necessarily colocated. If the RHS's store has closed a higher timestamp, say
509 1000, while the LHS's store is only at 500, after the merge commands might be
510 accepted on the combined range under the closed timestamp 500 that violate the
511 closed timestamp 1000. To counteract this, the `Subsume` operation
512 returns the closed timestamp on the origin store and the merging replica
513 takes it into account. Initially, the split trigger will populate the
514 timestamp cache for the right side of the merge; if this has too big an impact
515 on the timestamp cache (especially as merges are rolled out, we might merge
516 away large swaths of empty ranges), we can also store the timestamp on the
517 replica and use it to forward proposals manually.
518
519 ## Routing layer
520
521 This RFC proposes a somewhat simplistic implementation at the routing layer: At
522 `DistSender` and its DistSQL counterpart, if a read is for a timestamp earlier
523 than the current time less a target duration (which adds comfortable padding to
524 when followers are ideally able to serve these reads), it is sent to the nearest
525 replica (as measured by health, latency, locality, and perhaps a jitter),
526 instead of to the leaseholder.
527
528 When a read is handled by a replica not equipped to serve it via a regular or
529 follower read, a `NotLeaseHolderError` is returned and future requests for that
530 same (chunk of) batch will make no attempt to use follower reads; this avoids
531 getting stuck in an endless loop when followers lag significantly. Similarly,
532 follower reads are never attempted for ranges known not to use epoch based
533 leases.
534
535 ## Further work
536
537 While the design outlined so far should give a reasonably performant baseline,
538 it has several shortcomings that will need to be addressed in follow-up work:
539
540 ### Lagging followers
541
542 Assume that timestamps are closed at a 5s target duration every second, and
543 that the last proposal taken into account for each closed timestamp finishes
544 evaluating just before the timestamp is closed out. In that case, the *MLAI*
545 check on the followers is more likely to fail for a short moment until the Raft
546 log has caught up with the very recent proposal; if the catch-up takes longer
547 than the interval at which the timestamps are closed out, no follower read will
548 ever be possible. A similar scenario applies to followers far removed from the
549 usual commit quorum or lagging for any other reason. This should be fairly
550 rare, but seems important enough to be tackled in follow-up work.
551
552 The fundamental problem here is that older closed timestamps are discarded when
553 a new one is received, resulting in the follower never catching up to the current
554 closed timestamp. If it remembered the previous CT updates, it could at least
555 serve reads for that timestamp. This calls for a mechanism that holds on to
556 previous *CT*s and *MLAI*s so that reads further in the past can be served.
557 This won't be implemented initially to keep the complexity in the first version
558 to a minimum.
559
560 One way to address the problem is the following: On receipt of a CT update, copy
561 the CT and MLAI into the range state if the Raft log has caught up to the MLAI
562 (keeping the most recently overwritten value around to serve reads for). This
563 means that the replica will always have a valid CT during normal operation,
564 though one that lags the received updates (various variations on this theme
565 exist). However, note the strong connection to the following section:
566
567 ### Recovery from insufficient quorum
568
569 As mentioned in the initial paragraphs, follower reads can help recover a
570 recent consistent state of an unavailable cluster, by determining the maximum
571 timestamp at which every range has a surviving replica that can serve a
572 follower read (if all replicas of a range are lost, there is obviously no hope
573 of consistent recovery).
574 At this timestamp, a consistent read of the entire keyspace (excluding
575 expiration-based ranges) can be carried out and used to construct a backup.
576 Note that if expiration-based replicas persisted the last lease they held, the
577 timestamp could be lowered to the minimum over all surviving expiration based
578 replicas' last leases, for a consistent (but less recent) read of the *whole*
579 keyspace.
580
581 For maximum generality, it is desirable to in principle be able to recover
582 without relying on in-memory state, so that a termination of the running
583 process does not bar a subsequent recovery.
584
585 Naively this can be achieved by persisting all received *CT* updates (with
586 some eviction policy that rolls up old updates into a more recent initial
587 state), though the eventual implementation may opt to persist at the Replica
588 level instead (where updates caught up to can more easily be pruned).
589
590 ### Range feeds
591
592 [Range feeds] are a range-level mechanism to stream updates to an upstream
593 Change Data Capture processor. Range feeds will rely on closed timestamps and
594 will want to relay them to an upstream consumer as soon as possible. This
595 suggests a reactive mechanism that notifies the replicas with an active Range
596 feed on receipt of a CT update; given a registry of such replicas, this is easy
597 to add.
598
599 ### `AS OF SYSTEM TIME RECENT`;
600
601 With the advent of closed timestamps, we can also simplify `AS OF SYSTEM TIME`
602 by allowing users to let the server chose a reasonable "recent" timestamp in
603 the past for which reads can be distributed better. Note that, other than
604 what was requested in [this issue][autoaost], there is no guarantee about
605 blocking on conflicting writers. However, since a transaction that has
606 `PENDING` status with a timestamp that has since been closed out is likely
607 to have to restart (or ideally refresh) anyway, we could consider allowing it
608 to be pushed.
609
610 ## Rationale and Alternatives
611
612 This design appears to be the sane solution given boundary conditions.
613
614 ## Unresolved questions
615
616 ### Configurability
617
618 For now, the min proposal timestamp roughly trails real time by five seconds.
619 This can be made configurable, for example via a cluster setting or, if more
620 granularity is required, via zone configs (which in turn requires being able to
621 retrieve the history of the settings value or a mechanism that smears out the
622 change over some period of time, to avoid failed follower reads).
623
624 Transactions which exceed the lag are usually forced to restart, though this
625 will often happen through a refresh (which is comparatively cheap, though it
626 needs to be tested).
627
628 [RangeFeed]: https://github.com/cockroachdb/cockroach/pull/26782
629 [autoaost]: https://github.com/cockroachdb/cockroach/issues/25405