github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20180603_follower_reads.md (about)

     1  - Feature Name: Follower Reads
     2  - Status: accepted
     3  - Start Date: 2018-06-03
     4  - Authors: Spencer Kimball, Tobias Schottdorf
     5  - RFC PR: #21056
     6  - Cockroach Issue: #16593
     7  
     8  # Summary
     9  
    10  Follower reads are consistent reads at historical timestamps from follower
    11  replicas. They make the non-leader replicas in a range suitable sources for
    12  historical reads. Historical reads include both `AS OF SYSTEM TIME` queries
    13  as well as transactions with a read timestamp sufficiently in the past (for
    14  example long-running analytics queries).
    15  
    16  The key enabling technology is the exchange of **closed timestamp updates**
    17  between stores. A closed timestamp update (*CT update*) is a store-wide
    18  timestamp together with (sparse) per-range information on the Raft progress.
    19  Follower replicas use local state built up from successive received CT updates
    20  to ascertain that they have all the state necessary to serve consistent reads
    21  at and below the leaseholder store's closed timestamp.
    22  
    23  Follower reads are only possible for epoch-based leases, which includes all user
    24  ranges but excludes some system ranges (such as the addressing metadata ranges).
    25  In what follows all mentioned leases are epoch-based.
    26  
    27  # Motivation
    28  
    29  Consistent historical reads are useful for analytics queries and in particular
    30  allow such queries to be carried out more efficiently and, with appropriate
    31  configuration, away from foreground traffic. But historical reads are also key
    32  to a proposal for [reference-like
    33  tables](https://github.com/cockroachdb/cockroach/issues/26301) aimed at cutting
    34  down on foreign key check latencies particularly in geo-distributed clusters;
    35  they help recover a reasonably recent consistent snapshot of a cluster after a
    36  loss of quorum; and they are one of the ingredients for [Change Data
    37  Capture](https://github.com/cockroachdb/cockroach/pull/25229).
    38  
    39  # Guide-level explanation
    40  
    41  Fundamentally, the idea is that we already keep multiple consistent copies of
    42  all data via replication, and that we want to utilize all of the copies to
    43  serve reads. Morally speaking, a read which only cares to access data that was
    44  written at some timestamp well in the past *should* be servable from all
    45  replicas (assuming normal operation), as replication typically catches up all
    46  the followers quickly, and most writes happen at "newer" timestamps. Clearly
    47  neither of these two properties are guaranteed though, so replicas have to be
    48  provided with a a way of deciding whether a given read request can be served
    49  consistently.
    50  
    51  The closed timestamp mechanism provides between each pair of stores a regular
    52  (on the order of seconds) exchange of information to that end. At a high level,
    53  these updates contain what one might intuitively expect:
    54  
    55  A follower trying to serve a read needs to know that a given timestamp is "safe"
    56  (in parlance of this RFC, "closed") to serve reads for; there must not be some
    57  in-flight or future write that would invalidate a follower read retroactively.
    58  Each store maintains a data structure, the **min proposal tracker** (*MPT*)
    59  described later, to establish this timestamp.
    60  
    61  Similarly, if a range's leaseholder commits a write into its Raft log at index
    62  `P` before announcing a *closed timestamp*, then the follower must wait until it
    63  has caught up to that index `P` before serving reads at the closed timestamp. To
    64  provide this information, each store also includes with each closed timestamp an
    65  updated minimum log index that the follower must reach before "activating" the
    66  associated closed timestamp on that replica.
    67  
    68  Providing the information only when there has been write activity on a given
    69  range since the last closed timestamp is key to performance, as a store can
    70  house upwards of 50000 replicas, and including information about every single
    71  one of them in each update is prohibitive due to the overhead of visiting them.
    72  
    73  This is similar to *range quiescence*, which avoids Raft heartbeats between
    74  inactive ranges. It's worth pointing out that quiescent ranges are able to serve
    75  follower reads, and that there is no architectural connection between follower
    76  reads in quiescence, though a range that is quiescent is typically one that
    77  requires no per-range CT update.
    78  
    79  As we've seen above, this RFC deals in "log positions" (and closed timestamps).
    80  For technical reasons, the "log position" is not the Raft log position but the
    81  **Lease Applied Index**, a concept introduced by us on top of Raft to handle
    82  Raft-internal reproposals and reorderings. Ultimately, what we're after is a
    83  promise of the form
    84  
    85  > no more proposals writing to timestamps less than or equal to  `T` are going
    86  to apply after log index `I`.
    87  
    88  This guarantee is tricky to extract from the Raft log index since proposing a
    89  command at log index `I` does not restrict it from showing up at higher log
    90  indices later, especially in leader-not-leaseholder situations. The *Lease
    91  Applied Index* was introduced precisely to have better control, and allows us to
    92  make the above promise.
    93  
    94  # Reference-level explanation
    95  
    96  This section will focus on the technical details of the closed timestamp
    97  mechanism, with an emphasis on correctness.
    98  
    99  A closed timestamp update contains the following information (sent by an origin `Store`):
   100  
   101  - the **liveness epoch** (of the origin `Store`)
   102  - a **closed timestamp** (`hlc.Timestamp`, typically trails "real time" by at least a constant target duration)
   103  - a **sequence number** (to allow discarding built-up state on missed updates)
   104  - a map from `RangeID` to **minimum lease applied index** (*MLAI*) that specifies
   105    the updates to the recipient's map accumulated from all previous updates.
   106  
   107  The accumulated per-range state together with the closed timestamp serve as a
   108  guarantee of the form
   109  
   110  > Every Raft command proposed after the min lease applied index (MLAI)
   111  will be ahead of the closed timestamp (CT).
   112  
   113  Each store starts out with an empty state for each peer store and epoch, and
   114  merges the *MLAI* updates into the state (overwriting existing *MLAI*s).
   115  Whenever the sequence number received in an update from a peer store displays a
   116  gap, the state for that peer store is reset, and the current update merged into
   117  the empty state: this means that all information regarding ranges not explicitly
   118  mentioned in the current update is lost. Similarly, if the epoch changes, the
   119  state for any prior epoch is discarded and the update applied to an empty state
   120  for the new epoch.
   121  
   122  At a high level, the design splits into three parts:
   123  
   124  1. How are the outgoing updates assembled? This will mainly live in the Replica write
   125  path: whenever something is proposed to Raft, action needs to be taken to
   126  reflect this proposal in the next CT update.
   127  2. How are the received updates used and which reads can be served? This lives
   128  mostly in the read path.
   129  3. How are reads routed to eligible follower replicas? This lives both in
   130  `DistSender` and the DistSQL physical planner.
   131  
   132  We will talk about how they are used first, as that is the most natural
   133  starting point for correctness.
   134  
   135  To serve a read request at timestamp `T` via follower reads, a replica
   136  
   137  1. looks up the lease, noting the store (and thus node) and epoch it belongs to.
   138  1. looks up the CT state known for this node and epoch.
   139  1. checks whether the read timestamp `T` is less than or equal to the closed timestamp.
   140  1. checks whether its *Lease Applied Index* matches or exceeds the *MLAI* for the range (in the absence of an *MLAI*, this check fails by default).
   141  
   142  If the checks succeed, the follower serves the read (without an update to the
   143  timestamp cache necessary). If they don't, a `NotLeaseholderError` is returned.
   144  
   145  Note that if the read fails because no *MLAI* is known for that range, there
   146  needs to be some proactive action to prompt re-sending of the *MLAI*. This is
   147  because without write activity on the range (which is not necessarily going to
   148  happen any time soon) the origin store will not send an update. Strategies to
   149  mitigate this are discussed in a dedicated section below.
   150  
   151  ## Implied guarantees
   152  
   153  Implicitly, a received update represents the following essential promises:
   154  
   155  - the origin node was, at any point in time, live for the given epoch and closed
   156    timestamp. Concretely, this means that the origin node had a liveness update (for
   157    the epoch) with the closed timestamp falling *before* the stasis period.
   158  
   159    This guarantees that no other node could forcibly take over the lease at a
   160    timestamp less than or equal to the closed timestamp, and consequently for any
   161    lease (as seen on a follower) for that origin store and epoch the origin store
   162    knows about all relevant Raft proposals that need to be applied before serving
   163    follower reads.
   164  
   165    In other words, **the ranges map in the update is authoritative** as long as:
   166  - the *MLAI map* contains an update for any range for which a command has been
   167    proposed since the last update.
   168  
   169    This guarantee is hopefully not a surprise, but implicit in this is the
   170    requirement that any relevant write actually increments the lease applied
   171    index. Luckily, all commands do, except for lease requests (not transfers --
   172    see below for those), which don't mutate user-visible state.
   173  - the origin store won't (ever) initiate a lease transfer that would allow
   174    another node to write at or below the closed timestamp. In other words, in the
   175    case of a lease transfer the next lease would start at a timestamp greater than
   176    the closed timestamp. This is likely impossible in practice since the transfer
   177    timestamps and proposed closed timestamps are taken from the same hybrid logical
   178    clock, but an explicit safeguard will be added just in case.
   179  
   180    If this rule were broken, another lease holder could propose commands that
   181    violate the closed timestamp sent by the original node (and a lagging follower
   182    would continue seeing the old lease and convince itself that it was fine to
   183    serve reads).
   184  
   185    Lease transfers also require an update in the *MLAI map*; they need to
   186    essentially force the follower to see the new lease before they serve further
   187    follower reads (at which point they will turn to the new leaseholder's store
   188    for guidance). Nothing special is required to get this behavior; a lease
   189    transfer requires a valid *Lease Applied Index*, so the same mechanism that
   190    forces followers to catch up on the Raft log for writes also makes them
   191    observe the new lease. This requires that we wait until reaching the MLAI
   192    for a closed timestamp until we decide which node's state to query.
   193  
   194    Note that a node restart implies a change in the liveness epoch, which in
   195    turn invalidates all of the information sent before the restart.
   196  
   197  ## Recovering from missed updates
   198  
   199  To regain a fully populated *MLAI* map when first receiving updates (or after
   200  resetting the state for a peer node), there are two strategies:
   201  
   202  1. special case sequence number zero so that it includes an *MLAI* for all
   203     ranges for which the lease is held. When an update is missed, the recipient
   204     notifies the sender and it resets its sequence number to zero (thus sending
   205     a full update next).
   206  2. ask for updates on individual ranges whenever a follower read request fails
   207     because of a missing *MLAI*.
   208  
   209  We opt to implement both strategies, with the first doing the bulk of the work.
   210  The first strategy is worthwhile because
   211  
   212  1. the payload is essentially two varints for each range, amounting to no more than
   213     20 bytes on the wire, adding up to a 1mb payload at 50000 leaseholder replicas
   214     (but likely much less in practice).
   215     Even with 10x as many, a rare enough 10mb payload seems unproblematic,
   216     especially since it can be streamed.
   217  2. without an eager catch-up, followers will have to warm up "on demand" but the
   218     routing layer has no insight into this process and will blindly route reads
   219     to followers, which makes for a poor experience after a node restart.
   220  
   221  But this strategy can miss necessary updates as leases get transferred to
   222  otherwise inactive ranges. To guard against these rare cases, the second
   223  strategy serves as a fallback: recipients of updates can specify ranges they
   224  would like to receive an MLAI for in the next update. They do this when they
   225  observe a range state that suggests that an update has been missed, in
   226  particular when a replica has no known MLAI stored for the (non-recent) lease.
   227  
   228  ## Constructing outgoing updates
   229  
   230  To get in the right mindset of this, consider the simplified situation of a
   231  `Store` without any pending or (near) future write activity, that is, there are
   232  (and will be) no in-flight Raft proposals. Now, we want to send an initial CT
   233  update to another store. This means two things:
   234  
   235  1. the need to "close" a timestamp, i.e. preventing any future write activity visible
   236     at this timestamp, for any write proposed by this store as a leaseholder (for the
   237     current epoch).
   238  2. Tracking an *MLAI* for each replica (for which the lease for the epoch is held).
   239  
   240  The first requirement is roughly equivalent to bumping the low water mark of
   241  the timestamp cache to one logical tick above the desired closed timestamp
   242  (though doing that in practice would perform poorly).
   243  
   244  The second one is also straightforward: simply read the *Lease Applied Index* for
   245  each replica; since nothing is in-flight, that's all the followers need to know
   246  about.
   247  
   248  In reality, there will sometimes be ongoing writes on a replica for which we want
   249  to obtain an *MLAI*, and so 1) and 2) get more complicated.
   250  
   251  Instead of adjusting the timestamp cache, we introduce a dedicated data
   252  structure, the **minimum proposal tracker** (*MPT*), which tracks (at coarse
   253  granularity) the timestamps for which proposals are still ongoing. In
   254  particular, it can decide when it is safe to close out a higher timestamp than
   255  before. This replaces 1), but retrieving an *MLAI* is also less straightforward
   256  than before.
   257  
   258  Assume the replica shows a *Lease Applied Index* of 12, but three proposals are
   259  in-flight whereas another two have acquired latches and are still evaluating.
   260  Presumably the in-flight proposals were assigned to *Lease Applied Indexes* 13
   261  through 15, and the ones being evaluated will receive 15 and 16 (depending on
   262  the order in which they enter Raft). This is where the *MPT*'s second function
   263  comes in: it tracks writes until they are assigned a (provisional) *Lease
   264  Applied Index*, and makes sure that an authoritative *MLAI* delta is returned
   265  with each closed timestamp. This delta is *authoritative* in the sense that it
   266  will reflect the largest **proposed** *MLAI* seen relevant to the newly closed
   267  timestamp (relative to the previous one).
   268  
   269  Consequently when we say that a proposal is tracked, we're talking about the
   270  interval between determining the request timestamp (which is after acquiring
   271  latches) and determining the proposal's *Lease Applied Index*.
   272  
   273  It's natural to ask whether there may be "false positives", i.e. whether a
   274  command proposed for some *Lease Applied Index* may never actually return from
   275  Raft with a corresponding update to the range state. The response is that this
   276  isn't possible: a command proposed to Raft is either retried until's clear that
   277  the desired *Lease Applied Index* has already been surpassed (in which case
   278  there is no problem) or the leaseholder process exits (in which case there will
   279  be a new leaseholder and previous in-flight commands that never made it into the
   280  log are irrelevant).
   281  
   282  The naive approach of tracking the maximum assigned lease applied index is
   283  problematic. To see this, consider the relevant example of a store that wants to
   284  close out a timestamp around five seconds in the past, but which has high write
   285  throughput on some range. Tracking the maximum proposed lease applied index
   286  until we close out the timestamp `now()-5s` means that a follower won't be able
   287  to serve reads until it has caught up on the last five seconds as well, even
   288  though they are likely not relevant to the reads it wants to serve. This
   289  motivates the precise form of the *MPT*, which has two adjacent "buckets" that
   290  it moves forward in time: one tracking proposals relevant to the next closed
   291  timestamp, and one with proposals relevant for the one after that.
   292  
   293  The MPT consists of the previously emitted closed timestamp (zero initially) and
   294  a prospective next closed timestamp aptly named `next` (always strictly larger
   295  than `closed`) at or below which new writes are not accepted. It also contains
   296  two ref counts and *MLAI* maps associated to below and above `next`,
   297  respectively.
   298  
   299  Its API is roughly the following:
   300  
   301  ```go
   302  // t := NewTracker()
   303  
   304  // In Replica write path:
   305  waitCmdQueue(ba)
   306  applyTimestampCache(ba)
   307  ts, done:= t.Track(ba.Timestamp)
   308  ba.ForwardTimestamp(ts)
   309  proposal := evaluate(ba)
   310  proposal.LeaseAppliedIndex = <X>
   311  done(proposal.LeaseAppliedIndex)
   312  propose(proposal)
   313  
   314  // In periodic store-level loop:
   315  closedTS, mlaiMap := t.CloseWithNext(clock.Now()-TargetDuration)
   316  sendUpdateToPeers(closedTS, mlaiMap)
   317  ```
   318  
   319  Note that by using this API for *any* proposal it is guaranteed that we produce
   320  all the updates promised to consumers of the CT updates. A few redundant pieces
   321  of information may be sent (i.e. for lease requests triggering on a follower
   322  range) but these are infrequent and cause no harm.
   323  
   324  In what follows we'll to through an example, which for simplicity assumes that
   325  all writes relate to the same range (thus reducing the *MLAI* maps to scalars).
   326  The state of the *MPT* is laid out as in the diagram below. You see a previously
   327  closed timestamp as well as a prospective next closed timestamp. There are three
   328  proposals tracked at timestamps strictly less than `next`, and one proposal at
   329  `next` or higher. Additionally, for proposals strictly less than `next`, the
   330  *MLAI* `8` was recorded while that for the other side is `17`.
   331  
   332  ```
   333     closed           next
   334        |            @8 | @17
   335        |            #3 | #1
   336        |               |
   337        v               v
   338  ---------------------------------------------------------> time
   339  ```
   340  
   341  Let's walk through an example of how the MPT works. For ease of illustration, we
   342  restrict to activity on a single replica (which avoids having a *map* of
   343  *MLAI*s; now it's just one). Initially, `closed` and `next` demarcate some time
   344  interval. Three commands arrive; `next`'s right side picks up a refcount of three
   345  (new commands are forced above `next`, though in this case they were there to begin
   346  with):
   347  
   348  ```
   349            closed    next    commands
   350               |     @0 | @0     /\   \_______
   351               |     #0 | #3    /  \          |
   352               v        v       v  v          v
   353  ------------------------------x--x----------x------------> time
   354  ```
   355  
   356  Next, it's time to construct a CT update. Since `next`'s left has a refcount of
   357  zero, we know that nothing is in progress for timestamps below `next`, which
   358  will now officially become a closed timestamp. To do so, `next` is returned to
   359  the client along with the *MLAI* for its left (there is none this time around).
   360  Additionally, the data structure is set up for the next iteration: `closed` is
   361  forwarded to `next`, and `next` forwarded to a suitable timestamp some constant
   362  target duration away from the current time. The commands previously tracked
   363  ahead of `next` are now on its left. Note that even though one of the commands
   364  has a timestamp ahead of `next`, it is now tracked to its left. This is fine; it
   365  just means that we're taking a command into account earlier than required for
   366  correctness.
   367  
   368  ```
   369                                           next
   370                                         @0 | @0
   371                      closed   commands  #3 | #0
   372                        |        /\   \_____|__
   373                        |       /  \        | |
   374                        v       v  v        v v
   375  ------------------------------x--x----------x------------> time
   376  ```
   377  
   378  Two of the commands get proposed (at *LAI*s, say, 10 and 11), decrementing
   379  the left refcount and adding an *MLAI* entry of 11 (the max of the two) to it.
   380  Additionally, two new commands arrive, this time at timestamps below `next`.
   381   These commands are forced above `next` first, so the refcount goes to the right.
   382  These new commands get proposed quickly (so they don't show
   383  up again) and the right refcount will drop back to zero (though it will retain the
   384  max *MLAI* seen, likely 13).
   385  
   386  ```
   387                              in-flight
   388                     closed    command     next
   389                        |         \       @11| @0
   390                        |          \      #1 | #2
   391                        v          v         v
   392  ---------------------------------x-----------------------> time
   393                                             ĘŚ
   394                                             |
   395              _______________________________/
   396             |   forwarding    |
   397             |                 |
   398         new command         new command
   399     (finishes quickly @13) (finishes quickly @12)
   400  ```
   401  
   402  The remaining command sticks around in the evaluation phase. This is
   403  unfortunate; it's time for another CT update, but we can't send a higher closed
   404  timestamp than before (and must stick to the same one with an empty *MLAI* map)
   405  
   406  ```
   407                    (blocked)             (blocked)
   408                              in-flight
   409                     closed    command     next
   410                        |         \       @11| @13
   411                        |          \      #1 | #0
   412                        v          v         v
   413  ---------------------------------x-----------------------> time
   414  ```
   415  
   416  Finally the command gets proposed at LAI 14. A new command comes in at some
   417  reasonable timestamp and the right side picks up a ref. Note the resulting
   418  odd-looking situation in which the left is @14 and the right @13 (this is fine;
   419  the client tracks the maximum seen):
   420  
   421  ```
   422                     closed                next     in-flight
   423                        |                 @14| @13  proposal
   424                        |                 #0 | #1     |
   425                        v                    v        v
   426  ----------------------------------------------------x----> time
   427  ```
   428  
   429  Time for the next CT update. We can finally close `next` (emitting @14) and move
   430  it to `now-target duration`, moving the right side refcount and *MLAI* to the
   431  left in the process.
   432  
   433  ```
   434                                           closed   in-flight  @13| @0
   435                                             |      proposal   #1 | #0
   436                                             |        |     _____/
   437                                             |        |    /
   438                                             v        v   v
   439  ----------------------------------------------------x----> time
   440  ```
   441  
   442  ## Initial catch-up
   443  
   444  The main mechanism for propagating *MLAI*s is triggered by proposals. When an
   445  initial update is created, valid *MLAI*s have to be obtained for all ranges for
   446  which followers are supposed to be able to serve reads. This raises two practical
   447  questions: for which replicas should an *MLAI* be produced, and how to produce one.
   448  
   449  We create an *MLAI* for all ranges for which (at the time of checking) the
   450  current state indicates that the lease is held by the local store (this can have
   451  both false positives and false negatives but a missed follower read will trigger
   452  a proactive upgrade for the range it occurred on).
   453  
   454  The initial catch-up is simple: before closing a timestamp (via the MPT), iterate
   455  through all ranges and (if they show the store as holding the lease) feed the
   456  MPT a proposal that lets it know the most recent *Lease Applied Index* on that
   457  replica:
   458  
   459  ```go
   460  _, done:= t.Track(hlc.Timestamp{})
   461  repl.mu.Lock()
   462  lai := repl.mu.lastAssignedLeaseIndex
   463  repl.mu.Unlock()
   464  done(lai)
   465  ```
   466  
   467  This can race with other proposals, but the MPT will track the maximum seen.
   468  
   469  ## Timestamp forwarding and intents
   470  
   471  We forward commands' timestamps in order to guarantee that they don't
   472  produce visible data at timestamps below the CT. A case in which that
   473  is less obvious is that of an intent.
   474  
   475  To see this, consider that a transaction has two relevant timestamps:
   476  `OrigTimestamp` (also known as its read timestamp) and `Timestamp`
   477  (also known as its commit timestamp). while the timestamp we forward
   478  is `Timestamp`, the transaction internally will in fact attempt to
   479  write at OrigTimestamp (but relies on moving these intents to their
   480  actual timestamp later, when they are resolved). This prevents certain
   481  anomalies, particularly with `SNAPSHOT` isolation.
   482  
   483  Naively, this violates the guarantee: we promise that no more data will appear
   484  below a certain timestamp. Note however that this data isn't visible at
   485  timestamps below the commit timestamp (which was forwarded): to read the value,
   486  the intent has to be resolved first, which implies that it will move at least to
   487  `Timestamp` in the progress, restoring the guarantee required.
   488  
   489  Similarly, this does not impede the usefulness of the CT mechanism for
   490  recovery: the restored consistent state may contain intents. But the
   491  restored consistent state also allows resolving all of the intents in
   492  the same way, since what matters is the transaction record. The result
   493  will be that the intents are simply dropped, unless there is a committed
   494  transaction record, in which case they will commit.
   495  
   496  Note that for the CDC use case, this closed timestamp mechanism is a necessary,
   497  but not sufficient, solution. In particular, a CDC consumer must find (or track)
   498  and resolve all intents at timestamps below a given closed timestamp first.
   499  
   500  ## Splits/Merges
   501  
   502  No action is necessary for splits: the leaseholders of the LHS and RHS are
   503  colocated and so share the same closed timestamp mechanisms. For convenience an
   504  update for the RHS is added to the next round of outgoing updates, otherwise
   505  follower reads for the RHS would cut out for a moment.
   506  
   507  Merges are more interesting since the leaseholders of the RHS and the LHS are
   508  not necessarily colocated. If the RHS's store has closed a higher timestamp, say
   509  1000, while the LHS's store is only at 500, after the merge commands might be
   510  accepted on the combined range under the closed timestamp 500 that violate the
   511  closed timestamp 1000. To counteract this, the `Subsume` operation
   512  returns the closed timestamp on the origin store and the merging replica
   513  takes it into account. Initially, the split trigger will populate the
   514  timestamp cache for the right side of the merge; if this has too big an impact
   515  on the timestamp cache (especially as merges are rolled out, we might merge
   516  away large swaths of empty ranges), we can also store the timestamp on the
   517  replica and use it to forward proposals manually.
   518  
   519  ## Routing layer
   520  
   521  This RFC proposes a somewhat simplistic implementation at the routing layer: At
   522  `DistSender` and its DistSQL counterpart, if a read is for a timestamp earlier
   523  than the current time less a target duration (which adds comfortable padding to
   524  when followers are ideally able to serve these reads), it is sent to the nearest
   525  replica (as measured by health, latency, locality, and perhaps a jitter),
   526  instead of to the leaseholder.
   527  
   528  When a read is handled by a replica not equipped to serve it via a regular or
   529  follower read, a `NotLeaseHolderError` is returned and future requests for that
   530  same (chunk of) batch will make no attempt to use follower reads; this avoids
   531  getting stuck in an endless loop when followers lag significantly. Similarly,
   532  follower reads are never attempted for ranges known not to use epoch based
   533  leases.
   534  
   535  ## Further work
   536  
   537  While the design outlined so far should give a reasonably performant baseline,
   538  it has several shortcomings that will need to be addressed in follow-up work:
   539  
   540  ### Lagging followers
   541  
   542  Assume that timestamps are closed at a 5s target duration every second, and
   543  that the last proposal taken into account for each closed timestamp finishes
   544  evaluating just before the timestamp is closed out. In that case, the *MLAI*
   545  check on the followers is more likely to fail for a short moment until the Raft
   546  log has caught up with the very recent proposal; if the catch-up takes longer
   547  than the interval at which the timestamps are closed out, no follower read will
   548  ever be possible. A similar scenario applies to followers far removed from the
   549  usual commit quorum or lagging for any other reason. This should be fairly
   550  rare, but seems important enough to be tackled in follow-up work.
   551  
   552  The fundamental problem here is that older closed timestamps are discarded when
   553  a new one is received, resulting in the follower never catching up to the current
   554  closed timestamp. If it remembered the previous CT updates, it could at least
   555  serve reads for that timestamp. This calls for a mechanism that holds on to
   556  previous *CT*s and *MLAI*s so that reads further in the past can be served.
   557  This won't be implemented initially to keep the complexity in the first version
   558  to a minimum.
   559  
   560  One way to address the problem is the following: On receipt of a CT update, copy
   561  the CT and MLAI into the range state if the Raft log has caught up to the MLAI
   562  (keeping the most recently overwritten value around to serve reads for). This
   563  means that the replica will always have a valid CT during normal operation,
   564  though one that lags the received updates (various variations on this theme
   565  exist). However, note the strong connection to the following section:
   566  
   567  ### Recovery from insufficient quorum
   568  
   569  As mentioned in the initial paragraphs, follower reads can help recover a
   570  recent consistent state of an unavailable cluster, by determining the maximum
   571  timestamp at which every range has a surviving replica that can serve a
   572  follower read (if all replicas of a range are lost, there is obviously no hope
   573  of consistent recovery).
   574  At this timestamp, a consistent read of the entire keyspace (excluding
   575  expiration-based ranges) can be carried out and used to construct a backup.
   576  Note that if expiration-based replicas persisted the last lease they held, the
   577  timestamp could be lowered to the minimum over all surviving expiration based
   578  replicas' last leases, for a consistent (but less recent) read of the *whole*
   579  keyspace.
   580  
   581  For maximum generality, it is desirable to in principle be able to recover
   582  without relying on in-memory state, so that a termination of the running
   583  process does not bar a subsequent recovery.
   584  
   585  Naively this can be achieved by persisting all received *CT* updates (with
   586  some eviction policy that rolls up old updates into a more recent initial
   587  state), though the eventual implementation may opt to persist at the Replica
   588  level instead (where updates caught up to can more easily be pruned).
   589  
   590  ### Range feeds
   591  
   592  [Range feeds] are a range-level mechanism to stream updates to an upstream
   593  Change Data Capture processor. Range feeds will rely on closed timestamps and
   594  will want to relay them to an upstream consumer as soon as possible.  This
   595  suggests a reactive mechanism that notifies the replicas with an active Range
   596  feed on receipt of a CT update; given a registry of such replicas, this is easy
   597  to add.
   598  
   599  ### `AS OF SYSTEM TIME RECENT`;
   600  
   601  With the advent of closed timestamps, we can also simplify `AS OF SYSTEM TIME`
   602  by allowing users to let the server chose a reasonable "recent" timestamp in
   603  the past for which reads can be distributed better. Note that, other than
   604  what was requested in [this issue][autoaost], there is no guarantee about
   605  blocking on conflicting writers. However, since a transaction that has
   606  `PENDING` status with a timestamp that has since been closed out is likely
   607  to have to restart (or ideally refresh) anyway, we could consider allowing it
   608  to be pushed.
   609  
   610  ## Rationale and Alternatives
   611  
   612  This design appears to be the sane solution given boundary conditions.
   613  
   614  ## Unresolved questions
   615  
   616  ### Configurability
   617  
   618  For now, the min proposal timestamp roughly trails real time by five seconds.
   619  This can be made configurable, for example via a cluster setting or, if more
   620  granularity is required, via zone configs (which in turn requires being able to
   621  retrieve the history of the settings value or a mechanism that smears out the
   622  change over some period of time, to avoid failed follower reads).
   623  
   624  Transactions which exceed the lag are usually forced to restart, though this
   625  will often happen through a refresh (which is comparatively cheap, though it
   626  needs to be tested).
   627  
   628  [RangeFeed]: https://github.com/cockroachdb/cockroach/pull/26782
   629  [autoaost]: https://github.com/cockroachdb/cockroach/issues/25405