github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20181227_follower_reads_implementation.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20181227_follower_reads_implementation.md (about)

     1  - Feature Name: Follower Reads Adoption
     2  - Implementation Status: draft
     3  - Start Date: 2018-12-27
     4  - Authors: Andrew Werner
     5  - RFC PR: #33474
     6  - Cockroach Issue: #16593
     7  
     8  # Summary
     9  
    10  Follower reads are consistent reads at historical timestamps from follower
    11  replicas. They make the non-leaseholder replicas in a range suitable sources for
    12  historical reads. Historical reads include both `AS OF SYSTEM TIME` queries as
    13  well as transactions with a read timestamp sufficiently in the past (for example
    14  long-running analytics queries). Most of the required machinery to safely
    15  perform these reads was implemented in the [Follower Reads
    16  RFC](../20180603_follower_reads.md). Follower reads can greatly improve query
    17  performance by avoiding the need to make wide area RPCs and by reducing traffic
    18  on lease holders. This document proposes mechanisms to expose follower reads
    19  through a new SQL function to determine a reasonable read time stamp and the
    20  accompanying change to make that function legal in use with `AS OF SYSTEM TIME`
    21  queries as well as tweaks to make the SQL physical planner and kv DistSender
    22  aware that follower reads are possible. Lastly this proposal suggests extending
    23  the `BEGIN TRANSACTION` and `SET TRANSACTION` statements to allow read-only
    24  transactions to be performed at a single transaction timestamp that can read
    25  from followers. Given the intention to make follower reads an enterprise
    26  feature, some of the complexity in this proposal stems from the need to inject
    27  behavior from CCL code.
    28  
    29  # Motivation
    30  
    31  Given that cockroachdb stores multiple replicas of data, a client might expect
    32  that it be able to serve reads from any of those replicas. In order to provide
    33  its high level of consistency and isolation, cockroachdb currently requires that
    34  all reads for a range go to the current lease holder. For reads against data
    35  written sufficiently far in the past, consistency and isolation morally should
    36  not be a concern as no concurrent writes should be possible.  There are many
    37  queries which do not require a completely up-to-date view of the database such
    38  as analytical workloads for report generation.  Enabling reads to be performed
    39  within the same data center can greatly increase throughput for large reads and
    40  greatly reduce latency for small ones. Increasing performance and lowering cost
    41  to run large analytical queries is valuable, especially for geo-distributed
    42  deployments.  Providing a convenient mechanism to request local rather than
    43  lease-holder historical reads will be a compelling enterprise feature.
    44  
    45  # Guide-level explanation
    46  
    47  The Follower Reads RFC lays out the mechanisms to understand closed timestamps
    48  and presents the rules which the store follows to determine whether a replica
    49  can serve a read. This document deals with how clients can make use of this
    50  behavior offered by the storage layer.  The work described in the Follower Reads
    51  RFC provides the mechanism which enables a replica to determine if a query
    52  timestamp is adequately old to allow for a follower read through the use of its
    53  `Closed Timestamp` mechanism. That work has already enabled replicas to fulfil
    54  read requests in the past. In order to expose this functionality to clients all
    55  that needs to be done is to convince the SQL physical planner to direct
    56  historical reads to local nodes and to coax the `DistSender` to send requests to
    57  followers when appropriate.
    58  
    59  The `Closed Timestamp` is tracked on a per-range basis which attempts to lag
    60  behind "real" time by some target duration controlled by the cluster setting
    61  `kv.closed_timestamp.target_duration`. As of writing this value defaults to 30
    62  seconds but could likely be lowered to 5-10 seconds (at some threshold it may
    63  potentially interfere with on-going transactions). The closed timestamp
    64  subsystem seeks to update the closed timestamp at an interval defined as a
    65  fraction of the `target_duration` which is termed the `close_fraction`. For
    66  example, if the `target_duration` is 30s and the `close_fraction` is 0.2 then
    67  the subsystem will try to keep the closed timestamp 30s behind real time and
    68  will try to update the value every 6s. This proposal seeks to achieve follower
    69  reads by employing stateless approximation of when a follower read is possible
    70  by assuming that a read may be directed to a follower if it occurs at some
    71  multiple of the target duration which is controlled by a hidden cluster setting
    72  which we'll refer to as the `target_multiple`. While this may ultimately lead
    73  to failure to perform reads at a follower it leads to a simple implementation
    74  that controls a fixed tradeoff between staleness (the amount of time behind the
    75  "present" at which `follower_read_timestamp()` reads occur) and the risk of
    76  needing to perform a leaseholder read (which will happen seamlessly due to a
    77  NotLeaseHolderError). The `target_multiple` defines the tradeoff between
    78  staleness and likelihood of follower read failing.
    79  
    80  In order to ease the burden of the client determining an adequately old
    81  timestamp for use with an `AS OF SYSTEM TIME` query, this RFC introduces a new
    82  SQL function `follower_read_timestamp()` which is effectively a syntactic
    83  short-hand for multiplying the above mentioned cluster settings then extends `AS
    84  OF SYSTEM TIME` to allow for a non-constant expression. After this change and
    85  the enabling of `kv.closed_timestamp.follower_reads_enabled` clients can
    86  trivially encourage their `SELECT` statements to be directed to physically close
    87  replicas. For example, imagine that the kv.kv table exists, the below query
    88  would perform a read against the nearest replica:
    89  
    90  ```
    91  SELECT * FROM kv.kv AS OF SYSTEM TIME follower_read_timestamp() ORDER BY k LIMIT 10;
    92  ```
    93  
    94  The physical planning of SQL evaluation currently tries to send DistSQL
    95  processors to be run on nodes which are currently the leaseholder for ranges of
    96  interest. This allocation is performed via the `distsqlplan.SpanResolver` which
    97  internally uses a `leaseHolderOracle` which provides a ReplicaDesc given a
    98  RangeDesc according to a policy. This RFC refactors the oracle logic into its
    99  own package and provides (via injection) a new follower read aware policy.
   100  
   101  The `kv.DistSender` currently attempts to send all writes and reads at
   102  consistency levels other than INCONSISTENT to the current lease holder for a
   103  range falling back to replica closeness. This RFC adds an injectable
   104  `CanUseFollowerReads` function which defaults to returning `false` that the
   105  DistSender code will consult when determining whether to locate the current
   106  lease holder.
   107  
   108  While these changes enable and ease performing individual SELECT queries against
   109  follower replicas, it does not enable running entire (read-only) transactions at
   110  a single point in time and thus benefitting from the performance gains offerred
   111  by follower reads. This document proposes an extension to the `SET TRANSACTION`
   112  and `BEGIN TRANSACTION` statements to allow a clause analagous to `AS OF SYSTEM
   113  TIME` for `SELECT` statements today. This change will ease the multi-statement
   114  historical reads, potentially enabling use of existing code which relies on a 
   115  transaction object, and will provide a mechnaism to run historical reads with
   116  a HIGH transaction priority, eliminating the posibility of blocking on a
   117  long-running read-write transaction. `SET TRANSACTION` must be the first
   118  statement following `BEGIN`. Note that `SET TRANSACTION AS OF SYSTEM
   119  TIME` implies `READ ONLY`. A historical read only transaction thus will look 
   120  something like the following:
   121  
   122  ```sql
   123  BEGIN;
   124  SET TRANSACTION AS OF SYSTEM TIME follower_read_timestamp(), PRIORITY HIGH;
   125  SELECT * FROM kv.kv;
   126  COMMIT;
   127  ```
   128  
   129  or alternatively:
   130  
   131  ```sql
   132  BEGIN TRANSACTION AS OF SYSTEM TIME follower_read_timestamp(), PRIORITY HIGH;
   133  SELECT * FROM kv.kv;
   134  COMMIT;
   135  ```
   136  
   137  # Reference-level explanation
   138  
   139  This section will focus on the specific details of plumbing the functionality
   140  required to expose follower reads through the codebase. Because follower reads
   141  will be implemented as an enterprise feature the core enabling logic will live
   142  in a CCL licensed package `pkg/ccl/followerreadsccl`. This package will then
   143  inject the needed abstractions to provide the following six changes:
   144  
   145  1. Define the `target_multiple` internal setting.
   146  2. Add the `follower_read_timestamp()` SQL builtin.
   147  3. Extend the SQL evaluation to support `follower_read_timestamp()` with `AS OF
   148     SYSTEM TIME`.
   149  4. Abstract the replica selection mechanism for SQL physical planning.
   150  5. Modify DistSender logic to determine when it may safely send a read to a 
   151     follower.
   152  6. Extend the `SET TRANSACTION` and `BEGIN TRANSACTION` statements to enable 
   153     setting the TxnTime.
   154  
   155  ## Detailed Design
   156  
   157  ### The `target_multiple` Cluster setting
   158  
   159  The new cluster setting will be defined inside of `followerreadsccl` and thus 
   160  will only exist in CCL builds. The setting will be a float value greater than or
   161  equal to one which is combined with `kv.closed_timestamp.target_duration` to
   162  determine at which time `follower_read_timestamp()` should evaluate. A function
   163  in followerreadsccl like below will use the recent time:
   164  
   165  ```go
   166  // recentDuration returns the duration to be used as the offset to create a
   167  // follower_read_timestamp(). The same value plus a unit of clock uncertainty,
   168  // then should be used to determine if a query can use follower reads.
   169  func recentDuration(st *cluster.Settings) time.Duration {
   170      targetMultiple := TargetMultiple.Get(&st.SV)
   171      targetDuration := closedts.TargetDuration.Get(&st.SV)
   172      closeFraction := closedts.CloseFraction.Get(&st.SV)
   173      return -1 * time.Duration(float64(targetDuration) * 
   174          (1 + closeFraction * targetMultiple))
   175  }
   176  ```
   177  
   178  The setting represents the tradeoff between staleness of
   179  `follower_read_timestamp()` queries and the chance that such queries may fail to
   180  be performed on a follower.  The initial choice of value is `3` which likely is
   181  rather conservative. Given that the current target duration for closed
   182  timestamps is 30s and the close fraction is .2, queries performed with
   183  `follower_read_timestamp()` should lag "real" time by roughly 48s. If we can
   184  lower the target duration to 10s which would lead to a 16s real time delay.
   185  Note that while this setting does control a tradeoff, it is deeply related to
   186  implementation details and will be hidden from customers.
   187  
   188  ### SQL `follower_read_timestamp()` builtin
   189  
   190  A new SQL builtin `follower_read_timestamp()` is added to call through to an
   191  injected function which determines a timestamp which likely can be used to
   192  perform a follower read. This function will only be functional with the CCL
   193  version of CRDB and an enterprise license.
   194  
   195  ### `follower_read_timestamp()` expressions in `AS OF SYSTEM TIME` clauses
   196  
   197  Prior to this change, SQL evaluation enforced that the expression passed to
   198  `AS OF SYSTEM TIME` clauses be constant. This requirement prevented the use
   199  of the above mentioned function. Allowing for some evaluation of non-constant
   200  expressions enables useful functionality such as taking the max of two
   201  timestamps allowing enforcement of a maximum staleness.
   202  
   203  Allowing `AS OF SYSTEM TIME` clauses to accept all expressions is problematic
   204  as arbitrary SQL scalar evaluation is not available at time of AS OF clause
   205  expression evaluation. The intention is to only permit the
   206  `follower_read_timestamp()` function to be used and for all other expressions
   207  to be rejected.
   208  
   209  ### Abstract replica selection for SQL physical planning.
   210  
   211  The physical planning of SQL query evaluation attempts to place evaluation near
   212  the lease holder for ranges when known, falling back to a policy which seeks to
   213  pack requests on nearby nodes. This logic is encapsulated in an interface called
   214  a `leaseHolderOracle` (henceforth Oracle) which is constructed based on a
   215  policy. Today's policy is called the `binPackingLeaseHolderChoice`.  All of this
   216  logic currently resides in the `sql/distsqlplan` package and is used by the
   217  `SpanResolver`. A span resolver uses a `*client.Txn` to create a
   218  `SpanResolverIterator` which iterates through ranges and provides replica
   219  selection.
   220  
   221  This proposal moves the Oracle logic into a new package
   222  `sql/distsqlplan/replicaoracle` which will extend the current abstraction for
   223  selecting a replica given a policy to additional be able to take into account
   224  the current transaction. The package will also provide a mechanism to register
   225  new policies which we'll see that followerreadsccl will exploit. In addition to
   226  today's existing binPacking and random policies the new package will include a
   227  policy which selects the closest replica.
   228  
   229  Prior to this change the policy is used to statically construct an Oracle
   230  which is used throughout the life of the SpanResolver. An Oracle provides a
   231  single method:
   232  
   233  ```go
   234  // ChoosePreferredReplica returns a choice for one range. Implementors are free to
   235  // use the queryState param, which has info about the number of
   236  // ranges already handled by each node for the current SQL query. The state is
   237  // not updated with the result of this method; the caller is in charge of
   238  // that.
   239  //
   240  // A RangeUnavailableError can be returned if there's no information in gossip
   241  // about any of the nodes that might be tried.
   242  ChoosePreferredReplica(
   243      context.Context, roachpb.RangeDescriptor, OracleQueryState,
   244  ) (kv.ReplicaInfo, error)
   245  ```
   246  
   247  The change will add a layer of indirection such that rather than holding
   248  an Oracle, the SpanResolver will hold an OracleFactory with the following
   249  interface:
   250  
   251  ```go
   252  // OracleFactory creates an oracle for a Txn.
   253  type OracleFactory interface {
   254      // Oracle provides an Oracle to select an appropriate replica for a range.
   255      Oracle(*client.Txn) Oracle
   256  }
   257  ```
   258  
   259  For the existing policies the OracleFactory can be implemented by the same
   260  concrete struct which implement today's Oracles by merely returning themselves
   261  in calls to `Oracle()`. This mechanism allows different policies to be used for
   262  different Txns, namely the use of the closest policy for historical queries and
   263  the binPacking policy for all others. This `FollowerReadsAwarePolicy` will check
   264  to see if the OrigTimestamp of a Txn is before now less `recentDuration` plus a
   265  clock uncertainty duration. The `followerreadsccl` code will then register this
   266  new policy and set it to the global var `distsqlplan.ReplicaOraclePolicy`.
   267  
   268  ### Expose DistSender For Determining Follower Read Safety
   269  
   270  The last hurdle to exposing follower reads is that the `kv.DistSender` attempts
   271  to send batch requests to current lease holders which may prevent reads from
   272  going to nearby follower replicas. In order to inform the DistSender that it can
   273  send a batch request to a follower we add a new global var in the kv package
   274  
   275  ```go
   276  var CanSendToFollower = func(
   277     clusterID uuid.UUID, _ *cluster.Settings, _ *roachpb.BatchRequest,
   278  ) bool {
   279     return false
   280  }
   281  ```
   282  
   283  Which is adopted by the DistSender when it decides whether to look up a cached
   284  lease holder in `DistSender.sendSingleRange`. The followerreadsccl package can
   285  then inject a new implementation of this function which ensures that the batch
   286  request is a read only transaction and then verifies that it meets the criteria
   287  for a follower read. Not that the read-only property of a batch request is that
   288  all contained requests are neither write nor admin. 
   289  
   290  This simplistic policy has potentially negative implications for export
   291  requests. Right now export relies on balanced range leadership to spread load
   292  through the cluster. Export requests are also commonly run at historical
   293  timestamps. If we were to blindly allow all sufficiently old, read-only batches
   294  to be sent to the nearest replica then export load may all flood a single node
   295  in a multi-node cluster. To deal with this we'll explicitly exclude batches
   296  which contain export requests from consideration for follower reads.
   297  
   298  ### Extending `SET TRANSACTION` to support `AS OF SYSTEM TIME`
   299  
   300  It is easy to imagine cases where a client would like to run multiple reads
   301  at the same historical timestamp. Today this would require adding an `AS OF`
   302  clause to each `SELECT` this cumbersome even when the exact timestamp is known,
   303  but is made worse when the timestamp which the client would like to use is
   304  generated by the server. In order to reuse the same timestamp the client would
   305  need to jump through the hoop of first generating the timestamp and then
   306  plumbing it in to each individual statement. To simplify this workflow we'll
   307  expose a mechanism to run an entire transaction at a single timestamp by
   308  extending the `SET TRANSACTION` statement to support the `AS OF SYSTEM TIME`
   309  clause.
   310  
   311  While it might on some level seem reasonable to allow for arbitrary read-write
   312  queries to be performed at historical timestamps, due to the mechanisms of
   313  closed timestamps, write operations could never succesfully commit. Because the
   314  MinProposalTracker would effectively block write operations, we'll enforce that
   315  `SET TRANSACTION AS OF SYSTEM TIME` implies `SET TRANSACTION READ ONLY`.
   316  
   317  Allowing the setting of timestamp on transactions relates to another problem;
   318  you cannot set the priority on an implicit transaction. The problem arises when
   319  reasoning about follower reads which encounter unresolved intents which might
   320  impact the value they read. Like always, a read must resolve unresolved intents
   321  which might impact it. If the intent is from a transaction which is still
   322  running and that transaction is not at a lower priority than this read. If a
   323  client were to run a `follower_read_timestamp()` read only transaction as proposed
   324  above, then that client could additionally set the transaction priority to HIGH
   325  which will ensure that the read never waits on pending long-running update
   326  transactions.
   327  
   328  ## Drawbacks
   329  
   330  There are very few drawbacks to implementing the high level idea.  Most of the
   331  groundwork has already been laid. Any obvious downsides come from the impurity
   332  of the injection required to realize the functionality as an enterprise feature.
   333  
   334  ## Rationale and Alternatives
   335  
   336  ### Stateful closed timestamp tracking
   337  
   338  One potential downside of this approach is that in an edge case it may have the
   339  potential to have a detrimentally effect cluster performance in the case of
   340  bursty traffic and a large volume of follower reads. Imagine a situation where a
   341  large percentage of client traffic is due to follower reads and the cluster is
   342  heavily loaded such that all transactions are performing acceptably but if the
   343  workload were to be shifted entirely such that all requests were forced to go to
   344  leaseholders it would not be capable of acceptably serving the traffic. If then,
   345  a burst of load or some other cluster event were to lead one or more replicas to
   346  fall behind in its ability to publish closed timestamps, all traffic which was
   347  spread over all of the replicas would begin recieving all of the load that had
   348  been going to followers. It is possible that this concern is not realistic in
   349  most common cases.  Furthermore it seems straightforward to mitigate by
   350  increasing the target multiple. The problem seems worse as the replication
   351  factor increases beyond 3 to numbers like 7 or 9. Furthermore even if the
   352  increased load during this bursty period does not meaningfully affect OLTP
   353  traffic, it may lead to potentially massively increased latency for queries
   354  which in the previous regime had been fast.
   355  
   356  A more sophisticated mechanism which statefully tracks a closed timestamps on a
   357  per-range basis on all nodes would allow `follower_read_timestamp` to always
   358  evaluate to a timestamp which is known to be closed. Such an approach may, in
   359  the common case, be less pessimistic than this proposal's target_multiple and
   360  unlike the optimistic approach, would be sure to always safely perform follower
   361  reads.  That being said, a stateful approach which tracks follower reads would
   362  require nodes to track closed timestamps for all replicas at planning time and
   363  may additionally require new mechanisms to mark as known to be safe for follower
   364  reads. Furthermore the state tracking may be prohibitively expensive on large
   365  clusters.
   366  
   367  ### Node-local Dynamic Adjustment of `targetMultiple`
   368  
   369  Another less invasive might be to dynamically update the target multiple by
   370  detecting NotLeaseHolderErrors for queries which expected to hit followers.
   371  This could mitigate the flood of consistent reads in the face of lagging closed
   372  timestamps but would make the semantics of `follower_read_timestamp()` harder to
   373  understand and would require increased participation from the DistSender to
   374  provide the feedback.
   375