github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20181227_follower_reads_implementation.md (about) 1 - Feature Name: Follower Reads Adoption 2 - Implementation Status: draft 3 - Start Date: 2018-12-27 4 - Authors: Andrew Werner 5 - RFC PR: #33474 6 - Cockroach Issue: #16593 7 8 # Summary 9 10 Follower reads are consistent reads at historical timestamps from follower 11 replicas. They make the non-leaseholder replicas in a range suitable sources for 12 historical reads. Historical reads include both `AS OF SYSTEM TIME` queries as 13 well as transactions with a read timestamp sufficiently in the past (for example 14 long-running analytics queries). Most of the required machinery to safely 15 perform these reads was implemented in the [Follower Reads 16 RFC](../20180603_follower_reads.md). Follower reads can greatly improve query 17 performance by avoiding the need to make wide area RPCs and by reducing traffic 18 on lease holders. This document proposes mechanisms to expose follower reads 19 through a new SQL function to determine a reasonable read time stamp and the 20 accompanying change to make that function legal in use with `AS OF SYSTEM TIME` 21 queries as well as tweaks to make the SQL physical planner and kv DistSender 22 aware that follower reads are possible. Lastly this proposal suggests extending 23 the `BEGIN TRANSACTION` and `SET TRANSACTION` statements to allow read-only 24 transactions to be performed at a single transaction timestamp that can read 25 from followers. Given the intention to make follower reads an enterprise 26 feature, some of the complexity in this proposal stems from the need to inject 27 behavior from CCL code. 28 29 # Motivation 30 31 Given that cockroachdb stores multiple replicas of data, a client might expect 32 that it be able to serve reads from any of those replicas. In order to provide 33 its high level of consistency and isolation, cockroachdb currently requires that 34 all reads for a range go to the current lease holder. For reads against data 35 written sufficiently far in the past, consistency and isolation morally should 36 not be a concern as no concurrent writes should be possible. There are many 37 queries which do not require a completely up-to-date view of the database such 38 as analytical workloads for report generation. Enabling reads to be performed 39 within the same data center can greatly increase throughput for large reads and 40 greatly reduce latency for small ones. Increasing performance and lowering cost 41 to run large analytical queries is valuable, especially for geo-distributed 42 deployments. Providing a convenient mechanism to request local rather than 43 lease-holder historical reads will be a compelling enterprise feature. 44 45 # Guide-level explanation 46 47 The Follower Reads RFC lays out the mechanisms to understand closed timestamps 48 and presents the rules which the store follows to determine whether a replica 49 can serve a read. This document deals with how clients can make use of this 50 behavior offered by the storage layer. The work described in the Follower Reads 51 RFC provides the mechanism which enables a replica to determine if a query 52 timestamp is adequately old to allow for a follower read through the use of its 53 `Closed Timestamp` mechanism. That work has already enabled replicas to fulfil 54 read requests in the past. In order to expose this functionality to clients all 55 that needs to be done is to convince the SQL physical planner to direct 56 historical reads to local nodes and to coax the `DistSender` to send requests to 57 followers when appropriate. 58 59 The `Closed Timestamp` is tracked on a per-range basis which attempts to lag 60 behind "real" time by some target duration controlled by the cluster setting 61 `kv.closed_timestamp.target_duration`. As of writing this value defaults to 30 62 seconds but could likely be lowered to 5-10 seconds (at some threshold it may 63 potentially interfere with on-going transactions). The closed timestamp 64 subsystem seeks to update the closed timestamp at an interval defined as a 65 fraction of the `target_duration` which is termed the `close_fraction`. For 66 example, if the `target_duration` is 30s and the `close_fraction` is 0.2 then 67 the subsystem will try to keep the closed timestamp 30s behind real time and 68 will try to update the value every 6s. This proposal seeks to achieve follower 69 reads by employing stateless approximation of when a follower read is possible 70 by assuming that a read may be directed to a follower if it occurs at some 71 multiple of the target duration which is controlled by a hidden cluster setting 72 which we'll refer to as the `target_multiple`. While this may ultimately lead 73 to failure to perform reads at a follower it leads to a simple implementation 74 that controls a fixed tradeoff between staleness (the amount of time behind the 75 "present" at which `follower_read_timestamp()` reads occur) and the risk of 76 needing to perform a leaseholder read (which will happen seamlessly due to a 77 NotLeaseHolderError). The `target_multiple` defines the tradeoff between 78 staleness and likelihood of follower read failing. 79 80 In order to ease the burden of the client determining an adequately old 81 timestamp for use with an `AS OF SYSTEM TIME` query, this RFC introduces a new 82 SQL function `follower_read_timestamp()` which is effectively a syntactic 83 short-hand for multiplying the above mentioned cluster settings then extends `AS 84 OF SYSTEM TIME` to allow for a non-constant expression. After this change and 85 the enabling of `kv.closed_timestamp.follower_reads_enabled` clients can 86 trivially encourage their `SELECT` statements to be directed to physically close 87 replicas. For example, imagine that the kv.kv table exists, the below query 88 would perform a read against the nearest replica: 89 90 ``` 91 SELECT * FROM kv.kv AS OF SYSTEM TIME follower_read_timestamp() ORDER BY k LIMIT 10; 92 ``` 93 94 The physical planning of SQL evaluation currently tries to send DistSQL 95 processors to be run on nodes which are currently the leaseholder for ranges of 96 interest. This allocation is performed via the `distsqlplan.SpanResolver` which 97 internally uses a `leaseHolderOracle` which provides a ReplicaDesc given a 98 RangeDesc according to a policy. This RFC refactors the oracle logic into its 99 own package and provides (via injection) a new follower read aware policy. 100 101 The `kv.DistSender` currently attempts to send all writes and reads at 102 consistency levels other than INCONSISTENT to the current lease holder for a 103 range falling back to replica closeness. This RFC adds an injectable 104 `CanUseFollowerReads` function which defaults to returning `false` that the 105 DistSender code will consult when determining whether to locate the current 106 lease holder. 107 108 While these changes enable and ease performing individual SELECT queries against 109 follower replicas, it does not enable running entire (read-only) transactions at 110 a single point in time and thus benefitting from the performance gains offerred 111 by follower reads. This document proposes an extension to the `SET TRANSACTION` 112 and `BEGIN TRANSACTION` statements to allow a clause analagous to `AS OF SYSTEM 113 TIME` for `SELECT` statements today. This change will ease the multi-statement 114 historical reads, potentially enabling use of existing code which relies on a 115 transaction object, and will provide a mechnaism to run historical reads with 116 a HIGH transaction priority, eliminating the posibility of blocking on a 117 long-running read-write transaction. `SET TRANSACTION` must be the first 118 statement following `BEGIN`. Note that `SET TRANSACTION AS OF SYSTEM 119 TIME` implies `READ ONLY`. A historical read only transaction thus will look 120 something like the following: 121 122 ```sql 123 BEGIN; 124 SET TRANSACTION AS OF SYSTEM TIME follower_read_timestamp(), PRIORITY HIGH; 125 SELECT * FROM kv.kv; 126 COMMIT; 127 ``` 128 129 or alternatively: 130 131 ```sql 132 BEGIN TRANSACTION AS OF SYSTEM TIME follower_read_timestamp(), PRIORITY HIGH; 133 SELECT * FROM kv.kv; 134 COMMIT; 135 ``` 136 137 # Reference-level explanation 138 139 This section will focus on the specific details of plumbing the functionality 140 required to expose follower reads through the codebase. Because follower reads 141 will be implemented as an enterprise feature the core enabling logic will live 142 in a CCL licensed package `pkg/ccl/followerreadsccl`. This package will then 143 inject the needed abstractions to provide the following six changes: 144 145 1. Define the `target_multiple` internal setting. 146 2. Add the `follower_read_timestamp()` SQL builtin. 147 3. Extend the SQL evaluation to support `follower_read_timestamp()` with `AS OF 148 SYSTEM TIME`. 149 4. Abstract the replica selection mechanism for SQL physical planning. 150 5. Modify DistSender logic to determine when it may safely send a read to a 151 follower. 152 6. Extend the `SET TRANSACTION` and `BEGIN TRANSACTION` statements to enable 153 setting the TxnTime. 154 155 ## Detailed Design 156 157 ### The `target_multiple` Cluster setting 158 159 The new cluster setting will be defined inside of `followerreadsccl` and thus 160 will only exist in CCL builds. The setting will be a float value greater than or 161 equal to one which is combined with `kv.closed_timestamp.target_duration` to 162 determine at which time `follower_read_timestamp()` should evaluate. A function 163 in followerreadsccl like below will use the recent time: 164 165 ```go 166 // recentDuration returns the duration to be used as the offset to create a 167 // follower_read_timestamp(). The same value plus a unit of clock uncertainty, 168 // then should be used to determine if a query can use follower reads. 169 func recentDuration(st *cluster.Settings) time.Duration { 170 targetMultiple := TargetMultiple.Get(&st.SV) 171 targetDuration := closedts.TargetDuration.Get(&st.SV) 172 closeFraction := closedts.CloseFraction.Get(&st.SV) 173 return -1 * time.Duration(float64(targetDuration) * 174 (1 + closeFraction * targetMultiple)) 175 } 176 ``` 177 178 The setting represents the tradeoff between staleness of 179 `follower_read_timestamp()` queries and the chance that such queries may fail to 180 be performed on a follower. The initial choice of value is `3` which likely is 181 rather conservative. Given that the current target duration for closed 182 timestamps is 30s and the close fraction is .2, queries performed with 183 `follower_read_timestamp()` should lag "real" time by roughly 48s. If we can 184 lower the target duration to 10s which would lead to a 16s real time delay. 185 Note that while this setting does control a tradeoff, it is deeply related to 186 implementation details and will be hidden from customers. 187 188 ### SQL `follower_read_timestamp()` builtin 189 190 A new SQL builtin `follower_read_timestamp()` is added to call through to an 191 injected function which determines a timestamp which likely can be used to 192 perform a follower read. This function will only be functional with the CCL 193 version of CRDB and an enterprise license. 194 195 ### `follower_read_timestamp()` expressions in `AS OF SYSTEM TIME` clauses 196 197 Prior to this change, SQL evaluation enforced that the expression passed to 198 `AS OF SYSTEM TIME` clauses be constant. This requirement prevented the use 199 of the above mentioned function. Allowing for some evaluation of non-constant 200 expressions enables useful functionality such as taking the max of two 201 timestamps allowing enforcement of a maximum staleness. 202 203 Allowing `AS OF SYSTEM TIME` clauses to accept all expressions is problematic 204 as arbitrary SQL scalar evaluation is not available at time of AS OF clause 205 expression evaluation. The intention is to only permit the 206 `follower_read_timestamp()` function to be used and for all other expressions 207 to be rejected. 208 209 ### Abstract replica selection for SQL physical planning. 210 211 The physical planning of SQL query evaluation attempts to place evaluation near 212 the lease holder for ranges when known, falling back to a policy which seeks to 213 pack requests on nearby nodes. This logic is encapsulated in an interface called 214 a `leaseHolderOracle` (henceforth Oracle) which is constructed based on a 215 policy. Today's policy is called the `binPackingLeaseHolderChoice`. All of this 216 logic currently resides in the `sql/distsqlplan` package and is used by the 217 `SpanResolver`. A span resolver uses a `*client.Txn` to create a 218 `SpanResolverIterator` which iterates through ranges and provides replica 219 selection. 220 221 This proposal moves the Oracle logic into a new package 222 `sql/distsqlplan/replicaoracle` which will extend the current abstraction for 223 selecting a replica given a policy to additional be able to take into account 224 the current transaction. The package will also provide a mechanism to register 225 new policies which we'll see that followerreadsccl will exploit. In addition to 226 today's existing binPacking and random policies the new package will include a 227 policy which selects the closest replica. 228 229 Prior to this change the policy is used to statically construct an Oracle 230 which is used throughout the life of the SpanResolver. An Oracle provides a 231 single method: 232 233 ```go 234 // ChoosePreferredReplica returns a choice for one range. Implementors are free to 235 // use the queryState param, which has info about the number of 236 // ranges already handled by each node for the current SQL query. The state is 237 // not updated with the result of this method; the caller is in charge of 238 // that. 239 // 240 // A RangeUnavailableError can be returned if there's no information in gossip 241 // about any of the nodes that might be tried. 242 ChoosePreferredReplica( 243 context.Context, roachpb.RangeDescriptor, OracleQueryState, 244 ) (kv.ReplicaInfo, error) 245 ``` 246 247 The change will add a layer of indirection such that rather than holding 248 an Oracle, the SpanResolver will hold an OracleFactory with the following 249 interface: 250 251 ```go 252 // OracleFactory creates an oracle for a Txn. 253 type OracleFactory interface { 254 // Oracle provides an Oracle to select an appropriate replica for a range. 255 Oracle(*client.Txn) Oracle 256 } 257 ``` 258 259 For the existing policies the OracleFactory can be implemented by the same 260 concrete struct which implement today's Oracles by merely returning themselves 261 in calls to `Oracle()`. This mechanism allows different policies to be used for 262 different Txns, namely the use of the closest policy for historical queries and 263 the binPacking policy for all others. This `FollowerReadsAwarePolicy` will check 264 to see if the OrigTimestamp of a Txn is before now less `recentDuration` plus a 265 clock uncertainty duration. The `followerreadsccl` code will then register this 266 new policy and set it to the global var `distsqlplan.ReplicaOraclePolicy`. 267 268 ### Expose DistSender For Determining Follower Read Safety 269 270 The last hurdle to exposing follower reads is that the `kv.DistSender` attempts 271 to send batch requests to current lease holders which may prevent reads from 272 going to nearby follower replicas. In order to inform the DistSender that it can 273 send a batch request to a follower we add a new global var in the kv package 274 275 ```go 276 var CanSendToFollower = func( 277 clusterID uuid.UUID, _ *cluster.Settings, _ *roachpb.BatchRequest, 278 ) bool { 279 return false 280 } 281 ``` 282 283 Which is adopted by the DistSender when it decides whether to look up a cached 284 lease holder in `DistSender.sendSingleRange`. The followerreadsccl package can 285 then inject a new implementation of this function which ensures that the batch 286 request is a read only transaction and then verifies that it meets the criteria 287 for a follower read. Not that the read-only property of a batch request is that 288 all contained requests are neither write nor admin. 289 290 This simplistic policy has potentially negative implications for export 291 requests. Right now export relies on balanced range leadership to spread load 292 through the cluster. Export requests are also commonly run at historical 293 timestamps. If we were to blindly allow all sufficiently old, read-only batches 294 to be sent to the nearest replica then export load may all flood a single node 295 in a multi-node cluster. To deal with this we'll explicitly exclude batches 296 which contain export requests from consideration for follower reads. 297 298 ### Extending `SET TRANSACTION` to support `AS OF SYSTEM TIME` 299 300 It is easy to imagine cases where a client would like to run multiple reads 301 at the same historical timestamp. Today this would require adding an `AS OF` 302 clause to each `SELECT` this cumbersome even when the exact timestamp is known, 303 but is made worse when the timestamp which the client would like to use is 304 generated by the server. In order to reuse the same timestamp the client would 305 need to jump through the hoop of first generating the timestamp and then 306 plumbing it in to each individual statement. To simplify this workflow we'll 307 expose a mechanism to run an entire transaction at a single timestamp by 308 extending the `SET TRANSACTION` statement to support the `AS OF SYSTEM TIME` 309 clause. 310 311 While it might on some level seem reasonable to allow for arbitrary read-write 312 queries to be performed at historical timestamps, due to the mechanisms of 313 closed timestamps, write operations could never succesfully commit. Because the 314 MinProposalTracker would effectively block write operations, we'll enforce that 315 `SET TRANSACTION AS OF SYSTEM TIME` implies `SET TRANSACTION READ ONLY`. 316 317 Allowing the setting of timestamp on transactions relates to another problem; 318 you cannot set the priority on an implicit transaction. The problem arises when 319 reasoning about follower reads which encounter unresolved intents which might 320 impact the value they read. Like always, a read must resolve unresolved intents 321 which might impact it. If the intent is from a transaction which is still 322 running and that transaction is not at a lower priority than this read. If a 323 client were to run a `follower_read_timestamp()` read only transaction as proposed 324 above, then that client could additionally set the transaction priority to HIGH 325 which will ensure that the read never waits on pending long-running update 326 transactions. 327 328 ## Drawbacks 329 330 There are very few drawbacks to implementing the high level idea. Most of the 331 groundwork has already been laid. Any obvious downsides come from the impurity 332 of the injection required to realize the functionality as an enterprise feature. 333 334 ## Rationale and Alternatives 335 336 ### Stateful closed timestamp tracking 337 338 One potential downside of this approach is that in an edge case it may have the 339 potential to have a detrimentally effect cluster performance in the case of 340 bursty traffic and a large volume of follower reads. Imagine a situation where a 341 large percentage of client traffic is due to follower reads and the cluster is 342 heavily loaded such that all transactions are performing acceptably but if the 343 workload were to be shifted entirely such that all requests were forced to go to 344 leaseholders it would not be capable of acceptably serving the traffic. If then, 345 a burst of load or some other cluster event were to lead one or more replicas to 346 fall behind in its ability to publish closed timestamps, all traffic which was 347 spread over all of the replicas would begin recieving all of the load that had 348 been going to followers. It is possible that this concern is not realistic in 349 most common cases. Furthermore it seems straightforward to mitigate by 350 increasing the target multiple. The problem seems worse as the replication 351 factor increases beyond 3 to numbers like 7 or 9. Furthermore even if the 352 increased load during this bursty period does not meaningfully affect OLTP 353 traffic, it may lead to potentially massively increased latency for queries 354 which in the previous regime had been fast. 355 356 A more sophisticated mechanism which statefully tracks a closed timestamps on a 357 per-range basis on all nodes would allow `follower_read_timestamp` to always 358 evaluate to a timestamp which is known to be closed. Such an approach may, in 359 the common case, be less pessimistic than this proposal's target_multiple and 360 unlike the optimistic approach, would be sure to always safely perform follower 361 reads. That being said, a stateful approach which tracks follower reads would 362 require nodes to track closed timestamps for all replicas at planning time and 363 may additionally require new mechanisms to mark as known to be safe for follower 364 reads. Furthermore the state tracking may be prohibitively expensive on large 365 clusters. 366 367 ### Node-local Dynamic Adjustment of `targetMultiple` 368 369 Another less invasive might be to dynamically update the target multiple by 370 detecting NotLeaseHolderErrors for queries which expected to hit followers. 371 This could mitigate the flood of consistent reads in the face of lagging closed 372 timestamps but would make the semantics of `follower_read_timestamp()` harder to 373 understand and would require increased participation from the DistSender to 374 provide the feedback. 375