github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/life_of_a_query.md (about) 1 # Life of a SQL Query 2 3 Original author: Andrei Matei 4 5 ## Introduction 6 7 This document aims to explain the execution of an SQL query against 8 CockroachDB, explaining the code paths through the various layers of 9 the system (network protocol, SQL session management, parsing, 10 execution planning, syntax tree transformations, query running, 11 interface with the KV code, routing of KV requests, request 12 processing, Raft, on-disk storage engine). The idea is to provide a 13 high-level unifying view of the structure of the various components; 14 no one will be explored in particular depth but pointers to other 15 documentation will be provided where such documentation exists. Code 16 pointers will abound. 17 18 This document will generally not discuss design decisions; it will 19 rather focus on tracing through the actual (current) code. 20 21 The intended audience for this post is folks curious about a dive 22 through the architecture of a modern, albeit young, database presented 23 differently than in a [design 24 doc](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md). It 25 will hopefully also be helpful for open source contributors and new 26 Cockroach Labs engineers. 27 28 ## Limitations 29 30 This document does not cover some important aspects of query execution, 31 in particular major developments that have occurred after the document 32 was initially authored; including but not limited to: 33 34 - how are transaction and SQL session timestamps assigned 35 - proposer-evaluated KV 36 - concurrent statement execution inside a SQL transaction 37 - 1PC optimizations for UPDATE and INSERT 38 - column families 39 - composite encoding for collated strings and DECIMAL values in PK and indexes 40 - txn contention handling via pusher queues 41 42 ## Postgres Wire Protocol 43 44 A SQL query arrives at the server through the Postgres wire protocol 45 (CockroachDB speaks the Postgres protocol for compatibility with 46 existing client drivers and applications). The `pgwire` package 47 implements protocol-related functionality; once a client connection is 48 authenticated, it is represented by a 49 [`pgwire.v3Conn`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/pgwire/v3.go#L164) 50 struct (it wraps a [`net.Conn`](https://golang.org/pkg/net/#Conn) 51 interface - Go's 52 sockets). [`v3Conn.serve()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/pgwire/v3.go#L313) 53 implements the "read query - execute it - return result" loop. The 54 protocol is message-oriented: for the lifetime of the connection, we 55 [read a 56 message](https://github.com/cockroachdb/cockroach/blob/677f6f18b/pkg/sql/pgwire/v3.go#L324) 57 usually representing one or more SQL statements, pass it to the 58 `sql.Executor` for executing all the statements in the batch and, once 59 that's done and the results have been produced, [serialize them and 60 send them to the 61 client](https://github.com/cockroachdb/cockroach/blob/677f6f18b/pkg/sql/pgwire/v3.go#L755). 62 63 Notice that the results are not streamed to the client and, moreover, a whole batch of statements might be executed before any results are sent back. 64 65 ## SQL Executor 66 67 The 68 [`sql.Executor`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L194) 69 is responsible for parsing statements, executing them and returning 70 results back to the `pgwire.v3Conn`. The main entry point is 71 [`Executor.execRequest()`](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L481), 72 which receives a batch of statements as a raw `string`. The execution 73 of the batch is done in the context of a 74 [`sql.Session`](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/session.go#L61) 75 object which accumulates information about the state of the connection 76 (e.g. the database that has been selected, the various variables that 77 can be set, the transaction status), as well as accounting the memory 78 in use at any given time by this connection. The `Executor` also 79 manipulates a 80 [`planner`](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/planner.go#L39) 81 struct which provides the functionality around actually planning and 82 executing a query. 83 84 `Executor.execRequest()` implements a state-machine of sorts by 85 receiving batches of statements from `pgwire`, executing them one by 86 one, updating the `Session`'s transaction state (did a new transaction 87 just begin or an old transaction just end? did we encounter an error 88 which forces us to abort the current transaction?) and returning 89 results and control back to pgwire. The next batch of statements 90 received from the client will continue from the transaction state left 91 by the previous batch. 92 93 ### Parsing 94 95 The first thing the `Executor` does is [parse the 96 statements](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L495); 97 parsing uses a LALR parser generated by `go-yacc` from a [Yacc-like grammar 98 file](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/parser/sql.y), 99 originally copied from Postgres and stripped down, and then gradually 100 grown organically with ever-more SQL support. The process of parsing 101 transforms a `string` into an array of ASTs (Abstract Syntax Trees), 102 one for each statement. The AST nodes are structs defined in the 103 `sql/parser` package, generally of two types - statements and 104 expressions. Expressions implement a common interface useful for 105 applying tree transformations. These ASTs will later be transformed by 106 the `planner` into an execution plan. 107 108 ### Statement Execution 109 110 With a list of statements in hand, `Executor.execRequest()` goes 111 through them in order and executes one transaction's worth of 112 statements at a time (i.e. groups of statements between a `BEGIN` and 113 `COMMIT/ROLLBACK` statements, or single statements executed outside of 114 a transaction). If the session had an open transaction after execution 115 of the previous batch, we continue consuming statements until a 116 `COMMIT/ROLLBACK`. This "consuming of statements" is done by the call 117 to 118 [`runTxnAttempt`](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L587); 119 this function returns after executing statements until the 120 `COMMIT/ROLLBACK` has been encountered. 121 122 There is an impedance mismatch that has to be explained here, around 123 the interfacing of the SQL `Executor/session` code, which is 124 stream-oriented (with statements being executed one at a time possibly 125 within the scope of SQL transactions) and CockroachDB's Key/Value (KV) 126 interface, which is request oriented with transactions explicitly 127 attached to every request. The most interesting interface for the KV 128 layer of the database is the 129 [`Txn.Exec()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/txn.go#L465) 130 method. `Txn` lives in the `internal/client` package, which contains 131 the KV client interface (the "client" and the server in this context 132 are both internal to CockroachDB, although we used to expose the KV 133 interface externally in the past and it's not out of the question that 134 we'll do it again in the future). `Txn` represents a KV transaction; 135 there's generally one associated with the SQL session, reused between 136 client ping-pongs. 137 138 The `Txn.Exec` interface takes a callback and some execution options 139 and, based on those options, executes the callback possibly multiple 140 times and commits the transaction afterwards. If allowed by the 141 options, the callback might be called multiple times, to deal with 142 retries of transactions that are [sometimes 143 necessary](https://www.cockroachlabs.com/docs/stable/transactions.html#transaction-retries) 144 in CockroachDB (usually because of data contention). The SQL 145 `Executor` might or might not want to let the KV client perform such 146 retries automatically. 147 148 To hint at the complications: a single SQL 149 statement executed outside of a SQL transaction (i.e. an "implicit 150 transaction") can be safely retried. However, a SQL transaction 151 spanning multiple client requests will have different statements 152 executed in different callbacks passed to `Txn.Exec()`; as such, it is 153 not sufficient to retry one of these callbacks - we have to retry all 154 the statements in the transaction, and generally some of these 155 statements might be conditional on the client's logic and thus cannot 156 be retried verbatim (i.e. different results for a `SELECT` might 157 trigger different subsequent statements). In this case, we bubble up a 158 retryable error to the client; more details about this can be read in 159 our [transaction 160 documentation](https://www.cockroachlabs.com/docs/stable/transactions.html#client-side-intervention). This 161 complexity is captured in 162 [`Executor.execRequest()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L495), 163 which has logic for setting the different execution options and 164 contains a [suitable 165 callback](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L604) 166 [passed to 167 `Txn.Exec()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L643); 168 this callback will call 169 [`runTxnAttempt()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L621). The 170 statement execution code path continues inside the callback, but it is 171 worth noting that, from this moment on, we have interfaced with the 172 (client of the) KV layer and everything below is executing in the 173 context of a KV transaction. 174 175 ### Building execution plans 176 177 Now that we have figured out what (KV) transaction we're running inside 178 of, we are concerned with executing SQL statements one at a 179 time. `runTxnAttempt()` has a few layers below it dealing with the 180 various states a SQL transaction can be in 181 ([open](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L966s) 182 / 183 [aborted](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L887) 184 / [waiting for a user 185 retry](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L944), 186 etc.), but the interesting one is 187 [execStmt](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L1258). This 188 guy [creates an "execution 189 plan"](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L1261) 190 for a statement and runs it. 191 192 An execution plan in CockroachDB is a tree of 193 [`planNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad/pkg/sql/plan.go#L72) 194 nodes, similar in spirit to the AST but, this time, containing 195 semantic information and also runtime state. This tree is built by 196 [`planner.makePlan()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/plan.go#L199), 197 which takes a parsed statement and returns the root of the `planNode` 198 tree after having performed all the semantic analysis and various 199 transformations. The nodes in this tree are actually "executable" 200 (they have `Start()` and `Next()` methods), and each one will consume 201 data produced by its children (e.g. a `JoinNode` has [`left and 202 right`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/join.go#L125) 203 children whose data it consumes). 204 205 Currently building the execution plan, performing semantic analysis 206 and applying various transformations is a pretty ad-hoc process, but 207 we are working on replacing the code with a more structured process and 208 separating the IR (Intermediate Representation) used for analysis and 209 transforms from the runtime structures (see this WIP 210 RFC)[https://github.com/cockroachdb/cockroach/pull/10055/files#diff-542aa8b21b245d1144c920577333ceed]. 211 212 In the meantime, the `planner` [looks at the type of the 213 statement](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/plan.go#L248) 214 at the top of the AST and, for each statement type, invokes a specific 215 method that builds the execution plan. For example, the tree for a 216 `SELECT` statement is produced by 217 [`planner.SelectClause()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L257). Notice 218 how different aspects of a `SELECT` statement are handled there: a 219 `scanNode` is created 220 (`renderNode.initFrom()`->...-> 221 [`planner.Scan()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/data_source.go#L441)) 222 to scan a table, a `WHERE` clause is transformed into an expression 223 and assigned to a 224 `filterNode`, 225 an `ORDER BY` clause is [turned into a 226 `sortNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L296), 227 etc. In the end, a 228 [`selectTopNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L320) 229 is produced, which in fact is a tree of a `groupNode`, a `windowNode`, 230 a `sortNode`, a `distinctNode` and a `renderNode` wrapping a 231 `scanNode` acting as an original data source). 232 233 Finally, the execution plan is simplified and optimized somewhat; this 234 includes removing the `selectTopNode` wrappers and eliding all no-op 235 intermediate nodes. 236 237 To make this notion of the execution plan more concrete, consider one 238 actually "rendered" by the `EXPLAIN` statement: 239 240 ```sql 241 root@:26257> create table customers( 242 name string primary key, 243 address string, 244 state string, 245 index SI (state) 246 ); 247 248 root@:26257> insert into customers values 249 ('Google', '1600 Amphitheatre Parkway', 'CA'), 250 ('Apple', '1 Infinite Loop', 'CA'), 251 ('IBM', '1 New Orchard Road ', 'NY'); 252 253 root@:26257> EXPLAIN(EXPRS,NOEXPAND,NOOPTIMIZE,METADATA) SELECT * FROM customers WHERE address like '%Infinite%' ORDER BY state; 254 +-------+--------+----------+---------------------------+------------------------+----------+ 255 | Level | Type | Field | Description | Columns | Ordering | 256 +-------+--------+----------+---------------------------+------------------------+----------+ 257 | 0 | select | | | (name, address, state) | +state | 258 | 1 | nosort | | | (name, address, state) | +state | 259 | 1 | | order | +@3 | | | 260 | 1 | render | | | (name, address, state) | | 261 | 1 | | render 0 | name | | | 262 | 1 | | render 1 | address | | | 263 | 1 | | render 2 | state | | | 264 | 2 | filter | | | (name, address, state) | | 265 | 2 | | filter | address LIKE '%Infinite%' | | | 266 | 3 | scan | | | (name, address, state) | | 267 | 3 | | table | customers@primary | | | 268 +-------+--------+----------+---------------------------+------------------------+----------+ 269 ``` 270 271 You can see data being produced by a `scanNode`, being filtered by a 272 `filterNode` (presented as "filter"), and then sorted by a `sortNode` 273 (presented as "nosort", because we have turned off order analysis with 274 NOEXPAND and the sort node doesn't know yet whether sorting is 275 needed), wrapped in a `selectTopNode` (presented as "select"). 276 277 With plan simplification turned on, the EXPLAIN output becomes: 278 279 ``` 280 root@:26257> EXPLAIN (EXPRS,METADATA) SELECT * FROM customers WHERE address LIKE '%Infinite%' ORDER BY state; 281 +-------+------+--------+---------------------------+------------------------+--------------+ 282 | Level | Type | Field | Description | Columns | Ordering | 283 +-------+------+--------+---------------------------+------------------------+--------------+ 284 | 0 | sort | | | (name, address, state) | +state | 285 | 0 | | order | +state | | | 286 | 1 | scan | | | (name, address, state) | +name,unique | 287 | 1 | | table | customers@primary | | | 288 | 1 | | spans | ALL | | | 289 | 1 | | filter | address LIKE '%Infinite%' | | | 290 +-------+------+--------+---------------------------+------------------------+--------------+ 291 ``` 292 293 #### Expressions 294 295 A subset of ASTs are 296 [`parser.Expr`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/parser/expr.go#L25), 297 representing various "expressions" - parts of statements that can 298 occur in many various places - in a `WHERE` clause, in a `LIMIT` 299 clause, in an `ORDER BY` clause, as the projections of a `SELECT` 300 statement, etc. Expressions nodes implement a common interface so that 301 a [visitor pattern](https://en.wikipedia.org/wiki/Visitor_pattern) can 302 be applied to them for different transformations and 303 analysis. Regardless of where they appear in the query, all 304 expressions need some common processing (e.g. names appearing in them 305 need to be resolved to columns from data sources). These tasks are 306 performed by 307 [`planner.analyzeExpr`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/analyze.go#L1596). Each 308 `planNode` is responsible for calling `analyzeExpr` on the expressions 309 it contains, usually at node creation time (again, we hope to unify 310 our execution planning more in the future). 311 312 `planner.analyzeExpr` performs the following tasks: 313 314 1. resolving names (the `colA` in `select 3 * colA from MyTable` needs 315 to be replaced by an index within the rows produced by the underlying 316 data source (usually a `scanNode`)) 317 2. normalization (e.g. `a = 1 + 1` -> `a = 2`, ` a not between b and c` -> `(a < b) or (a > c)`) 318 3. type checking (see [the typing 319 RFC](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160203_typing.md) 320 for an in-depth discussion of Cockroach's typing system). 321 322 1. constant folding (e.g. `1 + 2` becomes `3`): we perform exact 323 arithmetic using [the same library used by the Go 324 compiler](https://golang.org/pkg/go/constant/) and classify all the 325 constants into two categories: [numeric - 326 `NumVal`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/parser/constant.go#L108) 327 or [string-like - 328 `StrVal`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/parser/constant.go#L281). These 329 representations of the constants are smart enough to figure out the 330 set of types that can represent the value 331 (e.g. [`NumVal.AvailableTypes`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/parser/constant.go#L188) 332 - `5` can be represented as `int, decimal or float`, but `5.4` can 333 only be represented as `decimal or float`) This will come in useful 334 in the next step. 335 336 2. type inference and propagation: this analysis phase assigns a 337 result type to an expression, and in the process types all the 338 sub-expressions. Typed expressions are represented by the 339 [`TypedExpr`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/parser/expr.go#L48) 340 interface, and they are finally able to evaluate themselves to a 341 result value through the `Eval` method. The typing algorithm is 342 presented in detail in the typing RFC: the general idea is that 343 it's a recursive algorithm operating on sub-expressions; each level 344 of the recursion may take a hint about the desired outcome, and 345 each expression node takes that hint into consideration while 346 weighting what options it has. In the absence of a hint, there's 347 also a set of "natural typing" rules. For example, a `NumVal` 348 described above 349 [checks](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/parser/constant.go#L61) 350 whether the hint is compatible with its list of possible 351 types. This process also deals with [`overload 352 resolution`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/parser/type_check.go#L371) 353 for function calls and operators. 354 4. replacing sub-query syntax nodes by a `sql.subquery` execution plan 355 node. 356 357 A note about sub-queries: consider a query like `select * from 358 Employees where DepartmentID in (select DepartmentID from Departments 359 where NumEmployees > 100)`. The query on the `Departments` table is 360 called a sub-query. Subqueries are recognized and replaced with an 361 execution node by 362 [`subqueryVisitor`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/subquery.go#L294). The 363 subqueries are then run and replaced by their results through the 364 [`subqueryPlanVisitor`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/subquery.go#L194). This 365 is usually done by various top-level nodes when they start execution 366 (e.g. `renderNode.Start()`). 367 368 ### Notable `planNodes` 369 370 As hinted throughout, execution plan nodes are responsible for 371 executing parts of a query. Each one consumes data from lower-level 372 nodes, performs some logic, and feeds data into a higher-level one. 373 374 After being constructed, their main methods are 375 [`Start`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/plan.go#L142), 376 which initiates the processing, and 377 [`Next`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/plan.go#L149), 378 which is called repeatedly to produce the next row. 379 380 To tie this to the [SQL Executor](#sql-executor) section above, 381 `executor.execLocal()`, 382 the method responsible for executing one statement, calls 383 `plan.Next()` repeatedly and accumulates the results. 384 385 Consider some `planNode`s involved in running a `SELECT` 386 statement, using the table defined above and 387 388 ```sql 389 SELECT * FROM customers WHERE State LIKE 'C%' AND strpos(address, 'Infinite') != 0 ORDER BY Name; 390 ``` 391 392 as a slightly contrived example. This is supposed to return customers 393 from states starting with "C" and whose address contains the string 394 "Infinite". To get excited, let's see the query plan for this 395 statement: 396 397 ```sql 398 root@:26257> EXPLAIN(EXPRS) SELECT * FROM customers WHERE State LIKE 'C%' and strpos(address, 'Infinite') != 0 order by name; 399 +-------+------------+--------+----------------------------------+ 400 | Level | Type | Field | Description | 401 +-------+------------+--------+----------------------------------+ 402 | 0 | sort | | | 403 | 0 | | order | +name | 404 | 1 | index-join | | | 405 | 2 | scan | | | 406 | 2 | | table | customers@SI | 407 | 2 | | spans | /"C"-/"D" | 408 | 2 | | filter | state LIKE 'C%' | 409 | 2 | scan | | | 410 | 2 | | table | customers@primary | 411 | 2 | | filter | strpos(address, 'Infinite') != 0 | 412 +-------+------------+--------+----------------------------------+ 413 ``` 414 415 So the plan produced for this query, from top (highest-level) to 416 bottom, looks like: 417 418 ``` 419 sortNode -> indexJoinNode -> scanNode (index) 420 -> scanNode (PK) 421 ``` 422 423 Before we inspect the nodes in turn, one thing deserves explanation: 424 how did the `indexJoinNode` (which indicates that the query is going 425 to use the "SI" index) come to be? The fact that this query uses an 426 index is not apparent in the syntactical structure of the `SELECT` 427 statement, and so this plan is not simply a product of the mechanical 428 tree building hinted to above. Indeed, there's a step that we haven't 429 mentioned before: ["plan 430 expansion"](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/expand_plan.go#L28). Among 431 other things, this step performs "index selection" (more information 432 about the algorithms currently used for index selection can be found 433 in [Radu's blog 434 post](https://www.cockroachlabs.com/blog/index-selection-cockroachdb-2/)). We're 435 looking for indexes that can be scanned to efficiently retrieve only 436 rows that match (part of) the filter. In our case, the "SI" index 437 (indexing the state) can be scanned to efficiently retrieve only the 438 rows that are candidates for satisfying the `state LIKE 'C%'` 439 expression (in an ecstasy to agony moment, we see that our index 440 selection / expression normalization code [is smart 441 enough](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/analyze.go#L1436) 442 to infer that `state LIKE 'C%'` implies `state >= 'C' AND state < 443 'D'`, but is not smart enough to infer that the two expressions are in 444 fact equivalent and thus the filter can be elided altogether). We 445 won't go into plan expansion or index selection here, but the index 446 selection process happens [in the expansion of the 447 `SelectNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/expand_plan.go#L283) 448 and, as a byproduct, produces `indexJoinNode`s configured with the 449 index spans to be scanned. 450 451 Now let's see how these `planNode`s run: 452 453 1. [`sortNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sort.go#L31): 454 The `sortNode` sorts the rows produced by its child and corresponds to 455 the `ORDER BY` SQL clause. The 456 [constructor](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sort.go#L60) 457 has a bunch of logic related to the quirky rules for name resolution 458 from SQL92/99. Another interesting fact is that, if we're sorting by a 459 non-trivial expression (e.g. `SELECT a, b ... ORDER BY a + b`), we 460 need the `a + b` values (for every row) to be produced by a 461 lower-level node. This is achieved through a pattern that's also 462 present in other node: the lower node capable of evaluating 463 expressions and rendering their results is the `renderNode`; the 464 `sortNode` constructor checks if the expressions it needs are already 465 rendered by that node and, if they are not, asks for them to be 466 produced through the 467 [`renderNode.addOrMergeRenders()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sort.go#L206) 468 method. The actual sorting is performed in the `sortNode.Next()` 469 method. The first time it is called, [it consumes all the data 470 produced by the child 471 node](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sort.go#L359) 472 and accumulates it into `n.sortStrategy` (an interface hiding multiple 473 sorting algorithms). When the last row is consumed, 474 `n.sortStrategy.Finish()` is called, at which time the sorting 475 algorithm finishes its processing. Subsequent calls to 476 `sortNode.Next()` simply iterate through the results of sorting 477 algorithm. 478 479 2. [`indexJoinNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/index_join.go#L30): 480 The `indexJoinNode` implements joining of results from an index with 481 the rows of a table. It is used when an index can be used for a query, 482 but it doesn't contain all the necessary columns; columns not 483 available in the index need to be retrieved from the Primary Key (PK) 484 key-values. The `indexJoinNode` sits on top of two scan nodes - one 485 configured to scan the index, and one that is constantly reconfigured 486 to do "point lookups" by PK. In the case of our query, we can see that 487 the "SI" index is used to read a compact set of rows that match the 488 "state" filter but, since it doesn't contain the "address" columns, 489 the PK also needs to be used. Each index KV pair contains the primary 490 key of the row, so there is enough information to do PK 491 lookups. [`indexJoinNode.Next`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/index_join.go#L273) 492 keeps reading rows from the index and, for each one, adds a spans to 493 be read by the PK. Once enough such spans have been batched, they are 494 all [read from the 495 PK](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/index_join.go#L257). As 496 described in the section on [SQL rows to KV 497 pairs](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#data-mapping-between-the-sql-model-and-kv)) 498 from the design doc, each SQL row is represented as a single KV pair 499 in the indexes, but as multiple consecutive rows in the PK 500 (represented by a "key span"). 501 502 An interesting detail has to do with 503 how filters are handled: note that the `state LIKE 'C%'` condition is 504 evaluated by the index scan, and the `strpos(address, 'Infinite') != 505 0` condition is evaluated by the PK scan. This is nice because it 506 means that we will be filtering as much as we can on the index side and 507 we will be doing fewer expensive PK lookups. The code that figures out 508 which conjunction is to be evaluated where is in `splitFilter()`, 509 called by the [`indexJoinNode` 510 constructor](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/index_join.go#L180). 511 512 3. [`scanNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/scan.go#L33): 513 The `scanNode` generally constitutes the source of a `renderNode` or `filterNode`; 514 it is responsible for scanning over the key/value pairs for a table or 515 index and reconstructing them into rows. This node is starting to 516 smell like rubber meeting a road, because we are getting closer to the 517 actual data - the monolithic, distributed KV map. You'll see that the 518 [`Next()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/scan.go#L214) 519 method is not particularly climactic, since it delegates the work to a 520 `rowFetcher`, described below. There's one interesting thing that the 521 `scanNode` does: it [runs a filter 522 expression](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/scan.go#L233), 523 just like the `filterNode`. That is because we are trying to push down 524 parts of the `WHERE` clause as far as possible. This is generally a 525 work in progress, see `filter_opt.go`. The idea is 526 that a query like 527 ```sql 528 /* Select the orders placed by each customer in the first year of membership. */ 529 SELECT * FROM Orders o inner join Customers c ON o.CustomerID = c.ID WHERE Orders.amount > 10 AND Customers.State = 'NY' AND age(c.JoinDate, o.Date) < INTERVAL '1 year' 530 ``` 531 is going to be compiled into two `scanNode`s, one for `Customers`, 532 one for `Orders`. Each one of them can do the part of filtering 533 that refers exclusively to their respective tables, and then the 534 higher-level `joinNode` only needs to evaluate expressions that 535 need data from both (i.e. `age(c.JoinDate, o.Date) < INTERVAL '1 536 year'`). 537 538 Let's continue downwards, looking at the structures that the 539 `scanNode` uses for actually reading data. 540 541 1. [`rowFetcher`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sqlbase/rowfetcher.go#L35): 542 The `rowFetcher` is responsible for iterating through key-value 543 pairs, figuring out where a SQL table or index row ends (remember 544 that a SQL row is potentially encoded in multiple KV entries), and 545 decoding all the keys and values in SQL column values, dealing with 546 differences between the primary index and other indexes and with 547 the [layout of a 548 table](https://www.cockroachlabs.com/docs/stable/column-families.html). For 549 details on the mapping between SQL rows and KV pairs, see the 550 [corresponding section from the Design 551 Doc](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#data-mapping-between-the-sql-model-and-kv) and the [encoding tech note](encoding.md). 552 553 The `rowFetcher` also [performs 554 decoding](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sqlbase/table.go#L953) 555 from on-disk byte arrays to the representation of data that we do 556 most processing on: implementation of the 557 [`parser.Datum`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/parser/datum.go#L57) 558 interface. For details on what the on-disk format is for different 559 data types, browse around the [`util/encoding 560 directory`](https://github.com/cockroachdb/cockroach/tree/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/util/encoding). 561 562 For actually reading a KV pair from the database, the `rowFetcher` delegates to the `kvBatchFetcher`. 563 564 2. [`kvBatchFetcher`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sqlbase/kvfetcher.go#L84): 565 The `kvBatchFetcher` finally reads data from the KV database. It 566 understands nothing of SQL concepts, such as tables, rows or 567 columns. When it is created, it is configured with a number of "key 568 spans" that it needs to read (these might be, for example, a single 569 span for reading a whole table, or a couple of spans for reading 570 parts of the PK or of an index). 571 572 To actually read data from the KV database, the `kvBatchFetcher` uses the 573 KV layer's "client" interface, namely 574 [`client.Batch`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/batch.go#L30). This 575 is where the "SQL layer" interfaces with the "KV layer" - the 576 `kvBatchFetcher` will 577 [build](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sqlbase/kvfetcher.go#L197) 578 such `Batch`es of requests, [send them for 579 execution](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sqlbase/kvfetcher.go#L203) 580 in the context of the KV transaction (remember the `Transaction` 581 mentioned in the [Statement Execution 582 section](#StatementExecution)), [read the 583 results](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sqlbase/kvfetcher.go#L220) 584 and return them to the hierarchy of `planNodes`. The requests being 585 sent to the KV layer, in the case of this read-only query, are 586 [`ScanRequest`s](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/roachpb/api.proto#L204). 587 588 The rest of this document will walk through the "execution" of KV 589 requests, such as the ones sent by the `kvBatchFetcher`. 590 591 ## KV 592 593 The KV layer of CockroachDB deals with execution of "requests". The 594 protocol-buffer-based API is defined in 595 [api.proto](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/roachpb/api.proto), 596 listing the various types of requests and response. In practice, the 597 KV's client always sends 598 [`BatchRequest`s](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/roachpb/api.proto#L848), 599 a generic request containing a collection of other requests. All 600 requests have a 601 [Header](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/roachpb/api.proto#L788) 602 which contains routing information (which replica a request is 603 destined for) and [transaction 604 information](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/roachpb/api.proto#L818) 605 - in the context of which transaction is the request to be executed. 606 607 ### The KV client interface 608 609 Clients "send" KV requests using a client interface (currently this 610 interface is internal, used by SQL, but we might offer it directly to 611 users in some form in the future). This client interface contains 612 primitives for [starting a (KV) 613 transaction](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/db.go#L452) 614 (remember, the SQL `Executor` uses this to run every statement in the 615 context of a transaction). Afterwards, a 616 [`Txn`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/txn.go#L36) 617 object is available for [executing requests in the context of that 618 transaction](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/txn.go#L295) 619 - this is what, for example, the `kvBatchFetcher` uses. If you trace what 620 happens inside that `Txn.Run()` method you eventually get to 621 [`txn.db.sender.Send(..., 622 batch)`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/txn.go#L587): 623 the request starts percolating through a hierarchy of 624 [`Sender`s](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/sender.go#L27) 625 - objects that perform various peripheral tasks and ultimately route 626 the request to a replica for execution. `Sender`s have a single method 627 - `Send()` - which ultimately passes the request to the lower 628 level. Let's go down this "sending" rabbit hole: `TxnCoordSender -> 629 DistSender -> Node -> Stores -> Store -> Replica`. The first two run 630 on the same node as the that received the SQL query and is doing the 631 SQL processing (the "gateway node"), the others run on the nodes 632 responsible for the data that is being accessed (the "range node"). 633 634 ### TxnCoordSender 635 636 The top-most `client.Sender` is the 637 [`TxnCoordSender`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/txn_coord_sender.go#L140). A 638 TxnCoordSender is responsible for dealing with transactions' state 639 (see the [Transaction Management section of the design 640 doc](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md)). After 641 a transaction is started, the TxnCoordSender starts asynchronously 642 sending heartbeat messages to that transaction's "txn record", to keep 643 it live. It also keeps track of each written key or key range over the 644 course of the transaction. When the transaction is committed or 645 aborted, it clears accumulated write intents for the transaction. All 646 requests being performed as part of a transaction have to go through 647 the same `TxnCoordSender` so that all write intents are accounted for 648 and eventually cleaned up. After performing this bookkeeping, the 649 request is [passed to the 650 `DistSender`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/txn_coord_sender.go#L418). 651 652 ### DistSender 653 654 The 655 [`DistSender`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L133) 656 is truly a workhorse: it handles the communication between the gateway 657 node and the (possibly many) range nodes, putting the "distributed" in 658 "distributed database". It receives `BatchRequest`s, looks at the 659 requests inside the batch, figures out what range each command needs 660 to go to, finds the nodes/replicas responsible for that range, routes 661 the requests there and then collects and reassembles the results. 662 663 Let's go through the code a bit: 664 665 1. The request is subdivided into ranges: `DistSender.Send()` calls 666 [`DistSender.divideAndSendBatchToRanges()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L620) 667 which 668 [iterates](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L720) 669 over the constituent ranges of requests by using a 670 [`RangeIterator`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_iter.go#L27) 671 (a single request, such as a `ScanRequest` can refer to a key span 672 that might straddle potentially many ranges). A lot of things hide 673 behind this innocent-looking iteration: the cluster's range metadata 674 needs to be accessed in order to find the mapping of keys to ranges 675 (info on this metadata can be found [in the Range Metadata 676 section](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#range-metadata) 677 of the design doc). Range metadata is stored as regular data in the 678 cluster, in a two-level index mapping range end keys to descriptors 679 about the replicas of the respective range (the ranges storing this 680 index are called "meta-ranges"). The `RangeIterator` logically 681 iterates over these descriptors, in range key order. Brace yourselves: 682 for moving from one range to the next, the iterator [calls back into 683 the 684 `DistSender`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_iter.go#L156), 685 which knows how to find the descriptor of the range responsible for 686 one particular key. The `DistSender` 687 [delegates](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L441) 688 resolving a key to a descriptor to the 689 [`rangeDescriptorCache`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_cache.go#L76) 690 (a LRU tree cache, indexed by range end key). This cache 691 desynchronized with reality as ranges in a cluster split or move 692 around; when an entry is discovered to be stale, we'll see below that 693 the `DistSender` removes it from the cache. 694 695 In the happy case, the cache has information about a descriptor 696 covering the key we're interested in and [it returns 697 it](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_cache.go#L295). In 698 the unhappy case, it needs to [perform a 699 scan](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_cache.go#L323) 700 of a meta-range. To do so, we need to know the descriptor of the 701 meta-range containing the descriptor we're interested in, which is 702 retrieved using [a recursive call into the 703 cache](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_cache.go#L431). This 704 recursion cannot go on forever - the descriptor of a regular range is 705 in a meta2-range (a 2nd level index range), and the descriptors for 706 meta2-ranges are present in the (one and only) meta1-range. Once the 707 meta-descriptor of the range we want to scan is known, the cache 708 [delegates back again to the 709 `DistSender`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_cache.go#L439), 710 which [sends a 711 `RangeLookupRequest`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L305) 712 KV command addressed directly to the meta range (so the `DistSender` 713 is not recursively involved in routing this request). 714 715 2. Each sub-request (partial batch) is sent to its range. This is done 716 through the call to 717 [`DistSender.sendPartialBatchAsync()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L793) 718 which 719 [truncates](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L894) 720 all the requests in the batch to the current range and then it 721 [sends](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L926) 722 the truncated batch to a range. All these partial batches are sent 723 concurrently. 724 725 `sendPartialBatch()` is the level at which error stemming from stale 726 `rangeDescriptorCache` information are handled: the range descriptor 727 that's detected to be stale is 728 [evicted](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L974) 729 from the cache and the partial batch is 730 [reprocessed](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L982). 731 732 3. Sending a partial batch to a single range implies selecting the 733 right replica of that range and performing an RPC to it. By default, 734 each range is replicated three ways, but only one of the three 735 replicas is the "lease holder" - the temporarily designed owner of 736 that range, in charge of coordinating all reads and writes to it (see 737 the [Range Leases 738 section](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#range-leases) 739 in the design doc). Figuring out which replica has the lease is done 740 through another cache - the 741 [`leaseHolderCache`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/leaseholder_cache.go#L28), 742 whose information can also get stale. 743 744 The method of the `DistSender` dealing with this is 745 [`sendSingleRange`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L451). It 746 will use the cache to send the request to the lease holder, but it's 747 also prepared to try the other replicas, in [order of 748 "proximity"](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L461). The 749 replica that the cache says is the leaseholder is simply [moved to the 750 front](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L467) 751 of the list of replicas to be tried and then an [RPC is 752 sent](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L474) 753 to all of them, in order. 754 755 4. Sending the RPCs is initiated by 756 [`sendToReplicas`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L1095), 757 which sends the request to the [first 758 one](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L1136) 759 and subsequently to [the 760 other](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L1234), 761 until one succeeds or returns a processing error. Processing errors 762 are distinguished from routing errors by 763 [`handlePerReplicaError`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L1258) 764 which, among others, handles [wrong information in the 765 `leaseHolderCache`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L1273). 766 767 5. Actually sending the RPCs is hidden behind the [`Transport 768 interface`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/transport.go#L87). Concretely, 769 [`grpcTransport.SendNext()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/transport.go#L159) 770 does [gRPC](http://www.grpc.io/) calls to the nodes containing the 771 destination replicas, namely to a service implementing the [`Internal` 772 service](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/roachpb/api.proto#L892). 773 774 6. The (async) responses from the different replicas are 775 [combined](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L691) 776 into a single `BatchResponse`, which is ultimately returned from the 777 `Send()` method. 778 779 We've now gone through the relevant things that happen on the gateway 780 node. Further, we're going to look at what happens on the "remote" 781 side - on each of the ranges. 782 783 ### RPC server - Node and Stores 784 785 We've seen how the `DistSender` splits `BatchRequest` into partial 786 batches, each containing commands local to a single replica, and how 787 these commands are sent to the lease holders of their ranges through 788 RPCs. We're now moving to the "server" end of these RPCs. The struct 789 that implements the RPC service is 790 [`Node`](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/server/node.go#L850). The 791 `Node` doesn't do anything of great relevance; it 792 [delegates](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/server/node.go#L825) 793 the request to its 794 [`Stores`](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/stores.go#L36) 795 member which represents a collection of "stores" (on-disk databases 796 imagined to be one per physical disk, see [the Architecture 797 section](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#architecture) 798 of the design doc). The `Stores` implements the `Sender` interface, 799 just like the gateway layers that we've seen before, resuming the 800 pattern of wrapping another `Sender` and passing requests down through 801 the `Send()` method. 802 803 `Stores.Send()` 804 [identifies](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/stores.go#L160) 805 which particular store contains the destination replica (based on 806 routing info filled into the request by the `DistSender`) and 807 [routes](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/stores.go#L187) 808 the request there. One interesting thing that the `Stores` does, in 809 case requests from the current transactions have already been 810 processed on this node, is [update the upper 811 bound](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/stores.go#L166) 812 on the uncertainty interval to be used by the current request (see the 813 ["Choosing a Timestamp" section of the design 814 doc](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md)) 815 for details on uncertainty intervals). The uncertainty interval 816 dictates which timestamps for values are ambiguous because of clock 817 skew between nodes (the values for which don't know if they were 818 written before or after the serialization point of the current 819 txn). This code realizes that, if a request from the current txn has 820 been processed on this node before, no value written after that node's 821 timestamp at the time of that other request processing is ambiguous. 822 823 ### Store - intent resolution 824 825 A 826 [`Store`](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L366) 827 represents one physical disk device. For our purposes, a `Store` 828 mostly 829 [delegates](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L2485) 830 the request to a `replica`, but it has one important role - in case 831 the request runs into "write intents" (i.e. uncommitted values), it 832 deals with those intents. This handles read-write and write-write 833 conflicts between transactions. Notice that the code calling the 834 `replica` is inside [a big infinite retry 835 loop](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L2454) 836 and that a bunch of the code inside it deals with 837 [`WriteIntentError`](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L2494). When 838 we see such an error, we [try to "resolve" 839 it](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L2522) 840 using the 841 [`intentResolver`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/intent_resolver.go#L43). Resolving 842 means figuring out if the transaction to which the intent belongs is 843 still pending (it might already be committed or aborted, in which case 844 the intent is "resolved"), or possibly "pushing" the transaction in 845 question (forcing it to restart at a higher timestamp, such that it 846 doesn't conflict with the current txn). If the conflicting txn is no 847 longer pending or if it was pushed, then the intents can be properly 848 resolved (i.e. either replaced by a committed value, or simply 849 discarded). The first part - figuring out the txn status or pushing it 850 - is done in 851 [`intentResolver.maybePushTransaction`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/intent_resolver.go#L220): 852 we can see that a series of `PushTxnRequest`s are batched and sent to 853 the cluster (meaning the hierarchy of `Sender`s on the current node 854 will be used, top to bottom, to route the requests to the various 855 transaction records - see the ["Transaction execution flow" 856 section](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md) 857 of the design doc). In case the transaction we're trying to push is 858 still pending, the decision about whether or not the push is 859 successful is done [deep in the processing of the 860 `PushTxnRequest`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L1380) 861 (several levels below the `Store` level we're discussing here, in the 862 stack for the spinned-off `PushTxnRequest`) based on the relative 863 priorities of the pusher/pushee txns. 864 865 The second part - replacing the intents that can now be resolved, is 866 done through a call to 867 [`intentResolver.resolveIntents`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/intent_resolver.go#L97). 868 Back where we left off in 869 [`Store.Send()`](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L2522), 870 the call to the `intentResolver`, if successful, will change the 871 `resolved` field of the `WriteIntentError` which will cause us to 872 [retry 873 immediately](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L2533). Otherwise, 874 we'll retry according to an exponential backoff, waiting for the still 875 pending transaction that we couldn't push to complete - we don't want 876 to retry too soon, as we'd almost surely run into the same intent 877 again (we're working to replace this "polling"-based mechanism for 878 waiting for a conflicting txn to finish with something more reactive). 879 880 ### Replica - executing reads, proposing Raft commands 881 882 A 883 [`Replica`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L246) 884 represents one copy of range, which in turn is a contiguous keyspace 885 managed by one instance of the Raft consensus algorithm. The system 886 tries to keep ranges around 64MB, by default. The `Replica` is the 887 final `Sender` in our hierarchy. The role of all the other `Sender`s 888 was, mostly, to route requests to the `Replica` currently acting as 889 the lease holder for the range (a _primus inter pares_ `Replica` that 890 takes on a bunch of coordination responsibilities we'll explore 891 below). A replica deals with read requests differently than write 892 requests. Reads are evaluated directly, whereas writes will enter 893 another big chapter in their life and go through the Raft consensus 894 protocol. 895 896 The difference between the paths of read requests vs write requests is 897 seen immediately: `replica.Send()` quickly [branches 898 off](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1322) 899 based on the request type. We'll talk about the read/write paths in 900 turn. 901 902 ### Read request path 903 904 The first thing that is done for a read request is checking if the 905 request got to the right place (i.e. the current replica is the lease 906 holder); remember that a lot of the routing was done based on caches 907 or out-right guesses. This check is performed by 908 [`replica.redirectOnOrAcquireLease()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1783), 909 a rabbit hole in its own right. Let's just say that, in case the 910 current replica is not the lease holder, `redirectOnOrAcquireLease` 911 either 912 [redirects](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L992) 913 to the lease holder, if there is a valid lease (remember that the 914 `DistSender` will handle such redirections), or [requests a new 915 lease](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L981) 916 otherwise, in the hope that it will become the lease 917 holder. Requesting a lease is done through the 918 [`pendingLeaseRequest`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_range_lease.go#L33) 919 helper struct, which coalesces multiple requests for the same lease 920 and eventually constructs a 921 [`RequestLeaseRequest`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_range_lease.go#L140) 922 and [sends it for 923 execution](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_range_lease.go#L205) 924 directly to the replica (as we've seen in other cases, bypassing all 925 the senders to avoid recursing infinitely). In case a lease is 926 requested, `redirectOnOrAcquireLease` will 927 [wait](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1012) 928 for that request to complete and check if it was successful. 929 930 Once the lease situation has been settled, the next thing to do for 931 the read is synchronizing it with possible in-flight writes - if a 932 write to an overlapping key span is in progress, the read might need 933 to see its value, so we can't race with it; we must wait until the 934 write is done. This synchronization is done through the 935 [`CommandQueue`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/command_queue.go#L32) 936 struct - an interval tree maintaining all the in-flight requests, 937 indexed by the key or span of keys that they touch. Waiting for the 938 writes is done inside 939 [`replica.beginCmds()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1525). Notice 940 that immediately after figuring out which commands we need to wait 941 for, we atomically [add the current 942 read](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1527) 943 to the command queue in order to block future writes. This overlaps in 944 spirit with the use of the `TimestampCache` structure described below 945 and in fact there is a 946 [proposal](https://forum.cockroachlabs.com/t/why-do-we-keep-read-commands-in-the-command-queue/360) 947 for not putting reads in the queue. Removal of commands from the queue 948 is [done 949 later](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1811) 950 through the callback returned by `beginCmds`. This epilogue also does 951 something else that's important: it [records the 952 read](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1429) 953 in the 954 [`TimestampCache`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/timestamp_cache.go#L85), 955 a bounded in-memory cache from key range to the latest timestamp at 956 which it was read. This structure serves to protect against violations 957 of the Snapshot Isolation transaction isolation level (the lowest that 958 CockroachDB provides) which require that the outcome of reads must be 959 preserved, i.e. a write of a key at a lower timestamp than a previous 960 read must not succeed (see the Read-Write Conflicts – Read Timestamp 961 Cache section in [Matt's blog 962 post](https://www.cockroachlabs.com/blog/serializable-lockless-distributed-isolation-cockroachdb/)). As 963 we'll see in the writes section, writes consult this structure to make 964 sure they're not writing "under" a read that has already been 965 performed. 966 967 Now we're reading to actually evaluate the read - control moves to 968 [`replica.evaluateBatch()`](https://github.com/cockroachdb/cockroach/blob/75c26e5498ef26a4adde9425cb43682d38ec8ee4/pkg/storage/replica.go#L5225) 969 which calls 970 [`replica.evaluateCommand`](https://github.com/cockroachdb/cockroach/blob/75c26e5498ef26a4adde9425cb43682d38ec8ee4/pkg/storage/replica.go#L5286) 971 for each request in the batch. `evaluateCommand` switches over the request 972 types using a [helper request to method 973 map](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L84) 974 and [passes 975 execution](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L160) 976 to the request-specific method. One typical read request is a 977 [`ScanRequest`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/roachpb/api.proto#L204); 978 this is evaluated by 979 [`evalScan`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L343). The 980 code is very brief - it immediately 981 [calls](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L374) 982 a corresponding on the `engine` - a handle to the on-disk 983 [RocksDB](http://rocksdb.org/) database. Before we dig a bit into 984 this `engine`, let's look at what `evalScan` will do next: it will 985 [return 986 intents](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L360) 987 to the higher levels. These are intents that the scanning 988 encountered, but they didn't prohibit it from continuing 989 (e.g. intents with a timestamp higher than the timestamp at which 990 we're reading - the read doesn't care if those intents are committed 991 or not); this is in contrast with intents that do block the read - 992 those, as we'll see below, are transformed into `WriteIntentError`s 993 which we've seen that they're handled by the [`Store`](#Store). These 994 non-interfering intents are collected for cleanup purposes - they 995 might be garbage left-over by dead transactions and we want to 996 proactively clean them up. They're returned up the stack until 997 `replica.addReadOnlyCmd` [tries to clean them 998 up](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1837) 999 using our old friend, the `intentResolver`. 1000 1001 ### Engine 1002 1003 We're getting to the bottom of the CockroachDB stack - the 1004 [`Engine`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/engine/engine.go#L159) 1005 is an interface abstracting away different on-disk stores. The only 1006 implementation we currently use is 1007 [`RocksDB`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/engine/rocksdb.go#L265), 1008 which is a wrapper around the RocksDB C++ library. We won't go into 1009 this wrapper other than to say that it uses 1010 [cgo](https://golang.org/cmd/cgo/) for interfacing with C++ code. We 1011 also won't go into the RocksDB code which, although it's obviously an 1012 important part of servicing a request, is not something that 1013 CockroachDB devs generally deal with. 1014 1015 For reads, the entry point into the `engine` package is 1016 [`mvccScanInternal()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/engine/mvcc.go#L1480). This 1017 performs a scan over the KV database, dealing with the data 1018 representation we use for [MultiVersion Concurrency 1019 Control](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) 1020 (MVCC). It iterates over the key/vals of the requested range and 1021 [appends](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/engine/mvcc.go#L1511) 1022 each one to the results. The MVCC details, such as the fact that we 1023 keep multiple versions of each key (for different timestamps) and the 1024 intents, are handled by 1025 [`MVCCIterate()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/engine/mvcc.go#L1555), 1026 which uses an iterator provided by the `Engine` to scan over 1027 key/vals. It delegates reading key/vals and advancing the iterator to 1028 [`mvccGetInternal()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/engine/mvcc.go#L659). 1029 1030 ### Write request path 1031 1032 Write requests are conceptually more interesting than reads because 1033 they're not simply serviced by one node/replica. Instead, they go 1034 through the Raft consensus algorithm, which maintains an ordered 1035 commit log, and are then applied by all of a range's replicas (see the 1036 [Raft 1037 section](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#raft---consistency-of-range-replicas) 1038 of the design doc for more details). The replica that initiates this 1039 process is, just like in the read case, the lease holder. Execution on 1040 this lease holder is thus broken into two stages - before ("upstream 1041 of" in code) Raft and below ("downstream of") Raft. The upstream stage 1042 will eventually block for the corresponding Raft command to be applied 1043 *locally* (after the command has been applied locally, future reads 1044 are guaranteed to see its effects). 1045 1046 For what follows we'll introduce some terminology. We've already seen 1047 that a `replica` (and the KV subsystem in general) receives 1048 *requests*. In what follows, these requests will be _evaluated_, which 1049 transforms them to Raft _commands_. The commands in turn are 1050 _proposed_ to the Raft consensus group and, after the Raft group 1051 accepts the proposals and commits them, control comes back to the 1052 `Replica`s (all of the replicas of a range this time, not just the 1053 lease holder), which _apply_ them. 1054 1055 Execution of write commands, mirroring the reads, starts in 1056 [`replica.addWriteCmd()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1857). This 1057 method just contains a retry loop that deals with exceptional cases in 1058 which requests need to be evaluated repeatedly and delegates to 1059 [`replica.tryAddWriteCmd`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1872). This 1060 guy does a number of things: 1061 1062 1. It [waits until overlapping in-flight requests are 1063 done](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1959) 1064 and adds the current as an in-flight request to the `CommandQueue` 1065 (similar to the reads). 1066 1067 2. It checks that the current replica is the lease holder by [calling 1068 the `redirectOnOrAcquireLease` 1069 method](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1983), 1070 just like the reads. 1071 1072 3. It ["applies the timestamp 1073 cache"](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2002) 1074 - meaning that the `TimestampCache` we've discussed above is checked 1075 to see if the write can proceed at the timestamp at which it's trying 1076 to modify that database. If it can't (because there's been a more 1077 recent overlapping read), the write's timestamp [is 1078 bumped](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1688) 1079 to a timestamp later than any overlapping read). 1080 1081 4. It evaluates the request and proposes resulting Raft commands. It 1082 all starts with [this call to 1083 replica.propose()](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2036). We'll 1084 describe the process below (it will be fun) but, before we do, let's 1085 see what the current method will do afterwards. 1086 1087 5. The call to `replica.propose` returns a channel that [we'll wait 1088 on](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2051). This 1089 is the decoupling point that we've anticipated above - the point where 1090 we cede control to the Raft machinery. The `replica` doing the 1091 proposals accepts its role as merely one of many replicas and waits 1092 for the consensus group to make progress in lock-step. The channel 1093 will receive a result when the (local) replica has applied the 1094 respective commands, which can happen only after the commands have 1095 been committed to the shared Raft log (a global operation). 1096 1097 6. As in the reads case, at the end of the `tryAddWriteCmd` method, 1098 [an 1099 epilogue](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1968) 1100 will remove the request from the `CommandQueue` and add it to the 1101 timestamp cache. 1102 1103 #### Evaluation of requests and application of Raft commands 1104 1105 As promised, let's see what happens inside `replica.propose()`. The 1106 first thing is the process of evaluation, i.e. turning a KV *request* 1107 into a Raft *command*. This is done through the call to 1108 [`requestToProposal()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2270), 1109 which quickly calls 1110 [`evaluateProposal()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2119), 1111 which in turn quickly calls the surprisingly-named 1112 [`applyRaftCommandInBatch`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2158). This 1113 last method [simulates the execution of the 1114 request](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3737), 1115 if you will, and records all the would-be changes to the `Engine` into 1116 a "batch" (these batches are how RocksDB models transactions). This 1117 batch will be 1118 [serialized](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3778) 1119 into a Raft command. If we were to commit this batch now, the changes 1120 would be live, but just on this one replica, which would be a 1121 potential data consistency violation. Instead, we [abort 1122 it](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3784). It 1123 will resurrect again when the command "comes out of Raft", as we'll 1124 see. 1125 1126 The simulation part takes place inside the 1127 [`executeWriteBatch()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3822) 1128 method. This takes in the `roachpb.BatchRequest` (the KV request we've 1129 been dealing with all along), [allocates an 1130 `engine.Batch`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3835) 1131 and delegates to 1132 [`evaluateBatch()`](https://github.com/cockroachdb/cockroach/blob/75c26e5498ef26a4adde9425cb43682d38ec8ee4/pkg/storage/replica.go#L5225). This 1133 fellow finally 1134 [iterates](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L4041) 1135 over the individual requests in the batch and, for each one, calls 1136 [`evaluateCommand`](https://github.com/cockroachdb/cockroach/blob/75c26e5498ef26a4adde9425cb43682d38ec8ee4/pkg/storage/replica_command.go#L63). We've 1137 seen `evaluateCommand` before, on the read path. It switches over the 1138 different types of requests and [calls a method specific to each 1139 type](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L160). One 1140 such method would be 1141 [`evalPut`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L225), 1142 which writes a value for a key. Inside it we'll see a [call to the 1143 engine](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L249) 1144 to perform this write (but remember, it's all performed inside a 1145 RocksDB transaction, the `engine.Batch`). 1146 1147 This was all for the purposes of recording the engine changes that 1148 need to be proposed to Raft. Let's unwind the stack to 1149 `replica.propose` (the method that started this section), and see what 1150 happens with the result of 1151 [`requestToProposal`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2270). For 1152 one, it gets [inserted 1153 into](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2292) 1154 the "pending proposals map" - a structure that will make the 1155 connection between a command being *applied* and `tryAddWriteCmd` 1156 which will be blocked on a channel waiting for the local 1157 application. More importantly, it gets passed to 1158 [`replica.submitProposalLocked`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2294) 1159 which, eventually, calls 1160 [`raftGroup.Propose()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2396). This 1161 `raftGroup` is a handle to a consensus group, implemented by the [Etcd 1162 Raft library](https://github.com/coreos/etcd/tree/master/raft), a 1163 black box to which we submit proposals to have them serialized through 1164 majority voting into a coherent distributed log. This library is 1165 responsible for passing the commands in order to all the replicas for 1166 *application*. 1167 1168 This concludes the discussion of the part specific to the lease holder 1169 replica: how commands are proposed to Raft and how the lease holder is 1170 waiting for them to be applied before returning a reply to the (KV) 1171 client. What's missing is the discussion on how exactly they are 1172 applied. 1173 1174 #### Raft command application 1175 1176 We've seen how commands are "put into" Raft. But how do they "come 1177 out"? The Etcd Raft library implements a distributed state machine 1178 whose description is beyond the present scope. Suffice to say that we 1179 have a 1180 [`raftProcessor`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/scheduler.go#L112) 1181 interface that state transitions from this library call to. Our older 1182 friend the `Store` implements this interface and the important method 1183 is 1184 [`Store.processReady()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/store.go#L3337). This 1185 will eventually call back into a specific `replica` (the replica of a 1186 range that's being modified by each command), namely it will call 1187 [`handleRaftReadyRaftMuLocked`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2484). This 1188 will [iterate through newly committed 1189 commands](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2616) 1190 calling 1191 [`processRaftCommand`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3247) 1192 for each one. This will in take the serialized `engine.Batch` and call 1193 [`replica.applyRaftCommand`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3467) 1194 with it. Here the batch is [deserialized and applied to the 1195 engine](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3644) 1196 and, this time, unlike on the proposed side in 1197 `applyRaftCommandInBatch`, the changes are actually [committed to 1198 storage](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3683). The 1199 command has now been applied (on one particular replica, but keep in 1200 mind that the process described in this section happens on *every* 1201 replica). 1202 1203 We've glossed over something in `processRaftCommand` that's important: 1204 after applying the command, if the current replica is proposer 1205 (i.e. the lease holder), we need to signal the proposer (which, as we 1206 saw in the previous section, is blocked in `tryAddWriteCmd`). This 1207 happens at the [very 1208 end](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3513). We've 1209 now come full circle - the proposer will now be unblocked and receive 1210 a response on the channel it was waiting on, and it can unwind the 1211 stack letting its client know that the request is complete. This reply 1212 can travel through the hierarchy of `Sender`s, back from the lease 1213 holder node to the SQL gateway node, to a SQL tree of `planNode`s, to 1214 the SQL Executor, and, through the `pgwire` implementation, to the SQL 1215 client. 1216