github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/life_of_a_query.md (about)

     1  # Life of a SQL Query
     2  
     3  Original author: Andrei Matei
     4  
     5  ## Introduction
     6  
     7  This document aims to explain the execution of an SQL query against
     8  CockroachDB, explaining the code paths through the various layers of
     9  the system (network protocol, SQL session management, parsing,
    10  execution planning, syntax tree transformations, query running,
    11  interface with the KV code, routing of KV requests, request
    12  processing, Raft, on-disk storage engine). The idea is to provide a
    13  high-level unifying view of the structure of the various components;
    14  no one will be explored in particular depth but pointers to other
    15  documentation will be provided where such documentation exists. Code
    16  pointers will abound.
    17  
    18  This document will generally not discuss design decisions; it will
    19  rather focus on tracing through the actual (current) code.
    20  
    21  The intended audience for this post is folks curious about a dive
    22  through the architecture of a modern, albeit young, database presented
    23  differently than in a [design
    24  doc](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md). It
    25  will hopefully also be helpful for open source contributors and new
    26  Cockroach Labs engineers.
    27  
    28  ## Limitations
    29  
    30  This document does not cover some important aspects of query execution,
    31  in particular major developments that have occurred after the document
    32  was initially authored; including but not limited to:
    33  
    34  - how are transaction and SQL session timestamps assigned
    35  - proposer-evaluated KV
    36  - concurrent statement execution inside a SQL transaction
    37  - 1PC optimizations for UPDATE and INSERT
    38  - column families
    39  - composite encoding for collated strings and DECIMAL values in PK and indexes
    40  - txn contention handling via pusher queues
    41  
    42  ## Postgres Wire Protocol
    43  
    44  A SQL query arrives at the server through the Postgres wire protocol
    45  (CockroachDB speaks the Postgres protocol for compatibility with
    46  existing client drivers and applications). The `pgwire` package
    47  implements protocol-related functionality; once a client connection is
    48  authenticated, it is represented by a
    49  [`pgwire.v3Conn`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/pgwire/v3.go#L164)
    50  struct (it wraps a [`net.Conn`](https://golang.org/pkg/net/#Conn)
    51  interface - Go's
    52  sockets). [`v3Conn.serve()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/pgwire/v3.go#L313)
    53  implements the "read query - execute it - return result" loop. The
    54  protocol is message-oriented: for the lifetime of the connection, we
    55  [read a
    56  message](https://github.com/cockroachdb/cockroach/blob/677f6f18b/pkg/sql/pgwire/v3.go#L324)
    57  usually representing one or more SQL statements, pass it to the
    58  `sql.Executor` for executing all the statements in the batch and, once
    59  that's done and the results have been produced, [serialize them and
    60  send them to the
    61  client](https://github.com/cockroachdb/cockroach/blob/677f6f18b/pkg/sql/pgwire/v3.go#L755).
    62  
    63  Notice that the results are not streamed to the client and, moreover, a whole batch of statements might be executed before any results are sent back.
    64  
    65  ## SQL Executor
    66  
    67  The
    68  [`sql.Executor`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L194)
    69  is responsible for parsing statements, executing them and returning
    70  results back to the `pgwire.v3Conn`. The main entry point is
    71  [`Executor.execRequest()`](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L481),
    72  which receives a batch of statements as a raw `string`. The execution
    73  of the batch is done in the context of a
    74  [`sql.Session`](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/session.go#L61)
    75  object which accumulates information about the state of the connection
    76  (e.g. the database that has been selected, the various variables that
    77  can be set, the transaction status), as well as accounting the memory
    78  in use at any given time by this connection. The `Executor` also
    79  manipulates a
    80  [`planner`](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/planner.go#L39)
    81  struct which provides the functionality around actually planning and
    82  executing a query.
    83  
    84  `Executor.execRequest()` implements a state-machine of sorts by
    85  receiving batches of statements from `pgwire`, executing them one by
    86  one, updating the `Session`'s transaction state (did a new transaction
    87  just begin or an old transaction just end? did we encounter an error
    88  which forces us to abort the current transaction?) and returning
    89  results and control back to pgwire. The next batch of statements
    90  received from the client will continue from the transaction state left
    91  by the previous batch.
    92  
    93  ### Parsing
    94  
    95  The first thing the `Executor` does is [parse the
    96  statements](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L495);
    97  parsing uses a LALR parser generated by `go-yacc` from a [Yacc-like grammar
    98  file](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/parser/sql.y),
    99  originally copied from Postgres and stripped down, and then gradually
   100  grown organically with ever-more SQL support. The process of parsing
   101  transforms a `string` into an array of ASTs (Abstract Syntax Trees),
   102  one for each statement. The AST nodes are structs defined in the
   103  `sql/parser` package, generally of two types - statements and
   104  expressions. Expressions implement a common interface useful for
   105  applying tree transformations. These ASTs will later be transformed by
   106  the `planner` into an execution plan.
   107  
   108  ### Statement Execution
   109  
   110  With a list of statements in hand, `Executor.execRequest()` goes
   111  through them in order and executes one transaction's worth of
   112  statements at a time (i.e. groups of statements between a `BEGIN` and
   113  `COMMIT/ROLLBACK` statements, or single statements executed outside of
   114  a transaction). If the session had an open transaction after execution
   115  of the previous batch, we continue consuming statements until a
   116  `COMMIT/ROLLBACK`. This "consuming of statements" is done by the call
   117  to
   118  [`runTxnAttempt`](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L587);
   119  this function returns after executing statements until the
   120  `COMMIT/ROLLBACK` has been encountered.
   121  
   122  There is an impedance mismatch that has to be explained here, around
   123  the interfacing of the SQL `Executor/session` code, which is
   124  stream-oriented (with statements being executed one at a time possibly
   125  within the scope of SQL transactions) and CockroachDB's Key/Value (KV)
   126  interface, which is request oriented with transactions explicitly
   127  attached to every request. The most interesting interface for the KV
   128  layer of the database is the
   129  [`Txn.Exec()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/txn.go#L465)
   130  method. `Txn` lives in the `internal/client` package, which contains
   131  the KV client interface (the "client" and the server in this context
   132  are both internal to CockroachDB, although we used to expose the KV
   133  interface externally in the past and it's not out of the question that
   134  we'll do it again in the future). `Txn` represents a KV transaction;
   135  there's generally one associated with the SQL session, reused between
   136  client ping-pongs.
   137  
   138  The `Txn.Exec` interface takes a callback and some execution options
   139  and, based on those options, executes the callback possibly multiple
   140  times and commits the transaction afterwards. If allowed by the
   141  options, the callback might be called multiple times, to deal with
   142  retries of transactions that are [sometimes
   143  necessary](https://www.cockroachlabs.com/docs/stable/transactions.html#transaction-retries)
   144  in CockroachDB (usually because of data contention). The SQL
   145  `Executor` might or might not want to let the KV client perform such
   146  retries automatically.
   147  
   148  To hint at the complications: a single SQL
   149  statement executed outside of a SQL transaction (i.e. an "implicit
   150  transaction") can be safely retried. However, a SQL transaction
   151  spanning multiple client requests will have different statements
   152  executed in different callbacks passed to `Txn.Exec()`; as such, it is
   153  not sufficient to retry one of these callbacks - we have to retry all
   154  the statements in the transaction, and generally some of these
   155  statements might be conditional on the client's logic and thus cannot
   156  be retried verbatim (i.e. different results for a `SELECT` might
   157  trigger different subsequent statements). In this case, we bubble up a
   158  retryable error to the client; more details about this can be read in
   159  our [transaction
   160  documentation](https://www.cockroachlabs.com/docs/stable/transactions.html#client-side-intervention). This
   161  complexity is captured in
   162  [`Executor.execRequest()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L495),
   163  which has logic for setting the different execution options and
   164  contains a [suitable
   165  callback](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L604)
   166  [passed to
   167  `Txn.Exec()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L643);
   168  this callback will call
   169  [`runTxnAttempt()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L621). The
   170  statement execution code path continues inside the callback, but it is
   171  worth noting that, from this moment on, we have interfaced with the
   172  (client of the) KV layer and everything below is executing in the
   173  context of a KV transaction.
   174  
   175  ### Building execution plans
   176  
   177  Now that we have figured out what (KV) transaction we're running inside
   178  of, we are concerned with executing SQL statements one at a
   179  time. `runTxnAttempt()` has a few layers below it dealing with the
   180  various states a SQL transaction can be in
   181  ([open](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L966s)
   182  /
   183  [aborted](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L887)
   184  / [waiting for a user
   185  retry](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L944),
   186  etc.), but the interesting one is
   187  [execStmt](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L1258). This
   188  guy [creates an "execution
   189  plan"](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L1261)
   190  for a statement and runs it.
   191  
   192  An execution plan in CockroachDB is a tree of
   193  [`planNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad/pkg/sql/plan.go#L72)
   194  nodes, similar in spirit to the AST but, this time, containing
   195  semantic information and also runtime state. This tree is built by
   196  [`planner.makePlan()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/plan.go#L199),
   197  which takes a parsed statement and returns the root of the `planNode`
   198  tree after having performed all the semantic analysis and various
   199  transformations. The nodes in this tree are actually "executable"
   200  (they have `Start()` and `Next()` methods), and each one will consume
   201  data produced by its children (e.g. a `JoinNode` has [`left and
   202  right`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/join.go#L125)
   203  children whose data it consumes).
   204  
   205  Currently building the execution plan, performing semantic analysis
   206  and applying various transformations is a pretty ad-hoc process, but
   207  we are working on replacing the code with a more structured process and
   208  separating the IR (Intermediate Representation) used for analysis and
   209  transforms from the runtime structures (see this WIP
   210  RFC)[https://github.com/cockroachdb/cockroach/pull/10055/files#diff-542aa8b21b245d1144c920577333ceed].
   211  
   212  In the meantime, the `planner` [looks at the type of the
   213  statement](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/plan.go#L248)
   214  at the top of the AST and, for each statement type, invokes a specific
   215  method that builds the execution plan. For example, the tree for a
   216  `SELECT` statement is produced by
   217  [`planner.SelectClause()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L257). Notice
   218  how different aspects of a `SELECT` statement are handled there: a
   219  `scanNode` is created
   220  (`renderNode.initFrom()`->...->
   221  [`planner.Scan()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/data_source.go#L441))
   222  to scan a table, a `WHERE` clause is transformed into an expression
   223  and assigned to a
   224  `filterNode`,
   225  an `ORDER BY` clause is [turned into a
   226  `sortNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L296),
   227  etc. In the end, a
   228  [`selectTopNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L320)
   229  is produced, which in fact is a tree of a `groupNode`, a `windowNode`,
   230  a `sortNode`, a `distinctNode` and a `renderNode` wrapping a
   231  `scanNode` acting as an original data source).
   232  
   233  Finally, the execution plan is simplified and optimized somewhat; this
   234  includes removing the `selectTopNode` wrappers and eliding all no-op
   235  intermediate nodes.
   236  
   237  To make this notion of the execution plan more concrete, consider one
   238  actually "rendered" by the `EXPLAIN` statement:
   239  
   240  ```sql
   241  root@:26257> create table customers(
   242  name string primary key,
   243  address string,
   244  state string,
   245  index SI (state)
   246  );
   247  
   248  root@:26257> insert into customers values
   249  ('Google', '1600 Amphitheatre Parkway', 'CA'),
   250  ('Apple', '1 Infinite Loop', 'CA'),
   251  ('IBM', '1 New Orchard Road ', 'NY');
   252  
   253  root@:26257> EXPLAIN(EXPRS,NOEXPAND,NOOPTIMIZE,METADATA) SELECT * FROM customers WHERE address like '%Infinite%' ORDER BY state;
   254  +-------+--------+----------+---------------------------+------------------------+----------+
   255  | Level |  Type  |  Field   |        Description        |        Columns         | Ordering |
   256  +-------+--------+----------+---------------------------+------------------------+----------+
   257  |     0 | select |          |                           | (name, address, state) | +state   |
   258  |     1 | nosort |          |                           | (name, address, state) | +state   |
   259  |     1 |        | order    | +@3                       |                        |          |
   260  |     1 | render |          |                           | (name, address, state) |          |
   261  |     1 |        | render 0 | name                      |                        |          |
   262  |     1 |        | render 1 | address                   |                        |          |
   263  |     1 |        | render 2 | state                     |                        |          |
   264  |     2 | filter |          |                           | (name, address, state) |          |
   265  |     2 |        | filter   | address LIKE '%Infinite%' |                        |          |
   266  |     3 | scan   |          |                           | (name, address, state) |          |
   267  |     3 |        | table    | customers@primary         |                        |          |
   268  +-------+--------+----------+---------------------------+------------------------+----------+
   269  ```
   270  
   271  You can see data being produced by a `scanNode`, being filtered by a
   272  `filterNode` (presented as "filter"), and then sorted by a `sortNode`
   273  (presented as "nosort", because we have turned off order analysis with
   274  NOEXPAND and the sort node doesn't know yet whether sorting is
   275  needed), wrapped in a `selectTopNode` (presented as "select").
   276  
   277  With plan simplification turned on, the EXPLAIN output becomes:
   278  
   279  ```
   280  root@:26257> EXPLAIN (EXPRS,METADATA) SELECT * FROM customers WHERE address LIKE '%Infinite%' ORDER BY state;
   281  +-------+------+--------+---------------------------+------------------------+--------------+
   282  | Level | Type | Field  |        Description        |        Columns         |   Ordering   |
   283  +-------+------+--------+---------------------------+------------------------+--------------+
   284  |     0 | sort |        |                           | (name, address, state) | +state       |
   285  |     0 |      | order  | +state                    |                        |              |
   286  |     1 | scan |        |                           | (name, address, state) | +name,unique |
   287  |     1 |      | table  | customers@primary         |                        |              |
   288  |     1 |      | spans  | ALL                       |                        |              |
   289  |     1 |      | filter | address LIKE '%Infinite%' |                        |              |
   290  +-------+------+--------+---------------------------+------------------------+--------------+
   291  ```
   292  
   293  #### Expressions
   294  
   295  A subset of ASTs are
   296  [`parser.Expr`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/parser/expr.go#L25),
   297  representing various "expressions" - parts of statements that can
   298  occur in many various places - in a `WHERE` clause, in a `LIMIT`
   299  clause, in an `ORDER BY` clause, as the projections of a `SELECT`
   300  statement, etc. Expressions nodes implement a common interface so that
   301  a [visitor pattern](https://en.wikipedia.org/wiki/Visitor_pattern) can
   302  be applied to them for different transformations and
   303  analysis. Regardless of where they appear in the query, all
   304  expressions need some common processing (e.g. names appearing in them
   305  need to be resolved to columns from data sources). These tasks are
   306  performed by
   307  [`planner.analyzeExpr`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/analyze.go#L1596). Each
   308  `planNode` is responsible for calling `analyzeExpr` on the expressions
   309  it contains, usually at node creation time (again, we hope to unify
   310  our execution planning more in the future).
   311  
   312  `planner.analyzeExpr` performs the following tasks:
   313  
   314  1. resolving names (the `colA` in `select 3 * colA from MyTable` needs
   315     to be replaced by an index within the rows produced by the underlying
   316     data source (usually a `scanNode`))
   317  2. normalization (e.g. `a = 1 + 1` -> `a = 2`, ` a not between b and c` -> `(a < b) or (a > c)`)
   318  3. type checking (see [the typing
   319     RFC](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160203_typing.md)
   320     for an in-depth discussion of Cockroach's typing system).
   321  
   322     1. constant folding (e.g. `1 + 2` becomes `3`): we perform exact
   323        arithmetic using [the same library used by the Go
   324        compiler](https://golang.org/pkg/go/constant/) and classify all the
   325        constants into two categories: [numeric -
   326        `NumVal`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/parser/constant.go#L108)
   327        or [string-like -
   328        `StrVal`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/parser/constant.go#L281). These
   329        representations of the constants are smart enough to figure out the
   330        set of types that can represent the value
   331        (e.g. [`NumVal.AvailableTypes`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/parser/constant.go#L188)
   332        - `5` can be represented as `int, decimal or float`, but `5.4` can
   333        only be represented as `decimal or float`) This will come in useful
   334        in the next step.
   335  
   336     2. type inference and propagation: this analysis phase assigns a
   337        result type to an expression, and in the process types all the
   338        sub-expressions. Typed expressions are represented by the
   339        [`TypedExpr`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/parser/expr.go#L48)
   340        interface, and they are finally able to evaluate themselves to a
   341        result value through the `Eval` method. The typing algorithm is
   342        presented in detail in the typing RFC: the general idea is that
   343        it's a recursive algorithm operating on sub-expressions; each level
   344        of the recursion may take a hint about the desired outcome, and
   345        each expression node takes that hint into consideration while
   346        weighting what options it has. In the absence of a hint, there's
   347        also a set of "natural typing" rules. For example, a `NumVal`
   348        described above
   349        [checks](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/parser/constant.go#L61)
   350        whether the hint is compatible with its list of possible
   351        types. This process also deals with [`overload
   352        resolution`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/parser/type_check.go#L371)
   353        for function calls and operators.
   354  4. replacing sub-query syntax nodes by a `sql.subquery` execution plan
   355     node.
   356  
   357  A note about sub-queries: consider a query like `select * from
   358  Employees where DepartmentID in (select DepartmentID from Departments
   359  where NumEmployees > 100)`. The query on the `Departments` table is
   360  called a sub-query. Subqueries are recognized and replaced with an
   361  execution node by
   362  [`subqueryVisitor`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/subquery.go#L294). The
   363  subqueries are then run and replaced by their results through the
   364  [`subqueryPlanVisitor`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/subquery.go#L194). This
   365  is usually done by various top-level nodes when they start execution
   366  (e.g. `renderNode.Start()`).
   367  
   368  ### Notable `planNodes`
   369  
   370  As hinted throughout, execution plan nodes are responsible for
   371  executing parts of a query. Each one consumes data from lower-level
   372  nodes, performs some logic, and feeds data into a higher-level one.
   373  
   374  After being constructed, their main methods are
   375  [`Start`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/plan.go#L142),
   376  which initiates the processing, and
   377  [`Next`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/plan.go#L149),
   378  which is called repeatedly to produce the next row.
   379  
   380  To tie this to the [SQL Executor](#sql-executor) section above,
   381  `executor.execLocal()`,
   382  the method responsible for executing one statement, calls
   383  `plan.Next()` repeatedly and accumulates the results.
   384  
   385  Consider some `planNode`s involved in running a `SELECT`
   386  statement, using the table defined above and
   387  
   388  ```sql
   389  SELECT * FROM customers WHERE State LIKE 'C%' AND strpos(address, 'Infinite') != 0 ORDER BY Name;
   390  ```
   391  
   392  as a slightly contrived example. This is supposed to return customers
   393  from states starting with "C" and whose address contains the string
   394  "Infinite". To get excited, let's see the query plan for this
   395  statement:
   396  
   397  ```sql
   398  root@:26257> EXPLAIN(EXPRS) SELECT * FROM customers WHERE State LIKE 'C%' and strpos(address, 'Infinite') != 0 order by name;
   399  +-------+------------+--------+----------------------------------+
   400  | Level |    Type    | Field  |           Description            |
   401  +-------+------------+--------+----------------------------------+
   402  |     0 | sort       |        |                                  |
   403  |     0 |            | order  | +name                            |
   404  |     1 | index-join |        |                                  |
   405  |     2 | scan       |        |                                  |
   406  |     2 |            | table  | customers@SI                     |
   407  |     2 |            | spans  | /"C"-/"D"                        |
   408  |     2 |            | filter | state LIKE 'C%'                  |
   409  |     2 | scan       |        |                                  |
   410  |     2 |            | table  | customers@primary                |
   411  |     2 |            | filter | strpos(address, 'Infinite') != 0 |
   412  +-------+------------+--------+----------------------------------+
   413  ```
   414  
   415  So the plan produced for this query, from top (highest-level) to
   416  bottom, looks like:
   417  
   418  ```
   419  sortNode -> indexJoinNode -> scanNode (index)
   420  			  -> scanNode (PK)
   421  ```
   422  
   423  Before we inspect the nodes in turn, one thing deserves explanation:
   424  how did the `indexJoinNode` (which indicates that the query is going
   425  to use the "SI" index) come to be? The fact that this query uses an
   426  index is not apparent in the syntactical structure of the `SELECT`
   427  statement, and so this plan is not simply a product of the mechanical
   428  tree building hinted to above. Indeed, there's a step that we haven't
   429  mentioned before: ["plan
   430  expansion"](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/expand_plan.go#L28). Among
   431  other things, this step performs "index selection" (more information
   432  about the algorithms currently used for index selection can be found
   433  in [Radu's blog
   434  post](https://www.cockroachlabs.com/blog/index-selection-cockroachdb-2/)). We're
   435  looking for indexes that can be scanned to efficiently retrieve only
   436  rows that match (part of) the filter. In our case, the "SI" index
   437  (indexing the state) can be scanned to efficiently retrieve only the
   438  rows that are candidates for satisfying the `state LIKE 'C%'`
   439  expression (in an ecstasy to agony moment, we see that our index
   440  selection / expression normalization code [is smart
   441  enough](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/analyze.go#L1436)
   442  to infer that `state LIKE 'C%'` implies `state >= 'C' AND state <
   443  'D'`, but is not smart enough to infer that the two expressions are in
   444  fact equivalent and thus the filter can be elided altogether). We
   445  won't go into plan expansion or index selection here, but the index
   446  selection process happens [in the expansion of the
   447  `SelectNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/expand_plan.go#L283)
   448  and, as a byproduct, produces `indexJoinNode`s configured with the
   449  index spans to be scanned.
   450  
   451  Now let's see how these `planNode`s run:
   452  
   453  1. [`sortNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sort.go#L31):
   454     The `sortNode` sorts the rows produced by its child and corresponds to
   455     the `ORDER BY` SQL clause. The
   456     [constructor](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sort.go#L60)
   457     has a bunch of logic related to the quirky rules for name resolution
   458     from SQL92/99. Another interesting fact is that, if we're sorting by a
   459     non-trivial expression (e.g. `SELECT a, b ... ORDER BY a + b`), we
   460     need the `a + b` values (for every row) to be produced by a
   461     lower-level node. This is achieved through a pattern that's also
   462     present in other node: the lower node capable of evaluating
   463     expressions and rendering their results is the `renderNode`; the
   464     `sortNode` constructor checks if the expressions it needs are already
   465     rendered by that node and, if they are not, asks for them to be
   466     produced through the
   467     [`renderNode.addOrMergeRenders()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sort.go#L206)
   468     method.  The actual sorting is performed in the `sortNode.Next()`
   469     method. The first time it is called, [it consumes all the data
   470     produced by the child
   471     node](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sort.go#L359)
   472     and accumulates it into `n.sortStrategy` (an interface hiding multiple
   473     sorting algorithms). When the last row is consumed,
   474     `n.sortStrategy.Finish()` is called, at which time the sorting
   475     algorithm finishes its processing. Subsequent calls to
   476     `sortNode.Next()` simply iterate through the results of sorting
   477     algorithm.
   478  
   479  2. [`indexJoinNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/index_join.go#L30):
   480     The `indexJoinNode` implements joining of results from an index with
   481     the rows of a table. It is used when an index can be used for a query,
   482     but it doesn't contain all the necessary columns; columns not
   483     available in the index need to be retrieved from the Primary Key (PK)
   484     key-values. The `indexJoinNode` sits on top of two scan nodes - one
   485     configured to scan the index, and one that is constantly reconfigured
   486     to do "point lookups" by PK. In the case of our query, we can see that
   487     the "SI" index is used to read a compact set of rows that match the
   488     "state" filter but, since it doesn't contain the "address" columns,
   489     the PK also needs to be used. Each index KV pair contains the primary
   490     key of the row, so there is enough information to do PK
   491     lookups. [`indexJoinNode.Next`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/index_join.go#L273)
   492     keeps reading rows from the index and, for each one, adds a spans to
   493     be read by the PK. Once enough such spans have been batched, they are
   494     all [read from the
   495     PK](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/index_join.go#L257). As
   496     described in the section on [SQL rows to KV
   497     pairs](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#data-mapping-between-the-sql-model-and-kv))
   498     from the design doc, each SQL row is represented as a single KV pair
   499     in the indexes, but as multiple consecutive rows in the PK
   500     (represented by a "key span").
   501  
   502     An interesting detail has to do with
   503     how filters are handled: note that the `state LIKE 'C%'` condition is
   504     evaluated by the index scan, and the `strpos(address, 'Infinite') !=
   505     0` condition is evaluated by the PK scan. This is nice because it
   506     means that we will be filtering as much as we can on the index side and
   507     we will be doing fewer expensive PK lookups. The code that figures out
   508     which conjunction is to be evaluated where is in `splitFilter()`,
   509     called by the [`indexJoinNode`
   510     constructor](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/index_join.go#L180).
   511  
   512  3. [`scanNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/scan.go#L33):
   513     The `scanNode` generally constitutes the source of a `renderNode` or `filterNode`;
   514     it is responsible for scanning over the key/value pairs for a table or
   515     index and reconstructing them into rows. This node is starting to
   516     smell like rubber meeting a road, because we are getting closer to the
   517     actual data - the monolithic, distributed KV map. You'll see that the
   518     [`Next()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/scan.go#L214)
   519     method is not particularly climactic, since it delegates the work to a
   520     `rowFetcher`, described below. There's one interesting thing that the
   521     `scanNode` does: it [runs a filter
   522     expression](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/scan.go#L233),
   523     just like the `filterNode`. That is because we are trying to push down
   524     parts of the `WHERE` clause as far as possible. This is generally a
   525     work in progress, see `filter_opt.go`. The idea is
   526     that a query like
   527  ```sql
   528  /* Select the orders placed by each customer in the first year of membership. */
   529  SELECT * FROM Orders o inner join Customers c ON o.CustomerID = c.ID WHERE Orders.amount > 10 AND Customers.State = 'NY' AND age(c.JoinDate, o.Date) < INTERVAL '1 year'
   530  ```
   531     is going to be compiled into two `scanNode`s, one for `Customers`,
   532     one for `Orders`. Each one of them can do the part of filtering
   533     that refers exclusively to their respective tables, and then the
   534     higher-level `joinNode` only needs to evaluate expressions that
   535     need data from both (i.e. `age(c.JoinDate, o.Date) < INTERVAL '1
   536     year'`).
   537  
   538     Let's continue downwards, looking at the structures that the
   539     `scanNode` uses for actually reading data.
   540  
   541     1. [`rowFetcher`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sqlbase/rowfetcher.go#L35):
   542        The `rowFetcher` is responsible for iterating through key-value
   543        pairs, figuring out where a SQL table or index row ends (remember
   544        that a SQL row is potentially encoded in multiple KV entries), and
   545        decoding all the keys and values in SQL column values, dealing with
   546        differences between the primary index and other indexes and with
   547        the [layout of a
   548        table](https://www.cockroachlabs.com/docs/stable/column-families.html). For
   549        details on the mapping between SQL rows and KV pairs, see the
   550        [corresponding section from the Design
   551        Doc](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#data-mapping-between-the-sql-model-and-kv) and the [encoding tech note](encoding.md).
   552  
   553        The `rowFetcher` also [performs
   554        decoding](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sqlbase/table.go#L953)
   555        from on-disk byte arrays to the representation of data that we do
   556        most processing on: implementation of the
   557        [`parser.Datum`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/parser/datum.go#L57)
   558        interface. For details on what the on-disk format is for different
   559        data types, browse around the [`util/encoding
   560        directory`](https://github.com/cockroachdb/cockroach/tree/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/util/encoding).
   561  
   562        For actually reading a KV pair from the database, the `rowFetcher` delegates to the `kvBatchFetcher`.
   563  
   564     2. [`kvBatchFetcher`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sqlbase/kvfetcher.go#L84):
   565        The `kvBatchFetcher` finally reads data from the KV database. It
   566        understands nothing of SQL concepts, such as tables, rows or
   567        columns. When it is created, it is configured with a number of "key
   568        spans" that it needs to read (these might be, for example, a single
   569        span for reading a whole table, or a couple of spans for reading
   570        parts of the PK or of an index).
   571  
   572        To actually read data from the KV database, the `kvBatchFetcher` uses the
   573        KV layer's "client" interface, namely
   574        [`client.Batch`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/batch.go#L30). This
   575        is where the "SQL layer" interfaces with the "KV layer" - the
   576        `kvBatchFetcher` will
   577        [build](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sqlbase/kvfetcher.go#L197)
   578        such `Batch`es of requests, [send them for
   579        execution](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sqlbase/kvfetcher.go#L203)
   580        in the context of the KV transaction (remember the `Transaction`
   581        mentioned in the [Statement Execution
   582        section](#StatementExecution)), [read the
   583        results](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sqlbase/kvfetcher.go#L220)
   584        and return them to the hierarchy of `planNodes`. The requests being
   585        sent to the KV layer, in the case of this read-only query, are
   586        [`ScanRequest`s](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/roachpb/api.proto#L204).
   587  
   588  The rest of this document will walk through the "execution" of KV
   589  requests, such as the ones sent by the `kvBatchFetcher`.
   590  
   591  ## KV
   592  
   593  The KV layer of CockroachDB deals with execution of "requests". The
   594  protocol-buffer-based API is defined in
   595  [api.proto](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/roachpb/api.proto),
   596  listing the various types of requests and response. In practice, the
   597  KV's client always sends
   598  [`BatchRequest`s](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/roachpb/api.proto#L848),
   599  a generic request containing a collection of other requests. All
   600  requests have a
   601  [Header](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/roachpb/api.proto#L788)
   602  which contains routing information (which replica a request is
   603  destined for) and [transaction
   604  information](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/roachpb/api.proto#L818)
   605  - in the context of which transaction is the request to be executed.
   606  
   607  ### The KV client interface
   608  
   609  Clients "send" KV requests using a client interface (currently this
   610  interface is internal, used by SQL, but we might offer it directly to
   611  users in some form in the future). This client interface contains
   612  primitives for [starting a (KV)
   613  transaction](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/db.go#L452)
   614  (remember, the SQL `Executor` uses this to run every statement in the
   615  context of a transaction). Afterwards, a
   616  [`Txn`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/txn.go#L36)
   617  object is available for [executing requests in the context of that
   618  transaction](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/txn.go#L295)
   619  - this is what, for example, the `kvBatchFetcher` uses. If you trace what
   620  happens inside that `Txn.Run()` method you eventually get to
   621  [`txn.db.sender.Send(...,
   622  batch)`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/txn.go#L587):
   623  the request starts percolating through a hierarchy of
   624  [`Sender`s](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/internal/client/sender.go#L27)
   625  - objects that perform various peripheral tasks and ultimately route
   626  the request to a replica for execution. `Sender`s have a single method
   627  - `Send()` - which ultimately passes the request to the lower
   628  level. Let's go down this "sending" rabbit hole: `TxnCoordSender ->
   629  DistSender -> Node -> Stores -> Store -> Replica`. The first two run
   630  on the same node as the that received the SQL query and is doing the
   631  SQL processing (the "gateway node"), the others run on the nodes
   632  responsible for the data that is being accessed (the "range node").
   633  
   634  ### TxnCoordSender
   635  
   636  The top-most `client.Sender` is the
   637  [`TxnCoordSender`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/txn_coord_sender.go#L140). A
   638  TxnCoordSender is responsible for dealing with transactions' state
   639  (see the [Transaction Management section of the design
   640  doc](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md)). After
   641  a transaction is started, the TxnCoordSender starts asynchronously
   642  sending heartbeat messages to that transaction's "txn record", to keep
   643  it live. It also keeps track of each written key or key range over the
   644  course of the transaction. When the transaction is committed or
   645  aborted, it clears accumulated write intents for the transaction. All
   646  requests being performed as part of a transaction have to go through
   647  the same `TxnCoordSender` so that all write intents are accounted for
   648  and eventually cleaned up.  After performing this bookkeeping, the
   649  request is [passed to the
   650  `DistSender`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/txn_coord_sender.go#L418).
   651  
   652  ### DistSender
   653  
   654  The
   655  [`DistSender`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L133)
   656  is truly a workhorse: it handles the communication between the gateway
   657  node and the (possibly many) range nodes, putting the "distributed" in
   658  "distributed database". It receives `BatchRequest`s, looks at the
   659  requests inside the batch, figures out what range each command needs
   660  to go to, finds the nodes/replicas responsible for that range, routes
   661  the requests there and then collects and reassembles the results.
   662  
   663  Let's go through the code a bit:
   664  
   665  1. The request is subdivided into ranges: `DistSender.Send()` calls
   666     [`DistSender.divideAndSendBatchToRanges()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L620)
   667     which
   668     [iterates](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L720)
   669     over the constituent ranges of requests by using a
   670     [`RangeIterator`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_iter.go#L27)
   671     (a single request, such as a `ScanRequest` can refer to a key span
   672     that might straddle potentially many ranges). A lot of things hide
   673     behind this innocent-looking iteration: the cluster's range metadata
   674     needs to be accessed in order to find the mapping of keys to ranges
   675     (info on this metadata can be found [in the Range Metadata
   676     section](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#range-metadata)
   677     of the design doc). Range metadata is stored as regular data in the
   678     cluster, in a two-level index mapping range end keys to descriptors
   679     about the replicas of the respective range (the ranges storing this
   680     index are called "meta-ranges"). The `RangeIterator` logically
   681     iterates over these descriptors, in range key order. Brace yourselves:
   682     for moving from one range to the next, the iterator [calls back into
   683     the
   684     `DistSender`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_iter.go#L156),
   685     which knows how to find the descriptor of the range responsible for
   686     one particular key. The `DistSender`
   687     [delegates](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L441)
   688     resolving a key to a descriptor to the
   689     [`rangeDescriptorCache`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_cache.go#L76)
   690     (a LRU tree cache, indexed by range end key). This cache
   691     desynchronized with reality as ranges in a cluster split or move
   692     around; when an entry is discovered to be stale, we'll see below that
   693     the `DistSender` removes it from the cache.
   694  
   695     In the happy case, the cache has information about a descriptor
   696     covering the key we're interested in and [it returns
   697     it](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_cache.go#L295). In
   698     the unhappy case, it needs to [perform a
   699     scan](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_cache.go#L323)
   700     of a meta-range. To do so, we need to know the descriptor of the
   701     meta-range containing the descriptor we're interested in, which is
   702     retrieved using [a recursive call into the
   703     cache](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_cache.go#L431). This
   704     recursion cannot go on forever - the descriptor of a regular range is
   705     in a meta2-range (a 2nd level index range), and the descriptors for
   706     meta2-ranges are present in the (one and only) meta1-range. Once the
   707     meta-descriptor of the range we want to scan is known, the cache
   708     [delegates back again to the
   709     `DistSender`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/range_cache.go#L439),
   710     which [sends a
   711     `RangeLookupRequest`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L305)
   712     KV command addressed directly to the meta range (so the `DistSender`
   713     is not recursively involved in routing this request).
   714  
   715  2. Each sub-request (partial batch) is sent to its range. This is done
   716     through the call to
   717     [`DistSender.sendPartialBatchAsync()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L793)
   718     which
   719     [truncates](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L894)
   720     all the requests in the batch to the current range and then it
   721     [sends](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L926)
   722     the truncated batch to a range. All these partial batches are sent
   723     concurrently.
   724  
   725     `sendPartialBatch()` is the level at which error stemming from stale
   726     `rangeDescriptorCache` information are handled: the range descriptor
   727     that's detected to be stale is
   728     [evicted](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L974)
   729     from the cache and the partial batch is
   730     [reprocessed](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L982).
   731  
   732  3. Sending a partial batch to a single range implies selecting the
   733     right replica of that range and performing an RPC to it. By default,
   734     each range is replicated three ways, but only one of the three
   735     replicas is the "lease holder" - the temporarily designed owner of
   736     that range, in charge of coordinating all reads and writes to it (see
   737     the [Range Leases
   738     section](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#range-leases)
   739     in the design doc). Figuring out which replica has the lease is done
   740     through another cache - the
   741     [`leaseHolderCache`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/leaseholder_cache.go#L28),
   742     whose information can also get stale.
   743  
   744     The method of the `DistSender` dealing with this is
   745     [`sendSingleRange`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L451). It
   746     will use the cache to send the request to the lease holder, but it's
   747     also prepared to try the other replicas, in [order of
   748     "proximity"](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L461). The
   749     replica that the cache says is the leaseholder is simply [moved to the
   750     front](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L467)
   751     of the list of replicas to be tried and then an [RPC is
   752     sent](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L474)
   753     to all of them, in order.
   754  
   755  4. Sending the RPCs is initiated by
   756     [`sendToReplicas`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L1095),
   757     which sends the request to the [first
   758     one](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L1136)
   759     and subsequently to [the
   760     other](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L1234),
   761     until one succeeds or returns a processing error. Processing errors
   762     are distinguished from routing errors by
   763     [`handlePerReplicaError`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L1258)
   764     which, among others, handles [wrong information in the
   765     `leaseHolderCache`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L1273).
   766  
   767  5. Actually sending the RPCs is hidden behind the [`Transport
   768     interface`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/transport.go#L87). Concretely,
   769     [`grpcTransport.SendNext()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/transport.go#L159)
   770     does [gRPC](http://www.grpc.io/) calls to the nodes containing the
   771     destination replicas, namely to a service implementing the [`Internal`
   772     service](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/roachpb/api.proto#L892).
   773  
   774  6. The (async) responses from the different replicas are
   775     [combined](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/kv/dist_sender.go#L691)
   776     into a single `BatchResponse`, which is ultimately returned from the
   777     `Send()` method.
   778  
   779  We've now gone through the relevant things that happen on the gateway
   780  node. Further, we're going to look at what happens on the "remote"
   781  side - on each of the ranges.
   782  
   783  ### RPC server - Node and Stores
   784  
   785  We've seen how the `DistSender` splits `BatchRequest` into partial
   786  batches, each containing commands local to a single replica, and how
   787  these commands are sent to the lease holders of their ranges through
   788  RPCs. We're now moving to the "server" end of these RPCs. The struct
   789  that implements the RPC service is
   790  [`Node`](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/server/node.go#L850). The
   791  `Node` doesn't do anything of great relevance; it
   792  [delegates](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/server/node.go#L825)
   793  the request to its
   794  [`Stores`](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/stores.go#L36)
   795  member which represents a collection of "stores" (on-disk databases
   796  imagined to be one per physical disk, see [the Architecture
   797  section](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#architecture)
   798  of the design doc). The `Stores` implements the `Sender` interface,
   799  just like the gateway layers that we've seen before, resuming the
   800  pattern of wrapping another `Sender` and passing requests down through
   801  the `Send()` method.
   802  
   803  `Stores.Send()`
   804  [identifies](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/stores.go#L160)
   805  which particular store contains the destination replica (based on
   806  routing info filled into the request by the `DistSender`) and
   807  [routes](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/stores.go#L187)
   808  the request there. One interesting thing that the `Stores` does, in
   809  case requests from the current transactions have already been
   810  processed on this node, is [update the upper
   811  bound](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/stores.go#L166)
   812  on the uncertainty interval to be used by the current request (see the
   813  ["Choosing a Timestamp" section of the design
   814  doc](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md))
   815  for details on uncertainty intervals). The uncertainty interval
   816  dictates which timestamps for values are ambiguous because of clock
   817  skew between nodes (the values for which don't know if they were
   818  written before or after the serialization point of the current
   819  txn). This code realizes that, if a request from the current txn has
   820  been processed on this node before, no value written after that node's
   821  timestamp at the time of that other request processing is ambiguous.
   822  
   823  ### Store - intent resolution
   824  
   825  A
   826  [`Store`](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L366)
   827  represents one physical disk device. For our purposes, a `Store`
   828  mostly
   829  [delegates](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L2485)
   830  the request to a `replica`, but it has one important role - in case
   831  the request runs into "write intents" (i.e. uncommitted values), it
   832  deals with those intents. This handles read-write and write-write
   833  conflicts between transactions. Notice that the code calling the
   834  `replica` is inside [a big infinite retry
   835  loop](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L2454)
   836  and that a bunch of the code inside it deals with
   837  [`WriteIntentError`](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L2494). When
   838  we see such an error, we [try to "resolve"
   839  it](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L2522)
   840  using the
   841  [`intentResolver`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/intent_resolver.go#L43). Resolving
   842  means figuring out if the transaction to which the intent belongs is
   843  still pending (it might already be committed or aborted, in which case
   844  the intent is "resolved"), or possibly "pushing" the transaction in
   845  question (forcing it to restart at a higher timestamp, such that it
   846  doesn't conflict with the current txn). If the conflicting txn is no
   847  longer pending or if it was pushed, then the intents can be properly
   848  resolved (i.e. either replaced by a committed value, or simply
   849  discarded). The first part - figuring out the txn status or pushing it
   850  - is done in
   851  [`intentResolver.maybePushTransaction`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/intent_resolver.go#L220):
   852  we can see that a series of `PushTxnRequest`s are batched and sent to
   853  the cluster (meaning the hierarchy of `Sender`s on the current node
   854  will be used, top to bottom, to route the requests to the various
   855  transaction records - see the ["Transaction execution flow"
   856  section](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md)
   857  of the design doc). In case the transaction we're trying to push is
   858  still pending, the decision about whether or not the push is
   859  successful is done [deep in the processing of the
   860  `PushTxnRequest`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L1380)
   861  (several levels below the `Store` level we're discussing here, in the
   862  stack for the spinned-off `PushTxnRequest`) based on the relative
   863  priorities of the pusher/pushee txns.
   864  
   865  The second part - replacing the intents that can now be resolved, is
   866  done through a call to
   867  [`intentResolver.resolveIntents`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/intent_resolver.go#L97).
   868  Back where we left off in
   869  [`Store.Send()`](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L2522),
   870  the call to the `intentResolver`, if successful, will change the
   871  `resolved` field of the `WriteIntentError` which will cause us to
   872  [retry
   873  immediately](https://github.com/cockroachdb/cockroach/blob/801a90b25b217e4ed5a833b15151c252bde0d1b0/pkg/storage/store.go#L2533). Otherwise,
   874  we'll retry according to an exponential backoff, waiting for the still
   875  pending transaction that we couldn't push to complete - we don't want
   876  to retry too soon, as we'd almost surely run into the same intent
   877  again (we're working to replace this "polling"-based mechanism for
   878  waiting for a conflicting txn to finish with something more reactive).
   879  
   880  ### Replica - executing reads, proposing Raft commands
   881  
   882  A
   883  [`Replica`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L246)
   884  represents one copy of range, which in turn is a contiguous keyspace
   885  managed by one instance of the Raft consensus algorithm. The system
   886  tries to keep ranges around 64MB, by default.  The `Replica` is the
   887  final `Sender` in our hierarchy. The role of all the other `Sender`s
   888  was, mostly, to route requests to the `Replica` currently acting as
   889  the lease holder for the range (a _primus inter pares_ `Replica` that
   890  takes on a bunch of coordination responsibilities we'll explore
   891  below).  A replica deals with read requests differently than write
   892  requests. Reads are evaluated directly, whereas writes will enter
   893  another big chapter in their life and go through the Raft consensus
   894  protocol.
   895  
   896  The difference between the paths of read requests vs write requests is
   897  seen immediately: `replica.Send()` quickly [branches
   898  off](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1322)
   899  based on the request type. We'll talk about the read/write paths in
   900  turn.
   901  
   902  ### Read request path
   903  
   904  The first thing that is done for a read request is checking if the
   905  request got to the right place (i.e. the current replica is the lease
   906  holder); remember that a lot of the routing was done based on caches
   907  or out-right guesses. This check is performed by
   908  [`replica.redirectOnOrAcquireLease()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1783),
   909  a rabbit hole in its own right. Let's just say that, in case the
   910  current replica is not the lease holder, `redirectOnOrAcquireLease`
   911  either
   912  [redirects](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L992)
   913  to the lease holder, if there is a valid lease (remember that the
   914  `DistSender` will handle such redirections), or [requests a new
   915  lease](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L981)
   916  otherwise, in the hope that it will become the lease
   917  holder. Requesting a lease is done through the
   918  [`pendingLeaseRequest`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_range_lease.go#L33)
   919  helper struct, which coalesces multiple requests for the same lease
   920  and eventually constructs a
   921  [`RequestLeaseRequest`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_range_lease.go#L140)
   922  and [sends it for
   923  execution](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_range_lease.go#L205)
   924  directly to the replica (as we've seen in other cases, bypassing all
   925  the senders to avoid recursing infinitely). In case a lease is
   926  requested, `redirectOnOrAcquireLease` will
   927  [wait](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1012)
   928  for that request to complete and check if it was successful.
   929  
   930  Once the lease situation has been settled, the next thing to do for
   931  the read is synchronizing it with possible in-flight writes - if a
   932  write to an overlapping key span is in progress, the read might need
   933  to see its value, so we can't race with it; we must wait until the
   934  write is done. This synchronization is done through the
   935  [`CommandQueue`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/command_queue.go#L32)
   936  struct - an interval tree maintaining all the in-flight requests,
   937  indexed by the key or span of keys that they touch. Waiting for the
   938  writes is done inside
   939  [`replica.beginCmds()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1525). Notice
   940  that immediately after figuring out which commands we need to wait
   941  for, we atomically [add the current
   942  read](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1527)
   943  to the command queue in order to block future writes. This overlaps in
   944  spirit with the use of the `TimestampCache` structure described below
   945  and in fact there is a
   946  [proposal](https://forum.cockroachlabs.com/t/why-do-we-keep-read-commands-in-the-command-queue/360)
   947  for not putting reads in the queue. Removal of commands from the queue
   948  is [done
   949  later](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1811)
   950  through the callback returned by `beginCmds`. This epilogue also does
   951  something else that's important: it [records the
   952  read](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1429)
   953  in the
   954  [`TimestampCache`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/timestamp_cache.go#L85),
   955  a bounded in-memory cache from key range to the latest timestamp at
   956  which it was read. This structure serves to protect against violations
   957  of the Snapshot Isolation transaction isolation level (the lowest that
   958  CockroachDB provides) which require that the outcome of reads must be
   959  preserved, i.e. a write of a key at a lower timestamp than a previous
   960  read must not succeed (see the Read-Write Conflicts – Read Timestamp
   961  Cache section in [Matt's blog
   962  post](https://www.cockroachlabs.com/blog/serializable-lockless-distributed-isolation-cockroachdb/)). As
   963  we'll see in the writes section, writes consult this structure to make
   964  sure they're not writing "under" a read that has already been
   965  performed.
   966  
   967   Now we're reading to actually evaluate the read - control moves to
   968   [`replica.evaluateBatch()`](https://github.com/cockroachdb/cockroach/blob/75c26e5498ef26a4adde9425cb43682d38ec8ee4/pkg/storage/replica.go#L5225)
   969   which calls
   970   [`replica.evaluateCommand`](https://github.com/cockroachdb/cockroach/blob/75c26e5498ef26a4adde9425cb43682d38ec8ee4/pkg/storage/replica.go#L5286)
   971   for each request in the batch. `evaluateCommand` switches over the request
   972   types using a [helper request to method
   973   map](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L84)
   974   and [passes
   975   execution](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L160)
   976   to the request-specific method. One typical read request is a
   977   [`ScanRequest`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/roachpb/api.proto#L204);
   978   this is evaluated by
   979   [`evalScan`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L343). The
   980   code is very brief - it immediately
   981   [calls](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L374)
   982   a corresponding on the `engine` - a handle to the on-disk
   983   [RocksDB](http://rocksdb.org/) database. Before we dig a bit into
   984   this `engine`, let's look at what `evalScan` will do next: it will
   985   [return
   986   intents](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L360)
   987   to the higher levels. These are intents that the scanning
   988   encountered, but they didn't prohibit it from continuing
   989   (e.g. intents with a timestamp higher than the timestamp at which
   990   we're reading - the read doesn't care if those intents are committed
   991   or not); this is in contrast with intents that do block the read -
   992   those, as we'll see below, are transformed into `WriteIntentError`s
   993   which we've seen that they're handled by the [`Store`](#Store). These
   994   non-interfering intents are collected for cleanup purposes - they
   995   might be garbage left-over by dead transactions and we want to
   996   proactively clean them up. They're returned up the stack until
   997   `replica.addReadOnlyCmd` [tries to clean them
   998   up](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1837)
   999   using our old friend, the `intentResolver`.
  1000  
  1001  ### Engine
  1002  
  1003  We're getting to the bottom of the CockroachDB stack - the
  1004  [`Engine`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/engine/engine.go#L159)
  1005  is an interface abstracting away different on-disk stores. The only
  1006  implementation we currently use is
  1007  [`RocksDB`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/engine/rocksdb.go#L265),
  1008  which is a wrapper around the RocksDB C++ library. We won't go into
  1009  this wrapper other than to say that it uses
  1010  [cgo](https://golang.org/cmd/cgo/) for interfacing with C++ code. We
  1011  also won't go into the RocksDB code which, although it's obviously an
  1012  important part of servicing a request, is not something that
  1013  CockroachDB devs generally deal with.
  1014  
  1015  For reads, the entry point into the `engine` package is
  1016  [`mvccScanInternal()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/engine/mvcc.go#L1480). This
  1017  performs a scan over the KV database, dealing with the data
  1018  representation we use for [MultiVersion Concurrency
  1019  Control](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)
  1020  (MVCC). It iterates over the key/vals of the requested range and
  1021  [appends](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/engine/mvcc.go#L1511)
  1022  each one to the results. The MVCC details, such as the fact that we
  1023  keep multiple versions of each key (for different timestamps) and the
  1024  intents, are handled by
  1025  [`MVCCIterate()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/engine/mvcc.go#L1555),
  1026  which uses an iterator provided by the `Engine` to scan over
  1027  key/vals. It delegates reading key/vals and advancing the iterator to
  1028  [`mvccGetInternal()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/engine/mvcc.go#L659).
  1029  
  1030  ### Write request path
  1031  
  1032  Write requests are conceptually more interesting than reads because
  1033  they're not simply serviced by one node/replica. Instead, they go
  1034  through the Raft consensus algorithm, which maintains an ordered
  1035  commit log, and are then applied by all of a range's replicas (see the
  1036  [Raft
  1037  section](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#raft---consistency-of-range-replicas)
  1038  of the design doc for more details). The replica that initiates this
  1039  process is, just like in the read case, the lease holder. Execution on
  1040  this lease holder is thus broken into two stages - before ("upstream
  1041  of" in code) Raft and below ("downstream of") Raft. The upstream stage
  1042  will eventually block for the corresponding Raft command to be applied
  1043  *locally* (after the command has been applied locally, future reads
  1044  are guaranteed to see its effects).
  1045  
  1046  For what follows we'll introduce some terminology. We've already seen
  1047  that a `replica` (and the KV subsystem in general) receives
  1048  *requests*. In what follows, these requests will be _evaluated_, which
  1049  transforms them to Raft _commands_. The commands in turn are
  1050  _proposed_ to the Raft consensus group and, after the Raft group
  1051  accepts the proposals and commits them, control comes back to the
  1052  `Replica`s (all of the replicas of a range this time, not just the
  1053  lease holder), which _apply_ them.
  1054  
  1055  Execution of write commands, mirroring the reads, starts in
  1056  [`replica.addWriteCmd()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1857). This
  1057  method just contains a retry loop that deals with exceptional cases in
  1058  which requests need to be evaluated repeatedly and delegates to
  1059  [`replica.tryAddWriteCmd`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1872). This
  1060  guy does a number of things:
  1061  
  1062  1. It [waits until overlapping in-flight requests are
  1063     done](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1959)
  1064     and adds the current as an in-flight request to the `CommandQueue`
  1065     (similar to the reads).
  1066  
  1067  2. It checks that the current replica is the lease holder by [calling
  1068     the `redirectOnOrAcquireLease`
  1069     method](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1983),
  1070     just like the reads.
  1071  
  1072  3. It ["applies the timestamp
  1073     cache"](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2002)
  1074     - meaning that the `TimestampCache` we've discussed above is checked
  1075     to see if the write can proceed at the timestamp at which it's trying
  1076     to modify that database. If it can't (because there's been a more
  1077     recent overlapping read), the write's timestamp [is
  1078     bumped](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1688)
  1079     to a timestamp later than any overlapping read).
  1080  
  1081  4. It evaluates the request and proposes resulting Raft commands. It
  1082     all starts with [this call to
  1083     replica.propose()](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2036). We'll
  1084     describe the process below (it will be fun) but, before we do, let's
  1085     see what the current method will do afterwards.
  1086  
  1087  5. The call to `replica.propose` returns a channel that [we'll wait
  1088     on](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2051). This
  1089     is the decoupling point that we've anticipated above - the point where
  1090     we cede control to the Raft machinery. The `replica` doing the
  1091     proposals accepts its role as merely one of many replicas and waits
  1092     for the consensus group to make progress in lock-step. The channel
  1093     will receive a result when the (local) replica has applied the
  1094     respective commands, which can happen only after the commands have
  1095     been committed to the shared Raft log (a global operation).
  1096  
  1097  6. As in the reads case, at the end of the `tryAddWriteCmd` method,
  1098     [an
  1099     epilogue](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L1968)
  1100     will remove the request from the `CommandQueue` and add it to the
  1101     timestamp cache.
  1102  
  1103  #### Evaluation of requests and application of Raft commands
  1104  
  1105  As promised, let's see what happens inside `replica.propose()`. The
  1106  first thing is the process of evaluation, i.e. turning a KV *request*
  1107  into a Raft *command*. This is done through the call to
  1108  [`requestToProposal()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2270),
  1109  which quickly calls
  1110  [`evaluateProposal()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2119),
  1111  which in turn quickly calls the surprisingly-named
  1112  [`applyRaftCommandInBatch`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2158). This
  1113  last method [simulates the execution of the
  1114  request](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3737),
  1115  if you will, and records all the would-be changes to the `Engine` into
  1116  a "batch" (these batches are how RocksDB models transactions). This
  1117  batch will be
  1118  [serialized](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3778)
  1119  into a Raft command. If we were to commit this batch now, the changes
  1120  would be live, but just on this one replica, which would be a
  1121  potential data consistency violation. Instead, we [abort
  1122  it](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3784). It
  1123  will resurrect again when the command "comes out of Raft", as we'll
  1124  see.
  1125  
  1126  The simulation part takes place inside the
  1127  [`executeWriteBatch()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3822)
  1128  method. This takes in the `roachpb.BatchRequest` (the KV request we've
  1129  been dealing with all along), [allocates an
  1130  `engine.Batch`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3835)
  1131  and delegates to
  1132  [`evaluateBatch()`](https://github.com/cockroachdb/cockroach/blob/75c26e5498ef26a4adde9425cb43682d38ec8ee4/pkg/storage/replica.go#L5225). This
  1133  fellow finally
  1134  [iterates](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L4041)
  1135  over the individual requests in the batch and, for each one, calls
  1136  [`evaluateCommand`](https://github.com/cockroachdb/cockroach/blob/75c26e5498ef26a4adde9425cb43682d38ec8ee4/pkg/storage/replica_command.go#L63). We've
  1137  seen `evaluateCommand` before, on the read path. It switches over the
  1138  different types of requests and [calls a method specific to each
  1139  type](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L160). One
  1140  such method would be
  1141  [`evalPut`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L225),
  1142  which writes a value for a key. Inside it we'll see a [call to the
  1143  engine](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica_command.go#L249)
  1144  to perform this write (but remember, it's all performed inside a
  1145  RocksDB transaction, the `engine.Batch`).
  1146  
  1147  This was all for the purposes of recording the engine changes that
  1148  need to be proposed to Raft. Let's unwind the stack to
  1149  `replica.propose` (the method that started this section), and see what
  1150  happens with the result of
  1151  [`requestToProposal`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2270). For
  1152  one, it gets [inserted
  1153  into](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2292)
  1154  the "pending proposals map" - a structure that will make the
  1155  connection between a command being *applied* and `tryAddWriteCmd`
  1156  which will be blocked on a channel waiting for the local
  1157  application. More importantly, it gets passed to
  1158  [`replica.submitProposalLocked`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2294)
  1159  which, eventually, calls
  1160  [`raftGroup.Propose()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2396). This
  1161  `raftGroup` is a handle to a consensus group, implemented by the [Etcd
  1162  Raft library](https://github.com/coreos/etcd/tree/master/raft), a
  1163  black box to which we submit proposals to have them serialized through
  1164  majority voting into a coherent distributed log. This library is
  1165  responsible for passing the commands in order to all the replicas for
  1166  *application*.
  1167  
  1168  This concludes the discussion of the part specific to the lease holder
  1169  replica: how commands are proposed to Raft and how the lease holder is
  1170  waiting for them to be applied before returning a reply to the (KV)
  1171  client. What's missing is the discussion on how exactly they are
  1172  applied.
  1173  
  1174  #### Raft command application
  1175  
  1176  We've seen how commands are "put into" Raft. But how do they "come
  1177  out"? The Etcd Raft library implements a distributed state machine
  1178  whose description is beyond the present scope. Suffice to say that we
  1179  have a
  1180  [`raftProcessor`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/scheduler.go#L112)
  1181  interface that state transitions from this library call to. Our older
  1182  friend the `Store` implements this interface and the important method
  1183  is
  1184  [`Store.processReady()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/store.go#L3337). This
  1185  will eventually call back into a specific `replica` (the replica of a
  1186  range that's being modified by each command), namely it will call
  1187  [`handleRaftReadyRaftMuLocked`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2484). This
  1188  will [iterate through newly committed
  1189  commands](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L2616)
  1190  calling
  1191  [`processRaftCommand`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3247)
  1192  for each one. This will in take the serialized `engine.Batch` and call
  1193  [`replica.applyRaftCommand`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3467)
  1194  with it. Here the batch is [deserialized and applied to the
  1195  engine](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3644)
  1196  and, this time, unlike on the proposed side in
  1197  `applyRaftCommandInBatch`, the changes are actually [committed to
  1198  storage](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3683). The
  1199  command has now been applied (on one particular replica, but keep in
  1200  mind that the process described in this section happens on *every*
  1201  replica).
  1202  
  1203  We've glossed over something in `processRaftCommand` that's important:
  1204  after applying the command, if the current replica is proposer
  1205  (i.e. the lease holder), we need to signal the proposer (which, as we
  1206  saw in the previous section, is blocked in `tryAddWriteCmd`). This
  1207  happens at the [very
  1208  end](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/storage/replica.go#L3513). We've
  1209  now come full circle - the proposer will now be unblocked and receive
  1210  a response on the channel it was waiting on, and it can unwind the
  1211  stack letting its client know that the request is complete. This reply
  1212  can travel through the hierarchy of `Sender`s, back from the lease
  1213  holder node to the SQL gateway node, to a SQL tree of `planNode`s, to
  1214  the SQL Executor, and, through the `pgwire` implementation, to the SQL
  1215  client.
  1216