github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/txn_coord_sender.md (about)

     1  # Transactional interface between SQL and KV (and TxnCoordSender)
     2  
     3  Original authors: knz, andrei
     4  
     5  This tech note explains how the SQL/KV interface currently works, up
     6  to the level of detail necessary to understand the processing of
     7  batches, error handling and SQL savepoints, to understand bug findings
     8  in this area and to participate in design discussions.
     9  
    10  Table of contents:
    11  
    12  - [Introduction](#Introduction)
    13  - [client.Txn and RootTxns](#clientTxn-and-RootTxns)
    14  - [LeafTxns and txn state repatriation](#LeafTxns-and-txn-state-repatriation)
    15  - [client.Txn, meta and TxnCoordSender](#clientTxn-meta-and-TxnCoordSender)
    16  - [Interceptors: between TxnCoordSender and DistSender](#Interceptors-between-TxnCoordSender-and-DistSender)
    17  - [TxnCoordSender state](#TxnCoordSender-state)
    18  - [Summary of the all-is-well path](#Summary-of-the-all-is-well-path)
    19  - [Error handling in TxnCoordSender](#Error-handling-in-TxnCoordSender)
    20  - [Error handling with LeafTxns](#Error-handling-with-LeafTxns)
    21  - [Concurrency between root and leaf](#Concurrency-between-roof-and-leaf)
    22  - [KV sequence numbers](#KV-sequence-numbers)
    23  - [Seqnum consistency across TxnCoordSenders](#Seqnum-consistency-across-TxnCoordSenders)
    24  
    25  ## Introduction
    26  
    27  CockroachDB's at a high level is architected into the following layers:
    28  
    29  1. SQL (incl. pgwire, SQL state machine, planning and execution)
    30  2. Transactional (YOU ARE HERE)
    31  3. Distribution (incl. range leasing, rebalancing)
    32  4. Replication (incl. Raft, replica lifecycle)
    33  5. Storage (incl. engines: RocksDB, Pebble)
    34  
    35  This tech note pertains to level 2 in this list and especially the
    36  boundary between levels 1 and 2.
    37  
    38  Conceptually, the "transactional layer" offers an API to the SQL layer
    39  which enables it to consider the CockroachDB cluster as a
    40  transactional KV store. The API offers ways to "define a KV
    41  transaction" and "execute KV operations" (and get results and errors),
    42  and maintains the lifecycle of KV transaction objects.
    43  
    44  In particular, it is responsible for a number of optimizations relating
    45  to transient fault recovery (incl. implicit/automatic retries inside
    46  that layer, invisible to SQL) and transaction conflict resolution
    47  (incl. implicit/automatic txn reordering by updating timestamps).
    48  
    49  Its other boundary, between levels 2 and 3 above, is another API
    50  offered by level 3 called `DistSender`.  That level also allows the
    51  levels above it to "execute KV operations" but it has very little
    52  logic for error recovery and does not maintain KV transaction state
    53  itself. (In fact, the levels 3 and below are mostly state-less with
    54  regards to SQL client connections. The transactional layer is the last
    55  level that manages internal state on behalf of a single SQL client.)
    56  
    57  The transactional layer's role is thus to *translate* requests coming
    58  from above (SQL) to go below (DistSender), performing optimizations
    59  and error recovery during that process.
    60  
    61  Since the interactions are relatively complex, the remainder of the
    62  tech note introduces the concepts incrementally. The explanations at
    63  the beginning are thus somewhat inaccurate, merely providing an upramp
    64  to understanding for the reader.
    65  
    66  Interestingly, in the context of the transactional layer, the word
    67  **"client"** designates the local SQL layer (e.g. a SQL executor or a
    68  distsql processor), not the remote client app. Likewise, the word
    69  **"server"** designates the local `DistSender` on each node, not the
    70  CockroachDB node as a whole.  This differs from the terminology in
    71  each other layer. For example, in SQL: "client" = remote client app,
    72  "server" = gateway or distsql execution server; in replication:
    73  "client" = layer 2, "server" = leaseholder for some range.  Depending
    74  on the reader's own technical background, some caution will be needed
    75  while reading the rest of this tech note.
    76  
    77  ## client.Txn and RootTxns
    78  
    79  The first two actors in this explanation are:
    80  
    81  - the SQL executor, which organizes the state of the SQL transaction
    82    and the sequencing of statements on the gateway.
    83  - a transaction object called "RootTxn" (the name will be motivated
    84    later), that exists on the SQL gateway, and which stores the "main"
    85    state of the SQL/KV transaction - for example whether it's aborted,
    86    waiting for a client-side retry, etc.
    87  
    88  A simplified view of the interaction between the two is as follows:
    89  
    90  ![](txn_coord_sender/txnbase.png)
    91  
    92  - the SQL executor instantiates an object of Go type `*client.Txn` with
    93    its type set to `RootTxn` (hence the name)
    94  - during query execution, the SQL exec code (on the gateway, we ignore
    95    distributed execution for now) uses the `Run()` method on that
    96    object to run KV operations.
    97  - "under the hood" the RootTxn translates the Run() calls into
    98    BatchRequests into the cluster, and translates the
    99    BatchResponses back into updates into the `client.Batch` object
   100    provided by the SQL code.
   101  - at the end of the SQL transaction (either commit or rollback, or
   102    close on error), a call is made to the RootTxn to
   103    finalize its state.
   104  
   105  ## LeafTxns and txn state repatriation
   106  
   107  When a query becomes distributed, we want other nodes to be able to
   108  run KV operations "on behalf" of the main SQL transaction running on
   109  the gateway. This needs the same txn ID, timestamp, list of write
   110  intents, etc, so we can't just create a fresh new RootTxn on each node
   111  where a distsql processor runs.
   112  
   113  Instead, there is some new complexity, involving three new actors:
   114  
   115  - one or more distSQL servers running on other nodes than the
   116    gateway, which receive requests from the gateway to execute
   117    work on behalf of a SQL session running there.
   118  - distSQL units of work, called "flows", which are specified
   119    to run some processing code and, relevant here, operate
   120    using...
   121  - ... another transaction object called "LeafTxn", which contains
   122    a copy of many fields of the original RootTxn and is
   123    able to run KV **read** operations.
   124  
   125  This works as follows:
   126  
   127  ![](txn_coord_sender/leafbase.png)
   128  
   129  - the SQL executor instantiates the RootTxn as usual.
   130  - when a distributed query is about to start, the distsql
   131    execution code pulls out a struct from the RootTxn
   132    called "LeafTxnInputState". This contains e.g. the txn ID,
   133    timestamp and write intents as outlined above.
   134  - the trimmed meta struct is sent along with the flow
   135    request to a remote distsql server.
   136  - on the other node, the distsql server instantiates the
   137    `LeafTxn` object using the provided meta struct as input.
   138  - the distsql processor(s) (e.g a table reader) then uses
   139    the LeafTxn to run KV batches.
   140  - when query execution completes, the distsql processor
   141    extracts a similar state struct off the LeafTxn
   142    called `LeafTxnFinalState` and the
   143    result is repatriated on the gateway when the
   144    flow is shut down.
   145  - on the gateway, repatriated LeafTxn state structs
   146    are merged into the RootTxn using `UpdateRootWithLeafFinalState()`.
   147  - on the gateway, any error produced by a LeafTxn is also "ingested"
   148    in the RootTxn to perform additional error recovery and clean-up,
   149    using `UpdateStateOnRemoteRetryableErr()`.
   150  
   151  Why do we need to bring back state from a LeafTxn into a RootTxn?
   152  
   153  There are many uses for this data repatriation, not
   154  all will be detailed further here.
   155  
   156  One use which is good to explain why the repatriation is _necessary_
   157  is that of refresh spans: as KV reads are issued by the LeafTxn, it
   158  populates a list of refresh spans. If we did not repatriate these
   159  spans, then a subsequent txn conflict check would not detect that
   160  reads performed by the LeafTxn are stale and incorrectly decide to
   161  refresh the txn (bump its commit ts into the future and retry
   162  automatically, instead of pushing the error back to the client).
   163  
   164  Another use of repatriation that's not strictly necessary but is
   165  nevertheless a useful optimization, is the case when the transaction
   166  is aborted concurrently (e.g. if a deadlock was detected by another
   167  txn). If the KV reads done on behalf of the LeafTxn detect the txn
   168  record to become aborted, this new state will be repatriated and the
   169  RootTxn will know that the entire KV txn has become aborted.  This is
   170  faster than letting the RootTxn discover this state later at the first
   171  next KV operation launched on its behalf.
   172  
   173  Related issues:
   174  https://github.com/cockroachdb/cockroach/issues/41222
   175  https://github.com/cockroachdb/cockroach/issues/41992
   176  
   177  ## client.Txn, meta and TxnCoordSender
   178  
   179  The two sections above used a simplified picture using
   180  a single "transaction object".
   181  
   182  In truth, the [type
   183  `*client.Txn`](https://github.com/cockroachdb/cockroach/blob/a57647381a4714b48f6ec6dec0bf766eaa6746dd/pkg/internal/client/txn.go#L32)
   184  is merely a thin facade for the SQL client. It contains, between other things:
   185  
   186  - a type tag (RootTxn/LeafTxn)
   187  - a reference of [interface type
   188    `TxnSender`](https://github.com/cockroachdb/cockroach/blob/a57647381a4714b48f6ec6dec0bf766eaa6746dd/pkg/internal/client/sender.go#L57),
   189    which abstracts the `Send()` operation to send batch requests to the
   190    rest of the cluster.
   191  
   192  In particular it does not contain the "main" txn payload including
   193  commit timestamp, intents, etc.
   194  
   195  Where is that payload then? Also, where are the refresh spans and
   196  other in-flight txn properties stored?
   197  
   198  The object referenced by `*client.Txn` is an instance of a coordinator
   199  component called the "TxnCoordSender" of [type
   200  `kv.TxnCoordSender`](https://github.com/cockroachdb/cockroach/blob/a57647381a4714b48f6ec6dec0bf766eaa6746dd/pkg/kv/txn_coord_sender.go#L104).
   201  
   202  The TxnCoordSender (hereafter abbreviated TCS), as its name implies,
   203  is in charge of maintaining the state of the txn at the top of the KV
   204  layer, and is in charge of coordinating the distribution of KV batches
   205  to layer underneath together with error handling, txn conflict
   206  management, etc.
   207  
   208  The TCS is also, itself, a rather thin data structure.
   209  
   210  Its main payload is what the KV team actually calls the "txn object",
   211  of [type
   212  `roachpb.Transaction`](https://github.com/cockroachdb/cockroach/blob/a57647381a4714b48f6ec6dec0bf766eaa6746dd/pkg/roachpb/data.proto#L302),
   213  which in turn also
   214  [contains](https://github.com/cockroachdb/cockroach/blob/a57647381a4714b48f6ec6dec0bf766eaa6746dd/pkg/roachpb/data.proto#L310)
   215  a copy of the "txn meta" object, of [type
   216  `enginepb.TxnMeta`](https://github.com/cockroachdb/cockroach/blob/a57647381a4714b48f6ec6dec0bf766eaa6746dd/pkg/storage/engine/enginepb/mvcc3.proto#L18).
   217  
   218  The separation of purpose between `roachpb.Transaction` and
   219  `enginepb.TxnMeta` is not further relevant in this RFC, and we will just
   220  call them collectively "the txn object".
   221  
   222  With this in place, the interaction goes roughly as follows:
   223  
   224  ![](txn_coord_sender/txncoordsender.png)
   225  
   226  The txn object is sent along in the header of every `BatchRequest`
   227  produced by TCS while it processes a `client.Batch` from
   228  SQL or other KV clients. This is passed along the
   229  transaction/replication/storage boundaries and the low-level MVCC code in
   230  storage has access to (a sufficient part of) the txn object during
   231  processing of each single KV operation.
   232  
   233  Additionally, the execution of low-level KV operations can _update_
   234  their copy of (parts of) the txn object. This will populate e.g.  the
   235  list of observed timestamps, used for later txn conflict resolution.
   236  The resulting txn state then flows back to TCS in the
   237  header of every `BatchResponse`.  Upon receiving a BatchResponse, the
   238  TCS *merges* the received txn object in the response with
   239  the txn object it already has, using the `txn.Update()` method.
   240  
   241  ## Interceptors: between TxnCoordSender and DistSender
   242  
   243  The explanation above suggested that TCS sends
   244  BatchRequests to "the cluster".
   245  
   246  In truth, "the cluster" is the entry point of the distribution layer,
   247  the overall architectural layer immediately under the transaction
   248  layer in CockroachDb.  Its entry point is an object called
   249  [`DistSender`](https://github.com/cockroachdb/cockroach/blob/a57647381a4714b48f6ec6dec0bf766eaa6746dd/pkg/kv/dist_sender.go),
   250  of which there is one instance per node.
   251  
   252  The interface between TCS and DistSender is an interface
   253  called `client.Sender` which defines a method `Send(BatchRequest)
   254  (BatchResponse, error)`.
   255  
   256  So _conceptually_, we have something like this in the code:
   257  
   258  ![](txn_coord_sender/distsender.png)
   259  
   260  However, there's a little more complexity hidden in there. If we had a
   261  direct call from `TCS.Send()` into `DistSender.Send()`,
   262  then a single blob of code in TCS itself would need to deal
   263  with all the complexity of txn pipelining, parallel commits, etc.
   264  
   265  To facilitate reasoning about the code and to ease maintenance, the
   266  txn management logic is split away from TCS itself, and
   267  across multiple other components arranged in a _pipeline_ placed between
   268  TCS and DistSender. Each stage of this pipeline is called
   269  an "interceptor" and responsible for a single aspect of txn
   270  coordination. Each also contains additional local state.
   271  
   272  Two example interceptors that happen to be relevant to this RFC are:
   273  
   274  - the
   275    [`txnSpanRefresher`](https://github.com/cockroachdb/cockroach/blob/a57647381a4714b48f6ec6dec0bf766eaa6746dd/pkg/kv/txn_interceptor_span_refresher.go#L103),
   276    which contains and manages the read and write refresh spans already
   277    mentioned above.
   278  - the
   279    [`txnSeqNumAllocator`](https://github.com/cockroachdb/cockroach/blob/a57647381a4714b48f6ec6dec0bf766eaa6746dd/pkg/kv/txn_interceptor_seq_num_allocator.go#L58),
   280    which assigns [sequence numbers](#KV-sequence-numbers) to individual KV
   281    operations in batches.
   282  
   283  Thus, in reality, the call stack looks more like this:
   284  
   285  ![](txn_coord_sender/interceptors.png)
   286  
   287  TCSs allocated for RootTxns use [the full pipeline of
   288  interceptors (6 of them as of this
   289  writing)](https://github.com/cockroachdb/cockroach/blob/a57647381a4714b48f6ec6dec0bf766eaa6746dd/pkg/kv/txn_coord_sender.go#L529),
   290  whereas LeafTxns, which only handle read requests, use [only a
   291  subset](https://github.com/cockroachdb/cockroach/blob/a57647381a4714b48f6ec6dec0bf766eaa6746dd/pkg/kv/txn_coord_sender.go#L556).
   292  
   293  ## TxnCoordSender state
   294  
   295  The overall "current state" of a TCS is thus distributed
   296  between various Go structs:
   297  
   298  - the txn object (`roachpb.Transaction`),
   299  - the set of its interceptors (each interceptor contains a portion of the txncoordsender state
   300    sufficient and necessary for its local processing),
   301  - its "own" very few fields, including a summary of the lifecycle of
   302    the txn object called `txnState` (relevant to this RFC, we'll come
   303    back to this later).
   304  
   305  This overall state is a native Go struct and not a protobuf. However,
   306  [as we've seen above](#LeafTxns-and-txn-state-repatriation) distributed execution needs to take the
   307  "current state" of a RootTxn and carry it over to another node to
   308  build a LeafTxn.
   309  
   310  For this purpose, a separate protobuf message `LeafTxnInputState` is
   311  defined. The TCS's `GetLeafTxnInputState()` method initially populates
   312  it by asking every interceptor in turn to write its portion of the
   313  state into it.
   314  
   315  Conversely, when the state of a LeafTxn is repatriated and to be
   316  "merged" into the RootTxn, the `UpdateRootFromLeafFinalState()` method
   317  uses the `Update()` method on the `roachpb.Transaction` sub-object
   318  (which merges the state of the txn object itself) then asks every
   319  interceptor, in turn, to collect bits of state it may be interested to
   320  merge in too.
   321  
   322  For example, that's where the RootTxn's txnSpanRefresher interceptor
   323  picks up the spans accumulated in the LeafTxn.
   324  
   325  ## Summary of the all-is-well path
   326  
   327  To summarize the previous sections, the SQL/KV interface
   328  involves the following actors:
   329  
   330  - a `client.Txn` object, which doesn't know much, other than...
   331  - a reference to a `TCS`, which stores:
   332    - (a copy of) the current txn object incl `roachpb.Transaction` and `enginepb.TxnMeta`,
   333    - a set of txn interceptors, each with its own local state,
   334    - at the end of the interceptor pipeline, a reference to the local node's `DistSender` object,
   335    - a little additional TCS local state including a "txn status" field called `txnState`.
   336  
   337  When a KV batch request arrives from SQL through `client.Txn`, it is
   338  passed through TCS, the stack of interceptors, delivered to
   339  DistSender, and the responses flow back up the same path.
   340  
   341  Now on the next question: *What of errors?*
   342  
   343  ## Error handling in TxnCoordSender
   344  
   345  For simplicity in this section, we'll start with the simple
   346  case of a RootTxn without any LeafTxn.
   347  
   348  When an error is encountered either in DistSender or "underneath"
   349  (remote replicas etc), it flows back through the interceptors back
   350  into the TCS's `Send()` method.
   351  
   352  Some interceptors peek into the error object and update their local
   353  state. Some of them (like the `txnSpanRefresher`) fully absorb the
   354  error to turn it into a non-error.
   355  
   356  Additionally, some interceptors can generate errors of their own
   357  either "on the way in" (towards DistSender), which causes a shortcut
   358  to the return path; or "on the way out" (alongside a BatchResponse).
   359  
   360  When `(TCS).Send()` receives an error from the chain
   361  of interceptors, it then separates between 7 kinds of errors, currently
   362  split into three groups:
   363  
   364  - sub-group 1: *same-TCS recoverable* errors, which cause the TCS to
   365    perform partial or full error recovery.
   366  
   367    This group contains 3 kinds of errors:
   368  
   369    1. *recoverable errors with in-place recovery*, where the TCS
   370       will handle the error internally, then retry the operation
   371       in a way that's invisible to the higher levels.
   372       In this case, the txn object remains "live" and
   373       its "identity" (ID, epoch) is unchanged.
   374  
   375       For example, txn refreshes are processed automatically
   376       in this way.
   377  
   378    2. *recoverable errors with txn restart*, where the
   379       TCS resets the txn object to a state where the
   380       client (the SQL layer) can restart the operation,
   381       or tell the client to attempt the operation again
   382       (client-side retries). In this case,
   383       the txn object remains "live" but its identity
   384       (epoch) changes immediately.
   385  
   386       Example sequence diagram in the case of a recoverable error with txn
   387       restart:
   388  
   389       ![](txn_coord_sender/erecovery.png)
   390  
   391    3. *deferred retry errors*, where the TCS remember the error
   392       has occurred but pretends the operation succeeded for the benefit
   393       of the (SQL) client. The error is only reported at
   394       the end of the SQL txn, where the client is requested to
   395       perform a client-side retry.
   396  
   397  	 This is currently used only for `WriteTooOldError`.
   398  
   399  - sub-group 2: *new-TCS recoverable* errors, which cause the TCS to
   400    become "trashed" (and unusable), but where the `*client.Txn` can
   401    continue/restart with a new TCS:
   402  
   403    1. *retryable transaction aborts* (`TransactionRetryWithProtoRefreshError`
   404    with a `TransactionAbortedError` payload), which occurs when the KV
   405    transaction gets aborted by some other transaction. This happens in case of
   406    deadlock, or in case the coordinator fails to heartbeat the txn record for a
   407    few seconds and another transaction is blocked on one of our intents. Faux
   408    `TransactionAbortedErrors` can also happen for transactions that straddle a
   409    lease transfer (the new leaseholder is not able to verify that the transaction
   410    had not been aborted by someone else before the lease transfer because we lost
   411    the information in the old timestamp cache).
   412  
   413    For these errors, the TCS becomes unusable but the `*client.Txn`
   414    immediately replaces the TCS by a fresh one, see
   415    `UpdateStateOnRemoteRetryableErr()` and
   416    `replaceSenderIfTxnAbortedLocked()` in `client/txn.go`.
   417  
   418  - sub-group 3: *unrecoverable* errors, which cause both the TCS and
   419    `*client.Txn` to become "trashed".
   420  
   421    This group contains 3 kinds of errors:
   422  
   423    1. *permanent transaction errors*, which
   424       occurs when the transaction encounters a permanent unrecoverable
   425       error typically due to client logic error (e.g. AOST read under GC).
   426  
   427    2. *transient processing errors*, for which it is certain that
   428       further processing is theoretically still possible after
   429       the error occurs. For example, attempting to read data using
   430       a historical timestamp that has already been garbage collected,
   431      `CPut` condition failure, transient network error on the read path, etc.
   432  
   433    3. *unhandled errors*, for which it is not certain that further
   434       processing is safe or sound (or where we haven't yet proven that
   435       it is). For example, "ambiguous result" errors, "store
   436       unavailable" and internal assertion errors fall in this category.
   437  
   438    When an unrecoverable error occurs, the TCS changes its `txnState` to
   439    `txnError`. After this happens, any further attempt to use the
   440    TCS will be rejected without even attempting further
   441    processing. At the SQL level, this is then recognized as a forced txn
   442    abort after which only ROLLBACK is accepted (or where COMMIT will
   443    produce a "txn is aborted" error).
   444  
   445    Example sequence diagram in the case of an unrecoverable error:
   446  
   447    ![](txn_coord_sender/eunrecoverable.png)
   448  
   449  Summary table:
   450  
   451  | Group                | Error kind                         | Example                                    | Current recovery                                          |
   452  |----------------------|------------------------------------|--------------------------------------------|-----------------------------------------------------------|
   453  | same-TCS recoverable | recoverable with in-place recovery | `ReadWithinUncertaintyIntervalError`       | internal auto retry, txn identity preserved               |
   454  | same-TCS recoverable | recoverable with txn restart       | commit deadline exceeded error             | tell client to retry, reset txn object to new epoch       |
   455  | same-TCS recoverable | deferred retry                     | transaction push, write too old (?)        | store error state, reveal retry error only at commit time |
   456  | new-TCS recoverable  | retryable txn aborts               | transaction aborted by concurrent txn      | TCS becomes unusable but `client.Txn` can resume          |
   457  | unrecoverable        | non-retryable txn aborts           | read under GC threshold                    | hard fail, TCS becomes unusable                           |
   458  | unrecoverable        | transient processing errors        | CPut condition failure                     | hard fail, TCS becomes unusable (see below)               |
   459  | unrecoverable        | unhandled errors                   | store unavailable error, assertion failure | hard fail, TCS becomes unusable                           |
   460  
   461  The keen reader may wonder why transient processing errors cause the txn
   462  object and the TCS to become unusable. Indeed, there is no good reason
   463  for that. It is actually poor design, as a SQL client may legitimately want
   464  to continue using the txn object after detecting a logical error (eg
   465  duplicate row) or transient error (eg network connection reset). **This
   466  is to change with the introduction of savepoints.**
   467  
   468  Another important aspect of "recoverable errors with txn restart" and
   469  "retryable txn aborts", which will become more noteworthy below, is
   470  that the txn object stored inside the TCS may become different "on the
   471  way out" (back to client.Txn and the SQL layer) from what it was "on
   472  the way in". It is currently the responsibility of the client (SQL
   473  layer), which may have its own copy of the txn object, to pick up this
   474  change. Cross-references on this point:
   475  
   476  - `(*client.Txn) UpdateStateOnRemoteRetryableErr()`
   477  - `(*DistSQLReceiver) Push()` -- `roachpb.ErrPriority(meta.Err) > roachpb.ErrPriority(r.resultWriter.Err())`
   478  
   479  ## Some additional wisdom by Tobias
   480  
   481  > Whenever we extract information from an error that counts as a read,
   482  > we have to make sure that read is accounted for by the KV store. For
   483  > example, a ConditionFailedError from a CPut is in fact a successful
   484  > read; if this isn't accounted for properly at the KV layer (timestamp
   485  > cache update), there will be anomalies (CPut has code to do exactly
   486  > that here). I generally find it unfortunate that we're relying on
   487  > errors to return what are really logical results, and I hope that we
   488  > buy out of that as much as we can. For CPut, for example, we'd have
   489  > `ConditionalPutResponse` carry a flag that tells us the actual value and
   490  > whether a write was carried out. I suspect we're using errors for some
   491  > of these only because errors eagerly halt the processing of the
   492  > current batch.  Continuing past errors generally needs a good amount
   493  > of idempotency (for example, getting a timeout during a CPut and
   494  > retrying the CPut without seqnos could read-your-own-write). We had no
   495  > way of doing that prior to the intent history and seqnos.
   496  
   497  > By the way, in case you haven't stumbled upon this yet, the txn span
   498  > refresher (an interceptor inside TCS) has a horrible contract with
   499  > `DistSender`, where DistSender returns a partially populated
   500  > response batch on errors, from which the refresher then picks out
   501  > spans for its refresh set. I'm wondering how this hasn't backfired
   502  > yet.
   503  
   504  ## Concurrency between root and leaf
   505  
   506  Today, it is not valid (= KV/SQL protocol violation) to perform KV
   507  operations using LeafTxns concurrently with a RootTxn,
   508  or use multiple RootTxns for the same txn object side-by-side.
   509  
   510  Note that while the SQL code is architected to take this restriction
   511  into account, *it is not currently enforced on the KV side*. We
   512  sometimes see bugs (eg #41222 / #41992) occuring because we do not
   513  have infrastructure in place to detect violations of this restriction.
   514  
   515  This restriction exists for 3 reasons, one of them actually invalid
   516  (a flawed past understanding):
   517  
   518  - KV writes must be uniquely identified, for txn pipelining. Since the
   519    identification is currently performed using a single counter in the txn
   520    object, there cannot be more than one TCS using this counter at a time.
   521  
   522    Today only RootTxn can process KV writes, so this restricts
   523    write concurrency to just 1 RootTxn. Even if LeafTxns had enough
   524    complexity to process writes, concurrency would be limited by this
   525    counter, until/unless we can guarantee that two separate TCSs
   526    generate separate operation identifiers.
   527  
   528    (For example, by combining sequence bits with a TCS ID.)
   529  
   530    Tobias notes:
   531  
   532    > I wonder if we also need their read/write spans to be
   533    > non-overlapping. There's all sort of weird stuff if they do
   534    > overlap, though maybe it's not actually illegal. Either way, not
   535    > something we're going to do today or tomorrow.
   536  
   537  - RootTxns update the txn object during error processing (see previous section).
   538  
   539    If we let the RootTxn process operations and perform error recovery
   540    while a LeafTxn is active, we'd need to answer difficult questions.
   541  
   542    Consider the following sequence:
   543  
   544    ![](txn_coord_sender/mismatch.png)
   545  
   546    In this situation, at t1 the RootTxn handles a retriable error
   547    by preparing the next client retry attempt via a new txn object,
   548    then at later instant t2 is augmented by a LeafTxn whose
   549    state was part of its "past" using the original txn object.
   550    How to combine the states from the two "generations" of the txn object?
   551  
   552    To avoid this situation altogether, any use of a LeafTxn
   553    comes with a requirement to not use the RootTxn at all
   554    while the LeafTxn is active.
   555  
   556  - (Mistaken) Expectation that distributed reads are able to observe
   557    concurrent writes on other nodes.
   558  
   559    The KV read and write operations are mutually ordered using seqnums.
   560    If we were to expect that a read is able to observe a write
   561    performed on a separate node, it would be necessary to synchronize
   562    seqnums across nodes for every KV write. This is neither practical
   563    nor currently implemented.
   564  
   565    This restriction currently mandates that there be no LeafTxn active
   566    while KV writes are being processed by a RootTxn.
   567  
   568    (The restriction is lifted by observing that the expectation is
   569    invalid: PostgreSQL semantics require that all reads performed by a
   570    mutation statement observe the state of the db prior to any
   571    write. So there is no requirement of read-the-writes inside a single
   572    SQL statement. The current crdb behavior actually is a bug, our
   573    [current halloween
   574    problem](https://github.com/cockroachdb/cockroach/issues/28842).
   575    Since LeafTxns are re-generated across SQL statements, it's trivial
   576    to get the right semantics without a restriction on LeafTxn/RootTxn
   577    concurrency.)
   578  
   579  The astute reader may wonder how distSQL deals with the requirement
   580  that no LeafTxn be active while a RootTxn is active, or no RootTxn be
   581  active while LeafTxns are active. To make this happen there is code in
   582  the distsql planner to select whether to use _either_ multiple
   583  LeafTxns, one per node / distsql processor, _or_ a single RootTxn,
   584  shared by all distsql processors (and forces them to run on the
   585  gateway, serially using a single goroutine) (also, this code has bugs.
   586  See eg issues #41222 / #41992).
   587  
   588  ## KV sequence numbers
   589  
   590  At the SQL/KV interface, KV operations are associated with *sequence numbers* (seqnums):
   591  
   592  - write operations generate new seqnums, which are stored inside write
   593    intents.
   594  - read operations operate "at" a particular seqnum: a MVCC read that
   595    encounters an intent ignores the values written at later seqnums
   596    and returns the most recent value prior to that seqnum instead.
   597  - combined read/write operations, like CPut, operate their read part
   598    at their write seqnum - 1.
   599  
   600  Today the TCS (the component that receives KV request batches from
   601  SQL) is responsible for generating seqnums.
   602  
   603  The seqnum counter's current value is split between three locations:
   604  
   605  - a local variable in one of the interceptors, called `txnSeqNumAllocator` inside the TCS;
   606  - the `enginepb.TxnMeta` record, inside the `roachpb.Transaction` held inside the `LeafTxnInputState`.
   607  - the `enginepb.TxnMeta` record, inside the `roachpb.Transaction` held inside the header of every executed KV batch.
   608  
   609  These three values are synchronized as follows:
   610  
   611  - The interceptor's counter is incremented for every KV write operation,
   612    and the current counter value (with or without increment) is copied to
   613    the `Sequence` field in the *request header* of every KV operation
   614    flowing through the interceptor. This ensures that:
   615  
   616    - every write gets a new sequence number.
   617    - every read gets a copy of the seqnum of the last write.
   618  - The `Sequence` field in the request header of individual KV
   619    operations is also copied to same-name field in `TxnMeta` of the
   620    batch header in certain circumstnaces (most notably by another later
   621    interceptor, the `txnPipeliner`) for use during txn conflict
   622    resolution and write reordering.
   623  - When a TCS is instantiated from a LeafTxnInputState (e.g. forking a
   624    RootTxn into a LeafTxn), the counter value from the TxnMeta inside the LeafTxnInputState
   625    is copied into the interceptor.
   626  - When a LeafTxnInputState is constructed from a TCS, the value is copied
   627    from the interceptor.
   628  
   629  Final note: the seqnum is scoped to a current txn epoch. When the
   630  epoch field is incremented, the seqnum generator resets to 0. The
   631  overall ordering of operation thus also needs to take the epoch into
   632  account.
   633  
   634  ## Seqnum consistency across TxnCoordSenders
   635  
   636  The current code was designed with the assumption that a single TCS
   637  can issue writes and assign new seqnums to requests.
   638  
   639  Today the code is organized to use only a single RootTxn (and no LeafTxns) for
   640  SQL statements that perform writes, so that anything that may
   641  update the seqnum ends up running sequentially in a single goroutine.
   642  
   643  It's interesting to consider how this would with LeafTxns if
   644  we were to relax the restriction and allow multiple readers
   645  with one writer.
   646  
   647  The main mechanism that helps is that without writes, a TCS will
   648  continue to assign the same seqnum to every read. A LeafTxn forked
   649  from a RootTxn will thus continue to use the seqnum last generated by
   650  the RootTxn before it was forked.
   651  
   652  So if we have a SQL sequence like this:
   653  
   654  1. UPDATE
   655  2. SELECT
   656  
   657  and the SELECT is distributed with LeafTxns, all the read requests
   658  performed on its behalf by other nodes will use the last (epoch, seqnum)
   659  generated for UPDATE and thus be able to "observe the writes".
   660  
   661  The astute reader can then consider what happens for the UPDATE
   662  itself. What if the UPDATE itself happens to be distributed, with some
   663  LeafTxns on other nodes running the "read part" of the UPDATE,
   664  and the RootTxn on the gateway issuing the KV operations?
   665  
   666  Here it would also work, as follows:
   667  
   668  - at the beginning of the UPDATE's execution, _before any writes have
   669    been issued_, the UPDATE's LeafTxn are forked. This ensures that any
   670    further distributed reads by the UPDATE will be using the last
   671    (epoch, seqnum) generated by the statement _before_ the UPDATE.
   672  - during the UPDATE's execution, the RootTxn increments its counter
   673    to perform the mutation. This increase remains invisible
   674    to the update's LeafTxns.
   675  
   676  By ensuring that the read path only sees the writes prior to the
   677  seqnum at the start of execution, it will be unaffected by subsequent
   678  writes. This solves crdb's [current halloween
   679  problem](https://github.com/cockroachdb/cockroach/issues/28842).
   680