github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20180324_parallel_commit.md (about) 1 - Feature Name: parallel commit 2 - Status: completed 3 - Start Date: 2018-03-24 4 - Authors: Tobias Schottdorf, Nathan VanBenschoten 5 - RFC PR: #24194 6 - Cockroach Issue: #30999 7 8 # Summary 9 10 Cut the commit latency for the final batch of a transaction in half, from two rounds of consensus down to one, by addressing a shortcoming of the transactional model which does not allow `EndTransaction` to be sent in parallel with other writes. 11 12 This is achieved through the introduction of a new transaction status `STAGED` which can be written in parallel with the final batch. This transaction status is usually short-lived; during failure events, transactions may be abandoned in this status and are recovered via a newly-introduced *status resolution process*. 13 14 With proper batching, this change translates directly into a reduction of SQL write latencies. 15 16 The latency until intents are resolved remains the same (two rounds of consensus). Thus, contended workloads are expected to profit less from this feature. 17 18 # Motivation 19 20 Consensus latencies are the price CockroachDB pays for its consistency guarantees; they are not usually incurred on monolithic or NoSQL databases, and can be used as an argument against the use of CockroachDB. 21 22 As a database that prides itself on geo-distributed use cases, we must strive to reduce the latency incurred by common OLTP transactions to near the theoretical minimum: the sum of all read latencies plus one consensus latency. 23 24 This is far from true at the time of writing: chatty SQL transactions incur a consensus latency on every write (unless they can use `RETURNING NOTHING`) due to the naive underlying implementation. 25 26 Consider the (implicit) transaction 27 28 ```sql 29 INSERT INTO t VALUES (1, 'x'), (2, 'y'), (3, 'z'); 30 ``` 31 32 where the table `t` has a simple KV schema and has been split at `2` and `3`. Since this is an implicit transaction, the SQL layer will send it as a single KV batch of the form 33 34 ``` 35 BeginTransaction(1) 36 CPut(1) 37 CPut(2) 38 CPut(3) 39 EndTransaction(1) 40 ``` 41 42 And, when split by range boundaries, `DistSender` will consider the subbatches 43 44 ``` 45 BeginTransaction(1) 46 CPut(1) 47 EndTransaction(1) 48 ``` 49 50 ``` 51 CPut(2) 52 ``` 53 54 ``` 55 CPut(3) 56 ``` 57 58 To minimize latency, you would want to send these three batches in parallel. However, this is not possible today. 59 60 To see why not, consider doing so and finding that the first two batches apply successfully, whereas the third fails (say, with a unique constraint violation). 61 62 The first batch both created and committed the transaction record, so any intents written become live. However, the third value is never written; we have lost a write. 63 64 To work around this, `DistSender` detects batches which try to commit a transaction, and forces the commit to be sent in a separate batch when necessary. This would result in the following requests to be sent in parallel first: 65 66 ``` 67 BeginTransaction(1) 68 CPut(1) 69 ``` 70 71 ``` 72 CPut(2) 73 ``` 74 75 ``` 76 CPut(3) 77 ``` 78 79 and then, contingent upon the success of the above wave, 80 81 ``` 82 EndTransaction(1) 83 ``` 84 85 This effectively doubles the latency (ignoring the client<->gateway latency). 86 87 This RFC should be considered in conjunction with #16026, which avoids write latencies until commit time (incurring only read latencies until then). With #16026 alone, transaction latencies would roughly equal the sum of read latencies, plus *two* rounds of consensus. This proposal reduces it to the minimum possible: the sum of read latencies, plus *one* round of consensus. 88 89 # Guide-level explanation 90 91 As seen in the last section, the fundamental problem is that transactions can't be marked as committed until there is proof that the final batch of writes succeeded. 92 93 A new transaction status `STAGED` avoids this problem by populating a transaction record with enough information for *anyone* to prove (or disprove!) that these writes are all present. If they are, the transaction must be marked committed; otherwise, it can be marked as aborted. This means that the coordinator can return to the client early when it knows that these writes have succeeded, and can send the actual commit operation lazily, as a performance optimization. 94 95 Omitting error handling for now, we want the coordinator to be able to 96 97 1. send `EndTransaction` with status `STAGED` in parallel with the last batch 98 1. on success of all writes, return to the client, 99 1. asynchronously send `EndTransaction` with status `COMMITTED`, which as a side effect 100 1. resolves the intents. 101 102 To achieve this, the semantics of what it means for a transaction to commit change. As it stands, they are 103 104 > A transaction is committed if and only if there is a transaction record with status COMMITTED. 105 106 The proposed changed condition (referred to as the **commit condition**) is: 107 108 > A transaction is committed if and only if a) there is a transaction record with status COMMITTED, or b) one with status STAGED and all of the intents written in the last batch of that transaction are present. 109 110 We refer to a transaction in status `STAGED` for which the *commit condition* holds as **implicitly committed**. 111 112 When there is a `COMMITTED` transaction record, it is **explicitly committed**. In typical operations, encountered transaction records are usually *explicitly committed*. 113 114 However, when coordinators crash or become disconnected, this does not hold and we need `STAGED` transaction records to contain enough information to check the *commit condition*. To achieve this, the set of intent spans for the writes in the last batch is included in the staging `EndTransaction` request; we call these intent spans the **promised writes**. 115 116 In fact, to later arrive at a consistent `COMMITTED` record (which needs to contain all writes for GC purposes), it will also separately contain the intent spans for all prior batches. 117 118 The changes suggested in this document chiefly affect `DistSender` and the conflict resolution machinery. The `STAGED` transaction status is never encountered by consumers of `DistSenders` and must not be used by them. 119 120 ## Example Notation 121 122 Throughout the document, examples illustrate the algorithm. In the examples, key `ki` lives on range `i`; time flows from left to right and each line corresponds to one goroutine (which may be reused for compactness). A batch directed at a single range is grouped in curly braces. A vertical line demarcates that the events to the right wait for the events on the left to complete. `Write(ki)`, `Stage(ki)`, `Commit(ki)` are requests (from the final batch in the transaction) sent by `DistSender`. `Resolve(ki)` is invoked from the intent resolver of the replica housing the transaction record. 123 124 The examples are accompanied by explanations which may contain additional details not reflected in the notation. 125 126 We now go through some basic examples from the point of view of the `DistSender` machinery. 127 128 ## Example (basic happy case) 129 130 The transaction writes two writes and staged txn record in parallel to three ranges. When all operations come back successfully, the client receives an ack for its commit. An asynchronous request marks the transaction record as committed (this is not required for correctness). 131 132 Note that the transaction did not write to `k2` in the final batch, but has written there earlier (for that's where the transaction record is anchored). We elide the intent resolution for it which happens as part of the `Commit`. 133 134 ``` 135 Write(k1) ok | |Resolve(k1) ok 136 Stage(k2) ok |Commit(k2) ok 137 Write(k3) ok| |Resolve(k3) ok 138 ^- ack commit to client 139 ``` 140 141 ## Example (1PC commit) 142 143 Today's behavior is maintained: whenever the final batch of the transaction addresses into a single range, the batch is sent as-is. In particular, no `STAGED` transaction record is ever written. Additionally, if the complete transaction is only one batch, the 1PC optimization (which avoids writing the transaction record in the first place) can apply on the range receiving it. 144 145 ### 1PC transaction (success) 146 147 ``` 148 {Begin(k1), Write(k1), Commit(k1)} ok 149 ``` 150 151 ### 1PC transaction (range split then success) 152 153 In the below example, the 1PC attempt fails since the range has split. 154 155 ``` 156 .- range key mismatch 157 v 158 {Begin(k1), Write(k1, k2) Commit(k1)} err 159 160 ``` 161 162 Instead, the request is retried and sent in the "conventional" way (the intent at `k1` is resolved as part of the `Commit`): 163 164 ``` 165 {Begin, Write, Stage}(k1) ok|Commit(k1) 166 Write(k2) ok | |Resolve(k2) 167 ^- ack commit to client 168 ``` 169 170 ### Failed write 171 172 #### Non-retryable 173 174 A write fails hard (a good example is `ConditionFailedError`, which corresponds to a unique constraint violation); the error bubbles up to the client: 175 176 ``` 177 {Write,Stage}(k1) ok 178 Write(k2) err <- returned to client 179 Write(k3) ok 180 ``` 181 182 The client then aborts (or retries) the transaction: 183 184 ``` 185 Abort(k1) ok| 186 |Resolve(k2) 187 |Resolve(k3) 188 ``` 189 190 The client has to stop using the transaction for the current epoch (i.e. it has to restart the txn), or new intents written would not be reflected in the staged transaction record, and thus the *commit condition*. 191 192 #### Retryable 193 194 Writes in the final batch can have outcomes that force a transaction retry: their timestamp can be pushed, they can catch a `WriteTooOldFlag`, or they incur a straight-up `TransactionRetryError`. 195 196 ``` 197 Stage(k1) ok 198 Write(k2) retry <- returned to client 199 Write(k3) ok 200 ``` 201 202 On the face of it, this looks very similar to the non-retryable case, and from the perspective of `DistSender`, it is. However, viewing this from the point of view of the *commit condition*, this is different because the transaction is potentially abortable: it has not laid down a committable intent at `k2`. However, as explained later, concurrent transactions will only make use of this if the transaction record looked abandoned, which wouldn't be the case. 203 204 # Reference-level explanation 205 206 ## Detailed design 207 208 ### txnCommitter 209 210 The parallel commit machinery lives in a `txnInterceptor` called `txnCommitter`. The interceptor 211 lives below `txnPipeliner` in the `TxnCoordSender`'s interceptor stack. It gets activated when a 212 `BatchRequest` containing at least one request plus a committing `EndTransaction` arrives. These 213 other requests may be `BeginTransaction` requests, writes, `QueryIntent` requests, or any other 214 transactional requests. 215 216 The code first checks whether the batch is eligible for parallel commit. This means that the batch contains no 217 218 - ranged (write) request. At the time of writing, the only such request type is `DeleteRange`, and we tend to try to use it less (#23258). Handling ranged requests is difficult (near impossible). Consider a `DeleteRange` that writes into a span and lays down an intent (on the single affected key), and a later (unrelated) write into that range (on a previously empty key). The status resolution process needs to know that the span related to the promised write was written atomically so that it can conclude from finding one intent that all intents are there. This implies that the ranged requests corresponding to promised writes must not be split across ranges, which adds an unreasonable amount of complexity. 219 - commit trigger. Commit triggers are only used by internal transactions for which the transaction record's lease is usually colocated with the client running the transaction (so that only one extra round trip to the follower is necessary to commit on the slow path). Support for this can be added later: add a commit trigger to the `STAGED` proto and, if set, don't allow the commit trigger to change when the transaction commits explicitly. 220 221 If the batch is not eligible, the batch passes through the `txnCommitter` unchanged. It will be 222 sent with a `COMMITTED` status as always. 223 224 When it is eligible, a copy of the `EndTransaction` request is created, with the status changed to `STAGED`, and the `PromisedWrites` field is populated from the other requests in the batch according to the protobuf below. 225 226 ``` 227 // promised_writes is only set when the transaction record is in 228 // status `STAGED` (part of the parallel commit mechanism) and is 229 // the set of key spans with associated sequence numbers at which 230 // the transaction's writes from the last batch were directed. This 231 // is required to be able to determine the true status of a 232 // transaction whose coordinator abandoned it in `STAGED` status. 233 // The so-called status resolution process needs to decide whether 234 // the promised writes are present. If so, the transaction is 235 // `COMMITTED`. Otherwise, it is `ABORTED`. 236 // 237 // The parallel commit mechanism is only active for batches which 238 // contain no range requests or commit trigger. 239 repeated message SequencedWrite { 240 bytes key = 1; 241 int32 seq = 3; 242 } promised_writes = 17; 243 ``` 244 245 In the happy case, all requests come back successfully. Errors (from above the routing layer) are propagated up the stack, with the client (`TxnCoordSender`) restarting (includes refreshing), or aborting the transaction as desired (to avoid forcing concurrent transactions into the status resolution process). 246 247 Note that RPC-level errors from the final batch will be turned into ambiguous commit errors as they are today (this happens at the RPC layer and wont be affected by the changes proposed here). 248 249 If the writes come back, it is checked whether they require a transaction restart. This is the case if the returned transaction has had its timestamp pushed, or if it has the `WriteTooOld` flag set. If this is the case, a transaction retry error is synthesized and sent up the `TxnCoordSender`'s interceptor stack. Note that `TxnCoordSender` also expects to learn about the intents which were written and abuses the `client.Sender` interface to this effect; these semantics are kept, though during implementation widening the interface between `TxnCoordSender` and `DistSender` to account for refreshes will be considered. 250 251 If the transaction *can* commit, the `txnCommitter` returns the final transaction to the client and asynchronously sends an `EndTransactionRequest` that finalizes the transaction record (and as a side effect, eagerly resolves the intents). Expected responses to the `EndTransactionRequest` are RPC errors (timeouts, etc) and success. In particular, we can assert that the commit is not rejected. 252 253 When adding in this new interceptor, we will also move `txnIntentCollector` below the `txnPipeliner`, 254 so that the new order between them becomes: `txnPipeliner` -> `txnIntentCollector` -> `txnIntentCollector`. 255 Getting the refresh behavior correct may also require us to move `txnSpanRefresher` above all three 256 of these interceptors. There are no known issues with making those rearrangements. 257 258 ### DistSender 259 260 DistSender is adjusted to allow `EndTransaction` requests to be sent in parallel with other 261 requests if its status is anything other than `COMMITTED`. In practice, this will never be used with 262 `PENDING` or `ABORTED` statuses, but it means that `EndTransaction` requests with a `STAGING` 263 status can be sent in parallel with other requests in its batch. 264 265 ### Replica 266 267 The `txnCommitter` is too high up the stack to know whether a committing `EndTransaction` request 268 can skip the `STAGING` status and move straight to a `COMMITTED` status. This is the case when 269 all promised writes in the request are on the same range as the transaction record. This detection 270 could be performed in the `DistSender`, but that's not desirable because it leaks too much knowledge 271 about transactions into the `DistSender`. 272 273 Instead, Replica is made aware of `EndTransaction` requests with the `STAGING` status. In a similar 274 fashion to how Replicas detects 1-phase commit batches, it is made to check for `STAGING` `EndTransaction` 275 requests that can skip the `STAGING` phase and move straight to `COMMITTING`. When the request finishes, 276 it will return to the txn coordinator, informing it of the new transaction status. 277 278 ### Status Resolution Process 279 280 The status resolution process deals with the case in which `DistSender` writes a `STAGED` transaction record but fails to finalize it, leaving behind an abandoned transaction (i.e. one that hasn't been heartbeat recently). 281 282 Its goal is to either abort the transaction or to determine that it is in fact committed, by trying to prevent one of the transaction's declared *promised writes* of the final batch, which are stored in a `STAGED` transaction record. 283 284 The status resolution process is triggered by a reader or writer which encounters an intent, or by the GC queue as it tries to remove old transaction records. In today's code, they issue a `PushTxnRequest` which may be held up by the txn wait queue. 285 286 With the introduction of the `STAGED` transaction status, push requests may fail even though the pushee is abandoned, and so the txn wait queue (on the leaseholder of the range to which the `STAGED` transaction is anchored) triggers the *status resolution process* when a transaction record is discovered as `STAGED` and abandoned, making sure to have only one such process in flight per transaction record. 287 288 #### PushTxn 289 290 `PushTxn` can't simply change the status of a `STAGED` transaction. Instead, it returns such transactions verbatim. All consumers of `PushTxnResult` are updated to deal with this outcome, and must call into the status resolution machinery instead. 291 292 An alternative to be considered is to return an error instead. This doesn't work well for batches of push requests, though. If during implementation we decide for the first option, we may also consider removing `TransactionPushError` in the process. 293 294 #### QueryIntent(Prevent=true) 295 296 At the heart of the process is trying to prevent an intent of the transaction to be laid down (which is only possible if it isn't already there) at at the provisional commit timestamp. To achieve this efficiently, we introduced a new point read request, `QueryIntent`, which optionally prevents missing intents from ever being written in the future. This request populates the timestamp cache (as any read does) and returns whether there is an intent at the given key for the specified transaction, timestamp, and at *at least* the specified sequence number. We don't check the exact sequence number because a batch could contain overlapping writes, in which case only the latest sequence number matters. If we trust that `PromisedWrites` has been faithfully populated, checking for "greater than or equal" is equivalent to (but simpler than) computing and keeping only the last write to a given key's sequence number. 297 298 The request also returns whether there was an intent of the transaction *above* the expected timestamp. If this happens, the transaction has restarted or been pushed, and should instruct the caller to check the transaction record for a new update (since status resolution isn't kicked off unless a transaction looks abandoned, this may not be worth it in practice). 299 300 As an optimization, we might return a structured error when an intent was prevented (but still populate the timestamp cache), to short-circuit execution of the remainder of the batch. 301 302 #### The Algorithm 303 304 1. wait until the transaction is abandoned (to avoid spurious aborts) -- this is done by the txn wait queue in the common path. 305 1. retrieve the *promised writes* from the `STAGED` transaction record, and note the transaction ID and provisional commit timestamp. 306 1. construct a batch at the provisional commit timestamp that contains a `QueryIntent(Prevent=true)` for each *promised write*, and includes the provisional commit timestamp and the transaction ID. 307 1. Run the batch, and act depending on the outcome: 308 1. if an intent was prevented, abort the transaction. But note a subtlety: the transaction may have restarted and written a new `STAGED` record with a higher provisional commit timestamp, and may now be (implicitly or explicitly) committed. In this case our aborting `EndTransaction` would fail as it checks the provisional commit timestamp against the record's. 309 1. on other errors, retry as appropriate but check the transaction record for updates before each retry. 310 1. if all intents were found in place, propose a committing `EndTransaction`. Note that the fact that we discovered the transaction as committed implies that the original client, if still alive, can only ever issue an `EndTransaction` that attempts to `COMMIT` the transaction as well. 311 312 #### Transaction record recreation 313 314 The status resolution mechanism in conjunction with the eager transaction GC path can lead to transaction entries being recreated in the wrong status. There are two kinds of eager GC (though the difference is immaterial here): First, transaction records which on commit have only intents in the same range are immediately deleted and second, after commit, when the intent resolver has resolved the external intents, the transaction record will be deleted (via an extra consensus write). 315 316 For example, take an *implicitly committed* transaction. While a status resolution is in process and has failed to prevent any of the intents, a concurrent writer discovers one of the intents and prepares to push the transaction. Before it does that, the status resolution succeeds, the transaction is committed, the intents resolved, and the transaction record removed thanks to eager GC. When the push is invoked, it recreates the transaction record as `ABORTED`. This is not an anomaly because the intent is at this point already committed, but it's on dangerously thin ice because the aborter is now likely to assume that they have actually aborted the competing transaction. In the [improving ambiguous results][#improving-ambiguous-results] section, a transaction tries to abort its own record to avoid having to return an ambiguous result to the SQL client. With the above race, it could erroneously return a transaction aborted error even though the transaction actually committed. 317 318 To address this race, we will introduce a distinction between committing `EndTransactionRequest`s issued by the status resolution process and those issued by `TxnCoordSender`. The transaction record will only be deleted after the `TxnCoordSender` has sent its commit or through the GC queue. 319 320 #### Examples 321 322 Assume that the transaction promised writes at `k1`, `k2`, and `k3` and is anchored at `k1`. `t1` is its provisional commit timestamp when the record is discovered. 323 324 ##### Missing intent (at k2) 325 326 ``` 327 QueryIntent(k1, t1) found|Abort(k1, t1) ok GC(k1) ok 328 QueryIntent(k2, t1) prevented | 329 QueryIntent(k3, t1) found | 330 ``` 331 332 ##### Missing intent racing with restart 333 334 An intent at t1 is prevented, but before the transaction record can be aborted, the transaction restarts and manages to (implicitly) commit. As the abort fails, the resolution process observes a transaction record with new activity and waits, observing the commit shortly thereafter. Note that the transaction would be heartbeat throughout this process, so that status resolution would not be kicked off in the first place under normal conditions. 335 336 ``` 337 QueryIntent(k1, t1) found | Write(k1, t2) ok | 338 QueryIntent(k2, t1)prevented|{Write,Stage}(k2, t2) ok| |Commit(k2) ok GC(k2) ok 339 QueryIntent(k3, t1) found | Write(k3, t2) | ok| 340 |Abort(k1, t1) |fail |Wait ok 341 ↑ 342 sees either no txn record or committed 343 one; either way, not `STAGED` any more 344 ``` 345 346 In the other scenario, the abort would have beaten the `Stage(k2)` and the transaction would be aborted. 347 348 ##### Multiple status resolutions racing 349 350 Status resolution only writes a single key (the transaction record) and does so conditionally, and intents are only resolved after this conditional write. As a result, having multiple status resolutions racing does not cause anomalies. 351 352 In the history below, two status resolutions race so that the second `QueryIntent` sees the commit version (and concludes that it has prevented the write). As it tries to update the transaction record, it fails and realizes that the transaction is now committed. 353 354 ``` 355 QueryIntent(k1, t1) found Commit(k1, t1) Resolve(k1)ok| 356 QueryIntent'(k1,t1) |prevented Abort(k1, t1) fail 357 ``` 358 359 ### Performance Implications 360 361 There is an extra write to the WAL due to the newly introduced intermediate transaction status and as a result, throughput may decrease (which is bad) while latencies decrease (which is good). In the long term, we may be able to piggy-back the committing `EndTransaction` on other Raft traffic to recuperate that loss. See #22349 which discusses similar ideas in the context of intent resolution. 362 363 ### No-Op Writes 364 365 The set of **promised writes** are writes that *need to leave an intent*. This is in contrast to the remaining *intent spans*, which only need to be a *superset* of the actually written intents. Today, this is true: any successful "write" command leaves an intent. 366 367 But, it is a restriction for the future: We must never issue [no-op writes](https://github.com/cockroachdb/cockroach/issues/23942) in the final batch of a transaction, and appropriate care must be taken to preserve this invariant. As of https://github.com/cockroachdb/cockroach/commit/9d7c35f, we assert against these no-op writes for successful point writes. Even with this added level of protection, we'll need to remember never to issue no-op writes as promised writes. 368 369 ### Error handling 370 371 Today, a client transaction could issue a `ConditionalPut` in the final batch, receive a `ConditionFailedError`, and might decide to try a different write after. This becomes strictly illegal with parallel commits as it allows for races that cause anomalies. Once a `STAGED` transaction record is sent, the *promised writes* (for that epoch) must not change. A client must restart or roll back the transaction following any error (but note that the client could use a read instead of the conditional put in the first place, which is slightly less performant). 372 373 Care must be taken to guard against this. For example, SQL `UPSERT` handling might be susceptible to such bugs. 374 375 ### Improving Ambiguous Results 376 377 At the `DistSender`-level, we may see more ambiguous commits. An ambiguous commit occurs when a commit operation (which includes an *implicit commit*) was sent but it is not known whether it was processed or not. Since we send more writes in parallel with the commit after this change, the chance of such errors increases. We can make use of the `STAGED` status to improve on this as necessary. After an ambiguous operation, we run the status resolution process (skipping the transaction record lookup; we may not have written our staged record but we know the promised writes and provisional commit timestamp anyway). If we manage to prevent a write, we may return to the client with a transaction retry error instead. Otherwise, if there is a transaction record, we can also run the process hoping for a positive outcome. 378 379 Note that we may not just try to abort our own transaction record. The record may have become *implicitly committed* and then *explicitly committed* by a status resolution process, and then garbage collected. (And even if the record weren't garbage collected, the status resolution process may have proven that the record was committed but then would find it aborted, which is a confusion we'd be wise to avoid). 380 381 ### Metrics 382 383 Metrics will be added for the number of status resolution attempts as well as the number of final batches eligible/not eligible for parallel commit. If the last metric is substantial, we need to lift some of the conditions around admissibility. 384 385 ### Interplay with #26599 (transactional write pipelining) 386 387 After #26599, when the transaction tries to commit, there may be outstanding in-flight RPCs from earlier batches. Instead of waiting for those to complete, the `client.Txn` will pass promised writes along with the final batch, which DistSender will include in the staged transaction record and also send out as `QueryIntent`s (which will be treated as writes, i.e. they may end up being Raft no-ops or batched up with another request to a range -- that is OK, but needs to be tested). There are a few possible outcomes: 388 389 - in the likely case, none of the intents will be prevented (and they are found at the right timestamps), as they have been dispatched earlier; the `QueryIntent`s succeed, as does the rest of the batch, and `DistSender` announces success to the client. 390 - an intent is prevented or found at the wrong timestamp. This is treated like a write having failed or having pushed the transaction, respectively. 391 392 Care will need to be paid to splitting batches with `EndTransaction(status=STAGING)` requests from `QueryIntent` requests for pipelined writes to the same range as the txn record. If care is not taken, the `EndTransaction` request will be blocked by the `spanlatch.Manager` with the rest of its batch and no speedup will be observed by the client. In fact, in conjunction with the [Replica-level detection](#replica), this will behave almost identically to how the case would behave today. 393 394 This will require that we allow multiple batches to the same range be sent in parallel. 395 396 ## Drawbacks 397 398 - The complexity introduced here is nontrivial; it's especially worrying to share the knowledge of whether a transaction is committed between `DistSender` and `EndTransaction`. The system becomes much more prone to transactional anomalies when unexpected commits/aborts are sent, or when clients retry inappropriately. New invariants are added and need to be protected. 399 - The extra WAL write will likely show up in some benchmarks. Avoiding it adds extra complexity, but this may be required when this mechanism is first introduced. 400 401 ## Rationale and Alternatives 402 403 There does not appear to be a viable alternative to this design. The impact of not doing this is to accept that commit latencies remain at roughly double what they could be. 404 405 ### Extended proposal 406 407 There is an extended form of this proposal which allows the intents to be resolved in parallel with the commit (as opposed to after it). This alternative was presently considered as out of scope, as it requires transaction IDs to be embedded into all committed versions (this is required for the status resolution process). Doing so requires a significant engineering effort, but few of the details in this proposal change. 408 409 ## Unresolved questions 410 411 No fundamental questions presently.