github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20151111_txn_gc.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20151111_txn_gc.md (about)

1 - Feature Name: txn-gc
2 - Status: completed
3 - Start Date: 2015-11-12
4 - RFC PR: [#3100](https://github.com/cockroachdb/cockroach/pull/3100)
5 - Cockroach Issue: [#2062](https://github.com/cockroachdb/cockroach/issues/2062)
6
7 # Summary
8
9 * rename `{Response->Sequence}Cache`, `Transaction{Table->Cache}`.
10 * store client's intents in `Txn` record on (successful) `EndTransaction`.
11 * GC transaction records asynchronously following the client's `EndTransaction` call after successful resolution of (possibly) outstanding intents.
12 * GC sequence cache entries with `ResolveIntent{,Range}` which are carried out as a consequence of client's `EndTransaction` (successful or not).
13 * Check the sequence cache on non-writes as well and store the epoch along with the sequence (to allow for the next point):
14 * Poison the sequence cache when aborting an intent after `Push` during `ResolveIntent{,Range}`, preventing the anomaly in #2231 (own writes vanishing).
15 * Keep the "slow" gc queue which cleans up in case of abandoned txns or node crashes.
16
17 # Motivation
18
19 Both transaction and sequence cache records should be deleted when they're no longer useful, ideally without introducing extra work. The procedure outlined here accomplishes that in the vast majority of cases (including all non-abandoned transactions).
20
21 # Detailed design
22
23 Refresher:
24
25 * there is exactly one txn record on a single range (the one that the
26 transaction is initiated on via `BeginTransaction`).
27 * there is exactly one sequence cache entry for each `Range` mutated by the txn
28 with a sequence counter (mutation with non-increasing sequence number triggers
29 a txn restart).
30 * in the absence of a transaction record, a `Push` always aborts the transaction.
31
32 ## GCing Txn Records
33
34 The above means that we can always garbage collect aborted transactions with only a best-effort attempt to clean up their intents (but we'll do it only after the client's `EndTransaction` or, if that never happens, the "slow way" via the GC queue; see below).
35 For committed transactions, we must guarantee that no open intents exist before deleting the entry (we already synchronously resolve all intents local to the transaction record and GC the record right away if no external intents exist). The straightforward solution is to have `EndTransaction` persist the external intents on the transaction record and let the goroutine which resolves them asynchronously do a little more work: after successfully carrying out the batch worth of `ResolveIntent`, it can delete the corresponding txn record.
36 All of this is best effort: we're still going to have a gc queue which walks over old transaction entries, poking old transactions and retrying their intent resolution for the .0001% of transactions which are left hanging.
37
38 ## GCing Sequence Cache Entries
39
40 * it's safe to remove a sequence cache entry when we know the transaction isn't running any more. That's after the client's `EndTransaction` gets executed (regardless of its outcome), but not when a concurrent transaction manages to abort it by means of a `Push`.
41 * there are sequence cache entries on any range mutated by the transaction, and we'll be sending `ResolveIntent` there anyways, so we simply make clearing (idempotently) the sequence cache entry a side effect of a `ResolveIntent{,Range}` (when it's carried out as part of `EndTransaction`).
42
43 ## Intentional Sequence Cache Poisoning
44
45 on `ResolveIntent` triggered through an aborting `Push`, we can actually deal with #2231 nicely. The issue there is that a running transaction may not know that it's been aborted already, which leads to anomalies related to the fact that its intents may be gone (so it may not read what it wrote). The key, again, is `ResolveIntent{,Range}`:
46
47 * store not only the sequence number, but also the minimal expected epoch.
48 * `(epoch, seq) < (epoch', seq') iff epoch < epoch' || (epoch == epoch' && seq < seq')`.
49 * upon aborting an intent after a `Push`, we simply poison the sequence cache on that range (setting `sequence=math.MaxInt64`). Assuming that we check the sequence cache on **every** batch (not only for writes), we trigger a transaction restart should the transaction come back to the `Range`. If checking the sequence cache on reads shows up in performance considerations, there are going to be ways to avoid disk I/O in most cases.
50 The retry increases the epoch, so when the txn comes back, it will be able to perform normally.
51
52 ## Interaction with Splits and Merges
53
54 On both Split and Merge we'll copy the entry (keeping the larger one on collision).
55
56 ## "Slow" GC Path
57
58 The slow path to sequence cache GC takes place in the following situations:
59
60 * a transaction is abandoned (so a list of intents is never sent by the client)
61 * a sequence cache entry is duplicated during `Split` to a `Range` not touched by `ResolveIntent{,Range}` for its transaction.
62
63 In the same queue which grooms the transaction cache, we'll also groom the local sequence cache with the goal of finding "inactive" entries, pinging their transaction and removing according to the outcome. To be able to do that, we need to persist more information into the response cache key:
64
65 * Txn.Key (to get the range which holds the transaction entry, but we only have Txn.ID from the sequence cache entry), and
66 * Txn.Timestamp (to figure out whether this is "probably" an inactive transaction)
67
68 Some of the additional overhead could be avoided if transaction IDs encoded some of that information. For example, instead of UUID4 transaction IDs we could adopt the scheme `<hlc_wall=64bit,hlc_logical=32bit><random=32bit>`, but the entropy is considerably lower. This is out of scope for this RFC.
69
70 # Drawbacks
71
72 Possibly checking the sequence cache on reads can show up in performance tuning (not necessarily expected though), in which case some extra caching should do the trick to avoid I/O.
73
74 Likewise, deleting the txn entries may need some batching up for performance (to save Raft proposals; again straightforward to do).
75
76 # Alternatives
77
78 The original design proposed keeping track of the cluster-wide oldest intent's timestamp, which would allow all txn entries older than that timestamp to be GC'ed. The mechanism with its global characteristics doesn't seem preferable to the one outlined above (especially since little complexity and no significant performance hits or new RPCs are introduced there) and does not immediately solve #2231.
79
80 # Unresolved questions