github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/index-backfill.md (about) 1 # Index backfills 2 3 This document explains why index backfills using historical reads and retry 4 transactions are always correct. 5 6 Index backfills work as follows: 7 8 1. Wait until index converges to write/delete state. This means that every 9 write and delete to the primary table gets reflected in the index. 10 2. Fix an MVCC timestamp, `tScan`. This timestamp will be used to perform scans 11 in the fast path, which are fast because historical scans don't need to 12 acquire latches that could conflict with foreground traffic. 13 3. Backfill in chunks of size `n`, using DistSQL. The chunks should be distinct, 14 but the chunk procedure works even if they're not. Each chunk performs steps 15 3 to the end, occurring in parallel. 16 4. Do a KV scan over the span handed to the chunk by DistSQL, at timestamp 17 `tScan`. Limit to the `n` chunk size. Create index entries for every found 18 key. 19 5. Send `InitPut` commands for all of the index entries. These commands succeed 20 when the index key's value is empty (and never written to, not even a 21 tombstone) or when the key's value is identical to that of the index entry. 22 The fast path is when all of the commands succeed - then the chunk is done. 23 6. If any of the `InitPut` commands fail, revert to the slow path. Create a new 24 transaction at `tNow`, in which the scan is re-run from scratch, index 25 entries are rebuilt with data at `tNow`, and `InitPut`s are sent in the same 26 transaction. These `InitPut` commands are set to *not fail* if a tombstone is 27 encountered, to resolve conflicts generated during the `DELETE_ONLY` state. 28 If this transaction fails due to an `InitPut` failure, that indicates an 29 actual uniqueness violation or other error and the schema change is 30 reversed. 31 32 ## Why is this algorithm correct? 33 34 If `InitPut` with index values created with data from `tScan` succeeds (in 35 other words, sees either an empty entry or an entry with identical data no 36 matter what the MVCC timestamp), the index entry is always correct. To see, 37 let's look case by case. 38 39 We'll use the following table as an example throughout, where `k` is the 40 primary key and assuming we're building a unique index on `v`. Since we're 41 building a unique index, the key of the index will be `v` by itself. The value 42 of the index will be `k`. 43 44 Further, let's say the schema change flips into delete mode at time `tDelete` 45 and write/delete mode at time `tWrite`, where `tScan > tWrite > tDelete > 0`. 46 Recall `tScan` is a fixed timestamp chosen for the backfill's historical reads. 47 48 | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) | 49 |-----|-----|----------------------|---------------|---------------|------------------------------| 50 | `1` | `a` | 0 | | | | 51 | `2` | `b` | `tWrite+1` | `b` | `2` | `tWrite+1` | 52 | `3` | `c` | 0 | `c` | `(tombstone)` | `tScan+1` | 53 | `3` | `d` | `tScan+1` | `d` | `3` | `tScan+1` | 54 | `4` | `e` | 0 | | | | 55 | `5` | `e` | `tScan+1` | `e` | `5` | `tScan+1` | 56 | `6` | `f` | 0 | | | | 57 | `6` | ` ` | `tScan+1` | `f` | `(tombstone)` | `tScan+1` | 58 | `7` | `g` | `0` | | | | 59 | `8` | `g` | `tScan+1` | `g` | `8` | `tScan+1` | 60 | `9` | `h` | `0` | | | | 61 | `9` | ` ` | `tDelete+1` | `h` | `(tombstone)` | `tDelete+1` | 62 | `9` | `h` | `tDelete+2` | `h` | `(tombstone)` | `tDelete+1` | 63 64 ## The `InitPut` succeeds 65 66 There are two scenarios where `InitPut` succeeds - there's no existing value or 67 the existing value is identical to the expected value. 68 69 ### The `InitPut` sees an empty entry 70 71 In this case, the index entry found by `InitPut` has never been written to. 72 Since we know that all operations on the table at or after `tScan` cause updates 73 to the index, an absent index entry indicates that the primary table entry has 74 not been updated since `tScan` and is therefore safe to echo to the index. 75 76 For example, when the backfiller encounters the pair `1` `a`, the corresponding 77 index entry is empty, so it proceeds to write. 78 79 | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) | 80 |-----|-----|----------------------|---------------|---------------|------------------------------| 81 | `1` | `a` | 0 | `a` | `1` | `after tScan` | 82 83 ### The `InitPut` sees a non-empty entry with identical value bytes 84 85 In this case, the index entry found by `InitPut` has a value identical to the 86 key of the table. There are just two ways this could happen, both of which are 87 ok: 88 89 1. A table entry was written after `tWrite` but before `tScan`, leaving behind 90 a corresponding index entry. This case is safe since we're just repeating 91 previous work. 92 2. A backfiller dies before it checkpoints, leaving behind an index entry. When 93 the backfiller restarts, it encounters its previously written entry. This 94 case is safe since we're again just repeating previous work. 95 96 There's never a case where this would accidentally cause a uniqueness 97 constraint violation, since the index entry contains the (guaranteed unique) 98 primary key of the table in the value. A uniqueness constraint violation always 99 manifests as an unexpected value for a key. 100 101 For example, when the backfiller encounters the pair `2` `b`, the corresponding 102 index entry is `b` `2`, so the backfiller does nothing and moves on. 103 104 | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) | 105 |-----|-----|----------------------|---------------|---------------|------------------------------| 106 | `2` | `b` | `tWrite+1` | `b` | `2` | `tWrite+1` | 107 108 ## The `InitPut` fails 109 110 In all other cases, the `InitPut` will fail. In these cases, we will retry the 111 operation transactionally with a fresh read to determine whether to return a 112 uniqueness constraint violation or to write a more up to date value. 113 114 ### The `InitPut` fails due to a changed table value 115 116 In the case where a table entry's value changes, the backfill encounters a 117 tombstone and fails. When retried, the backfill will generate a new index entry 118 with the correct key and write it. 119 120 For example, when the pair `3` `c` is encountered, the `InitPut` sees a 121 tombstone and fails. When retried transactionally, the pair is determined to 122 actually be `3` `d`. The transactional `InitPut` then does nothing since the 123 entry was already in the index. 124 125 | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) | 126 |-----|-----|----------------------|---------------|---------------|------------------------------| 127 | `3` | `c` | 0 | `c` | `(tombstone)` | `tScan+1` | 128 | `3` | `d` | `tScan+1` | `d` | `3` | `tScan+1` | 129 130 ### The `InitPut` fails due to a changed table key 131 132 In the case where a table entry's key changes, the backfill encounters a 133 mismatched value and fails. When retried, the backfill won't see the old key 134 and will do nothing. This is correct because the new entry will have already 135 been written by the table update, since it's occurred after `tWrite`. 136 137 For example, when the pair `4` `e` is encountered, the `InitPut` sees a 138 `5` when it expects a `4` and fails. When retried transactionally, there's no 139 longer an entry for `4`. 140 141 | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) | 142 |-----|-----|----------------------|---------------|---------------|------------------------------| 143 | `4` | `e` | 0 | | | | 144 | `5` | `e` | `tScan+1` | `e` | `5` | `tScan+1` | 145 146 ### The `InitPut` fails due to a deleted entry 147 148 In the case where a table entry was deleted, the backfill encounters a tombstone 149 and fails. When retried, the backfill won't see the deleted entry and will 150 therefore do nothing. 151 152 For example, when the pair `6` `f` is encountered, the `InitPut` sees a 153 tombstone and fails. When retried transactionally, there's no longer an entry 154 for `6`. 155 156 | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) | 157 |-----|-----|----------------------|---------------|---------------|------------------------------| 158 | `6` | `f` | 0 | | | | 159 | `6` | ` ` | `tScan+1` | `f` | `(tombstone)` | `tScan+1` | 160 161 ### The `InitPut` fails due to a duplicate value 162 163 In the case where an entry was written that violates a uniquness constraint by 164 duplicating a value that had been written before `tWrite`, the backfill 165 encounters a mismatched value and fails. When retried, the entry is still there 166 and fails `InitPut` again, which fails the backfill. 167 168 For example, when the pair `7` `g` is encountered, the `InitPut` sees an `8` 169 when it expects a `7` and fails. When retried transactionally, the `InitPut` 170 still sees an `8` and fails again, correctly halting the backfill with a real 171 uniqueness constraint violation. 172 173 | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) | 174 |-----|-----|----------------------|---------------|---------------|------------------------------| 175 | `7` | `g` | `0` | | | | 176 | `8` | `g` | `tScan+1` | `g` | `8` | `tScan+1` | 177 178 ### The `InitPut` fails due to a reinserted table value during delete only mode 179 180 In the case where a table entry is deleted and re-added when the cluster is 181 in the delete only state, the backfill encounters a tombstone value and fails. 182 When retried, the entry is still there but will not fail `InitPut` because 183 the transactional retries instruct `InitPut` to treat tombstone values the same 184 as absent values. 185 186 For example, when the pair `9` `h` is encountered, the `InitPut` sees a 187 tombstone when it expects a `9` and fails. When retried transactionally, the 188 `InitPut` still sees a tombstone and succeeds. 189 190 | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) | 191 |-----|-----|----------------------|---------------|---------------|------------------------------| 192 | `9` | `h` | `0` | | | | 193 | `9` | ` ` | `tDelete+1` | `h` | `(tombstone)` | `tDelete+1` | 194 | `9` | `h` | `tDelete+2` | `h` | `(tombstone)` | `tDelete+1` | 195 | `9` | `h` | `tScan+1 | `h` | `9` | `tScan+1` | 196 197 ## Tests 198 199 The above scenarios are exercised in a unit test in 200 `pkg/sql/indexbackfiller_test.go`. If this tech note is updated, please ensure 201 that the updates are propagated to that test.