github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/index-backfill.md (about)

     1  # Index backfills
     2  
     3  This document explains why index backfills using historical reads and retry
     4  transactions are always correct.
     5  
     6  Index backfills work as follows:
     7  
     8  1. Wait until index converges to write/delete state. This means that every
     9     write and delete to the primary table gets reflected in the index.
    10  2. Fix an MVCC timestamp, `tScan`. This timestamp will be used to perform scans
    11     in the fast path, which are fast because historical scans don't need to
    12     acquire latches that could conflict with foreground traffic.
    13  3. Backfill in chunks of size `n`, using DistSQL. The chunks should be distinct,
    14     but the chunk procedure works even if they're not. Each chunk performs steps
    15     3 to the end, occurring in parallel.
    16  4. Do a KV scan over the span handed to the chunk by DistSQL, at timestamp
    17     `tScan`. Limit to the `n` chunk size. Create index entries for every found
    18     key.
    19  5. Send `InitPut` commands for all of the index entries. These commands succeed
    20     when the index key's value is empty (and never written to, not even a
    21     tombstone) or when the key's value is identical to that of the index entry.
    22     The fast path is when all of the commands  succeed - then the chunk is done.
    23  6. If any of the `InitPut` commands fail, revert to the slow path. Create a new
    24     transaction at `tNow`, in which the scan is re-run from scratch, index
    25     entries are rebuilt with data at `tNow`, and `InitPut`s are sent in the same
    26     transaction. These `InitPut` commands are set to *not fail* if a tombstone is
    27     encountered, to resolve conflicts generated during the `DELETE_ONLY` state.
    28     If this transaction fails due to an `InitPut` failure, that indicates an
    29     actual uniqueness violation or other error and the schema change is
    30     reversed.
    31  
    32  ## Why is this algorithm correct?
    33  
    34  If `InitPut` with index values created with data from `tScan` succeeds (in
    35  other words, sees either an empty entry or an entry with identical data no
    36  matter what the MVCC timestamp), the index entry is always correct. To see,
    37  let's look case by case.
    38  
    39  We'll use the following table as an example throughout, where `k` is the
    40  primary key and assuming we're building a unique index on `v`. Since we're
    41  building a unique index, the key of the index will be `v` by itself. The value
    42  of the index will be `k`.
    43  
    44  Further, let's say the schema change flips into delete mode at time `tDelete`
    45  and write/delete mode at time `tWrite`, where `tScan > tWrite > tDelete > 0`.
    46  Recall `tScan` is a fixed timestamp chosen for the backfill's historical reads.
    47  
    48  | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) |
    49  |-----|-----|----------------------|---------------|---------------|------------------------------|
    50  | `1` | `a` | 0                    |               |               |                              |
    51  | `2` | `b` | `tWrite+1`           | `b`           | `2`           | `tWrite+1`                   |
    52  | `3` | `c` | 0                    | `c`           | `(tombstone)` | `tScan+1`                    |
    53  | `3` | `d` | `tScan+1`            | `d`           | `3`           | `tScan+1`                    |
    54  | `4` | `e` | 0                    |               |               |                              |
    55  | `5` | `e` | `tScan+1`            | `e`           | `5`           | `tScan+1`                    |
    56  | `6` | `f` | 0                    |               |               |                              |
    57  | `6` | ` ` | `tScan+1`            | `f`           | `(tombstone)` | `tScan+1`                    |
    58  | `7` | `g` | `0`                  |               |               |                              |
    59  | `8` | `g` | `tScan+1`            | `g`           | `8`           | `tScan+1`                    |
    60  | `9` | `h` | `0`                  |               |               |                              |
    61  | `9` | ` ` | `tDelete+1`          | `h`           | `(tombstone)` | `tDelete+1`                  |
    62  | `9` | `h` | `tDelete+2`          | `h`           | `(tombstone)` | `tDelete+1`                  |
    63  
    64  ## The `InitPut` succeeds
    65  
    66  There are two scenarios where `InitPut` succeeds - there's no existing value or
    67  the existing value is identical to the expected value.
    68  
    69  ### The `InitPut` sees an empty entry
    70  
    71  In this case, the index entry found by `InitPut` has never been written to.
    72  Since we know that all operations on the table at or after `tScan` cause updates
    73  to the index, an absent index entry indicates that the primary table entry has
    74  not been updated since `tScan` and is therefore safe to echo to the index.
    75  
    76  For example, when the backfiller encounters the pair `1` `a`, the corresponding
    77  index entry is empty, so it proceeds to write.
    78  
    79  | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) |
    80  |-----|-----|----------------------|---------------|---------------|------------------------------|
    81  | `1` | `a` | 0                    | `a`           | `1`           | `after tScan`                |
    82  
    83  ### The `InitPut` sees a non-empty entry with identical value bytes
    84  
    85  In this case, the index entry found by `InitPut` has a value identical to the
    86  key of the table. There are just two ways this could happen, both of which are
    87  ok:
    88  
    89  1. A table entry was written after `tWrite` but before `tScan`, leaving behind
    90     a corresponding index entry. This case is safe since we're just repeating
    91     previous work.
    92  2. A backfiller dies before it checkpoints, leaving behind an index entry. When
    93     the backfiller restarts, it encounters its previously written entry. This
    94     case is safe since we're again just repeating previous work.
    95  
    96  There's never a case where this would accidentally cause a uniqueness
    97  constraint violation, since the index entry contains the (guaranteed unique)
    98  primary key of the table in the value. A uniqueness constraint violation always
    99  manifests as an unexpected value for a key.
   100  
   101  For example, when the backfiller encounters the pair `2` `b`, the corresponding
   102  index entry is `b` `2`, so the backfiller does nothing and moves on.
   103  
   104  | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) |
   105  |-----|-----|----------------------|---------------|---------------|------------------------------|
   106  | `2` | `b` | `tWrite+1`           | `b`           | `2`           | `tWrite+1`                   |
   107  
   108  ## The `InitPut` fails
   109  
   110  In all other cases, the `InitPut` will fail. In these cases, we will retry the
   111  operation transactionally with a fresh read to determine whether to return a
   112  uniqueness constraint violation or to write a more up to date value.
   113  
   114  ### The `InitPut` fails due to a changed table value
   115  
   116  In the case where a table entry's value changes, the backfill encounters a
   117  tombstone and fails. When retried, the backfill will generate a new index entry
   118  with the correct key and write it.
   119  
   120  For example, when the pair `3` `c` is encountered, the `InitPut` sees a
   121  tombstone and fails. When retried transactionally, the pair is determined to
   122  actually be `3` `d`. The transactional `InitPut` then does nothing since the
   123  entry was already in the index.
   124  
   125  | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) |
   126  |-----|-----|----------------------|---------------|---------------|------------------------------|
   127  | `3` | `c` | 0                    | `c`           | `(tombstone)` | `tScan+1`                    |
   128  | `3` | `d` | `tScan+1`            | `d`           | `3`           | `tScan+1`                    |
   129  
   130  ### The `InitPut` fails due to a changed table key
   131  
   132  In the case where a table entry's key changes, the backfill encounters a
   133  mismatched value and fails. When retried, the backfill won't see the old key
   134  and will do nothing. This is correct because the new entry will have already
   135  been written by the table update, since it's occurred after `tWrite`.
   136  
   137  For example, when the pair `4` `e` is encountered, the `InitPut` sees a
   138  `5` when it expects a `4` and fails. When retried transactionally, there's no
   139  longer an entry for `4`.
   140  
   141  | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) |
   142  |-----|-----|----------------------|---------------|---------------|------------------------------|
   143  | `4` | `e` | 0                    |               |               |                              |
   144  | `5` | `e` | `tScan+1`            | `e`           | `5`           | `tScan+1`                    |
   145  
   146  ### The `InitPut` fails due to a deleted entry
   147  
   148  In the case where a table entry was deleted, the backfill encounters a tombstone
   149  and fails. When retried, the backfill won't see the deleted entry and will
   150  therefore do nothing.
   151  
   152  For example, when the pair `6` `f` is encountered, the `InitPut` sees a
   153  tombstone and fails. When retried transactionally, there's no longer an entry
   154  for `6`.
   155  
   156  | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) |
   157  |-----|-----|----------------------|---------------|---------------|------------------------------|
   158  | `6` | `f` | 0                    |               |               |                              |
   159  | `6` | ` ` | `tScan+1`            | `f`           | `(tombstone)` | `tScan+1`                    |
   160  
   161  ### The `InitPut` fails due to a duplicate value
   162  
   163  In the case where an entry was written that violates a uniquness constraint by
   164  duplicating a value that had been written before `tWrite`, the backfill
   165  encounters a mismatched value and fails. When retried, the entry is still there
   166  and fails `InitPut` again, which fails the backfill.
   167  
   168  For example, when the pair `7` `g` is encountered, the `InitPut` sees an `8`
   169  when it expects a `7` and fails. When retried transactionally, the `InitPut`
   170  still sees an `8` and fails again, correctly halting the backfill with a real
   171  uniqueness constraint violation.
   172  
   173  | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) |
   174  |-----|-----|----------------------|---------------|---------------|------------------------------|
   175  | `7` | `g` | `0`                  |               |               |                              |
   176  | `8` | `g` | `tScan+1`            | `g`           | `8`           | `tScan+1`                    |
   177  
   178  ### The `InitPut` fails due to a reinserted table value during delete only mode
   179  
   180  In the case where a table entry is deleted and re-added when the cluster is
   181  in the delete only state, the backfill encounters a tombstone value and fails.
   182  When retried, the entry is still there but will not fail `InitPut` because
   183  the transactional retries instruct `InitPut` to treat tombstone values the same
   184  as absent values.
   185  
   186  For example, when the pair `9` `h` is encountered, the `InitPut` sees a
   187  tombstone when it expects a `9` and fails. When retried transactionally, the
   188  `InitPut` still sees a tombstone and succeeds.
   189  
   190  | `k` | `v` | `t` (MVCC timestamp) | `idx_k` (`v`) | `idx_v` (`k`) | `idx_t` (idx MVCC timestamp) |
   191  |-----|-----|----------------------|---------------|---------------|------------------------------|
   192  | `9` | `h` | `0`                  |               |               |                              |
   193  | `9` | ` ` | `tDelete+1`          | `h`           | `(tombstone)` | `tDelete+1`                  |
   194  | `9` | `h` | `tDelete+2`          | `h`           | `(tombstone)` | `tDelete+1`                  |
   195  | `9` | `h` | `tScan+1             | `h`           | `9`           | `tScan+1`                    |
   196  
   197  ## Tests
   198  
   199  The above scenarios are exercised in a unit test in
   200  `pkg/sql/indexbackfiller_test.go`. If this tech note is updated, please ensure
   201  that the updates are propagated to that test.