github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20151014_online_schema_change.md (about)

     1  - Feature Name: online_schema_change
     2  - Status: completed
     3  - Start Date: 2015-10-14
     4  - RFC PR: [#2842](https://github.com/cockroachdb/cockroach/pull/2842)
     5  - Cockroach Issue: [#2036](https://github.com/cockroachdb/cockroach/issues/2036)
     6  
     7  # Summary
     8  
     9  Implement online schema change: adding or removing an index or column
    10  to a table without blocking access to the table.
    11  
    12  # Motivation
    13  
    14  The existing schema change operations such as adding or removing an
    15  index are performed as a single transaction. This approach was
    16  convenient for initial implementation but infeasible for a table with
    17  any significant amount of data. Additionally, the single transaction
    18  approach necessitates reading the table descriptor on every SQL
    19  operation which is a significant performance bottleneck.
    20  
    21  We will implement online schema change which breaks down a high-level
    22  schema change operation such as `CREATE INDEX` into a series of
    23  discrete steps in such a way that user transactions are never blocked
    24  from accessing the table and yet never leave table or index data in an
    25  invalid state.
    26  
    27  Online schema change will be built on top of [table descriptor
    28  leases](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20151009_table_descriptor_lease.md)
    29  which describes the mechanism for asynchronously distributing
    30  modifications to table descriptors. This RFC is concerned with the
    31  actual steps of performing a schema change.
    32  
    33  # Detailed design
    34  
    35  Table descriptors will be enhanced to contain information about an
    36  on-going schema change operation:
    37  
    38  ```proto
    39  message TableDescriptor {
    40    ...
    41  
    42    oneof mutation {
    43      optional IndexDescriptor add_index;
    44      optional IndexDescriptor drop_index;
    45      optional ColumnDescriptor add_column;
    46      optional ColumnDescriptor drop_column;
    47    }
    48  }
    49  ```
    50  
    51  Additionally, index and column descriptors will be enhanced to contain
    52  a state:
    53  
    54  ```proto
    55  enum State {
    56    DELETE_ONLY;
    57    WRITE_ONLY;
    58    PUBLIC;
    59  }
    60  ```
    61  
    62  The `DELETE_ONLY` state specifies that `UPDATE` and `DELETE`
    63  operations should only remove keys from the index or for the
    64  column. The `WRITE_ONLY` state specifies that `INSERT`, `UPDATE` and
    65  `DELETE` operations should maintain the data for the index or column:
    66  adding or deleting as necessary. For both states, the column or index
    67  is not used for read operations. Lastly, the `PUBLIC` state allows the
    68  index or column to be used normally.
    69  
    70  The state for an index or column moves from `DELETE_ONLY` to
    71  `WRITE_ONLY` to `PUBLIC` or in reverse. It is invalid to jump from the
    72  `DELETE_ONLY` to `PUBLIC` state or vice versa. By moving the state in
    73  these discrete steps it can be shown that that a cluster will not
    74  create invalid data when two consecutive versions of the descriptor
    75  are in use concurrently. The table lease mechanism will be used to
    76  ensure that there are only two active versions of a descriptor in use
    77  in a cluster and the scheme change implementation will ensure that
    78  these two versions follow the state transition invariant. The state
    79  transition diagram (Figure 3) in [Online, Asynchronous Schema Change
    80  in
    81  F1](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41376.pdf)
    82  is a useful reference.
    83  
    84  When an index is added to a table a potentially time consuming
    85  backfill operation needs to be performed to create the index
    86  data. This operation is performed when the index is in the
    87  `WRITE_ONLY` state and completed before moving the index to the
    88  `PUBLIC` state. Backfilling creates index entries for every row in the
    89  table. As such, it is infeasible for this backfilling to be performed
    90  transactionally: the transaction would create an unreasonable number
    91  of write intents and user transactions could cause the transaction to
    92  abort. Instead, backfill will be performed as a series of small
    93  transactions that process a fraction of the table at a time:
    94  
    95  ```go
    96    startKey := MakeTableIndexKey(desc.ID, desc.Primary)
    97    endKey := startKey.PrefixEnd()
    98    for startKey != endKey {
    99      var lastKey roachpb.Key
   100      err := db.Txn(func(ctx context.Context, txn *Txn) error {
   101        txn.SetPriority(VeryLowPriority)
   102        scan, err := txn.Scan(startKey, endKey, 1000)
   103        if err != nil {
   104          return err
   105        }
   106        lastKey = getLastKeyOfFullRowInScan(scan)
   107        b := makeIndexKeysBatch(scan)
   108        return txn.CommitInBatch(ctx, b)
   109      })
   110      if err != nil {
   111        // Abort!
   112      }
   113      startKey = lastKey.Next()
   114    }
   115  ```
   116  
   117  The above pseudo-code is intended to give the gist of how backfilling
   118  will work. We'll iterate over the primary index transactionally
   119  retrieving keys and generating index entries. We perform the
   120  backfilling work transactionally in order to avoid anomalies involving
   121  deletions. Backfilling might duplicate work performed by concurrent
   122  insertions or updates, but the blind overwrites of identical data are
   123  safe.
   124  
   125  Dropping an index involves a single `DelRange` operation. Dropping a
   126  column is more involved and will be performed as a back-delete
   127  process, similar to the above backfill process, which loops over the
   128  primary index for the table and deletes the column keys.
   129  
   130  Since schema change operations are potentially long running they need
   131  to be restartable or abortable if the node performing them dies. We
   132  accomplish this by performing the schema change operation for a table
   133  on a well known node: the replica holding the range lease for the
   134  first range of the table (i.e. containing the key `/<tableID>`). When
   135  a node receives a schema change operation such as `CREATE INDEX` it
   136  will forward the operation to this "table lease holder". When the table
   137  lease holder restarts it will load the associated table descriptor and
   138  restart or abort the schema change operation. Note that aborting a
   139  schema change operation needs to maintain the invariant that the
   140  descriptor version only increase.
   141  
   142  For initial implementation simplicity, we will only allow a single
   143  schema change operation at a time per table. This restriction can be
   144  lifted in the future if we see a benefit in allowing concurrent schema
   145  changes on a table (e.g. concurrently adding multiple indexes).
   146  
   147  # Drawbacks
   148  
   149  * None. This is rock solid.
   150  
   151  # Alternatives
   152  
   153  * No real good ones. We could try to come up with some sort of global
   154    table lock mechanism and then synchronously perform the
   155    backfill. This seems as difficult as the current proposal and much
   156    worse for the user experience.
   157  
   158  # Unresolved questions
   159  
   160  * Is there a way to avoid performing the backfilling work
   161    transactionally? Is it worth optimizing? Delete, in particular,
   162    seems problematic. If we scan a portion of the primary index
   163    non-transactionally and then generate index keys, a concurrent
   164    delete can come in and delete one of the rows and not generate a
   165    delete of the index key unless we left a "tombstone" deletion for
   166    the index key.
   167  
   168  * If the node performing the backfill gets restarted we should figure
   169    out a way to avoid restarting the backfill from scratch. One thought
   170    is that the backfill operation can periodically checkpoint the high
   171    water mark of its progress: either in the descriptor itself (taking
   172    care not to bump the version) or in a separate backfill checkpoint
   173    table.
   174  
   175  * Figure out how to distribute the backfill work. Ideally we would
   176    have each range of the primary index generate and write the index
   177    keys. Given that this is not an urgent need, I feel this is best
   178    left to a separate RFC.