github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20151014_online_schema_change.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20151014_online_schema_change.md (about)

1 - Feature Name: online_schema_change
2 - Status: completed
3 - Start Date: 2015-10-14
4 - RFC PR: [#2842](https://github.com/cockroachdb/cockroach/pull/2842)
5 - Cockroach Issue: [#2036](https://github.com/cockroachdb/cockroach/issues/2036)
6
7 # Summary
8
9 Implement online schema change: adding or removing an index or column
10 to a table without blocking access to the table.
11
12 # Motivation
13
14 The existing schema change operations such as adding or removing an
15 index are performed as a single transaction. This approach was
16 convenient for initial implementation but infeasible for a table with
17 any significant amount of data. Additionally, the single transaction
18 approach necessitates reading the table descriptor on every SQL
19 operation which is a significant performance bottleneck.
20
21 We will implement online schema change which breaks down a high-level
22 schema change operation such as `CREATE INDEX` into a series of
23 discrete steps in such a way that user transactions are never blocked
24 from accessing the table and yet never leave table or index data in an
25 invalid state.
26
27 Online schema change will be built on top of [table descriptor
28 leases](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20151009_table_descriptor_lease.md)
29 which describes the mechanism for asynchronously distributing
30 modifications to table descriptors. This RFC is concerned with the
31 actual steps of performing a schema change.
32
33 # Detailed design
34
35 Table descriptors will be enhanced to contain information about an
36 on-going schema change operation:
37
38 ```proto
39 message TableDescriptor {
40 ...
41
42 oneof mutation {
43 optional IndexDescriptor add_index;
44 optional IndexDescriptor drop_index;
45 optional ColumnDescriptor add_column;
46 optional ColumnDescriptor drop_column;
47 }
48 }
49 ```
50
51 Additionally, index and column descriptors will be enhanced to contain
52 a state:
53
54 ```proto
55 enum State {
56 DELETE_ONLY;
57 WRITE_ONLY;
58 PUBLIC;
59 }
60 ```
61
62 The `DELETE_ONLY` state specifies that `UPDATE` and `DELETE`
63 operations should only remove keys from the index or for the
64 column. The `WRITE_ONLY` state specifies that `INSERT`, `UPDATE` and
65 `DELETE` operations should maintain the data for the index or column:
66 adding or deleting as necessary. For both states, the column or index
67 is not used for read operations. Lastly, the `PUBLIC` state allows the
68 index or column to be used normally.
69
70 The state for an index or column moves from `DELETE_ONLY` to
71 `WRITE_ONLY` to `PUBLIC` or in reverse. It is invalid to jump from the
72 `DELETE_ONLY` to `PUBLIC` state or vice versa. By moving the state in
73 these discrete steps it can be shown that that a cluster will not
74 create invalid data when two consecutive versions of the descriptor
75 are in use concurrently. The table lease mechanism will be used to
76 ensure that there are only two active versions of a descriptor in use
77 in a cluster and the scheme change implementation will ensure that
78 these two versions follow the state transition invariant. The state
79 transition diagram (Figure 3) in [Online, Asynchronous Schema Change
80 in
81 F1](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41376.pdf)
82 is a useful reference.
83
84 When an index is added to a table a potentially time consuming
85 backfill operation needs to be performed to create the index
86 data. This operation is performed when the index is in the
87 `WRITE_ONLY` state and completed before moving the index to the
88 `PUBLIC` state. Backfilling creates index entries for every row in the
89 table. As such, it is infeasible for this backfilling to be performed
90 transactionally: the transaction would create an unreasonable number
91 of write intents and user transactions could cause the transaction to
92 abort. Instead, backfill will be performed as a series of small
93 transactions that process a fraction of the table at a time:
94
95 ```go
96 startKey := MakeTableIndexKey(desc.ID, desc.Primary)
97 endKey := startKey.PrefixEnd()
98 for startKey != endKey {
99 var lastKey roachpb.Key
100 err := db.Txn(func(ctx context.Context, txn *Txn) error {
101 txn.SetPriority(VeryLowPriority)
102 scan, err := txn.Scan(startKey, endKey, 1000)
103 if err != nil {
104 return err
105 }
106 lastKey = getLastKeyOfFullRowInScan(scan)
107 b := makeIndexKeysBatch(scan)
108 return txn.CommitInBatch(ctx, b)
109 })
110 if err != nil {
111 // Abort!
112 }
113 startKey = lastKey.Next()
114 }
115 ```
116
117 The above pseudo-code is intended to give the gist of how backfilling
118 will work. We'll iterate over the primary index transactionally
119 retrieving keys and generating index entries. We perform the
120 backfilling work transactionally in order to avoid anomalies involving
121 deletions. Backfilling might duplicate work performed by concurrent
122 insertions or updates, but the blind overwrites of identical data are
123 safe.
124
125 Dropping an index involves a single `DelRange` operation. Dropping a
126 column is more involved and will be performed as a back-delete
127 process, similar to the above backfill process, which loops over the
128 primary index for the table and deletes the column keys.
129
130 Since schema change operations are potentially long running they need
131 to be restartable or abortable if the node performing them dies. We
132 accomplish this by performing the schema change operation for a table
133 on a well known node: the replica holding the range lease for the
134 first range of the table (i.e. containing the key `/<tableID>`). When
135 a node receives a schema change operation such as `CREATE INDEX` it
136 will forward the operation to this "table lease holder". When the table
137 lease holder restarts it will load the associated table descriptor and
138 restart or abort the schema change operation. Note that aborting a
139 schema change operation needs to maintain the invariant that the
140 descriptor version only increase.
141
142 For initial implementation simplicity, we will only allow a single
143 schema change operation at a time per table. This restriction can be
144 lifted in the future if we see a benefit in allowing concurrent schema
145 changes on a table (e.g. concurrently adding multiple indexes).
146
147 # Drawbacks
148
149 * None. This is rock solid.
150
151 # Alternatives
152
153 * No real good ones. We could try to come up with some sort of global
154 table lock mechanism and then synchronously perform the
155 backfill. This seems as difficult as the current proposal and much
156 worse for the user experience.
157
158 # Unresolved questions
159
160 * Is there a way to avoid performing the backfilling work
161 transactionally? Is it worth optimizing? Delete, in particular,
162 seems problematic. If we scan a portion of the primary index
163 non-transactionally and then generate index keys, a concurrent
164 delete can come in and delete one of the rows and not generate a
165 delete of the index key unless we left a "tombstone" deletion for
166 the index key.
167
168 * If the node performing the backfill gets restarted we should figure
169 out a way to avoid restarting the backfill from scratch. One thought
170 is that the backfill operation can periodically checkpoint the high
171 water mark of its progress: either in the descriptor itself (taking
172 care not to bump the version) or in a separate backfill checkpoint
173 table.
174
175 * Figure out how to distribute the backfill work. Ideally we would
176 have each range of the primary index generate and write the index
177 keys. Given that this is not an urgent need, I feel this is best
178 left to a separate RFC.