github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20151014_online_schema_change.md (about) 1 - Feature Name: online_schema_change 2 - Status: completed 3 - Start Date: 2015-10-14 4 - RFC PR: [#2842](https://github.com/cockroachdb/cockroach/pull/2842) 5 - Cockroach Issue: [#2036](https://github.com/cockroachdb/cockroach/issues/2036) 6 7 # Summary 8 9 Implement online schema change: adding or removing an index or column 10 to a table without blocking access to the table. 11 12 # Motivation 13 14 The existing schema change operations such as adding or removing an 15 index are performed as a single transaction. This approach was 16 convenient for initial implementation but infeasible for a table with 17 any significant amount of data. Additionally, the single transaction 18 approach necessitates reading the table descriptor on every SQL 19 operation which is a significant performance bottleneck. 20 21 We will implement online schema change which breaks down a high-level 22 schema change operation such as `CREATE INDEX` into a series of 23 discrete steps in such a way that user transactions are never blocked 24 from accessing the table and yet never leave table or index data in an 25 invalid state. 26 27 Online schema change will be built on top of [table descriptor 28 leases](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20151009_table_descriptor_lease.md) 29 which describes the mechanism for asynchronously distributing 30 modifications to table descriptors. This RFC is concerned with the 31 actual steps of performing a schema change. 32 33 # Detailed design 34 35 Table descriptors will be enhanced to contain information about an 36 on-going schema change operation: 37 38 ```proto 39 message TableDescriptor { 40 ... 41 42 oneof mutation { 43 optional IndexDescriptor add_index; 44 optional IndexDescriptor drop_index; 45 optional ColumnDescriptor add_column; 46 optional ColumnDescriptor drop_column; 47 } 48 } 49 ``` 50 51 Additionally, index and column descriptors will be enhanced to contain 52 a state: 53 54 ```proto 55 enum State { 56 DELETE_ONLY; 57 WRITE_ONLY; 58 PUBLIC; 59 } 60 ``` 61 62 The `DELETE_ONLY` state specifies that `UPDATE` and `DELETE` 63 operations should only remove keys from the index or for the 64 column. The `WRITE_ONLY` state specifies that `INSERT`, `UPDATE` and 65 `DELETE` operations should maintain the data for the index or column: 66 adding or deleting as necessary. For both states, the column or index 67 is not used for read operations. Lastly, the `PUBLIC` state allows the 68 index or column to be used normally. 69 70 The state for an index or column moves from `DELETE_ONLY` to 71 `WRITE_ONLY` to `PUBLIC` or in reverse. It is invalid to jump from the 72 `DELETE_ONLY` to `PUBLIC` state or vice versa. By moving the state in 73 these discrete steps it can be shown that that a cluster will not 74 create invalid data when two consecutive versions of the descriptor 75 are in use concurrently. The table lease mechanism will be used to 76 ensure that there are only two active versions of a descriptor in use 77 in a cluster and the scheme change implementation will ensure that 78 these two versions follow the state transition invariant. The state 79 transition diagram (Figure 3) in [Online, Asynchronous Schema Change 80 in 81 F1](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41376.pdf) 82 is a useful reference. 83 84 When an index is added to a table a potentially time consuming 85 backfill operation needs to be performed to create the index 86 data. This operation is performed when the index is in the 87 `WRITE_ONLY` state and completed before moving the index to the 88 `PUBLIC` state. Backfilling creates index entries for every row in the 89 table. As such, it is infeasible for this backfilling to be performed 90 transactionally: the transaction would create an unreasonable number 91 of write intents and user transactions could cause the transaction to 92 abort. Instead, backfill will be performed as a series of small 93 transactions that process a fraction of the table at a time: 94 95 ```go 96 startKey := MakeTableIndexKey(desc.ID, desc.Primary) 97 endKey := startKey.PrefixEnd() 98 for startKey != endKey { 99 var lastKey roachpb.Key 100 err := db.Txn(func(ctx context.Context, txn *Txn) error { 101 txn.SetPriority(VeryLowPriority) 102 scan, err := txn.Scan(startKey, endKey, 1000) 103 if err != nil { 104 return err 105 } 106 lastKey = getLastKeyOfFullRowInScan(scan) 107 b := makeIndexKeysBatch(scan) 108 return txn.CommitInBatch(ctx, b) 109 }) 110 if err != nil { 111 // Abort! 112 } 113 startKey = lastKey.Next() 114 } 115 ``` 116 117 The above pseudo-code is intended to give the gist of how backfilling 118 will work. We'll iterate over the primary index transactionally 119 retrieving keys and generating index entries. We perform the 120 backfilling work transactionally in order to avoid anomalies involving 121 deletions. Backfilling might duplicate work performed by concurrent 122 insertions or updates, but the blind overwrites of identical data are 123 safe. 124 125 Dropping an index involves a single `DelRange` operation. Dropping a 126 column is more involved and will be performed as a back-delete 127 process, similar to the above backfill process, which loops over the 128 primary index for the table and deletes the column keys. 129 130 Since schema change operations are potentially long running they need 131 to be restartable or abortable if the node performing them dies. We 132 accomplish this by performing the schema change operation for a table 133 on a well known node: the replica holding the range lease for the 134 first range of the table (i.e. containing the key `/<tableID>`). When 135 a node receives a schema change operation such as `CREATE INDEX` it 136 will forward the operation to this "table lease holder". When the table 137 lease holder restarts it will load the associated table descriptor and 138 restart or abort the schema change operation. Note that aborting a 139 schema change operation needs to maintain the invariant that the 140 descriptor version only increase. 141 142 For initial implementation simplicity, we will only allow a single 143 schema change operation at a time per table. This restriction can be 144 lifted in the future if we see a benefit in allowing concurrent schema 145 changes on a table (e.g. concurrently adding multiple indexes). 146 147 # Drawbacks 148 149 * None. This is rock solid. 150 151 # Alternatives 152 153 * No real good ones. We could try to come up with some sort of global 154 table lock mechanism and then synchronously perform the 155 backfill. This seems as difficult as the current proposal and much 156 worse for the user experience. 157 158 # Unresolved questions 159 160 * Is there a way to avoid performing the backfilling work 161 transactionally? Is it worth optimizing? Delete, in particular, 162 seems problematic. If we scan a portion of the primary index 163 non-transactionally and then generate index keys, a concurrent 164 delete can come in and delete one of the rows and not generate a 165 delete of the index key unless we left a "tombstone" deletion for 166 the index key. 167 168 * If the node performing the backfill gets restarted we should figure 169 out a way to avoid restarting the backfill from scratch. One thought 170 is that the backfill operation can periodically checkpoint the high 171 water mark of its progress: either in the descriptor itself (taking 172 care not to bump the version) or in a separate backfill checkpoint 173 table. 174 175 * Figure out how to distribute the backfill work. Ideally we would 176 have each range of the primary index generate and write the index 177 keys. Given that this is not an urgent need, I feel this is best 178 left to a separate RFC.