github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170314_sql_split_syntax.md (about) 1 - Feature Name: SQL Syntax for splitting tables 2 - Status: completed 3 - Start Date: 2017-03-14 4 - Authors: Radu Berinde 5 - RFC PR: [#14146](https://github.com/cockroachdb/cockroach/pull/14146) 6 - Cockroach Issue: [#13665](https://github.com/cockroachdb/cockroach/issues/13665) 7 8 # Summary 9 10 This RFC proposes new syntax for introducing table splits and relocating ranges via SQL. 11 12 Specifically, we want to be able to: 13 - introduce range splits at specific points in a table or index; 14 - trigger a "scattering" of ranges in a table or index (or in an area of a 15 table/index); 16 - relocate specific ranges to a specified set of replicas. 17 18 # Motivation 19 20 The main motivation is to allow setting up tests, benchmarks, and reproducible 21 testbeds, especially for DistSQL. One set of some sample tests that we want to 22 be able to set up directly from logic tests are in 23 [TestDistSQLPlanner](https://github.com/cockroachdb/cockroach/blob/cc5ba57/pkg/sql/distsql_physical_planner_test.go#L75). 24 These tests need to split tables in a specific way, and they need to reprogram 25 the replicas and range leaders of each split. 26 27 The secondary motivation is for restore (as in backup/restore), which needs to 28 introduce a specific set of key splits. Currently this is done by custom code 29 which works as follows: the keys are sorted; a split is introduced on the middle 30 key, then the splits to the left and respectively to the right are processed 31 recursively (in parallel). We want the new syntax to implement this algorithm so 32 that backup/restore can use it (assuming they can switch to using split point 33 specified via column values as opposed to keys). 34 35 We currently have the `ALTER TABLE/INDEX SPLIT AT` statement which allows 36 introducing a table or index split at a specific tuple of column values. The 37 main drawback is that it cannot be used *programmatically*: each split must be 38 in its own statement; we cannot generate (e.g. via a `SELECT`) a set of values 39 where to split. We want to write tests like the one linked above without 40 hardcoding each split in the test file; and we want to be able to easily change 41 the test table sizes. In addition, the `SPLIT AT` statements don't support 42 control of replication. 43 44 # Detailed design 45 46 ### 1. `ALTER TABLE/INDEX SPLIT AT` enhancement ### 47 48 The existing `SPLIT AT` syntax is changed: instead of taking a tuple of column 49 values for a single split, `SPLIT AT` takes an arbitrary select clause where 50 each row contains the primary key values; a split is issued for each row. 51 52 ```sql 53 ALTER TABLE <table> SPLIT AT <select_statement> 54 ALTER INDEX <table>@index SPLIT AT <select_statement> 55 ``` 56 57 Existing uses of `ALTER TABLE SPLIT AT (pk1,pk2,...)` are changed to `ALTER 58 TABLE SPLIT AT VALUES (pk1,pk2,...)` (and similarly for `ALTER INDEX`); 59 examples: 60 61 62 | Old | New | 63 |----------------------------------------|-----------------------------------------------| 64 | `ALTER TABLE d.t SPLIT AT ('c', 3)` | `ALTER TABLE d.t SPLIT AT VALUES ('c', 3)` | 65 | `ALTER INDEX d.t@s_idx SPLIT AT ('f')` | `ALTER INDEX d.t@s_idx SPLIT AT VALUES ('f')` | 66 67 The new syntax allows introducing multiple split points, as well as calculated 68 or table-driven split points; for example: 69 ```sql 70 ALTER TABLE t1 SPLIT AT VALUES ('c', 3), ('d', 4), ('e', 5) 71 ALTER INDEX t2@idx SPLIT AT SELECT i*10 FROM GENERATE_SERIES(1, 5) AS g(i) 72 ALTER TABLE t3 SPLIT AT SELECT pk1, pk2 FROM d.t WHERE split_here 73 ``` 74 75 The `SPLIT AT` statement is also extended to support providing only a *prefix* 76 of the primary key or index values. Example: 77 ```sql 78 CREATE TABLE t (k1, k2, k3, v INT, PRIMARY KEY (k1, k2, k3)) 79 80 -- Introduce a split at at /t/primary/1 81 ALTER TABLE t SPLIT AT VALUES (1) 82 83 -- Introduce a split at at /t/primary/1/2 84 ALTER TABLE t SPLIT AT VALUES (1,2) 85 86 -- Introduce a split at at /t/primary/1/2/3 87 ALTER TABLE t SPLIT AT VALUES (1,2,3) 88 ``` 89 90 Note that the statement returns only after the splits have been preformed. 91 92 Implementation note: the `SPLIT AT` implementation can issue the splits in any 93 order; for example, if there are many splits, it is advantageous to sort the 94 split points, split at the middle point, then recursively process the left and 95 right sides (in parallel). 96 97 *Interleaved tables*: the command works as expected; the split will inherently 98 cause a corresponding split in the parent or child tables/indexes. 99 100 ##### Return values ##### 101 102 `ALTER TABLE/INDEX SPLIT AT` currently returns a row with two columns: the key 103 in raw and pretty-printed form. The same schema is kept; a row for each split is 104 output. 105 106 If a split already exists, `SPLIT AT` currently returns an error. 107 108 TBD: what is the correct behavior with multiple split points? Ignore this error? 109 110 ### 2. `ALTER TABLE/INDEX SCATTER` ### 111 112 A new pair of statements similar to `SPLIT AT` are introduced. Each has two 113 forms. The first form causes all the ranges for that table or index to be 114 "scattered": for each range, a new random set of stores are chosen for replicas 115 (in accordance with the zone config). 116 117 ```sql 118 ALTER TABLE <table> SCATTER 119 ALTER INDEX <table>@index SCATTER 120 ``` 121 122 The second form allows only a specific area of a table or index to be scattered. 123 The area is specified using a pair of tuples (primary key or index column 124 values) for the start and end of the area. The values are interpreted similar to 125 `SPLIT AT`, and similarly a prefix of the primary key or index columns can be 126 specified: 127 128 ```sql 129 ALTER TABLE <table> SCATTER 130 ALTER TABLE <table> SCATTER FROM (startPK1, startPK2, ...) TO (endPK1, endPK2, ...) 131 ALTER INDEX <table>@index SCATTER 132 ALTER INDEX <table>@index SCATTER FROM (startCol1, startCol2, ...) TO (endCol1, endCol2, ...) 133 ``` 134 135 Note that no new split points are introduced by `SCATTER`. 136 137 Examples: 138 ```sql 139 ALTER TABLE t SCATTER 140 ALTER TABLE t SCATTER FROM (1,1) TO (1,2) 141 ALTER INDEX t@idx SCATTER FROM (1) TO (2) 142 ``` 143 144 The statement returns only after the relocations are complete. 145 146 *Interleaved tables*: the command works as expected (the ranges may contain rows 147 for parent or child tables/indexes). 148 149 ### 3. `ALTER TABLE/INDEX EXPERIMENTAL_RELOCATE` ### 150 151 The `EXPERIMENTAL_RELOCATE` statements can be used to relocate specific ranges to 152 specific stores. This is very low-level functionality and is intended to be used 153 sparingly, mainly for setting up tests which benefit from a predetermined data 154 distribution. The rebalancing queues should be stopped in order to make the 155 relocations "stick". 156 157 ```sql 158 ALTER TABLE <table> EXPERIMENTAL_RELOCATE <select_statement> 159 ALTER INDEX <table>@<index> EXPERIMENTAL_RELOCATE <select_statement> 160 ``` 161 162 `EXPERIMENTAL_RELOCATE` takes a select statement with the following result schema: 163 - the first column is the relocation information: an array of integers, where 164 each integer is a store ID. This indicates the set of replicas for the range; 165 the first replica in the array will be the new lease owner. 166 167 - the rest of the columns indicate a point in the table as a list of primary 168 key or index columns (similar to `SPLIT AT`). The range that contains this 169 point is relocated according to the first column. Note that, just like `SPLIT 170 AT`, a row with this particular set of values doesn't have to exist in the 171 table. 172 173 Examples: 174 ```sql 175 CREATE TABLE t (k1, k2, k3, v INT, PRIMARY KEY (k1, k2, k3)) 176 177 -- Move the range containing /t/primary/1/2/3 to store 1: 178 ALTER TABLE t EXPERIMENTAL_RELOCATE VALUES (ARRAY[1], 1, 2, 3) 179 180 -- Move the range containing /t/primary/1/2 to stores 5,6,7 (with 5 as lease owner): 181 ALTER TABLE t EXPERIMENTAL_RELOCATE VALUES (ARRAY[5,6,7], 1, 2) 182 183 -- Move even k1s to store 1, odd k1s to store 2: 184 ALTER TABLE t EXPERIMENTAL_RELOCATE SELECT ARRAY[1+i%2], i FROM GENERATE_SERIES(1, 10) AS g(i) 185 ``` 186 187 The statement returns only after the relocations are complete. 188 189 *Interleaved tables*: the command works as expected (the ranges may contain rows 190 for parent or child tables/indexes). 191 192 ### 4. `crdb_internal.ranges` and `ranges_cached` system table ### 193 194 To facilitate testing the implementation of the new commands (as well as allow a 195 user to verify what the commands did), we introduce a `crdb_internal.ranges` 196 system table that can be used to look at all the ranges on the system, or the 197 ranges from a table. 198 199 The schema of the table is as follows: 200 201 Column | Type | Description 202 ---------------|------------|------------------------------ 203 `start_key` | BYTES | Range start key (raw) 204 `start_pretty` | STRING | Range start key (pretty-printed) 205 `end_key` | BYTES | Range end key (raw) 206 `end_pretty` | STRING | Range end key (pretty-printed) 207 `database` | STRING | Database name (if range is part of a table) 208 `table` | STRING | Table name (if range is part of a table); for interleaved tables this is always the root table. 209 `index` | STRING | Index name (if range is part of a non-primary index); 210 `replicas` | ARRAY(INT) | Replica store IDs 211 `lease_holder` | INT | Lease holder store ID 212 213 The last two columns could be hidden (so they are only available if `SELECT`ed 214 for specifically). 215 216 Implementation notes: 217 - the system table infrastructure will be improved so the row producing 218 function has access to filters; specifying a `table` or `index` filter that 219 should be optimized to only look at the ranges for that table. 220 - the row producing function should also have access to needed columns; that 221 way the more expensive lease holder determination can be omitted if the 222 column is not needed. 223 224 A second table, `crdb_internal.ranges_cached` has the same schema, but it 225 returns data from the range and lease holder caches. Specifically: the ranges 226 along with `replicas` information are populated from the range cache; for each 227 range, if that range ID has an entry in the lease holder cache, `lease_holder` 228 is set according to that entry; otherwise it is NULL. 229 230 # Alternatives 231 232 An alternative considered was to create statements that operate on *keys* (key 233 prefixes) and provide functions that generate key prefixes from column values. 234 This was deemed as too low-level for the SQL interface. 235 236 Another alternative considered was to also use a function for the splitting 237 functionality, e.g. `SELECT split_at(..)`. The problem is that it forces the 238 splits to happen sequentially; we cannot implement the algorithm mentioned above 239 that parallelizes the splits. One way around this would be to introduce `split_at` 240 as an *aggregation* function (akin to `sum`). 241 242 Alternatives considered for `crdb_internal.ranges`: 243 - a `SHOW RANGES FOR TABLE/INDEX` statement; the system table was deemed more 244 useful. 245 - having multiple system tables (e.g. a separate one for lease holders) and 246 using joins as necessary; this requires too many changes to make sure we only 247 generate the parts of the table that are needed. 248 249 # Unresolved questions