github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170314_sql_split_syntax.md (about)

     1  - Feature Name: SQL Syntax for splitting tables
     2  - Status: completed
     3  - Start Date: 2017-03-14
     4  - Authors: Radu Berinde
     5  - RFC PR: [#14146](https://github.com/cockroachdb/cockroach/pull/14146)
     6  - Cockroach Issue: [#13665](https://github.com/cockroachdb/cockroach/issues/13665)
     7  
     8  # Summary
     9  
    10  This RFC proposes new syntax for introducing table splits and relocating ranges via SQL.
    11  
    12  Specifically, we want to be able to:
    13   - introduce range splits at specific points in a table or index;
    14   - trigger a "scattering" of ranges in a table or index (or in an area of a
    15     table/index);
    16   - relocate specific ranges to a specified set of replicas.
    17  
    18  # Motivation
    19  
    20  The main motivation is to allow setting up tests, benchmarks, and reproducible
    21  testbeds, especially for DistSQL. One set of some sample tests that we want to
    22  be able to set up directly from logic tests are in
    23  [TestDistSQLPlanner](https://github.com/cockroachdb/cockroach/blob/cc5ba57/pkg/sql/distsql_physical_planner_test.go#L75).
    24  These tests need to split tables in a specific way, and they need to reprogram
    25  the replicas and range leaders of each split.
    26  
    27  The secondary motivation is for restore (as in backup/restore), which needs to
    28  introduce a specific set of key splits. Currently this is done by custom code
    29  which works as follows: the keys are sorted; a split is introduced on the middle
    30  key, then the splits to the left and respectively to the right are processed
    31  recursively (in parallel). We want the new syntax to implement this algorithm so
    32  that backup/restore can use it (assuming they can switch to using split point
    33  specified via column values as opposed to keys).
    34  
    35  We currently have the `ALTER TABLE/INDEX SPLIT AT` statement which allows
    36  introducing a table or index split at a specific tuple of column values. The
    37  main drawback is that it cannot be used *programmatically*: each split must be
    38  in its own statement; we cannot generate (e.g. via a `SELECT`) a set of values
    39  where to split. We want to write tests like the one linked above without
    40  hardcoding each split in the test file; and we want to be able to easily change
    41  the test table sizes. In addition, the `SPLIT AT` statements don't support
    42  control of replication.
    43  
    44  # Detailed design
    45  
    46  ### 1. `ALTER TABLE/INDEX SPLIT AT` enhancement ###
    47  
    48  The existing `SPLIT AT` syntax is changed: instead of taking a tuple of column
    49  values for a single split, `SPLIT AT` takes an arbitrary select clause where
    50  each row contains the primary key values; a split is issued for each row.
    51  
    52  ```sql
    53  ALTER TABLE <table> SPLIT AT <select_statement>
    54  ALTER INDEX <table>@index SPLIT AT <select_statement>
    55  ```
    56  
    57  Existing uses of `ALTER TABLE SPLIT AT (pk1,pk2,...)` are changed to `ALTER
    58  TABLE SPLIT AT VALUES (pk1,pk2,...)` (and similarly for `ALTER INDEX`);
    59  examples:
    60  
    61  
    62  | Old                                    | New                                           |
    63  |----------------------------------------|-----------------------------------------------|
    64  | `ALTER TABLE d.t SPLIT AT ('c', 3)`    | `ALTER TABLE d.t SPLIT AT VALUES ('c', 3)`    |
    65  | `ALTER INDEX d.t@s_idx SPLIT AT ('f')` | `ALTER INDEX d.t@s_idx SPLIT AT VALUES ('f')` |
    66  
    67  The new syntax allows introducing multiple split points, as well as calculated
    68  or table-driven split points; for example:
    69  ```sql
    70  ALTER TABLE t1 SPLIT AT VALUES ('c', 3), ('d', 4), ('e', 5)
    71  ALTER INDEX t2@idx SPLIT AT SELECT i*10 FROM GENERATE_SERIES(1, 5) AS g(i)
    72  ALTER TABLE t3 SPLIT AT SELECT pk1, pk2 FROM d.t WHERE split_here
    73  ```
    74  
    75  The `SPLIT AT` statement is also extended to support providing only a *prefix*
    76  of the primary key or index values. Example:
    77  ```sql
    78  CREATE TABLE t (k1, k2, k3, v INT, PRIMARY KEY (k1, k2, k3))
    79  
    80  -- Introduce a split at at /t/primary/1
    81  ALTER TABLE t SPLIT AT VALUES (1)
    82  
    83  -- Introduce a split at at /t/primary/1/2
    84  ALTER TABLE t SPLIT AT VALUES (1,2)
    85  
    86  -- Introduce a split at at /t/primary/1/2/3
    87  ALTER TABLE t SPLIT AT VALUES (1,2,3)
    88  ```
    89  
    90  Note that the statement returns only after the splits have been preformed.
    91  
    92  Implementation note: the `SPLIT AT` implementation can issue the splits in any
    93  order; for example, if there are many splits, it is advantageous to sort the
    94  split points, split at the middle point, then recursively process the left and
    95  right sides (in parallel).
    96  
    97  *Interleaved tables*: the command works as expected; the split will inherently
    98  cause a corresponding split in the parent or child tables/indexes.
    99  
   100  ##### Return values #####
   101  
   102  `ALTER TABLE/INDEX SPLIT AT` currently returns a row with two columns: the key
   103  in raw and pretty-printed form. The same schema is kept; a row for each split is
   104  output.
   105  
   106  If a split already exists, `SPLIT AT` currently returns an error.
   107  
   108  TBD: what is the correct behavior with multiple split points? Ignore this error?
   109  
   110  ### 2. `ALTER TABLE/INDEX SCATTER` ###
   111  
   112  A new pair of statements similar to `SPLIT AT` are introduced. Each has two
   113  forms. The first form causes all the ranges for that table or index to be
   114  "scattered": for each range, a new random set of stores are chosen for replicas
   115  (in accordance with the zone config).
   116  
   117  ```sql
   118  ALTER TABLE <table> SCATTER
   119  ALTER INDEX <table>@index SCATTER
   120  ```
   121  
   122  The second form allows only a specific area of a table or index to be scattered.
   123  The area is specified using a pair of tuples (primary key or index column
   124  values) for the start and end of the area. The values are interpreted similar to
   125  `SPLIT AT`, and similarly a prefix of the primary key or index columns can be
   126  specified:
   127  
   128  ```sql
   129  ALTER TABLE <table> SCATTER
   130  ALTER TABLE <table> SCATTER FROM (startPK1, startPK2, ...) TO (endPK1, endPK2, ...)
   131  ALTER INDEX <table>@index SCATTER
   132  ALTER INDEX <table>@index SCATTER FROM (startCol1, startCol2, ...) TO (endCol1, endCol2, ...)
   133  ```
   134  
   135  Note that no new split points are introduced by `SCATTER`.
   136  
   137  Examples:
   138  ```sql
   139  ALTER TABLE t SCATTER
   140  ALTER TABLE t SCATTER FROM (1,1) TO (1,2)
   141  ALTER INDEX t@idx SCATTER FROM (1) TO (2)
   142  ```
   143  
   144  The statement returns only after the relocations are complete.
   145  
   146  *Interleaved tables*: the command works as expected (the ranges may contain rows
   147  for parent or child tables/indexes).
   148  
   149  ### 3. `ALTER TABLE/INDEX EXPERIMENTAL_RELOCATE` ###
   150  
   151  The `EXPERIMENTAL_RELOCATE` statements can be used to relocate specific ranges to
   152  specific stores. This is very low-level functionality and is intended to be used
   153  sparingly, mainly for setting up tests which benefit from a predetermined data
   154  distribution. The rebalancing queues should be stopped in order to make the
   155  relocations "stick".
   156  
   157  ```sql
   158  ALTER TABLE <table> EXPERIMENTAL_RELOCATE <select_statement>
   159  ALTER INDEX <table>@<index> EXPERIMENTAL_RELOCATE <select_statement>
   160  ```
   161  
   162  `EXPERIMENTAL_RELOCATE` takes a select statement with the following result schema:
   163   - the first column is the relocation information: an array of integers, where
   164     each integer is a store ID. This indicates the set of replicas for the range;
   165     the first replica in the array will be the new lease owner.
   166  
   167   - the rest of the columns indicate a point in the table as a list of primary
   168     key or index columns (similar to `SPLIT AT`). The range that contains this
   169     point is relocated according to the first column. Note that, just like `SPLIT
   170     AT`, a row with this particular set of values doesn't have to exist in the
   171     table.
   172  
   173  Examples:
   174  ```sql
   175  CREATE TABLE t (k1, k2, k3, v INT, PRIMARY KEY (k1, k2, k3))
   176  
   177  -- Move the range containing /t/primary/1/2/3 to store 1:
   178  ALTER TABLE t EXPERIMENTAL_RELOCATE VALUES (ARRAY[1], 1, 2, 3)
   179  
   180  -- Move the range containing /t/primary/1/2 to stores 5,6,7 (with 5 as lease owner):
   181  ALTER TABLE t EXPERIMENTAL_RELOCATE VALUES (ARRAY[5,6,7], 1, 2)
   182  
   183  -- Move even k1s to store 1, odd k1s to store 2:
   184  ALTER TABLE t EXPERIMENTAL_RELOCATE SELECT ARRAY[1+i%2], i FROM GENERATE_SERIES(1, 10) AS g(i)
   185  ```
   186  
   187  The statement returns only after the relocations are complete.
   188  
   189  *Interleaved tables*: the command works as expected (the ranges may contain rows
   190  for parent or child tables/indexes).
   191  
   192  ### 4. `crdb_internal.ranges` and `ranges_cached` system table ###
   193  
   194  To facilitate testing the implementation of the new commands (as well as allow a
   195  user to verify what the commands did), we introduce a `crdb_internal.ranges`
   196  system table that can be used to look at all the ranges on the system, or the
   197  ranges from a table.
   198  
   199  The schema of the table is as follows:
   200  
   201  Column         | Type       | Description
   202  ---------------|------------|------------------------------
   203  `start_key`    | BYTES      | Range start key (raw)
   204  `start_pretty` | STRING     | Range start key (pretty-printed)
   205  `end_key`      | BYTES      | Range end key (raw)
   206  `end_pretty`   | STRING     | Range end key (pretty-printed)
   207  `database`     | STRING     | Database name (if range is part of a table)
   208  `table`        | STRING     | Table name (if range is part of a table); for interleaved tables this is always the root table.
   209  `index`        | STRING     | Index name (if range is part of a non-primary index); 
   210  `replicas`     | ARRAY(INT) | Replica store IDs
   211  `lease_holder` | INT        | Lease holder store ID
   212  
   213  The last two columns could be hidden (so they are only available if `SELECT`ed
   214  for specifically).
   215  
   216  Implementation notes:
   217   - the system table infrastructure will be improved so the row producing
   218     function has access to filters; specifying a `table` or `index` filter that
   219     should be optimized to only look at the ranges for that table.
   220   - the row producing function should also have access to needed columns; that
   221     way the more expensive lease holder determination can be omitted if the
   222     column is not needed.
   223  
   224  A second table, `crdb_internal.ranges_cached` has the same schema, but it
   225  returns data from the range and lease holder caches. Specifically: the ranges
   226  along with `replicas` information are populated from the range cache; for each
   227  range, if that range ID has an entry in the lease holder cache, `lease_holder`
   228  is set according to that entry; otherwise it is NULL.
   229  
   230  # Alternatives
   231  
   232  An alternative considered was to create statements that operate on *keys* (key
   233  prefixes) and provide functions that generate key prefixes from column values.
   234  This was deemed as too low-level for the SQL interface.
   235  
   236  Another alternative considered was to also use a function for the splitting
   237  functionality, e.g. `SELECT split_at(..)`. The problem is that it forces the
   238  splits to happen sequentially; we cannot implement the algorithm mentioned above
   239  that parallelizes the splits. One way around this would be to introduce `split_at`
   240  as an *aggregation* function (akin to `sum`).
   241  
   242  Alternatives considered for `crdb_internal.ranges`:
   243   - a `SHOW RANGES FOR TABLE/INDEX` statement; the system table was deemed more
   244     useful.
   245   - having multiple system tables (e.g. a separate one for lease holders) and
   246     using joins as necessary; this requires too many changes to make sure we only
   247     generate the parts of the table that are needed.
   248  
   249  # Unresolved questions