github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170921_sql_partitioning.md (about)

     1  - Feature Name: SQL Table Partitioning
     2  - Status: in-progress
     3  - Start Date: 2017-01-25
     4  - Authors: Nikhil Benesch, Daniel Harrison, David Taylor
     5  - RFC PR: [#18683]
     6  
     7  
     8  # Summary
     9  
    10  Table partitioning provides row-level control over how and where data is stored.
    11  
    12  Specifically, `CREATE TABLE`,  `ALTER TABLE`, and their `INDEX` counterparts
    13  learn to accept a partition specification, which maps ranges of the index to
    14  named partitions. Zone configs learn to target these partitions so that locality
    15  constraints and replication settings can vary per partition instead of per table
    16  or database.
    17  
    18  
    19  # Motivation
    20  
    21  The magic of CockroachDB is its ability to stitch together physically-disjoint
    22  ranges of data into one logical table, no matter whether the servers are
    23  separated by several racks or several continents.
    24  
    25  If the cluster operator has any of the following requirements, however,
    26  CockroachDB’s default range allocation policy might be a poor fit:
    27  
    28  - **Latency guarantees.** The operator must ensure that accessing particular
    29    data from a particular part of the world will be fast. Since the default
    30    allocation policy optimizes average latency across all ranges, it might
    31    violate the latency requirements on a particular range in the service of
    32    balancing the cluster.
    33  
    34  - **Data sovereignty.** The operator is subject to regulations that require
    35    certain data to live on servers in certain geographical regions. The default
    36    allocation policy will blindly move data from overfull regions to underfull
    37    regions, potentially violating these regulations.
    38  
    39  - **Cost reduction.** The operator needs to store infrequently-accessed data on
    40    slower hardware (e.g. spinning disks) to save money. The default allocation
    41    policy will happily shuffle data between nodes with SSDs and nodes with HDDs.
    42  
    43  Zone configs can help with the above, provided the constraints applies to an
    44  entire table. To meet latency guarantees, a zone config can constrain an entire
    45  table to a particular data center. To reduce costs, a zone config can constrain
    46  an entire infrequently-accessed table to nodes with slower hardware. Data
    47  sovereignty, however, requires further work.
    48  
    49  Often, however, the requirement applies to only a subset of the table. A `users`
    50  table, for example, should store its American users in North America, its
    51  European users in Europe, etc. Maybe an `articles` table should store the most
    52  recent few months of posts on SSDs, but everything else on HDDs. Creating
    53  individual tables (`users_america`, `users_europe`, `articles_recent`,
    54  `articles_archive` etc.) would allow today’s zone configs to solve the problem,
    55  but would bring back some of the pain of manual sharding that CockroachDB was
    56  designed to avoid. Clients would be responsible for reading data from and
    57  writing data to the right shard and stitching data from multiple shards together
    58  if necessary.
    59  
    60  By allowing zone configs to target subsets (i.e., “partitions”) of a table, we
    61  allow operators to meet their latency, regulatory, and cost requirements while
    62  preserving the magic of CockroachDB: that is, a logical table that hides the
    63  complex physical sharding underneath.
    64  
    65  
    66  # A note on scope
    67  
    68  To ensure the design proposed within can be implemented in a reasonable
    69  timeframe, the following items are considered out-of-scope:
    70  
    71  - **Improving the zone config interface.** The YAML/CLI interface for specifying
    72    zone configs is clunky and will get more confusing when partitions are added,
    73    but improvements are better left to a comprehensive redesign (which is already
    74    planned for row-level ttls, auditing, etc). Instead, this RFC proposes the
    75    minimum changes necessary to the existing interface to support partitioning.
    76  
    77  - **Data sovereignty.** The design within is necessary but not
    78    sufficient to meet data sovereignty regulations. In particular, the proposed
    79    scheme provides little protection against misconfiguration and little
    80    visibility into, at a given moment, what data is actually stored where.
    81    Additionally, we'll need a resolution to protected information leaking via
    82    keys, which make their way into various system ranges (e.g. meta2). The rest
    83    of the data sovereignty work will be developed in a followup RFC.
    84  
    85  - **Engineering work schedule.** Some of this work will make 2.0 and some won’t,
    86    but determining milestones is intentionally omitted from this RFC.
    87  
    88  - **Altering the primary key columns of a table.** The design below relies on
    89    careful primary key selection, which means tables created without partitioning
    90    in mind may not have a suitable schema. For now, a workaround is to export and
    91    reimport the data with a new schema. Eventually, we need to support primary
    92    key schema changes (see [#19141]).
    93  
    94  
    95  # Guide-level explanation
    96  
    97  ## Example: geographic partitioning
    98  
    99  Consider a globally-distributed online storefront, RoachMart, that sells and
   100  ships live cockroaches. RoachMart processes several hundred requests per second,
   101  each of which might generate several queries to RoachMart’s nine-node
   102  CockroachDB cluster. (Live cockroaches are very popular these days.) To keep
   103  these requests speedy, RoachMart’s operators want to automatically locate each
   104  user’s data in datacenters near that user.
   105  
   106  Today, RoachMart has two data centers: one in North America and one in
   107  Australia. The company expects to open data centers in Europe, Africa, and Asia
   108  soon, so they want to ensure they can migrate users to the new data centers once
   109  they open.
   110  
   111  In CockroachDB, partitions are cheap and easy to redefine at will, with one
   112  caveat: partitions must be defined over columns that are a prefix of the primary
   113  key. A `users` table with just a string primary key, like
   114  
   115  ```sql
   116  CREATE TABLE users (
   117    email STRING PRIMARY KEY,
   118    continent STRING,
   119    ...
   120  );
   121  ```
   122  
   123  cannot be partitioned by `continent` because `continent` is not part of the
   124  primary key. Instead, the table needs to be specified with a composite primary
   125  key
   126  
   127  ```sql
   128  CREATE TABLE users (
   129    email STRING,
   130    continent STRING,
   131    ...
   132    PRIMARY KEY (continent, email)
   133  );
   134  ```
   135  
   136  where the partition column `continent` appears first in the key.
   137  
   138  Note that this usage of composite primary keys is somewhat unusual. See the
   139  [Partitioning key selection] section below for a discussion.
   140  
   141  Since it’s not currently possible to change a table’s primary key, the partition
   142  columns included in the primary key when the table is created must be granular
   143  enough to support any partitioning scheme that might be desired in the future.
   144  Even if it were possible, changing the primary key would require rewriting *all*
   145  the data in the table and any tables interleaved beneath, a potentially
   146  intractable operation. See [Partitioning and index columns] for a more detailed
   147  discussion of this drawback.
   148  
   149  RoachMart is worried that, someday, they’ll have more than one data center per
   150  continent, so they decide to use `country` instead of `continent` as their
   151  partition column. Using the `PARTITION BY LIST` syntax, they can group
   152  individual countries into larger partitions, one for each of their data centers:
   153  
   154  ```sql
   155  CREATE TABLE roachmart.users (
   156    email STRING,
   157    country STRING,
   158    ...,
   159    PRIMARY KEY (country, email)
   160  ) PARTITION BY LIST (country) (
   161    PARTITION australia VALUES IN ('AU', 'NZ'),
   162    PARTITION north_america VALUES IN ('CA', 'MX', 'US'),
   163    PARTITION default VALUES IN (DEFAULT),
   164  );
   165  ```
   166  
   167  As requirements shift (e.g., a new data center opens, an existing data center is
   168  running low on capacity, etc.), a country can be seamlessly migrated to a new
   169  partition with an `ALTER TABLE` command that repartitions the table:
   170  
   171  ```sql
   172  ALTER TABLE roachmart.users PARTITION BY LIST (country) (
   173    PARTITION australia VALUES IN ('AU', 'NZ'),
   174    PARTITION north_america VALUES IN ('CA', 'MX', 'US'),
   175    PARTITION scandinavia VALUES IN ('DK', 'NO', 'SE'),
   176    PARTITION default VALUES IN (DEFAULT)
   177  );
   178  ```
   179  
   180  Each partition is required to have a unique name that logically identifies it
   181  across repartitions. Above, the `default` partition is considered equivalent to
   182  the original `default` partition, though it is missing three countries that are
   183  now in the new `scandinavia` partition.
   184  
   185  On their own, partitions are inert. The SQL schema for `roachmart.users` does
   186  nothing to actually locate the `australia` partition in Australia. Applying
   187  functionality to a partition requires zone configs. Just as each database and
   188  table can be targeted by a [zone config] that overrides the number and location
   189  of its replicas, each partition can be targeted by a zone config.
   190  
   191  So, RoachMart restarts each of its CockroachDB nodes with a `--locality` flag
   192  that indicates its data center:
   193  
   194  ```
   195  $ ./cockroach start --locality=datacenter=au1
   196  $ ./cockroach start --locality=datacenter=us1
   197  ```
   198  
   199  Then they can apply a zone config to each partition that restricts it to the
   200  appropriate locality:
   201  
   202  ```
   203  $ cat australia.zone.yml
   204  constraints: [+datacenter=au1]
   205  $ cat north_america.zone.yml
   206  constraints: [+datacenter=us1]
   207  $ ./cockroach zone set roachmart.users --partition=australia -f australia.zone.yml
   208  $ ./cockroach zone set roachmart.users --partition=north_america -f north_america.zone.yml
   209  ```
   210  
   211  The replicate queue on each node will notice the updated zone config and begin
   212  rebalancing the cluster. Users with `country = 'NZ'` will live on a range that
   213  is only replicated within the au1 data center and users with `country = 'US'`
   214  will live on a range that is only replicated within the us1 data center.
   215  
   216  
   217  ### Additional colocated tables
   218  
   219  Data in the `users` table is now stored in the correct place, but the typical
   220  RoachMart request accesses more than just the `users` table. In particular,
   221  RoachMart would like to colocate users’ orders. Interleaved tables make this
   222  simple.
   223  
   224  ```sql
   225  CREATE TABLE roachmart.orders (
   226    user_country STRING,
   227    user_email STRING,
   228    id INT,
   229    part_id INT,
   230    ...
   231    PRIMARY KEY (user_country, user_email, id),
   232    FOREIGN KEY (user_country, user_email) REFERENCES users
   233  ) INTERLEAVE IN PARENT users (user_country, user_email)
   234  ```
   235  
   236  The key encoding of interleaved tables ensures that the zone config of any
   237  top-level partitions applies to the data interleaved within.
   238  
   239  
   240  ### Partitioned secondary indices
   241  
   242  RoachMart’s warehouse team wants to know how many orders are processed for each
   243  part they sell. They want a secondary index to make the query efficient. Since
   244  the common operation is to look at one country’s orders at a time, they
   245  partition the index on the same columns as users.
   246  
   247  ```sql
   248  CREATE INDEX part_idx ON roachmart.orders (user_country, part_id)
   249  PARTITION BY LIST (user_country) (
   250    ...
   251  );
   252  ```
   253  
   254  ## Example: date partitioning
   255  
   256  RoachBlog, a free weblog provider, is worried about the amount of data stored in
   257  their `articles` table. The table is growing at the rate of approximately 100GB
   258  per day. (Cockroach enthusiasts are prolific.)
   259  
   260  An investigation of their query traffic has revealed that, save for a few
   261  outliers, articles published more than 30 days ago receive virtually no traffic.
   262  They’d like to move those articles to nodes with HDDs to save money on hardware.
   263  Smartly—as if they had anticipated this eventuality—RoachBlog’s engineers
   264  designed the table with a `TIMESTAMP` primary key:
   265  
   266  ```sql
   267  CREATE TABLE roachblog.articles (
   268    id SERIAL,
   269    published TIMESTAMP,
   270    author_id INT,
   271    ...,
   272    PRIMARY KEY (published, id)
   273  );
   274  ```
   275  
   276  Listing out every `published` timestamp that should be considered “archived”
   277  would be infeasible, so they use the range partitioning syntax instead:
   278  
   279  ```sql
   280  ALTER TABLE roachblog.articles PARTITION BY RANGE (published) (
   281    PARTITION archived VALUES FROM (MINVALUE) TO ('2017-12-04'),
   282    PARTITION recent VALUES FROM ('2017-12-04') TO (MAXVALUE)
   283  );
   284  ```
   285  
   286  RoachBlog plans to run the query every week with an updated partition split
   287  point. Repartitioning was designed to be quite cheap to support this very use
   288  case, and indeed no actual data needs to be rewritten.
   289  
   290  N.B.: Each time this query is run it will produce a new split point in the
   291  table, which could leave a lot of small ranges (until we implement range
   292  merges). Documentation will need to caution users about repartitioning too
   293  often.
   294  
   295  As with RoachMart, RoachBlog now need only launch their nodes with appropriate
   296  store attributes
   297  
   298  ```
   299  $ ./cockroach start --store=path=/mnt/crdb,attrs=ssd
   300  $ ./cockroach start --store=path=/mnt/crdb,attrs=hdd
   301  ```
   302  
   303  and install the corresponding zone config:
   304  
   305  ```
   306  $ cat recent.zone.yml
   307  constraints: [+ssd]
   308  $ cat archived.zone.yml
   309  constraints: [+hdd]
   310  $ ./cockroach zone set roachblog.articles --partition=recent -f recent.zone.yml
   311  $ ./cockroach zone set roachblog.articles --partition=archived -f archived.zone.yml
   312  ```
   313  
   314  
   315  ### Global secondary indices
   316  
   317  RoachBlog has a page which lists all articles written by a given author. There
   318  are no sovereignty issues with this, so they create a global index to keep this
   319  page fast:
   320  
   321  ```sql
   322  CREATE INDEX author_idx ON roachblog.articles (author_id)
   323  ```
   324  
   325  ## Partitioning key selection
   326  
   327  Typically, a composite key is used when the unique identifier for a row is
   328  naturally formed from more than one column. Consider, for example, a university
   329  course catalog that spans multiple years. Every course has an ID, like CS101,
   330  but that ID is reused every year. The primary key, then, must include both the
   331  course ID and the term it was offered. In this case, you might very reasonably
   332  choose to form a composite key out of three columns:
   333  
   334  ```sql
   335  CREATE TABLE courses (
   336    course_id STRING,
   337    term      STRING,
   338    year      INT,
   339    ...
   340    PRIMARY KEY (course_id, term, year)
   341  )
   342  ```
   343  
   344  This is an accepted practice in schema design. You might even choose to order
   345  your primary key in a different order based on expected query patterns; e.g.,
   346  (`year`, `term`, `course_id`) might be a better order if the application is
   347  frequently querying for the courses in a given year.
   348  
   349  Partitioning stretches this practice by constructing a composite key out of a
   350  columns that could uniquely identify a row on their own. Notice that in the
   351  `courses` table, no proper subset of the primary key columns is sufficient to
   352  uniquely identify a row. In `roachmart.users`, however, `email` alone is
   353  sufficient to identify a row; the `country` column is superfluous and exists
   354  only to facilitate partitioning. This biggest consequence of this oddity is the
   355  potential for user confusion; overspecified keys cause no known technical
   356  problems.
   357  
   358  In fact, the guidelines for constructing a partitioning key under this scheme
   359  are quite straightforward: include everything you wish to partition by in the
   360  key, in the order you wish to nest your subpartitions, and follow it with
   361  everything that would have otherwise been in the key.
   362  
   363  See the [Partitioning and index columns](#partitioning-and-index-columns)
   364  section for a discussion of why this approach was chosen.
   365  
   366  ## Summary: initial partitioning and repartitioning
   367  
   368  - Partitions can be added to a table when it is created, or at any time
   369    afterward. (However, since the primary key of a table cannot be changed,
   370    forethought is still required.)
   371  
   372  - Partitions are defined over one or more columns. These columns must be a
   373    prefix of the primary key (or secondary index).
   374  
   375  - Indexes can also be partitioned, but are not required to be.
   376  
   377  - Each partition is required to have a name that is unique among all partitions
   378    on that table or index.
   379  
   380  - Repartitioning is relatively cheap; data is not rewritten. The table metadata
   381    is updated and the allocator is left to move the data as necessary. A
   382    partition with the same name before and after a repartitioning is considered
   383    the same partition. During a repartitioning, any partition zone config entries
   384    with no corresponding name in the new partitions are removed.
   385  
   386  - Future work to expose zone configs through SQL will likely allow for updating
   387    partitions and the corresponding zone configs together. In the meantime, users
   388    can create an empty partition, apply a zone config, then repartition.
   389  
   390  - After a table is partitioned, its partitions can be targeted by zone
   391    constraints via the existing CLI, e.g.  `./cockroach zone set
   392    database.table --partition=partition -f zone.yml`.
   393  
   394  - As before, CockroachDB uses the most granular zone config available. Zone
   395    configs that target a partition are considered more granular than those that
   396    target a table, which in turn are considered more granular than those that
   397    target a database. Configs do not inherit, but unspecified fields when a zone
   398    override is first create are copied from the parent, as is currently the case.
   399  
   400  
   401  ## Usage: locality–resilience tradeoff
   402  
   403  There exists a tradeoff between making reads/writes fast and surviving failures.
   404  Consider a partition with three replicas of `roachmart.users` for Australian
   405  users. If only one replica is pinned to an Australian datacenter, then reads may
   406  be fast (via [leases follow the sun]) but writes will be slow. If two replicas
   407  are pinned to an Australian datacenter, than reads and writes will be fast (as
   408  long as the cross-ocean link has enough bandwidth that the third replica doesn’t
   409  fall behind and hit the quota pool). If those two replicas are in the same
   410  datacenter, then loss of one datacenter can lead to data unavailability, so some
   411  deployments may want two separate Austrialian datacenters. If all three replicas
   412  are in Australian datacenters, then three Australian datacenters are needed to
   413  be resilient to a datacenter loss.
   414  
   415  
   416  # Reference-level explanation
   417  
   418  ## Key encoding
   419  
   420  Different key encodings were evaluated at length in a prior version of this RFC.
   421  The approach recommended, entitled [Index Key Prefix], is described here without
   422  discussion of the tradeoffs involved in the decision; refer the original RFC for
   423  that discussion.
   424  
   425  The Index Key Prefix approach simply allows part of the existing row key to be
   426  used as the partition key. For a partitioned table, the partition key can be any
   427  prefix of the table primary key. Similarly, for a partitioned index, the
   428  partition key can be any prefix of the indexed columns. Since a row’s key is
   429  encoded as the table ID, followed by the index ID, followed by each of the
   430  indexed columns, in order, the partition key appears immediately after the table
   431  and index IDs, and thus rows with the same partition key will be adjacent.
   432  
   433  To further illustrate this, consider the key encodings for the earlier
   434  geographic partitioning example.
   435  
   436  Primary index for `roachmart.users`:
   437  
   438  ```
   439  /TableID/IndexID/<region>/<user_id> -> [email]
   440  ```
   441  
   442  Global non-unique secondary index for `roachblog.articles.author_id`:
   443  
   444  ```
   445  /TableID/IndexID/<author_id><article_id> -> null
   446  ```
   447  
   448  Partitioned non-unique secondary index for `roachmart.orders.part_id`:
   449  
   450  ```
   451  /TableID/IndexID/<region>/<part_id>/<order_id> -> null
   452  ```
   453  
   454  Note that values need not be physically adjacent to belong to the same logical
   455  partition. For example, `'US'`, `'CA'` and `'MX'` are interspersed with other
   456  countries, but all map to the same  `north_america` partition in
   457  `roachmart.users`. This comes at a cost, as the `north_america` partition now
   458  requires a minimum of three separate ranges instead of just one. See [Range
   459  splits] below for further discussion.
   460  
   461  
   462  ## SQL syntax
   463  
   464  The `PARTITION BY` clause is usable in `CREATE TABLE`, `ALTER TABLE`, `CREATE
   465  INDEX`, `ALTER INDEX` and variants. See [Example: geographic partitioning] for
   466  the `PARTITION BY LIST` syntax and [Example: date partitioning]  for `PARTITION
   467  BY RANGE`.
   468  
   469  A somewhat formal version of the syntax is presented below:
   470  
   471  ```
   472  CREATE TABLE <table-name> ( <elements...> ) [<interleave>] [<partition-scheme>]
   473  CREATE INDEX [<index-name>] ON <tablename> ( <col-names>... ) [<interleave>] [<partition-scheme>]
   474  ALTER TABLE <table-name> <partition-scheme>
   475  ALTER INDEX <index-name> <partition-scheme>
   476  
   477  Partition scheme:
   478    PARTITION BY LIST ( <col-names>... ) ( <list-partition> [ , ... ] ) <partition-scheme>
   479    PARTITION BY RANGE ( <col-names>... ) ( <range-partition> [ , ... ] ) <partition-scheme>
   480  
   481  List partition:
   482    PARTITION <partition-name> VALUES IN ( <list-expr>... )
   483  
   484  List expression:
   485    DEFAULT
   486    <const-expr>
   487  
   488  Range partition:
   489    PARTITION <partition-name> VALUES FROM (<range-expr>) TO (<range-expr>)
   490  
   491  Range expression:
   492    MINVALUE
   493    MAXVALUE
   494    <const-expr>
   495  ```
   496  
   497  To reduce confusion, expressions in a `VALUES` or `VALUES IN` clause must be
   498  constant, as they are only computed once when the `CREATE` or `ALTER` statement
   499  is executed. Allowing non-constant expressions, like `now()` or `SELECT
   500  country_name FROM countries`, would suggest that the partition updates whenever
   501  the expression changes in value.
   502  
   503  Note that the lower bound of a range partition (`FROM`) is inclusive, while the
   504  upper bound (`TO`) is exclusive. Note also that a `NULL` value in a
   505  range-partitioned column sorts into the first range, which is consistent with
   506  our key encoding ordering and `ORDER BY` behavior.
   507  
   508  ### SELECT FROM PARTITION
   509  
   510  To allow reads to target only selected partitions, we propose to extend table
   511  names (not arbitrary table expressions) with a `PARTITION` clause. For example:
   512  
   513  ```sql
   514  SELECT * FROM roachmart.users PARTITION (australia, north_america)
   515  
   516  SELECT *
   517    FROM
   518      roachblog.articles PARTITION (recent) AS a
   519    JOIN
   520      roachblog.user_views PARTITION (north_america) AS b
   521    ON a.id = b.article_id
   522  ```
   523  
   524  The implementation is purely syntatic sugar and simply transforms each
   525  `PARTITION` clause into an additional constraint. The join query above, for
   526  example, would be rewritten to include `WHERE user_views.country IN ('CA', 'MX',
   527  'US') AND articles.published >= 'recent cutoff'`.
   528  
   529  This is only a sugar, but it's an important one. For example, in
   530  geopartitioning, it is expected that there will be many more countries than
   531  partitions, so it is much easier to specify partition restrictions with this
   532  than the de-sugared `WHERE`.
   533  
   534  Note that a list partitioning without a `DEFAULT` is an enum. The user can use
   535  this syntax with all partitions specified to force the planner to turn a query
   536  like `SELECT * FROM roachmart.users WHERE email = "..."` (and no country
   537  specified) from a full table scan into a point lookup per `country` in the
   538  partitioning.
   539  
   540  
   541  ### Subpartitioning
   542  
   543  Subpartitioning allows partitioning along several axes simultaneously. The
   544  `PARTITION BY` syntax presented above is recursive so that list partitions can
   545  be themselves partitioned any number of times, using either list or range
   546  partitioning. Note that the subpartition columns must be a prefix of the columns
   547  in the primary key that have not been consumed by parent partitions.
   548  
   549  Suppose RoachMart wanted to age out users who haven’t logged in in several
   550  months to slower hardware, while continuing to partition by country for improved
   551  latency. Subpartitioning would neatly solve their use case:
   552  
   553  ```sql
   554  CREATE TABLE roachmart.users (
   555    id SERIAL,
   556    country STRING,
   557    last_seen DATETIME,
   558    ...,
   559    PRIMARY KEY (country, last_seen, id)
   560  ) PARTITION BY LIST (country) (
   561    PARTITION australia VALUES IN ('AU', 'NZ') PARTITION BY RANGE (last_seen) (
   562      PARTITION australia_archived VALUES FROM (MINVALUE) TO ('2017-06-04'),
   563      PARTITION australia_recent VALUES FROM ('2017-06-04') TO (MAXVALUE)
   564    ),
   565    PARTITION north_america VALUES IN ('CA', 'MX', 'US') PARTITION BY RANGE (last_seen) (
   566      PARTITION north_america_archived VALUES FROM (MINVALUE) TO ('2017-06-04'),
   567      PARTITION north_america_recent VALUES FROM ('2017-06-04') TO (MAXVALUE)
   568    ),
   569    ...
   570  );
   571  ```
   572  
   573  Subpartition names must be unique within a table, as each table's partitions and
   574  subpartitions share a namespace.
   575  
   576  Other databases also provide less flexible but more convenient syntaxes, like a
   577  `SUBPARTITION TEMPLATE` that prevents repetition of identical subpartition
   578  schemes. Above, a `SUBPARTITION TEMPLATE` could be used to describe the
   579  `last_seen` partitioning scheme exactly once, instead of once for each `country`
   580  partition. We propose to implement only the more general syntax and defer design
   581  of a more convenient syntax until demand exists.
   582  
   583  ## IndexDescriptor changes
   584  
   585  Partitioning information is added to `IndexDescriptor` as below. All tuples are
   586  encoded as `EncDatums` using the value encoding.
   587  
   588  ```protobuf
   589  message IndexDescriptor {
   590    ...
   591    optional PartitioningDescriptor partitioning = 13;
   592  }
   593  
   594  message PartitioningDescriptor {
   595    message List {
   596      optional string name = 1;
   597      repeated bytes values = 2;
   598      optional PartitioningDescriptor subpartition = 3;
   599    }
   600  
   601    message Range {
   602      optional string name = 1;
   603      optional bytes inclusive_lower_bound = 2;
   604      optional bytes exclusive_upper_bound = 3;
   605    }
   606  
   607    optional uint32 num_columns = 1;
   608    repeated List list = 2;
   609    repeated Range range = 3;
   610  }
   611  ```
   612  
   613  ## Zone config changes
   614  
   615  Zone configs are currently stored in the `system.zones` table, which maps
   616  database and table IDs to `ZoneConfig` protobufs. We propose to adjust this
   617  `ZoneConfig` protobuf to include configuration for "subzones," where a subzone
   618  represents either an entire index, or a partition of any index. Subzones are not
   619  applicable when the zone does not represent a table.
   620  
   621  Ideally, we'd modify or replace `system.zones` with a table keyed by
   622  `(descriptor_id, index_id, partition_name)`, but a proposal to this effect [was
   623  presented and rejected in a previous version of this RFC][system-subzones]. In
   624  short, replacing `system.zones` requires a complex, backwards-incompatible
   625  migration that is better left to its own RFC.
   626  
   627  Our interim solution adds two fields to the `ZoneConfig` proto:
   628  
   629  ```protobuf
   630  message ZoneConfig {
   631    ...
   632    repeated Spans subzone_spans = 8;
   633    repeated SubzoneConfig subzone_configs = 9;
   634  }
   635  
   636  message SubzoneSpan {
   637    optional roachpb.Span span = 1;
   638    optional uint32 subzone_index = 2;
   639  }
   640  
   641  message SubzoneConfig {
   642    optional uint32 index_id = 1;
   643    optional string partition_name = 2;
   644    optional message ZoneConfig = 3;
   645  }
   646  ```
   647  
   648  A subzone's config is stored in the `subzone_configs` field as a `SubzoneConfig`
   649  message, which bundles a `ZoneConfig` with identification of the index or
   650  partition that the subzone represents. Every `SubzoneConfig` must specify an
   651  `index_id`, but subzones that apply to the entire index omit the
   652  `partition_name` field.
   653  
   654  A mapping from key span to `SubzoneConfig` is stored in the `subzone_spans`
   655  field to allow efficient lookups of the subzone override for a given key.
   656  Entries in `subzone_spans` are non-overlapping and sorted by start key to allow
   657  for binary search. If an entry's span contains the lookup key, the
   658  `subzone_index` field is the index of the `SubzoneConfig` in the
   659  `subzone_configs` field; if no span contains the lookup key, the table config
   660  (i.e., the outer `ZoneConfig`) applies.
   661  
   662  We could alternatively derive `subzone_spans` from the corresponding
   663  `TableDescriptor` on every call to `GetZoneConfigForKey`, but this would involve
   664  quite a bit of encoding/decoding on a hot code path, as the table descriptor
   665  stores the partition split points using value encoding. Updating zone configs or
   666  partitioning, by contrast, happens infrequently; we'll simply recompute
   667  `subzone_spans` whenever an index or partition zone config is updated, or when
   668  a table's partitioning scheme is updated.
   669  
   670  One case requires special care to handle. Suppose the `archived` partition of
   671  `roachblog.articles` has a custom zone config, but the `roachblog.articles`
   672  table itself does not have a custom zone config. The `system.zones` table will
   673  necessarily have an entry for `roachblog.articles` to store the custom zone
   674  config for the `archived` partition, but that entry will have an otherwise empty
   675  `ZoneConfig`. This must not be taken to mean that the other partitions of the
   676  table should use this empty zone config, but that the default zone config
   677  applies. We propose to use the `num_replicas` field for this purpose:
   678  `num_replicas = 0` is invalid and therefore indicates that the table in
   679  question does not have an active `ZoneConfig`.
   680  
   681  As mentioned in the examples, the zone config CLI will be adjusted to accept an
   682  index specifier and partitioning flag:
   683  
   684  ```bash
   685  # Before:
   686  $ ./cockroach zone {set|get|rm} DATABASE[.TABLE[@INDEX]] [--partition=PARTITION] -f zone.yml
   687  
   688  # After:
   689  $ ./cockroach zone {set|get|rm} DATABASE[.TABLE] f zone.yml
   690  ```
   691  
   692  Omitting the index but specifying the `--partition` flag (e.g., `./cockroach
   693  zone set db.tbl --partition=p0`) implies a partition of the primary index. Note,
   694  however, that `./cockroach set db.tbl` and `./cockroach set db.tbl@primary` are
   695  *not* equivalent: the former specifies a table zone config that applies to any
   696  secondary indices, unless more specific overrides exist, while the latter
   697  specifies a primary index zone config that never applies to the table's
   698  secondary indices.
   699  
   700  One downside of this scheme is that the zone configs for all partitions of a
   701  table's indices are stored in one row, which puts an effective limit on the
   702  number of partitions allowed on a table. Additionally, `system.zones` is part of
   703  the unsplittable, gossiped system config span, so all zone configs across all
   704  tables must add up to less than 64MB. Some back of the envelope math suggests 60
   705  bytes per range partition, and 60 bytes + the size of the values in a list
   706  partition.
   707  
   708  For `roachmart.users` with every country allocated between 7 partitions, this
   709  results in 1167B. This results in an absolute max of 64MB / 60B = ~1,000,000
   710  partitioned tables or 64MB / 1167B = ~54,000 tables if they were all partitioned
   711  by country. The recommended max number of partitions in a table seems to range
   712  from 100 to 1024 in other partitioning implementations, so this seems
   713  reasonable.
   714  
   715  Similar to the way that the zone config for a table, if present, completely
   716  overrides (with no inheritance) the zone config for a database, a zone config
   717  for a partition overrides the zone config for the table, database, or cluster.
   718  The ergonomics of this will likely be sub-optimal; the user will need to
   719  maintain the denormalization of what is naturally an inheritance hierarchy of
   720  configuration. There is a larger upcoming effort to refactor zone configs which
   721  will address these issues, so the following are out of scope of this RFC:
   722  
   723  - A SQL interface for reading and writing zone configs
   724  
   725  - Auditing of zone config changes
   726  
   727  - Inheritance
   728  
   729  - Moving zone configs out of gossip
   730  
   731  - Raising global and per-table limitations on the number of partitions
   732  
   733  - Direct replica-level control of data placement
   734  
   735  - Partition configs outside of the database+table scoping. This could be useful
   736    for allowing partitioning in a shared CockroachDB offering with static, preset
   737    zone configs.
   738  
   739  - The ability to define a partition and specify a zone config for it at the same
   740    time.
   741  
   742  ## Range splits and schema changes
   743  
   744  The CockroachDB unit of replication and rebalancing is the range. So for zone
   745  configs to target a partition, it needs to be on its own range (and potentially
   746  more than that for list partitioning see [Range splits]).
   747  
   748  Currently, CockroachDB asynchronously splits each newly created table into its
   749  own range. This happens regardless of whether any zone configs are added to
   750  target that table specifically. Each partition could similarly be asynchronously
   751  split after being defined.
   752  
   753  This may create extra ranges if a table is partitioned, repartitioned, and only
   754  then given zone configs (though this problem goes away when we support range
   755  merges). We address this by lazily splitting a partition only when a zone config
   756  is added for it. This deviates a bit from the table behavior and may violate a
   757  user’s expectation of partitions living on separate ranges, but the tradeoff was
   758  deemed worth it.
   759  
   760  A table’s partitions can have a large influence on sql planning, so any changes
   761  should be broadcast immediately. Further, our schema changes and table leasing
   762  require that at most 2 versions are in use at a time. So it’s a natural fit for
   763  partitionings to run as schema changes. This schema change will be responsible
   764  for creating the splits and can be hooked into `system.jobs` to provide
   765  introspection of progress as well as cancellation.
   766  
   767  
   768  ## Query planning changes
   769  
   770  Because of the decision to require partitions be defined over normal,
   771  materialized columns, there are no correctness changes needed to sql query
   772  planning. Performance, however, needs some work.
   773  
   774  
   775  ### Case study
   776  
   777  Consider the following queries issued by RoachMart to the `roachmart.users`
   778  table. (Ignore the poor security practices, this is meant to be illustrative.)
   779  
   780  When a user visits RoachMart and needs to log in:
   781  
   782  ```sql
   783  SELECT id FROM roachmart.users WHERE email = $1`
   784  ```
   785  
   786  Whenever a new page is loaded on the web or a new screen is loaded on the native
   787  app, the `id` is extracted from a cookie or passed via the RoachMart api and
   788  used in the stateless servers to rehydrate the user information:
   789  
   790  ```sql
   791  SELECT * FROM roachmart.users WHERE id = 12345`
   792  ```
   793  
   794  Regardless of whether the `roachmart.users` table is partitioned, the first
   795  query is kept speedy by an unpartitioned secondary index on `email`. This may
   796  require a cross-ocean hop to another datacenter but login is an infrequent
   797  operation and so this is okay. (If data sovereignty of emails is a concern, then
   798  a global index is not appropriate and either a global index on `hash(email)` or
   799  a partitioned index on must be used. The details of this is left for a later
   800  sovereignty RFC.)
   801  
   802  The second query is much harder. In an unpartitioned table, the primary key
   803  would be only on `(id)` and so this is a point lookup. But RoachMart is a global
   804  company and wants to keep a user’s data in the datacenter nearest them, so
   805  they’ve partitioned `roachmart.users` on `(country, id)`. This means the query
   806  as written above will require a full table scan, which is obviously
   807  unacceptable.
   808  
   809  The best solution to this, and the one we will always recommend first, is for
   810  RoachMart to also specify the user’s `country` in the second query. The login
   811  query will be changed to also return the `country`, this will be saved alongside
   812  `id` in the web cookie or native app’s local data, and passed back whenever
   813  retrieving the user. It’s even not as onerous as it first seems since the
   814  RoachMart API returns the user’s `id` as a string containing both pieces:
   815  `US|123`.
   816  
   817  In some cases, this will not be possible or will not be desirable, so
   818  CockroachDB has a number of pieces that can be combined to deal with this. NB:
   819  None of these really solve the problem in a satisfactory way, so as mentioned
   820  above we will very strongly urge the user to specify the entire primary key.
   821  
   822  
   823  1. The developer could create a global index on just `id`, but in the common
   824     case this requires a cross-datacenter request for the index lookup.
   825  
   826  2. If `LIMIT 1` is added to the query and no sort is requested, the planner is
   827     free to return the first result it finds and cancel the rest of the work. If
   828     a uniqueness constraint exists on `id`, the `LIMIT 1` can be assumed. This
   829     latter is an optimization that may be helpful in general.
   830  
   831  3. A list partitioning without a `DEFAULT` is an enum. The user can use the
   832     [SELECT FROM PARTITION] syntax with all partitions specified. Internally, the
   833     planner will turn this into an `AND country IN ('CA', 'MX','US', …)` clause
   834     in the `WHERE`, which turns the query from a full table scan into a point
   835     lookup per `country` in the partitioning. If `DEFAULT` is present, the
   836     non-default cases could optimistically be checked first. If some future
   837     version of CockroachDB supports enum types, the user would get this behavior
   838     even without using the `SELECT FROM PARTITION` syntax.
   839  
   840  4. This is not a full table scan in other partitioning implementations because
   841     they don’t require the `(country, id)` primary key, instead indexing `(id)`
   842     as normal inside each partition. A query on `id` without the partition
   843     information then becomes a point lookup per partition. This can be simulated
   844     in CockroachDB by introducing and partitioning on a derived `partition_id`
   845     column that is 1:1 with partitions. This is sufficient justification to
   846     prioritize building [computed columns].
   847  
   848     Concretely, the `roachmart.users` table above could have PRIMARY KEY
   849     (continent, country, id) and PARTITION BY LIST (continent) to start, so there
   850     is only one key value per partition. Later, when/if it is needed, it could
   851     change to PARTITION BY LIST (continent, country)
   852  
   853  6. 3+4 could allow a user to issue one query to optimistically try a point
   854     lookup in the local partition before trying a point lookup in all partitions.
   855  
   856  
   857  ## Other SQL changes
   858  
   859  Other implementations surface partitioning information via
   860  `information_schema.partitions`, so we should as well. `SHOW CREATE TABLE` will
   861  also need to learn to display `PARTITION BY` clauses.
   862  
   863  
   864  ## Interleaved tables and partitioning
   865  
   866  [Interleaved tables] and partitioning are designed to work together. Geographic
   867  partitioning is used to locate `roachmart.users` records in the nearest
   868  datacenter and interleaved tables are used to locate the data associated with
   869  that user (orders, etc) near it. A geographically partitioned user can be moved
   870  with one `UPDATE`, and the user’s orders (plus the rest of the interleave
   871  hierarchy) can be moved with `ON UPDATE CASCADE`.
   872  
   873  
   874  # Drawbacks
   875  
   876  Like index selection, column families, interleaved tables, and other
   877  optimization tools, partitioning will require some knowledge of the internals of
   878  the system to use effectively. See, for example, the [locality–resilience
   879  tradeoff] described above.
   880  
   881  
   882  ## Partitioning and index columns
   883  
   884  Since a table can only be partitioned by a prefix of its primary key, a table
   885  destined for partitioning often has an “unnatural” primary key. For example, to
   886  support partitioning the `roachmart.users` table by `country`, the table’s
   887  natural primary key, `id`, must be explicitly prefixed with the partition column
   888  to create a composite primary key of `(country, id)`. The composite primary key
   889  has two notable drawbacks: it does not enforce that `id` is globally unique, and
   890  it does not provide fast lookups on `id`. If ensuring uniqueness or fast lookups
   891  are required, the user must explicitly create a unique, unpartitioned secondary
   892  index on `id`.
   893  
   894  We could automatically add this secondary index to preserve uniqueness and fast
   895  lookups, but this would violate user expectations. First, a secondary index
   896  might itself store sensitive information. We want to ensure that operators are
   897  aware of the secondary index so they can specify an appropriate zone config or
   898  even create the index on `hash(id)` (though hashes may not provide sufficient
   899  masking of sensitive information, depending on the entropy of the input and the
   900  type of hash). Second, every secondary index increases the cost of writes
   901  (specifically for inserts and for updates that change an indexed or stored
   902  column). For example, an unpartitioned, unique index on `roachmart.users` would
   903  require cross-ocean hops for writes that would otherwise hit just one
   904  continent.*
   905  
   906  Similarly, we could silently prefix the specified primary key (i.e., the natural
   907  primary key) with the partition columns, but this too would violate user
   908  expectations. Most notably, queries that specify all columns in an index, as
   909  specified at creation or returned by reflection like `information_schema`, are
   910  expected to be point lookups.
   911  
   912  Instead, we can aid the user with detailed guidance in the error message
   913  generated by invalid partitioning schemes.
   914  
   915  *In general, ensuring global uniqueness requires cross-datacenter hops on
   916  writes. In limited cases, like `SERIAL` columns, users can achieve both
   917  uniqueness and fast writes without a secondary unique index by assuming
   918  `unique_rowid()` collisions are sufficiently improbable. The risk with such a
   919  scheme, of course, is that someone can manually insert a colliding value.
   920  
   921  
   922  ## Range splits
   923  
   924  The unit of replication in CockroachDB is the range, so partitioning necessarily
   925  requires splitting at least one range for each partition. In the worst case,
   926  when partitioning by list, partitions containing non-adjacent values will
   927  generate an extra range for *each* non-adjacent value.
   928  
   929  For example, consider the original partitioning specification from
   930  `roachmart.users`, in which `'CA'` and `'MX'` belong to the same partition
   931  `north_america`, but are bisected by a value, `'FJ'`, in another partition
   932  `oceania`. This forces each country on to its own range. As more countries are
   933  added, the effect is amplified.
   934  
   935  This is unfortunate but should be fine in larger tables. It can be worked around
   936  by introducing and partitioning on a `partition_id` column, which is derived
   937  from country. (Computed columns are a natural choice for this.) The tradeoff
   938  here is that repartitioning will necessitate rewriting rows instead of just
   939  updating range metadata.
   940  
   941  Note that repartitioning could make this worse, especially until we support
   942  range merges. As discussed in [Range splits and schema changes], list
   943  partitioning may create more ranges than expected when partition values are not
   944  adjacent. Additionally, repartitioning may result in small or empty ranges that
   945  cannot be cleaned up until we support range merges.
   946  
   947  
   948  # Future work
   949  
   950  ## Admin UI
   951  
   952  Exposing partitioning information and zone configs in the admin UI is out of
   953  scope for this document. See [#14113], which tracks providing broad insight into
   954  zone constraint violations.
   955  
   956  
   957  ## Bulk load
   958  
   959  Other PARTITION BY implementations can be used to quickly bulk load data into a
   960  table (or bulk remove it) and this appears to be a popular use. We currently
   961  allow only bulk load of an entire table at once, so this may be useful for us to
   962  consider as well, but it’s out of scope for this document.
   963  
   964  
   965  # Alternatives
   966  
   967  The separation between partition specification, which happens via SQL, and zone
   968  configuration, which happens via the CLI, is unfortunate and largely historical
   969  happenstance. A table’s partitioning is rarely updated independently of its zone
   970  config, so if and when we move zone configuration to SQL, we should consider
   971  tightly coupling its interface to table definitions/partitioning.
   972  
   973  
   974  # Appendix: Other PARTITIONING syntaxes
   975  
   976  There is unfortunately no SQL standard for partitioning. As a result, separate
   977  partitioning syntaxes have emerged:
   978  
   979  - MySQL and Oracle specify partitions inline in `CREATE TABLE`, like we do
   980    above.
   981  
   982  - Microsoft SQL Server requires four steps: allocating physical storage called
   983    “filegroups” for each partition with `ALTER DATABASE… ADD FILE`, followed by
   984    `CREATE PARTITION FUNCTION` to define the partition split points, followed by
   985    `CREATE PARTITION SCHEME` to tie the partition function output to the created
   986    file groups, followed by `CREATE TABLE... ON partition_scheme`  to tie the
   987    table to the partitioning scheme.
   988  
   989  - PostgreSQL 10 takes a hybrid approach: the partition columns and scheme (i.e,
   990    `RANGE` or `LIST`) are specified in the `CREATE TABLE` statement, but the
   991    partition split points are specified by running `CREATE TABLE… PARTITION OF…
   992    FOR VALUES` once for each partition.
   993  
   994  We normally reuse PostgreSQL syntax for compatibility reasons, but we’ve
   995  deliberately rejected it here. Partitioning was not a first class feature before
   996  PostgreSQL 10, which was only released on 2017-10-05, so we shouldn’t have
   997  compatibility issues with existing ORMs and applications. The syntax being
   998  introduced in PostgreSQL 10, shown below, treats a partitioned table roughly as
   999  a collection of tables, which each have their own indexes, constraints, etc. It
  1000  does not allow for a global index across all partitions of a table and the
  1001  global table namespace is polluted with each partition.
  1002  
  1003  We’ve instead chosen to closely follow the MySQL and Oracle syntax, since it
  1004  fits with our model of partitions as mostly-invisible subdivisions of a table.
  1005  
  1006  For reference, we’ve replicated the `roachblog.articles` example in each of the
  1007  three syntaxes.
  1008  
  1009  ## MySQL and Oracle
  1010  
  1011  ```sql
  1012  CREATE TABLE articles (
  1013      id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  1014      published DATE,
  1015      ...
  1016  ) PARTITION BY RANGE (published) (
  1017      PARTITION archived VALUES LESS THAN ('2017-01-25'),
  1018      PARTITION recent VALUES LESS THAN MAXVALUE
  1019  );
  1020  ```
  1021  
  1022  ## PostgreSQL
  1023  
  1024  ```sql
  1025  CREATE TABLE articles (
  1026      id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  1027      published DATE,
  1028      ...
  1029  ) PARTITION BY RANGE (published);
  1030  CREATE TABLE articles_archived PARTITION OF articles
  1031    FOR VALUES FROM (MINVALUE) TO ('2017-01-25');
  1032  CREATE TABLE articles_recent PARTITION OF articles
  1033    FOR VALUES FROM ('2017-01-25') TO (MAXVALUE);
  1034  ```
  1035  
  1036  ## Microsoft SQL Server
  1037  
  1038  ```sql
  1039  CREATE PARTITION FUNCTION fun (int) AS RANGE LEFT FOR VALUES ('2017-01-25');
  1040  CREATE PARTITION SCHEME sch AS PARTITION fun TO (filegroups...);
  1041  CREATE TABLE articles (id int PRIMARY KEY, published DATE) ON sch (published);
  1042  ```
  1043  
  1044  [#14113]: https://github.com/cockroachdb/cockroach/issues/14113
  1045  [#18683]: https://github.com/cockroachdb/cockroach/pull/18683
  1046  [#19141]: https://github.com/cockroachdb/cockroach/issues/19141
  1047  [computed columns]: https://github.com/cockroachdb/cockroach/pull/20735
  1048  [example: date partitioning]: #example-date-partitioning
  1049  [example: geographic partitioning]: #example-geographic-partitioning
  1050  [index key prefix]: https://github.com/cockroachdb/cockroach/blob/1f3c72f17546f944490e0a4dcd928fd96a375987/docs/RFCS/sql_partitioning.md#key-encoding
  1051  [interleaved tables]: https://www.cockroachlabs.com/docs/stable/interleave-in-parent.html
  1052  [leases follow the sun]: https://github.com/cockroachdb/cockroach/blob/763d21e6fad69728a523d3cdd8b449c8513094b7/docs/RFCS/20170125_leaseholder_locality.md
  1053  [locality–resilience tradeoff]: #usage-localityresilience-tradeoff
  1054  [partitioning and index columns]: #partitioning-and-index-columns
  1055  [partitioning key selection]: #partitioning-key-selection
  1056  [range splits and schema changes]: #range-splits-and-schema-changes
  1057  [range splits]: #range-splits
  1058  [select from partition]: #select-from-partition
  1059  [system.subzones]: https://github.com/cockroachdb/cockroach/blob/1f3c72f17546f944490e0a4dcd928fd96a375987/docs/RFCS/sql_partitioning.md#table-subzones
  1060  [zone config]: https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html