github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170921_sql_partitioning.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170921_sql_partitioning.md (about)

1 - Feature Name: SQL Table Partitioning
2 - Status: in-progress
3 - Start Date: 2017-01-25
4 - Authors: Nikhil Benesch, Daniel Harrison, David Taylor
5 - RFC PR: [#18683]
6
7
8 # Summary
9
10 Table partitioning provides row-level control over how and where data is stored.
11
12 Specifically, `CREATE TABLE`, `ALTER TABLE`, and their `INDEX` counterparts
13 learn to accept a partition specification, which maps ranges of the index to
14 named partitions. Zone configs learn to target these partitions so that locality
15 constraints and replication settings can vary per partition instead of per table
16 or database.
17
18
19 # Motivation
20
21 The magic of CockroachDB is its ability to stitch together physically-disjoint
22 ranges of data into one logical table, no matter whether the servers are
23 separated by several racks or several continents.
24
25 If the cluster operator has any of the following requirements, however,
26 CockroachDB’s default range allocation policy might be a poor fit:
27
28 - **Latency guarantees.** The operator must ensure that accessing particular
29 data from a particular part of the world will be fast. Since the default
30 allocation policy optimizes average latency across all ranges, it might
31 violate the latency requirements on a particular range in the service of
32 balancing the cluster.
33
34 - **Data sovereignty.** The operator is subject to regulations that require
35 certain data to live on servers in certain geographical regions. The default
36 allocation policy will blindly move data from overfull regions to underfull
37 regions, potentially violating these regulations.
38
39 - **Cost reduction.** The operator needs to store infrequently-accessed data on
40 slower hardware (e.g. spinning disks) to save money. The default allocation
41 policy will happily shuffle data between nodes with SSDs and nodes with HDDs.
42
43 Zone configs can help with the above, provided the constraints applies to an
44 entire table. To meet latency guarantees, a zone config can constrain an entire
45 table to a particular data center. To reduce costs, a zone config can constrain
46 an entire infrequently-accessed table to nodes with slower hardware. Data
47 sovereignty, however, requires further work.
48
49 Often, however, the requirement applies to only a subset of the table. A `users`
50 table, for example, should store its American users in North America, its
51 European users in Europe, etc. Maybe an `articles` table should store the most
52 recent few months of posts on SSDs, but everything else on HDDs. Creating
53 individual tables (`users_america`, `users_europe`, `articles_recent`,
54 `articles_archive` etc.) would allow today’s zone configs to solve the problem,
55 but would bring back some of the pain of manual sharding that CockroachDB was
56 designed to avoid. Clients would be responsible for reading data from and
57 writing data to the right shard and stitching data from multiple shards together
58 if necessary.
59
60 By allowing zone configs to target subsets (i.e., “partitions”) of a table, we
61 allow operators to meet their latency, regulatory, and cost requirements while
62 preserving the magic of CockroachDB: that is, a logical table that hides the
63 complex physical sharding underneath.
64
65
66 # A note on scope
67
68 To ensure the design proposed within can be implemented in a reasonable
69 timeframe, the following items are considered out-of-scope:
70
71 - **Improving the zone config interface.** The YAML/CLI interface for specifying
72 zone configs is clunky and will get more confusing when partitions are added,
73 but improvements are better left to a comprehensive redesign (which is already
74 planned for row-level ttls, auditing, etc). Instead, this RFC proposes the
75 minimum changes necessary to the existing interface to support partitioning.
76
77 - **Data sovereignty.** The design within is necessary but not
78 sufficient to meet data sovereignty regulations. In particular, the proposed
79 scheme provides little protection against misconfiguration and little
80 visibility into, at a given moment, what data is actually stored where.
81 Additionally, we'll need a resolution to protected information leaking via
82 keys, which make their way into various system ranges (e.g. meta2). The rest
83 of the data sovereignty work will be developed in a followup RFC.
84
85 - **Engineering work schedule.** Some of this work will make 2.0 and some won’t,
86 but determining milestones is intentionally omitted from this RFC.
87
88 - **Altering the primary key columns of a table.** The design below relies on
89 careful primary key selection, which means tables created without partitioning
90 in mind may not have a suitable schema. For now, a workaround is to export and
91 reimport the data with a new schema. Eventually, we need to support primary
92 key schema changes (see [#19141]).
93
94
95 # Guide-level explanation
96
97 ## Example: geographic partitioning
98
99 Consider a globally-distributed online storefront, RoachMart, that sells and
100 ships live cockroaches. RoachMart processes several hundred requests per second,
101 each of which might generate several queries to RoachMart’s nine-node
102 CockroachDB cluster. (Live cockroaches are very popular these days.) To keep
103 these requests speedy, RoachMart’s operators want to automatically locate each
104 user’s data in datacenters near that user.
105
106 Today, RoachMart has two data centers: one in North America and one in
107 Australia. The company expects to open data centers in Europe, Africa, and Asia
108 soon, so they want to ensure they can migrate users to the new data centers once
109 they open.
110
111 In CockroachDB, partitions are cheap and easy to redefine at will, with one
112 caveat: partitions must be defined over columns that are a prefix of the primary
113 key. A `users` table with just a string primary key, like
114
115 ```sql
116 CREATE TABLE users (
117 email STRING PRIMARY KEY,
118 continent STRING,
119 ...
120 );
121 ```
122
123 cannot be partitioned by `continent` because `continent` is not part of the
124 primary key. Instead, the table needs to be specified with a composite primary
125 key
126
127 ```sql
128 CREATE TABLE users (
129 email STRING,
130 continent STRING,
131 ...
132 PRIMARY KEY (continent, email)
133 );
134 ```
135
136 where the partition column `continent` appears first in the key.
137
138 Note that this usage of composite primary keys is somewhat unusual. See the
139 [Partitioning key selection] section below for a discussion.
140
141 Since it’s not currently possible to change a table’s primary key, the partition
142 columns included in the primary key when the table is created must be granular
143 enough to support any partitioning scheme that might be desired in the future.
144 Even if it were possible, changing the primary key would require rewriting *all*
145 the data in the table and any tables interleaved beneath, a potentially
146 intractable operation. See [Partitioning and index columns] for a more detailed
147 discussion of this drawback.
148
149 RoachMart is worried that, someday, they’ll have more than one data center per
150 continent, so they decide to use `country` instead of `continent` as their
151 partition column. Using the `PARTITION BY LIST` syntax, they can group
152 individual countries into larger partitions, one for each of their data centers:
153
154 ```sql
155 CREATE TABLE roachmart.users (
156 email STRING,
157 country STRING,
158 ...,
159 PRIMARY KEY (country, email)
160 ) PARTITION BY LIST (country) (
161 PARTITION australia VALUES IN ('AU', 'NZ'),
162 PARTITION north_america VALUES IN ('CA', 'MX', 'US'),
163 PARTITION default VALUES IN (DEFAULT),
164 );
165 ```
166
167 As requirements shift (e.g., a new data center opens, an existing data center is
168 running low on capacity, etc.), a country can be seamlessly migrated to a new
169 partition with an `ALTER TABLE` command that repartitions the table:
170
171 ```sql
172 ALTER TABLE roachmart.users PARTITION BY LIST (country) (
173 PARTITION australia VALUES IN ('AU', 'NZ'),
174 PARTITION north_america VALUES IN ('CA', 'MX', 'US'),
175 PARTITION scandinavia VALUES IN ('DK', 'NO', 'SE'),
176 PARTITION default VALUES IN (DEFAULT)
177 );
178 ```
179
180 Each partition is required to have a unique name that logically identifies it
181 across repartitions. Above, the `default` partition is considered equivalent to
182 the original `default` partition, though it is missing three countries that are
183 now in the new `scandinavia` partition.
184
185 On their own, partitions are inert. The SQL schema for `roachmart.users` does
186 nothing to actually locate the `australia` partition in Australia. Applying
187 functionality to a partition requires zone configs. Just as each database and
188 table can be targeted by a [zone config] that overrides the number and location
189 of its replicas, each partition can be targeted by a zone config.
190
191 So, RoachMart restarts each of its CockroachDB nodes with a `--locality` flag
192 that indicates its data center:
193
194 ```
195 $ ./cockroach start --locality=datacenter=au1
196 $ ./cockroach start --locality=datacenter=us1
197 ```
198
199 Then they can apply a zone config to each partition that restricts it to the
200 appropriate locality:
201
202 ```
203 $ cat australia.zone.yml
204 constraints: [+datacenter=au1]
205 $ cat north_america.zone.yml
206 constraints: [+datacenter=us1]
207 $ ./cockroach zone set roachmart.users --partition=australia -f australia.zone.yml
208 $ ./cockroach zone set roachmart.users --partition=north_america -f north_america.zone.yml
209 ```
210
211 The replicate queue on each node will notice the updated zone config and begin
212 rebalancing the cluster. Users with `country = 'NZ'` will live on a range that
213 is only replicated within the au1 data center and users with `country = 'US'`
214 will live on a range that is only replicated within the us1 data center.
215
216
217 ### Additional colocated tables
218
219 Data in the `users` table is now stored in the correct place, but the typical
220 RoachMart request accesses more than just the `users` table. In particular,
221 RoachMart would like to colocate users’ orders. Interleaved tables make this
222 simple.
223
224 ```sql
225 CREATE TABLE roachmart.orders (
226 user_country STRING,
227 user_email STRING,
228 id INT,
229 part_id INT,
230 ...
231 PRIMARY KEY (user_country, user_email, id),
232 FOREIGN KEY (user_country, user_email) REFERENCES users
233 ) INTERLEAVE IN PARENT users (user_country, user_email)
234 ```
235
236 The key encoding of interleaved tables ensures that the zone config of any
237 top-level partitions applies to the data interleaved within.
238
239
240 ### Partitioned secondary indices
241
242 RoachMart’s warehouse team wants to know how many orders are processed for each
243 part they sell. They want a secondary index to make the query efficient. Since
244 the common operation is to look at one country’s orders at a time, they
245 partition the index on the same columns as users.
246
247 ```sql
248 CREATE INDEX part_idx ON roachmart.orders (user_country, part_id)
249 PARTITION BY LIST (user_country) (
250 ...
251 );
252 ```
253
254 ## Example: date partitioning
255
256 RoachBlog, a free weblog provider, is worried about the amount of data stored in
257 their `articles` table. The table is growing at the rate of approximately 100GB
258 per day. (Cockroach enthusiasts are prolific.)
259
260 An investigation of their query traffic has revealed that, save for a few
261 outliers, articles published more than 30 days ago receive virtually no traffic.
262 They’d like to move those articles to nodes with HDDs to save money on hardware.
263 Smartly—as if they had anticipated this eventuality—RoachBlog’s engineers
264 designed the table with a `TIMESTAMP` primary key:
265
266 ```sql
267 CREATE TABLE roachblog.articles (
268 id SERIAL,
269 published TIMESTAMP,
270 author_id INT,
271 ...,
272 PRIMARY KEY (published, id)
273 );
274 ```
275
276 Listing out every `published` timestamp that should be considered “archived”
277 would be infeasible, so they use the range partitioning syntax instead:
278
279 ```sql
280 ALTER TABLE roachblog.articles PARTITION BY RANGE (published) (
281 PARTITION archived VALUES FROM (MINVALUE) TO ('2017-12-04'),
282 PARTITION recent VALUES FROM ('2017-12-04') TO (MAXVALUE)
283 );
284 ```
285
286 RoachBlog plans to run the query every week with an updated partition split
287 point. Repartitioning was designed to be quite cheap to support this very use
288 case, and indeed no actual data needs to be rewritten.
289
290 N.B.: Each time this query is run it will produce a new split point in the
291 table, which could leave a lot of small ranges (until we implement range
292 merges). Documentation will need to caution users about repartitioning too
293 often.
294
295 As with RoachMart, RoachBlog now need only launch their nodes with appropriate
296 store attributes
297
298 ```
299 $ ./cockroach start --store=path=/mnt/crdb,attrs=ssd
300 $ ./cockroach start --store=path=/mnt/crdb,attrs=hdd
301 ```
302
303 and install the corresponding zone config:
304
305 ```
306 $ cat recent.zone.yml
307 constraints: [+ssd]
308 $ cat archived.zone.yml
309 constraints: [+hdd]
310 $ ./cockroach zone set roachblog.articles --partition=recent -f recent.zone.yml
311 $ ./cockroach zone set roachblog.articles --partition=archived -f archived.zone.yml
312 ```
313
314
315 ### Global secondary indices
316
317 RoachBlog has a page which lists all articles written by a given author. There
318 are no sovereignty issues with this, so they create a global index to keep this
319 page fast:
320
321 ```sql
322 CREATE INDEX author_idx ON roachblog.articles (author_id)
323 ```
324
325 ## Partitioning key selection
326
327 Typically, a composite key is used when the unique identifier for a row is
328 naturally formed from more than one column. Consider, for example, a university
329 course catalog that spans multiple years. Every course has an ID, like CS101,
330 but that ID is reused every year. The primary key, then, must include both the
331 course ID and the term it was offered. In this case, you might very reasonably
332 choose to form a composite key out of three columns:
333
334 ```sql
335 CREATE TABLE courses (
336 course_id STRING,
337 term STRING,
338 year INT,
339 ...
340 PRIMARY KEY (course_id, term, year)
341 )
342 ```
343
344 This is an accepted practice in schema design. You might even choose to order
345 your primary key in a different order based on expected query patterns; e.g.,
346 (`year`, `term`, `course_id`) might be a better order if the application is
347 frequently querying for the courses in a given year.
348
349 Partitioning stretches this practice by constructing a composite key out of a
350 columns that could uniquely identify a row on their own. Notice that in the
351 `courses` table, no proper subset of the primary key columns is sufficient to
352 uniquely identify a row. In `roachmart.users`, however, `email` alone is
353 sufficient to identify a row; the `country` column is superfluous and exists
354 only to facilitate partitioning. This biggest consequence of this oddity is the
355 potential for user confusion; overspecified keys cause no known technical
356 problems.
357
358 In fact, the guidelines for constructing a partitioning key under this scheme
359 are quite straightforward: include everything you wish to partition by in the
360 key, in the order you wish to nest your subpartitions, and follow it with
361 everything that would have otherwise been in the key.
362
363 See the [Partitioning and index columns](#partitioning-and-index-columns)
364 section for a discussion of why this approach was chosen.
365
366 ## Summary: initial partitioning and repartitioning
367
368 - Partitions can be added to a table when it is created, or at any time
369 afterward. (However, since the primary key of a table cannot be changed,
370 forethought is still required.)
371
372 - Partitions are defined over one or more columns. These columns must be a
373 prefix of the primary key (or secondary index).
374
375 - Indexes can also be partitioned, but are not required to be.
376
377 - Each partition is required to have a name that is unique among all partitions
378 on that table or index.
379
380 - Repartitioning is relatively cheap; data is not rewritten. The table metadata
381 is updated and the allocator is left to move the data as necessary. A
382 partition with the same name before and after a repartitioning is considered
383 the same partition. During a repartitioning, any partition zone config entries
384 with no corresponding name in the new partitions are removed.
385
386 - Future work to expose zone configs through SQL will likely allow for updating
387 partitions and the corresponding zone configs together. In the meantime, users
388 can create an empty partition, apply a zone config, then repartition.
389
390 - After a table is partitioned, its partitions can be targeted by zone
391 constraints via the existing CLI, e.g. `./cockroach zone set
392 database.table --partition=partition -f zone.yml`.
393
394 - As before, CockroachDB uses the most granular zone config available. Zone
395 configs that target a partition are considered more granular than those that
396 target a table, which in turn are considered more granular than those that
397 target a database. Configs do not inherit, but unspecified fields when a zone
398 override is first create are copied from the parent, as is currently the case.
399
400
401 ## Usage: locality–resilience tradeoff
402
403 There exists a tradeoff between making reads/writes fast and surviving failures.
404 Consider a partition with three replicas of `roachmart.users` for Australian
405 users. If only one replica is pinned to an Australian datacenter, then reads may
406 be fast (via [leases follow the sun]) but writes will be slow. If two replicas
407 are pinned to an Australian datacenter, than reads and writes will be fast (as
408 long as the cross-ocean link has enough bandwidth that the third replica doesn’t
409 fall behind and hit the quota pool). If those two replicas are in the same
410 datacenter, then loss of one datacenter can lead to data unavailability, so some
411 deployments may want two separate Austrialian datacenters. If all three replicas
412 are in Australian datacenters, then three Australian datacenters are needed to
413 be resilient to a datacenter loss.
414
415
416 # Reference-level explanation
417
418 ## Key encoding
419
420 Different key encodings were evaluated at length in a prior version of this RFC.
421 The approach recommended, entitled [Index Key Prefix], is described here without
422 discussion of the tradeoffs involved in the decision; refer the original RFC for
423 that discussion.
424
425 The Index Key Prefix approach simply allows part of the existing row key to be
426 used as the partition key. For a partitioned table, the partition key can be any
427 prefix of the table primary key. Similarly, for a partitioned index, the
428 partition key can be any prefix of the indexed columns. Since a row’s key is
429 encoded as the table ID, followed by the index ID, followed by each of the
430 indexed columns, in order, the partition key appears immediately after the table
431 and index IDs, and thus rows with the same partition key will be adjacent.
432
433 To further illustrate this, consider the key encodings for the earlier
434 geographic partitioning example.
435
436 Primary index for `roachmart.users`:
437
438 ```
439 /TableID/IndexID/<region>/<user_id> -> [email]
440 ```
441
442 Global non-unique secondary index for `roachblog.articles.author_id`:
443
444 ```
445 /TableID/IndexID/<author_id><article_id> -> null
446 ```
447
448 Partitioned non-unique secondary index for `roachmart.orders.part_id`:
449
450 ```
451 /TableID/IndexID/<region>/<part_id>/<order_id> -> null
452 ```
453
454 Note that values need not be physically adjacent to belong to the same logical
455 partition. For example, `'US'`, `'CA'` and `'MX'` are interspersed with other
456 countries, but all map to the same `north_america` partition in
457 `roachmart.users`. This comes at a cost, as the `north_america` partition now
458 requires a minimum of three separate ranges instead of just one. See [Range
459 splits] below for further discussion.
460
461
462 ## SQL syntax
463
464 The `PARTITION BY` clause is usable in `CREATE TABLE`, `ALTER TABLE`, `CREATE
465 INDEX`, `ALTER INDEX` and variants. See [Example: geographic partitioning] for
466 the `PARTITION BY LIST` syntax and [Example: date partitioning] for `PARTITION
467 BY RANGE`.
468
469 A somewhat formal version of the syntax is presented below:
470
471 ```
472 CREATE TABLE <table-name> ( <elements...> ) [<interleave>] [<partition-scheme>]
473 CREATE INDEX [<index-name>] ON <tablename> ( <col-names>... ) [<interleave>] [<partition-scheme>]
474 ALTER TABLE <table-name> <partition-scheme>
475 ALTER INDEX <index-name> <partition-scheme>
476
477 Partition scheme:
478 PARTITION BY LIST ( <col-names>... ) ( <list-partition> [ , ... ] ) <partition-scheme>
479 PARTITION BY RANGE ( <col-names>... ) ( <range-partition> [ , ... ] ) <partition-scheme>
480
481 List partition:
482 PARTITION <partition-name> VALUES IN ( <list-expr>... )
483
484 List expression:
485 DEFAULT
486 <const-expr>
487
488 Range partition:
489 PARTITION <partition-name> VALUES FROM (<range-expr>) TO (<range-expr>)
490
491 Range expression:
492 MINVALUE
493 MAXVALUE
494 <const-expr>
495 ```
496
497 To reduce confusion, expressions in a `VALUES` or `VALUES IN` clause must be
498 constant, as they are only computed once when the `CREATE` or `ALTER` statement
499 is executed. Allowing non-constant expressions, like `now()` or `SELECT
500 country_name FROM countries`, would suggest that the partition updates whenever
501 the expression changes in value.
502
503 Note that the lower bound of a range partition (`FROM`) is inclusive, while the
504 upper bound (`TO`) is exclusive. Note also that a `NULL` value in a
505 range-partitioned column sorts into the first range, which is consistent with
506 our key encoding ordering and `ORDER BY` behavior.
507
508 ### SELECT FROM PARTITION
509
510 To allow reads to target only selected partitions, we propose to extend table
511 names (not arbitrary table expressions) with a `PARTITION` clause. For example:
512
513 ```sql
514 SELECT * FROM roachmart.users PARTITION (australia, north_america)
515
516 SELECT *
517 FROM
518 roachblog.articles PARTITION (recent) AS a
519 JOIN
520 roachblog.user_views PARTITION (north_america) AS b
521 ON a.id = b.article_id
522 ```
523
524 The implementation is purely syntatic sugar and simply transforms each
525 `PARTITION` clause into an additional constraint. The join query above, for
526 example, would be rewritten to include `WHERE user_views.country IN ('CA', 'MX',
527 'US') AND articles.published >= 'recent cutoff'`.
528
529 This is only a sugar, but it's an important one. For example, in
530 geopartitioning, it is expected that there will be many more countries than
531 partitions, so it is much easier to specify partition restrictions with this
532 than the de-sugared `WHERE`.
533
534 Note that a list partitioning without a `DEFAULT` is an enum. The user can use
535 this syntax with all partitions specified to force the planner to turn a query
536 like `SELECT * FROM roachmart.users WHERE email = "..."` (and no country
537 specified) from a full table scan into a point lookup per `country` in the
538 partitioning.
539
540
541 ### Subpartitioning
542
543 Subpartitioning allows partitioning along several axes simultaneously. The
544 `PARTITION BY` syntax presented above is recursive so that list partitions can
545 be themselves partitioned any number of times, using either list or range
546 partitioning. Note that the subpartition columns must be a prefix of the columns
547 in the primary key that have not been consumed by parent partitions.
548
549 Suppose RoachMart wanted to age out users who haven’t logged in in several
550 months to slower hardware, while continuing to partition by country for improved
551 latency. Subpartitioning would neatly solve their use case:
552
553 ```sql
554 CREATE TABLE roachmart.users (
555 id SERIAL,
556 country STRING,
557 last_seen DATETIME,
558 ...,
559 PRIMARY KEY (country, last_seen, id)
560 ) PARTITION BY LIST (country) (
561 PARTITION australia VALUES IN ('AU', 'NZ') PARTITION BY RANGE (last_seen) (
562 PARTITION australia_archived VALUES FROM (MINVALUE) TO ('2017-06-04'),
563 PARTITION australia_recent VALUES FROM ('2017-06-04') TO (MAXVALUE)
564 ),
565 PARTITION north_america VALUES IN ('CA', 'MX', 'US') PARTITION BY RANGE (last_seen) (
566 PARTITION north_america_archived VALUES FROM (MINVALUE) TO ('2017-06-04'),
567 PARTITION north_america_recent VALUES FROM ('2017-06-04') TO (MAXVALUE)
568 ),
569 ...
570 );
571 ```
572
573 Subpartition names must be unique within a table, as each table's partitions and
574 subpartitions share a namespace.
575
576 Other databases also provide less flexible but more convenient syntaxes, like a
577 `SUBPARTITION TEMPLATE` that prevents repetition of identical subpartition
578 schemes. Above, a `SUBPARTITION TEMPLATE` could be used to describe the
579 `last_seen` partitioning scheme exactly once, instead of once for each `country`
580 partition. We propose to implement only the more general syntax and defer design
581 of a more convenient syntax until demand exists.
582
583 ## IndexDescriptor changes
584
585 Partitioning information is added to `IndexDescriptor` as below. All tuples are
586 encoded as `EncDatums` using the value encoding.
587
588 ```protobuf
589 message IndexDescriptor {
590 ...
591 optional PartitioningDescriptor partitioning = 13;
592 }
593
594 message PartitioningDescriptor {
595 message List {
596 optional string name = 1;
597 repeated bytes values = 2;
598 optional PartitioningDescriptor subpartition = 3;
599 }
600
601 message Range {
602 optional string name = 1;
603 optional bytes inclusive_lower_bound = 2;
604 optional bytes exclusive_upper_bound = 3;
605 }
606
607 optional uint32 num_columns = 1;
608 repeated List list = 2;
609 repeated Range range = 3;
610 }
611 ```
612
613 ## Zone config changes
614
615 Zone configs are currently stored in the `system.zones` table, which maps
616 database and table IDs to `ZoneConfig` protobufs. We propose to adjust this
617 `ZoneConfig` protobuf to include configuration for "subzones," where a subzone
618 represents either an entire index, or a partition of any index. Subzones are not
619 applicable when the zone does not represent a table.
620
621 Ideally, we'd modify or replace `system.zones` with a table keyed by
622 `(descriptor_id, index_id, partition_name)`, but a proposal to this effect [was
623 presented and rejected in a previous version of this RFC][system-subzones]. In
624 short, replacing `system.zones` requires a complex, backwards-incompatible
625 migration that is better left to its own RFC.
626
627 Our interim solution adds two fields to the `ZoneConfig` proto:
628
629 ```protobuf
630 message ZoneConfig {
631 ...
632 repeated Spans subzone_spans = 8;
633 repeated SubzoneConfig subzone_configs = 9;
634 }
635
636 message SubzoneSpan {
637 optional roachpb.Span span = 1;
638 optional uint32 subzone_index = 2;
639 }
640
641 message SubzoneConfig {
642 optional uint32 index_id = 1;
643 optional string partition_name = 2;
644 optional message ZoneConfig = 3;
645 }
646 ```
647
648 A subzone's config is stored in the `subzone_configs` field as a `SubzoneConfig`
649 message, which bundles a `ZoneConfig` with identification of the index or
650 partition that the subzone represents. Every `SubzoneConfig` must specify an
651 `index_id`, but subzones that apply to the entire index omit the
652 `partition_name` field.
653
654 A mapping from key span to `SubzoneConfig` is stored in the `subzone_spans`
655 field to allow efficient lookups of the subzone override for a given key.
656 Entries in `subzone_spans` are non-overlapping and sorted by start key to allow
657 for binary search. If an entry's span contains the lookup key, the
658 `subzone_index` field is the index of the `SubzoneConfig` in the
659 `subzone_configs` field; if no span contains the lookup key, the table config
660 (i.e., the outer `ZoneConfig`) applies.
661
662 We could alternatively derive `subzone_spans` from the corresponding
663 `TableDescriptor` on every call to `GetZoneConfigForKey`, but this would involve
664 quite a bit of encoding/decoding on a hot code path, as the table descriptor
665 stores the partition split points using value encoding. Updating zone configs or
666 partitioning, by contrast, happens infrequently; we'll simply recompute
667 `subzone_spans` whenever an index or partition zone config is updated, or when
668 a table's partitioning scheme is updated.
669
670 One case requires special care to handle. Suppose the `archived` partition of
671 `roachblog.articles` has a custom zone config, but the `roachblog.articles`
672 table itself does not have a custom zone config. The `system.zones` table will
673 necessarily have an entry for `roachblog.articles` to store the custom zone
674 config for the `archived` partition, but that entry will have an otherwise empty
675 `ZoneConfig`. This must not be taken to mean that the other partitions of the
676 table should use this empty zone config, but that the default zone config
677 applies. We propose to use the `num_replicas` field for this purpose:
678 `num_replicas = 0` is invalid and therefore indicates that the table in
679 question does not have an active `ZoneConfig`.
680
681 As mentioned in the examples, the zone config CLI will be adjusted to accept an
682 index specifier and partitioning flag:
683
684 ```bash
685 # Before:
686 $ ./cockroach zone {set|get|rm} DATABASE[.TABLE[@INDEX]] [--partition=PARTITION] -f zone.yml
687
688 # After:
689 $ ./cockroach zone {set|get|rm} DATABASE[.TABLE] f zone.yml
690 ```
691
692 Omitting the index but specifying the `--partition` flag (e.g., `./cockroach
693 zone set db.tbl --partition=p0`) implies a partition of the primary index. Note,
694 however, that `./cockroach set db.tbl` and `./cockroach set db.tbl@primary` are
695 *not* equivalent: the former specifies a table zone config that applies to any
696 secondary indices, unless more specific overrides exist, while the latter
697 specifies a primary index zone config that never applies to the table's
698 secondary indices.
699
700 One downside of this scheme is that the zone configs for all partitions of a
701 table's indices are stored in one row, which puts an effective limit on the
702 number of partitions allowed on a table. Additionally, `system.zones` is part of
703 the unsplittable, gossiped system config span, so all zone configs across all
704 tables must add up to less than 64MB. Some back of the envelope math suggests 60
705 bytes per range partition, and 60 bytes + the size of the values in a list
706 partition.
707
708 For `roachmart.users` with every country allocated between 7 partitions, this
709 results in 1167B. This results in an absolute max of 64MB / 60B = ~1,000,000
710 partitioned tables or 64MB / 1167B = ~54,000 tables if they were all partitioned
711 by country. The recommended max number of partitions in a table seems to range
712 from 100 to 1024 in other partitioning implementations, so this seems
713 reasonable.
714
715 Similar to the way that the zone config for a table, if present, completely
716 overrides (with no inheritance) the zone config for a database, a zone config
717 for a partition overrides the zone config for the table, database, or cluster.
718 The ergonomics of this will likely be sub-optimal; the user will need to
719 maintain the denormalization of what is naturally an inheritance hierarchy of
720 configuration. There is a larger upcoming effort to refactor zone configs which
721 will address these issues, so the following are out of scope of this RFC:
722
723 - A SQL interface for reading and writing zone configs
724
725 - Auditing of zone config changes
726
727 - Inheritance
728
729 - Moving zone configs out of gossip
730
731 - Raising global and per-table limitations on the number of partitions
732
733 - Direct replica-level control of data placement
734
735 - Partition configs outside of the database+table scoping. This could be useful
736 for allowing partitioning in a shared CockroachDB offering with static, preset
737 zone configs.
738
739 - The ability to define a partition and specify a zone config for it at the same
740 time.
741
742 ## Range splits and schema changes
743
744 The CockroachDB unit of replication and rebalancing is the range. So for zone
745 configs to target a partition, it needs to be on its own range (and potentially
746 more than that for list partitioning see [Range splits]).
747
748 Currently, CockroachDB asynchronously splits each newly created table into its
749 own range. This happens regardless of whether any zone configs are added to
750 target that table specifically. Each partition could similarly be asynchronously
751 split after being defined.
752
753 This may create extra ranges if a table is partitioned, repartitioned, and only
754 then given zone configs (though this problem goes away when we support range
755 merges). We address this by lazily splitting a partition only when a zone config
756 is added for it. This deviates a bit from the table behavior and may violate a
757 user’s expectation of partitions living on separate ranges, but the tradeoff was
758 deemed worth it.
759
760 A table’s partitions can have a large influence on sql planning, so any changes
761 should be broadcast immediately. Further, our schema changes and table leasing
762 require that at most 2 versions are in use at a time. So it’s a natural fit for
763 partitionings to run as schema changes. This schema change will be responsible
764 for creating the splits and can be hooked into `system.jobs` to provide
765 introspection of progress as well as cancellation.
766
767
768 ## Query planning changes
769
770 Because of the decision to require partitions be defined over normal,
771 materialized columns, there are no correctness changes needed to sql query
772 planning. Performance, however, needs some work.
773
774
775 ### Case study
776
777 Consider the following queries issued by RoachMart to the `roachmart.users`
778 table. (Ignore the poor security practices, this is meant to be illustrative.)
779
780 When a user visits RoachMart and needs to log in:
781
782 ```sql
783 SELECT id FROM roachmart.users WHERE email = $1`
784 ```
785
786 Whenever a new page is loaded on the web or a new screen is loaded on the native
787 app, the `id` is extracted from a cookie or passed via the RoachMart api and
788 used in the stateless servers to rehydrate the user information:
789
790 ```sql
791 SELECT * FROM roachmart.users WHERE id = 12345`
792 ```
793
794 Regardless of whether the `roachmart.users` table is partitioned, the first
795 query is kept speedy by an unpartitioned secondary index on `email`. This may
796 require a cross-ocean hop to another datacenter but login is an infrequent
797 operation and so this is okay. (If data sovereignty of emails is a concern, then
798 a global index is not appropriate and either a global index on `hash(email)` or
799 a partitioned index on must be used. The details of this is left for a later
800 sovereignty RFC.)
801
802 The second query is much harder. In an unpartitioned table, the primary key
803 would be only on `(id)` and so this is a point lookup. But RoachMart is a global
804 company and wants to keep a user’s data in the datacenter nearest them, so
805 they’ve partitioned `roachmart.users` on `(country, id)`. This means the query
806 as written above will require a full table scan, which is obviously
807 unacceptable.
808
809 The best solution to this, and the one we will always recommend first, is for
810 RoachMart to also specify the user’s `country` in the second query. The login
811 query will be changed to also return the `country`, this will be saved alongside
812 `id` in the web cookie or native app’s local data, and passed back whenever
813 retrieving the user. It’s even not as onerous as it first seems since the
814 RoachMart API returns the user’s `id` as a string containing both pieces:
815 `US|123`.
816
817 In some cases, this will not be possible or will not be desirable, so
818 CockroachDB has a number of pieces that can be combined to deal with this. NB:
819 None of these really solve the problem in a satisfactory way, so as mentioned
820 above we will very strongly urge the user to specify the entire primary key.
821
822
823 1. The developer could create a global index on just `id`, but in the common
824 case this requires a cross-datacenter request for the index lookup.
825
826 2. If `LIMIT 1` is added to the query and no sort is requested, the planner is
827 free to return the first result it finds and cancel the rest of the work. If
828 a uniqueness constraint exists on `id`, the `LIMIT 1` can be assumed. This
829 latter is an optimization that may be helpful in general.
830
831 3. A list partitioning without a `DEFAULT` is an enum. The user can use the
832 [SELECT FROM PARTITION] syntax with all partitions specified. Internally, the
833 planner will turn this into an `AND country IN ('CA', 'MX','US', …)` clause
834 in the `WHERE`, which turns the query from a full table scan into a point
835 lookup per `country` in the partitioning. If `DEFAULT` is present, the
836 non-default cases could optimistically be checked first. If some future
837 version of CockroachDB supports enum types, the user would get this behavior
838 even without using the `SELECT FROM PARTITION` syntax.
839
840 4. This is not a full table scan in other partitioning implementations because
841 they don’t require the `(country, id)` primary key, instead indexing `(id)`
842 as normal inside each partition. A query on `id` without the partition
843 information then becomes a point lookup per partition. This can be simulated
844 in CockroachDB by introducing and partitioning on a derived `partition_id`
845 column that is 1:1 with partitions. This is sufficient justification to
846 prioritize building [computed columns].
847
848 Concretely, the `roachmart.users` table above could have PRIMARY KEY
849 (continent, country, id) and PARTITION BY LIST (continent) to start, so there
850 is only one key value per partition. Later, when/if it is needed, it could
851 change to PARTITION BY LIST (continent, country)
852
853 6. 3+4 could allow a user to issue one query to optimistically try a point
854 lookup in the local partition before trying a point lookup in all partitions.
855
856
857 ## Other SQL changes
858
859 Other implementations surface partitioning information via
860 `information_schema.partitions`, so we should as well. `SHOW CREATE TABLE` will
861 also need to learn to display `PARTITION BY` clauses.
862
863
864 ## Interleaved tables and partitioning
865
866 [Interleaved tables] and partitioning are designed to work together. Geographic
867 partitioning is used to locate `roachmart.users` records in the nearest
868 datacenter and interleaved tables are used to locate the data associated with
869 that user (orders, etc) near it. A geographically partitioned user can be moved
870 with one `UPDATE`, and the user’s orders (plus the rest of the interleave
871 hierarchy) can be moved with `ON UPDATE CASCADE`.
872
873
874 # Drawbacks
875
876 Like index selection, column families, interleaved tables, and other
877 optimization tools, partitioning will require some knowledge of the internals of
878 the system to use effectively. See, for example, the [locality–resilience
879 tradeoff] described above.
880
881
882 ## Partitioning and index columns
883
884 Since a table can only be partitioned by a prefix of its primary key, a table
885 destined for partitioning often has an “unnatural” primary key. For example, to
886 support partitioning the `roachmart.users` table by `country`, the table’s
887 natural primary key, `id`, must be explicitly prefixed with the partition column
888 to create a composite primary key of `(country, id)`. The composite primary key
889 has two notable drawbacks: it does not enforce that `id` is globally unique, and
890 it does not provide fast lookups on `id`. If ensuring uniqueness or fast lookups
891 are required, the user must explicitly create a unique, unpartitioned secondary
892 index on `id`.
893
894 We could automatically add this secondary index to preserve uniqueness and fast
895 lookups, but this would violate user expectations. First, a secondary index
896 might itself store sensitive information. We want to ensure that operators are
897 aware of the secondary index so they can specify an appropriate zone config or
898 even create the index on `hash(id)` (though hashes may not provide sufficient
899 masking of sensitive information, depending on the entropy of the input and the
900 type of hash). Second, every secondary index increases the cost of writes
901 (specifically for inserts and for updates that change an indexed or stored
902 column). For example, an unpartitioned, unique index on `roachmart.users` would
903 require cross-ocean hops for writes that would otherwise hit just one
904 continent.*
905
906 Similarly, we could silently prefix the specified primary key (i.e., the natural
907 primary key) with the partition columns, but this too would violate user
908 expectations. Most notably, queries that specify all columns in an index, as
909 specified at creation or returned by reflection like `information_schema`, are
910 expected to be point lookups.
911
912 Instead, we can aid the user with detailed guidance in the error message
913 generated by invalid partitioning schemes.
914
915 *In general, ensuring global uniqueness requires cross-datacenter hops on
916 writes. In limited cases, like `SERIAL` columns, users can achieve both
917 uniqueness and fast writes without a secondary unique index by assuming
918 `unique_rowid()` collisions are sufficiently improbable. The risk with such a
919 scheme, of course, is that someone can manually insert a colliding value.
920
921
922 ## Range splits
923
924 The unit of replication in CockroachDB is the range, so partitioning necessarily
925 requires splitting at least one range for each partition. In the worst case,
926 when partitioning by list, partitions containing non-adjacent values will
927 generate an extra range for *each* non-adjacent value.
928
929 For example, consider the original partitioning specification from
930 `roachmart.users`, in which `'CA'` and `'MX'` belong to the same partition
931 `north_america`, but are bisected by a value, `'FJ'`, in another partition
932 `oceania`. This forces each country on to its own range. As more countries are
933 added, the effect is amplified.
934
935 This is unfortunate but should be fine in larger tables. It can be worked around
936 by introducing and partitioning on a `partition_id` column, which is derived
937 from country. (Computed columns are a natural choice for this.) The tradeoff
938 here is that repartitioning will necessitate rewriting rows instead of just
939 updating range metadata.
940
941 Note that repartitioning could make this worse, especially until we support
942 range merges. As discussed in [Range splits and schema changes], list
943 partitioning may create more ranges than expected when partition values are not
944 adjacent. Additionally, repartitioning may result in small or empty ranges that
945 cannot be cleaned up until we support range merges.
946
947
948 # Future work
949
950 ## Admin UI
951
952 Exposing partitioning information and zone configs in the admin UI is out of
953 scope for this document. See [#14113], which tracks providing broad insight into
954 zone constraint violations.
955
956
957 ## Bulk load
958
959 Other PARTITION BY implementations can be used to quickly bulk load data into a
960 table (or bulk remove it) and this appears to be a popular use. We currently
961 allow only bulk load of an entire table at once, so this may be useful for us to
962 consider as well, but it’s out of scope for this document.
963
964
965 # Alternatives
966
967 The separation between partition specification, which happens via SQL, and zone
968 configuration, which happens via the CLI, is unfortunate and largely historical
969 happenstance. A table’s partitioning is rarely updated independently of its zone
970 config, so if and when we move zone configuration to SQL, we should consider
971 tightly coupling its interface to table definitions/partitioning.
972
973
974 # Appendix: Other PARTITIONING syntaxes
975
976 There is unfortunately no SQL standard for partitioning. As a result, separate
977 partitioning syntaxes have emerged:
978
979 - MySQL and Oracle specify partitions inline in `CREATE TABLE`, like we do
980 above.
981
982 - Microsoft SQL Server requires four steps: allocating physical storage called
983 “filegroups” for each partition with `ALTER DATABASE… ADD FILE`, followed by
984 `CREATE PARTITION FUNCTION` to define the partition split points, followed by
985 `CREATE PARTITION SCHEME` to tie the partition function output to the created
986 file groups, followed by `CREATE TABLE... ON partition_scheme` to tie the
987 table to the partitioning scheme.
988
989 - PostgreSQL 10 takes a hybrid approach: the partition columns and scheme (i.e,
990 `RANGE` or `LIST`) are specified in the `CREATE TABLE` statement, but the
991 partition split points are specified by running `CREATE TABLE… PARTITION OF…
992 FOR VALUES` once for each partition.
993
994 We normally reuse PostgreSQL syntax for compatibility reasons, but we’ve
995 deliberately rejected it here. Partitioning was not a first class feature before
996 PostgreSQL 10, which was only released on 2017-10-05, so we shouldn’t have
997 compatibility issues with existing ORMs and applications. The syntax being
998 introduced in PostgreSQL 10, shown below, treats a partitioned table roughly as
999 a collection of tables, which each have their own indexes, constraints, etc. It
1000 does not allow for a global index across all partitions of a table and the
1001 global table namespace is polluted with each partition.
1002
1003 We’ve instead chosen to closely follow the MySQL and Oracle syntax, since it
1004 fits with our model of partitions as mostly-invisible subdivisions of a table.
1005
1006 For reference, we’ve replicated the `roachblog.articles` example in each of the
1007 three syntaxes.
1008
1009 ## MySQL and Oracle
1010
1011 ```sql
1012 CREATE TABLE articles (
1013 id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
1014 published DATE,
1015 ...
1016 ) PARTITION BY RANGE (published) (
1017 PARTITION archived VALUES LESS THAN ('2017-01-25'),
1018 PARTITION recent VALUES LESS THAN MAXVALUE
1019 );
1020 ```
1021
1022 ## PostgreSQL
1023
1024 ```sql
1025 CREATE TABLE articles (
1026 id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
1027 published DATE,
1028 ...
1029 ) PARTITION BY RANGE (published);
1030 CREATE TABLE articles_archived PARTITION OF articles
1031 FOR VALUES FROM (MINVALUE) TO ('2017-01-25');
1032 CREATE TABLE articles_recent PARTITION OF articles
1033 FOR VALUES FROM ('2017-01-25') TO (MAXVALUE);
1034 ```
1035
1036 ## Microsoft SQL Server
1037
1038 ```sql
1039 CREATE PARTITION FUNCTION fun (int) AS RANGE LEFT FOR VALUES ('2017-01-25');
1040 CREATE PARTITION SCHEME sch AS PARTITION fun TO (filegroups...);
1041 CREATE TABLE articles (id int PRIMARY KEY, published DATE) ON sch (published);
1042 ```
1043
1044 [#14113]: https://github.com/cockroachdb/cockroach/issues/14113
1045 [#18683]: https://github.com/cockroachdb/cockroach/pull/18683
1046 [#19141]: https://github.com/cockroachdb/cockroach/issues/19141
1047 [computed columns]: https://github.com/cockroachdb/cockroach/pull/20735
1048 [example: date partitioning]: #example-date-partitioning
1049 [example: geographic partitioning]: #example-geographic-partitioning
1050 [index key prefix]: https://github.com/cockroachdb/cockroach/blob/1f3c72f17546f944490e0a4dcd928fd96a375987/docs/RFCS/sql_partitioning.md#key-encoding
1051 [interleaved tables]: https://www.cockroachlabs.com/docs/stable/interleave-in-parent.html
1052 [leases follow the sun]: https://github.com/cockroachdb/cockroach/blob/763d21e6fad69728a523d3cdd8b449c8513094b7/docs/RFCS/20170125_leaseholder_locality.md
1053 [locality–resilience tradeoff]: #usage-localityresilience-tradeoff
1054 [partitioning and index columns]: #partitioning-and-index-columns
1055 [partitioning key selection]: #partitioning-key-selection
1056 [range splits and schema changes]: #range-splits-and-schema-changes
1057 [range splits]: #range-splits
1058 [select from partition]: #select-from-partition
1059 [system.subzones]: https://github.com/cockroachdb/cockroach/blob/1f3c72f17546f944490e0a4dcd928fd96a375987/docs/RFCS/sql_partitioning.md#table-subzones
1060 [zone config]: https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html