github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170921_sql_partitioning.md (about) 1 - Feature Name: SQL Table Partitioning 2 - Status: in-progress 3 - Start Date: 2017-01-25 4 - Authors: Nikhil Benesch, Daniel Harrison, David Taylor 5 - RFC PR: [#18683] 6 7 8 # Summary 9 10 Table partitioning provides row-level control over how and where data is stored. 11 12 Specifically, `CREATE TABLE`, `ALTER TABLE`, and their `INDEX` counterparts 13 learn to accept a partition specification, which maps ranges of the index to 14 named partitions. Zone configs learn to target these partitions so that locality 15 constraints and replication settings can vary per partition instead of per table 16 or database. 17 18 19 # Motivation 20 21 The magic of CockroachDB is its ability to stitch together physically-disjoint 22 ranges of data into one logical table, no matter whether the servers are 23 separated by several racks or several continents. 24 25 If the cluster operator has any of the following requirements, however, 26 CockroachDB’s default range allocation policy might be a poor fit: 27 28 - **Latency guarantees.** The operator must ensure that accessing particular 29 data from a particular part of the world will be fast. Since the default 30 allocation policy optimizes average latency across all ranges, it might 31 violate the latency requirements on a particular range in the service of 32 balancing the cluster. 33 34 - **Data sovereignty.** The operator is subject to regulations that require 35 certain data to live on servers in certain geographical regions. The default 36 allocation policy will blindly move data from overfull regions to underfull 37 regions, potentially violating these regulations. 38 39 - **Cost reduction.** The operator needs to store infrequently-accessed data on 40 slower hardware (e.g. spinning disks) to save money. The default allocation 41 policy will happily shuffle data between nodes with SSDs and nodes with HDDs. 42 43 Zone configs can help with the above, provided the constraints applies to an 44 entire table. To meet latency guarantees, a zone config can constrain an entire 45 table to a particular data center. To reduce costs, a zone config can constrain 46 an entire infrequently-accessed table to nodes with slower hardware. Data 47 sovereignty, however, requires further work. 48 49 Often, however, the requirement applies to only a subset of the table. A `users` 50 table, for example, should store its American users in North America, its 51 European users in Europe, etc. Maybe an `articles` table should store the most 52 recent few months of posts on SSDs, but everything else on HDDs. Creating 53 individual tables (`users_america`, `users_europe`, `articles_recent`, 54 `articles_archive` etc.) would allow today’s zone configs to solve the problem, 55 but would bring back some of the pain of manual sharding that CockroachDB was 56 designed to avoid. Clients would be responsible for reading data from and 57 writing data to the right shard and stitching data from multiple shards together 58 if necessary. 59 60 By allowing zone configs to target subsets (i.e., “partitions”) of a table, we 61 allow operators to meet their latency, regulatory, and cost requirements while 62 preserving the magic of CockroachDB: that is, a logical table that hides the 63 complex physical sharding underneath. 64 65 66 # A note on scope 67 68 To ensure the design proposed within can be implemented in a reasonable 69 timeframe, the following items are considered out-of-scope: 70 71 - **Improving the zone config interface.** The YAML/CLI interface for specifying 72 zone configs is clunky and will get more confusing when partitions are added, 73 but improvements are better left to a comprehensive redesign (which is already 74 planned for row-level ttls, auditing, etc). Instead, this RFC proposes the 75 minimum changes necessary to the existing interface to support partitioning. 76 77 - **Data sovereignty.** The design within is necessary but not 78 sufficient to meet data sovereignty regulations. In particular, the proposed 79 scheme provides little protection against misconfiguration and little 80 visibility into, at a given moment, what data is actually stored where. 81 Additionally, we'll need a resolution to protected information leaking via 82 keys, which make their way into various system ranges (e.g. meta2). The rest 83 of the data sovereignty work will be developed in a followup RFC. 84 85 - **Engineering work schedule.** Some of this work will make 2.0 and some won’t, 86 but determining milestones is intentionally omitted from this RFC. 87 88 - **Altering the primary key columns of a table.** The design below relies on 89 careful primary key selection, which means tables created without partitioning 90 in mind may not have a suitable schema. For now, a workaround is to export and 91 reimport the data with a new schema. Eventually, we need to support primary 92 key schema changes (see [#19141]). 93 94 95 # Guide-level explanation 96 97 ## Example: geographic partitioning 98 99 Consider a globally-distributed online storefront, RoachMart, that sells and 100 ships live cockroaches. RoachMart processes several hundred requests per second, 101 each of which might generate several queries to RoachMart’s nine-node 102 CockroachDB cluster. (Live cockroaches are very popular these days.) To keep 103 these requests speedy, RoachMart’s operators want to automatically locate each 104 user’s data in datacenters near that user. 105 106 Today, RoachMart has two data centers: one in North America and one in 107 Australia. The company expects to open data centers in Europe, Africa, and Asia 108 soon, so they want to ensure they can migrate users to the new data centers once 109 they open. 110 111 In CockroachDB, partitions are cheap and easy to redefine at will, with one 112 caveat: partitions must be defined over columns that are a prefix of the primary 113 key. A `users` table with just a string primary key, like 114 115 ```sql 116 CREATE TABLE users ( 117 email STRING PRIMARY KEY, 118 continent STRING, 119 ... 120 ); 121 ``` 122 123 cannot be partitioned by `continent` because `continent` is not part of the 124 primary key. Instead, the table needs to be specified with a composite primary 125 key 126 127 ```sql 128 CREATE TABLE users ( 129 email STRING, 130 continent STRING, 131 ... 132 PRIMARY KEY (continent, email) 133 ); 134 ``` 135 136 where the partition column `continent` appears first in the key. 137 138 Note that this usage of composite primary keys is somewhat unusual. See the 139 [Partitioning key selection] section below for a discussion. 140 141 Since it’s not currently possible to change a table’s primary key, the partition 142 columns included in the primary key when the table is created must be granular 143 enough to support any partitioning scheme that might be desired in the future. 144 Even if it were possible, changing the primary key would require rewriting *all* 145 the data in the table and any tables interleaved beneath, a potentially 146 intractable operation. See [Partitioning and index columns] for a more detailed 147 discussion of this drawback. 148 149 RoachMart is worried that, someday, they’ll have more than one data center per 150 continent, so they decide to use `country` instead of `continent` as their 151 partition column. Using the `PARTITION BY LIST` syntax, they can group 152 individual countries into larger partitions, one for each of their data centers: 153 154 ```sql 155 CREATE TABLE roachmart.users ( 156 email STRING, 157 country STRING, 158 ..., 159 PRIMARY KEY (country, email) 160 ) PARTITION BY LIST (country) ( 161 PARTITION australia VALUES IN ('AU', 'NZ'), 162 PARTITION north_america VALUES IN ('CA', 'MX', 'US'), 163 PARTITION default VALUES IN (DEFAULT), 164 ); 165 ``` 166 167 As requirements shift (e.g., a new data center opens, an existing data center is 168 running low on capacity, etc.), a country can be seamlessly migrated to a new 169 partition with an `ALTER TABLE` command that repartitions the table: 170 171 ```sql 172 ALTER TABLE roachmart.users PARTITION BY LIST (country) ( 173 PARTITION australia VALUES IN ('AU', 'NZ'), 174 PARTITION north_america VALUES IN ('CA', 'MX', 'US'), 175 PARTITION scandinavia VALUES IN ('DK', 'NO', 'SE'), 176 PARTITION default VALUES IN (DEFAULT) 177 ); 178 ``` 179 180 Each partition is required to have a unique name that logically identifies it 181 across repartitions. Above, the `default` partition is considered equivalent to 182 the original `default` partition, though it is missing three countries that are 183 now in the new `scandinavia` partition. 184 185 On their own, partitions are inert. The SQL schema for `roachmart.users` does 186 nothing to actually locate the `australia` partition in Australia. Applying 187 functionality to a partition requires zone configs. Just as each database and 188 table can be targeted by a [zone config] that overrides the number and location 189 of its replicas, each partition can be targeted by a zone config. 190 191 So, RoachMart restarts each of its CockroachDB nodes with a `--locality` flag 192 that indicates its data center: 193 194 ``` 195 $ ./cockroach start --locality=datacenter=au1 196 $ ./cockroach start --locality=datacenter=us1 197 ``` 198 199 Then they can apply a zone config to each partition that restricts it to the 200 appropriate locality: 201 202 ``` 203 $ cat australia.zone.yml 204 constraints: [+datacenter=au1] 205 $ cat north_america.zone.yml 206 constraints: [+datacenter=us1] 207 $ ./cockroach zone set roachmart.users --partition=australia -f australia.zone.yml 208 $ ./cockroach zone set roachmart.users --partition=north_america -f north_america.zone.yml 209 ``` 210 211 The replicate queue on each node will notice the updated zone config and begin 212 rebalancing the cluster. Users with `country = 'NZ'` will live on a range that 213 is only replicated within the au1 data center and users with `country = 'US'` 214 will live on a range that is only replicated within the us1 data center. 215 216 217 ### Additional colocated tables 218 219 Data in the `users` table is now stored in the correct place, but the typical 220 RoachMart request accesses more than just the `users` table. In particular, 221 RoachMart would like to colocate users’ orders. Interleaved tables make this 222 simple. 223 224 ```sql 225 CREATE TABLE roachmart.orders ( 226 user_country STRING, 227 user_email STRING, 228 id INT, 229 part_id INT, 230 ... 231 PRIMARY KEY (user_country, user_email, id), 232 FOREIGN KEY (user_country, user_email) REFERENCES users 233 ) INTERLEAVE IN PARENT users (user_country, user_email) 234 ``` 235 236 The key encoding of interleaved tables ensures that the zone config of any 237 top-level partitions applies to the data interleaved within. 238 239 240 ### Partitioned secondary indices 241 242 RoachMart’s warehouse team wants to know how many orders are processed for each 243 part they sell. They want a secondary index to make the query efficient. Since 244 the common operation is to look at one country’s orders at a time, they 245 partition the index on the same columns as users. 246 247 ```sql 248 CREATE INDEX part_idx ON roachmart.orders (user_country, part_id) 249 PARTITION BY LIST (user_country) ( 250 ... 251 ); 252 ``` 253 254 ## Example: date partitioning 255 256 RoachBlog, a free weblog provider, is worried about the amount of data stored in 257 their `articles` table. The table is growing at the rate of approximately 100GB 258 per day. (Cockroach enthusiasts are prolific.) 259 260 An investigation of their query traffic has revealed that, save for a few 261 outliers, articles published more than 30 days ago receive virtually no traffic. 262 They’d like to move those articles to nodes with HDDs to save money on hardware. 263 Smartly—as if they had anticipated this eventuality—RoachBlog’s engineers 264 designed the table with a `TIMESTAMP` primary key: 265 266 ```sql 267 CREATE TABLE roachblog.articles ( 268 id SERIAL, 269 published TIMESTAMP, 270 author_id INT, 271 ..., 272 PRIMARY KEY (published, id) 273 ); 274 ``` 275 276 Listing out every `published` timestamp that should be considered “archived” 277 would be infeasible, so they use the range partitioning syntax instead: 278 279 ```sql 280 ALTER TABLE roachblog.articles PARTITION BY RANGE (published) ( 281 PARTITION archived VALUES FROM (MINVALUE) TO ('2017-12-04'), 282 PARTITION recent VALUES FROM ('2017-12-04') TO (MAXVALUE) 283 ); 284 ``` 285 286 RoachBlog plans to run the query every week with an updated partition split 287 point. Repartitioning was designed to be quite cheap to support this very use 288 case, and indeed no actual data needs to be rewritten. 289 290 N.B.: Each time this query is run it will produce a new split point in the 291 table, which could leave a lot of small ranges (until we implement range 292 merges). Documentation will need to caution users about repartitioning too 293 often. 294 295 As with RoachMart, RoachBlog now need only launch their nodes with appropriate 296 store attributes 297 298 ``` 299 $ ./cockroach start --store=path=/mnt/crdb,attrs=ssd 300 $ ./cockroach start --store=path=/mnt/crdb,attrs=hdd 301 ``` 302 303 and install the corresponding zone config: 304 305 ``` 306 $ cat recent.zone.yml 307 constraints: [+ssd] 308 $ cat archived.zone.yml 309 constraints: [+hdd] 310 $ ./cockroach zone set roachblog.articles --partition=recent -f recent.zone.yml 311 $ ./cockroach zone set roachblog.articles --partition=archived -f archived.zone.yml 312 ``` 313 314 315 ### Global secondary indices 316 317 RoachBlog has a page which lists all articles written by a given author. There 318 are no sovereignty issues with this, so they create a global index to keep this 319 page fast: 320 321 ```sql 322 CREATE INDEX author_idx ON roachblog.articles (author_id) 323 ``` 324 325 ## Partitioning key selection 326 327 Typically, a composite key is used when the unique identifier for a row is 328 naturally formed from more than one column. Consider, for example, a university 329 course catalog that spans multiple years. Every course has an ID, like CS101, 330 but that ID is reused every year. The primary key, then, must include both the 331 course ID and the term it was offered. In this case, you might very reasonably 332 choose to form a composite key out of three columns: 333 334 ```sql 335 CREATE TABLE courses ( 336 course_id STRING, 337 term STRING, 338 year INT, 339 ... 340 PRIMARY KEY (course_id, term, year) 341 ) 342 ``` 343 344 This is an accepted practice in schema design. You might even choose to order 345 your primary key in a different order based on expected query patterns; e.g., 346 (`year`, `term`, `course_id`) might be a better order if the application is 347 frequently querying for the courses in a given year. 348 349 Partitioning stretches this practice by constructing a composite key out of a 350 columns that could uniquely identify a row on their own. Notice that in the 351 `courses` table, no proper subset of the primary key columns is sufficient to 352 uniquely identify a row. In `roachmart.users`, however, `email` alone is 353 sufficient to identify a row; the `country` column is superfluous and exists 354 only to facilitate partitioning. This biggest consequence of this oddity is the 355 potential for user confusion; overspecified keys cause no known technical 356 problems. 357 358 In fact, the guidelines for constructing a partitioning key under this scheme 359 are quite straightforward: include everything you wish to partition by in the 360 key, in the order you wish to nest your subpartitions, and follow it with 361 everything that would have otherwise been in the key. 362 363 See the [Partitioning and index columns](#partitioning-and-index-columns) 364 section for a discussion of why this approach was chosen. 365 366 ## Summary: initial partitioning and repartitioning 367 368 - Partitions can be added to a table when it is created, or at any time 369 afterward. (However, since the primary key of a table cannot be changed, 370 forethought is still required.) 371 372 - Partitions are defined over one or more columns. These columns must be a 373 prefix of the primary key (or secondary index). 374 375 - Indexes can also be partitioned, but are not required to be. 376 377 - Each partition is required to have a name that is unique among all partitions 378 on that table or index. 379 380 - Repartitioning is relatively cheap; data is not rewritten. The table metadata 381 is updated and the allocator is left to move the data as necessary. A 382 partition with the same name before and after a repartitioning is considered 383 the same partition. During a repartitioning, any partition zone config entries 384 with no corresponding name in the new partitions are removed. 385 386 - Future work to expose zone configs through SQL will likely allow for updating 387 partitions and the corresponding zone configs together. In the meantime, users 388 can create an empty partition, apply a zone config, then repartition. 389 390 - After a table is partitioned, its partitions can be targeted by zone 391 constraints via the existing CLI, e.g. `./cockroach zone set 392 database.table --partition=partition -f zone.yml`. 393 394 - As before, CockroachDB uses the most granular zone config available. Zone 395 configs that target a partition are considered more granular than those that 396 target a table, which in turn are considered more granular than those that 397 target a database. Configs do not inherit, but unspecified fields when a zone 398 override is first create are copied from the parent, as is currently the case. 399 400 401 ## Usage: locality–resilience tradeoff 402 403 There exists a tradeoff between making reads/writes fast and surviving failures. 404 Consider a partition with three replicas of `roachmart.users` for Australian 405 users. If only one replica is pinned to an Australian datacenter, then reads may 406 be fast (via [leases follow the sun]) but writes will be slow. If two replicas 407 are pinned to an Australian datacenter, than reads and writes will be fast (as 408 long as the cross-ocean link has enough bandwidth that the third replica doesn’t 409 fall behind and hit the quota pool). If those two replicas are in the same 410 datacenter, then loss of one datacenter can lead to data unavailability, so some 411 deployments may want two separate Austrialian datacenters. If all three replicas 412 are in Australian datacenters, then three Australian datacenters are needed to 413 be resilient to a datacenter loss. 414 415 416 # Reference-level explanation 417 418 ## Key encoding 419 420 Different key encodings were evaluated at length in a prior version of this RFC. 421 The approach recommended, entitled [Index Key Prefix], is described here without 422 discussion of the tradeoffs involved in the decision; refer the original RFC for 423 that discussion. 424 425 The Index Key Prefix approach simply allows part of the existing row key to be 426 used as the partition key. For a partitioned table, the partition key can be any 427 prefix of the table primary key. Similarly, for a partitioned index, the 428 partition key can be any prefix of the indexed columns. Since a row’s key is 429 encoded as the table ID, followed by the index ID, followed by each of the 430 indexed columns, in order, the partition key appears immediately after the table 431 and index IDs, and thus rows with the same partition key will be adjacent. 432 433 To further illustrate this, consider the key encodings for the earlier 434 geographic partitioning example. 435 436 Primary index for `roachmart.users`: 437 438 ``` 439 /TableID/IndexID/<region>/<user_id> -> [email] 440 ``` 441 442 Global non-unique secondary index for `roachblog.articles.author_id`: 443 444 ``` 445 /TableID/IndexID/<author_id><article_id> -> null 446 ``` 447 448 Partitioned non-unique secondary index for `roachmart.orders.part_id`: 449 450 ``` 451 /TableID/IndexID/<region>/<part_id>/<order_id> -> null 452 ``` 453 454 Note that values need not be physically adjacent to belong to the same logical 455 partition. For example, `'US'`, `'CA'` and `'MX'` are interspersed with other 456 countries, but all map to the same `north_america` partition in 457 `roachmart.users`. This comes at a cost, as the `north_america` partition now 458 requires a minimum of three separate ranges instead of just one. See [Range 459 splits] below for further discussion. 460 461 462 ## SQL syntax 463 464 The `PARTITION BY` clause is usable in `CREATE TABLE`, `ALTER TABLE`, `CREATE 465 INDEX`, `ALTER INDEX` and variants. See [Example: geographic partitioning] for 466 the `PARTITION BY LIST` syntax and [Example: date partitioning] for `PARTITION 467 BY RANGE`. 468 469 A somewhat formal version of the syntax is presented below: 470 471 ``` 472 CREATE TABLE <table-name> ( <elements...> ) [<interleave>] [<partition-scheme>] 473 CREATE INDEX [<index-name>] ON <tablename> ( <col-names>... ) [<interleave>] [<partition-scheme>] 474 ALTER TABLE <table-name> <partition-scheme> 475 ALTER INDEX <index-name> <partition-scheme> 476 477 Partition scheme: 478 PARTITION BY LIST ( <col-names>... ) ( <list-partition> [ , ... ] ) <partition-scheme> 479 PARTITION BY RANGE ( <col-names>... ) ( <range-partition> [ , ... ] ) <partition-scheme> 480 481 List partition: 482 PARTITION <partition-name> VALUES IN ( <list-expr>... ) 483 484 List expression: 485 DEFAULT 486 <const-expr> 487 488 Range partition: 489 PARTITION <partition-name> VALUES FROM (<range-expr>) TO (<range-expr>) 490 491 Range expression: 492 MINVALUE 493 MAXVALUE 494 <const-expr> 495 ``` 496 497 To reduce confusion, expressions in a `VALUES` or `VALUES IN` clause must be 498 constant, as they are only computed once when the `CREATE` or `ALTER` statement 499 is executed. Allowing non-constant expressions, like `now()` or `SELECT 500 country_name FROM countries`, would suggest that the partition updates whenever 501 the expression changes in value. 502 503 Note that the lower bound of a range partition (`FROM`) is inclusive, while the 504 upper bound (`TO`) is exclusive. Note also that a `NULL` value in a 505 range-partitioned column sorts into the first range, which is consistent with 506 our key encoding ordering and `ORDER BY` behavior. 507 508 ### SELECT FROM PARTITION 509 510 To allow reads to target only selected partitions, we propose to extend table 511 names (not arbitrary table expressions) with a `PARTITION` clause. For example: 512 513 ```sql 514 SELECT * FROM roachmart.users PARTITION (australia, north_america) 515 516 SELECT * 517 FROM 518 roachblog.articles PARTITION (recent) AS a 519 JOIN 520 roachblog.user_views PARTITION (north_america) AS b 521 ON a.id = b.article_id 522 ``` 523 524 The implementation is purely syntatic sugar and simply transforms each 525 `PARTITION` clause into an additional constraint. The join query above, for 526 example, would be rewritten to include `WHERE user_views.country IN ('CA', 'MX', 527 'US') AND articles.published >= 'recent cutoff'`. 528 529 This is only a sugar, but it's an important one. For example, in 530 geopartitioning, it is expected that there will be many more countries than 531 partitions, so it is much easier to specify partition restrictions with this 532 than the de-sugared `WHERE`. 533 534 Note that a list partitioning without a `DEFAULT` is an enum. The user can use 535 this syntax with all partitions specified to force the planner to turn a query 536 like `SELECT * FROM roachmart.users WHERE email = "..."` (and no country 537 specified) from a full table scan into a point lookup per `country` in the 538 partitioning. 539 540 541 ### Subpartitioning 542 543 Subpartitioning allows partitioning along several axes simultaneously. The 544 `PARTITION BY` syntax presented above is recursive so that list partitions can 545 be themselves partitioned any number of times, using either list or range 546 partitioning. Note that the subpartition columns must be a prefix of the columns 547 in the primary key that have not been consumed by parent partitions. 548 549 Suppose RoachMart wanted to age out users who haven’t logged in in several 550 months to slower hardware, while continuing to partition by country for improved 551 latency. Subpartitioning would neatly solve their use case: 552 553 ```sql 554 CREATE TABLE roachmart.users ( 555 id SERIAL, 556 country STRING, 557 last_seen DATETIME, 558 ..., 559 PRIMARY KEY (country, last_seen, id) 560 ) PARTITION BY LIST (country) ( 561 PARTITION australia VALUES IN ('AU', 'NZ') PARTITION BY RANGE (last_seen) ( 562 PARTITION australia_archived VALUES FROM (MINVALUE) TO ('2017-06-04'), 563 PARTITION australia_recent VALUES FROM ('2017-06-04') TO (MAXVALUE) 564 ), 565 PARTITION north_america VALUES IN ('CA', 'MX', 'US') PARTITION BY RANGE (last_seen) ( 566 PARTITION north_america_archived VALUES FROM (MINVALUE) TO ('2017-06-04'), 567 PARTITION north_america_recent VALUES FROM ('2017-06-04') TO (MAXVALUE) 568 ), 569 ... 570 ); 571 ``` 572 573 Subpartition names must be unique within a table, as each table's partitions and 574 subpartitions share a namespace. 575 576 Other databases also provide less flexible but more convenient syntaxes, like a 577 `SUBPARTITION TEMPLATE` that prevents repetition of identical subpartition 578 schemes. Above, a `SUBPARTITION TEMPLATE` could be used to describe the 579 `last_seen` partitioning scheme exactly once, instead of once for each `country` 580 partition. We propose to implement only the more general syntax and defer design 581 of a more convenient syntax until demand exists. 582 583 ## IndexDescriptor changes 584 585 Partitioning information is added to `IndexDescriptor` as below. All tuples are 586 encoded as `EncDatums` using the value encoding. 587 588 ```protobuf 589 message IndexDescriptor { 590 ... 591 optional PartitioningDescriptor partitioning = 13; 592 } 593 594 message PartitioningDescriptor { 595 message List { 596 optional string name = 1; 597 repeated bytes values = 2; 598 optional PartitioningDescriptor subpartition = 3; 599 } 600 601 message Range { 602 optional string name = 1; 603 optional bytes inclusive_lower_bound = 2; 604 optional bytes exclusive_upper_bound = 3; 605 } 606 607 optional uint32 num_columns = 1; 608 repeated List list = 2; 609 repeated Range range = 3; 610 } 611 ``` 612 613 ## Zone config changes 614 615 Zone configs are currently stored in the `system.zones` table, which maps 616 database and table IDs to `ZoneConfig` protobufs. We propose to adjust this 617 `ZoneConfig` protobuf to include configuration for "subzones," where a subzone 618 represents either an entire index, or a partition of any index. Subzones are not 619 applicable when the zone does not represent a table. 620 621 Ideally, we'd modify or replace `system.zones` with a table keyed by 622 `(descriptor_id, index_id, partition_name)`, but a proposal to this effect [was 623 presented and rejected in a previous version of this RFC][system-subzones]. In 624 short, replacing `system.zones` requires a complex, backwards-incompatible 625 migration that is better left to its own RFC. 626 627 Our interim solution adds two fields to the `ZoneConfig` proto: 628 629 ```protobuf 630 message ZoneConfig { 631 ... 632 repeated Spans subzone_spans = 8; 633 repeated SubzoneConfig subzone_configs = 9; 634 } 635 636 message SubzoneSpan { 637 optional roachpb.Span span = 1; 638 optional uint32 subzone_index = 2; 639 } 640 641 message SubzoneConfig { 642 optional uint32 index_id = 1; 643 optional string partition_name = 2; 644 optional message ZoneConfig = 3; 645 } 646 ``` 647 648 A subzone's config is stored in the `subzone_configs` field as a `SubzoneConfig` 649 message, which bundles a `ZoneConfig` with identification of the index or 650 partition that the subzone represents. Every `SubzoneConfig` must specify an 651 `index_id`, but subzones that apply to the entire index omit the 652 `partition_name` field. 653 654 A mapping from key span to `SubzoneConfig` is stored in the `subzone_spans` 655 field to allow efficient lookups of the subzone override for a given key. 656 Entries in `subzone_spans` are non-overlapping and sorted by start key to allow 657 for binary search. If an entry's span contains the lookup key, the 658 `subzone_index` field is the index of the `SubzoneConfig` in the 659 `subzone_configs` field; if no span contains the lookup key, the table config 660 (i.e., the outer `ZoneConfig`) applies. 661 662 We could alternatively derive `subzone_spans` from the corresponding 663 `TableDescriptor` on every call to `GetZoneConfigForKey`, but this would involve 664 quite a bit of encoding/decoding on a hot code path, as the table descriptor 665 stores the partition split points using value encoding. Updating zone configs or 666 partitioning, by contrast, happens infrequently; we'll simply recompute 667 `subzone_spans` whenever an index or partition zone config is updated, or when 668 a table's partitioning scheme is updated. 669 670 One case requires special care to handle. Suppose the `archived` partition of 671 `roachblog.articles` has a custom zone config, but the `roachblog.articles` 672 table itself does not have a custom zone config. The `system.zones` table will 673 necessarily have an entry for `roachblog.articles` to store the custom zone 674 config for the `archived` partition, but that entry will have an otherwise empty 675 `ZoneConfig`. This must not be taken to mean that the other partitions of the 676 table should use this empty zone config, but that the default zone config 677 applies. We propose to use the `num_replicas` field for this purpose: 678 `num_replicas = 0` is invalid and therefore indicates that the table in 679 question does not have an active `ZoneConfig`. 680 681 As mentioned in the examples, the zone config CLI will be adjusted to accept an 682 index specifier and partitioning flag: 683 684 ```bash 685 # Before: 686 $ ./cockroach zone {set|get|rm} DATABASE[.TABLE[@INDEX]] [--partition=PARTITION] -f zone.yml 687 688 # After: 689 $ ./cockroach zone {set|get|rm} DATABASE[.TABLE] f zone.yml 690 ``` 691 692 Omitting the index but specifying the `--partition` flag (e.g., `./cockroach 693 zone set db.tbl --partition=p0`) implies a partition of the primary index. Note, 694 however, that `./cockroach set db.tbl` and `./cockroach set db.tbl@primary` are 695 *not* equivalent: the former specifies a table zone config that applies to any 696 secondary indices, unless more specific overrides exist, while the latter 697 specifies a primary index zone config that never applies to the table's 698 secondary indices. 699 700 One downside of this scheme is that the zone configs for all partitions of a 701 table's indices are stored in one row, which puts an effective limit on the 702 number of partitions allowed on a table. Additionally, `system.zones` is part of 703 the unsplittable, gossiped system config span, so all zone configs across all 704 tables must add up to less than 64MB. Some back of the envelope math suggests 60 705 bytes per range partition, and 60 bytes + the size of the values in a list 706 partition. 707 708 For `roachmart.users` with every country allocated between 7 partitions, this 709 results in 1167B. This results in an absolute max of 64MB / 60B = ~1,000,000 710 partitioned tables or 64MB / 1167B = ~54,000 tables if they were all partitioned 711 by country. The recommended max number of partitions in a table seems to range 712 from 100 to 1024 in other partitioning implementations, so this seems 713 reasonable. 714 715 Similar to the way that the zone config for a table, if present, completely 716 overrides (with no inheritance) the zone config for a database, a zone config 717 for a partition overrides the zone config for the table, database, or cluster. 718 The ergonomics of this will likely be sub-optimal; the user will need to 719 maintain the denormalization of what is naturally an inheritance hierarchy of 720 configuration. There is a larger upcoming effort to refactor zone configs which 721 will address these issues, so the following are out of scope of this RFC: 722 723 - A SQL interface for reading and writing zone configs 724 725 - Auditing of zone config changes 726 727 - Inheritance 728 729 - Moving zone configs out of gossip 730 731 - Raising global and per-table limitations on the number of partitions 732 733 - Direct replica-level control of data placement 734 735 - Partition configs outside of the database+table scoping. This could be useful 736 for allowing partitioning in a shared CockroachDB offering with static, preset 737 zone configs. 738 739 - The ability to define a partition and specify a zone config for it at the same 740 time. 741 742 ## Range splits and schema changes 743 744 The CockroachDB unit of replication and rebalancing is the range. So for zone 745 configs to target a partition, it needs to be on its own range (and potentially 746 more than that for list partitioning see [Range splits]). 747 748 Currently, CockroachDB asynchronously splits each newly created table into its 749 own range. This happens regardless of whether any zone configs are added to 750 target that table specifically. Each partition could similarly be asynchronously 751 split after being defined. 752 753 This may create extra ranges if a table is partitioned, repartitioned, and only 754 then given zone configs (though this problem goes away when we support range 755 merges). We address this by lazily splitting a partition only when a zone config 756 is added for it. This deviates a bit from the table behavior and may violate a 757 user’s expectation of partitions living on separate ranges, but the tradeoff was 758 deemed worth it. 759 760 A table’s partitions can have a large influence on sql planning, so any changes 761 should be broadcast immediately. Further, our schema changes and table leasing 762 require that at most 2 versions are in use at a time. So it’s a natural fit for 763 partitionings to run as schema changes. This schema change will be responsible 764 for creating the splits and can be hooked into `system.jobs` to provide 765 introspection of progress as well as cancellation. 766 767 768 ## Query planning changes 769 770 Because of the decision to require partitions be defined over normal, 771 materialized columns, there are no correctness changes needed to sql query 772 planning. Performance, however, needs some work. 773 774 775 ### Case study 776 777 Consider the following queries issued by RoachMart to the `roachmart.users` 778 table. (Ignore the poor security practices, this is meant to be illustrative.) 779 780 When a user visits RoachMart and needs to log in: 781 782 ```sql 783 SELECT id FROM roachmart.users WHERE email = $1` 784 ``` 785 786 Whenever a new page is loaded on the web or a new screen is loaded on the native 787 app, the `id` is extracted from a cookie or passed via the RoachMart api and 788 used in the stateless servers to rehydrate the user information: 789 790 ```sql 791 SELECT * FROM roachmart.users WHERE id = 12345` 792 ``` 793 794 Regardless of whether the `roachmart.users` table is partitioned, the first 795 query is kept speedy by an unpartitioned secondary index on `email`. This may 796 require a cross-ocean hop to another datacenter but login is an infrequent 797 operation and so this is okay. (If data sovereignty of emails is a concern, then 798 a global index is not appropriate and either a global index on `hash(email)` or 799 a partitioned index on must be used. The details of this is left for a later 800 sovereignty RFC.) 801 802 The second query is much harder. In an unpartitioned table, the primary key 803 would be only on `(id)` and so this is a point lookup. But RoachMart is a global 804 company and wants to keep a user’s data in the datacenter nearest them, so 805 they’ve partitioned `roachmart.users` on `(country, id)`. This means the query 806 as written above will require a full table scan, which is obviously 807 unacceptable. 808 809 The best solution to this, and the one we will always recommend first, is for 810 RoachMart to also specify the user’s `country` in the second query. The login 811 query will be changed to also return the `country`, this will be saved alongside 812 `id` in the web cookie or native app’s local data, and passed back whenever 813 retrieving the user. It’s even not as onerous as it first seems since the 814 RoachMart API returns the user’s `id` as a string containing both pieces: 815 `US|123`. 816 817 In some cases, this will not be possible or will not be desirable, so 818 CockroachDB has a number of pieces that can be combined to deal with this. NB: 819 None of these really solve the problem in a satisfactory way, so as mentioned 820 above we will very strongly urge the user to specify the entire primary key. 821 822 823 1. The developer could create a global index on just `id`, but in the common 824 case this requires a cross-datacenter request for the index lookup. 825 826 2. If `LIMIT 1` is added to the query and no sort is requested, the planner is 827 free to return the first result it finds and cancel the rest of the work. If 828 a uniqueness constraint exists on `id`, the `LIMIT 1` can be assumed. This 829 latter is an optimization that may be helpful in general. 830 831 3. A list partitioning without a `DEFAULT` is an enum. The user can use the 832 [SELECT FROM PARTITION] syntax with all partitions specified. Internally, the 833 planner will turn this into an `AND country IN ('CA', 'MX','US', …)` clause 834 in the `WHERE`, which turns the query from a full table scan into a point 835 lookup per `country` in the partitioning. If `DEFAULT` is present, the 836 non-default cases could optimistically be checked first. If some future 837 version of CockroachDB supports enum types, the user would get this behavior 838 even without using the `SELECT FROM PARTITION` syntax. 839 840 4. This is not a full table scan in other partitioning implementations because 841 they don’t require the `(country, id)` primary key, instead indexing `(id)` 842 as normal inside each partition. A query on `id` without the partition 843 information then becomes a point lookup per partition. This can be simulated 844 in CockroachDB by introducing and partitioning on a derived `partition_id` 845 column that is 1:1 with partitions. This is sufficient justification to 846 prioritize building [computed columns]. 847 848 Concretely, the `roachmart.users` table above could have PRIMARY KEY 849 (continent, country, id) and PARTITION BY LIST (continent) to start, so there 850 is only one key value per partition. Later, when/if it is needed, it could 851 change to PARTITION BY LIST (continent, country) 852 853 6. 3+4 could allow a user to issue one query to optimistically try a point 854 lookup in the local partition before trying a point lookup in all partitions. 855 856 857 ## Other SQL changes 858 859 Other implementations surface partitioning information via 860 `information_schema.partitions`, so we should as well. `SHOW CREATE TABLE` will 861 also need to learn to display `PARTITION BY` clauses. 862 863 864 ## Interleaved tables and partitioning 865 866 [Interleaved tables] and partitioning are designed to work together. Geographic 867 partitioning is used to locate `roachmart.users` records in the nearest 868 datacenter and interleaved tables are used to locate the data associated with 869 that user (orders, etc) near it. A geographically partitioned user can be moved 870 with one `UPDATE`, and the user’s orders (plus the rest of the interleave 871 hierarchy) can be moved with `ON UPDATE CASCADE`. 872 873 874 # Drawbacks 875 876 Like index selection, column families, interleaved tables, and other 877 optimization tools, partitioning will require some knowledge of the internals of 878 the system to use effectively. See, for example, the [locality–resilience 879 tradeoff] described above. 880 881 882 ## Partitioning and index columns 883 884 Since a table can only be partitioned by a prefix of its primary key, a table 885 destined for partitioning often has an “unnatural” primary key. For example, to 886 support partitioning the `roachmart.users` table by `country`, the table’s 887 natural primary key, `id`, must be explicitly prefixed with the partition column 888 to create a composite primary key of `(country, id)`. The composite primary key 889 has two notable drawbacks: it does not enforce that `id` is globally unique, and 890 it does not provide fast lookups on `id`. If ensuring uniqueness or fast lookups 891 are required, the user must explicitly create a unique, unpartitioned secondary 892 index on `id`. 893 894 We could automatically add this secondary index to preserve uniqueness and fast 895 lookups, but this would violate user expectations. First, a secondary index 896 might itself store sensitive information. We want to ensure that operators are 897 aware of the secondary index so they can specify an appropriate zone config or 898 even create the index on `hash(id)` (though hashes may not provide sufficient 899 masking of sensitive information, depending on the entropy of the input and the 900 type of hash). Second, every secondary index increases the cost of writes 901 (specifically for inserts and for updates that change an indexed or stored 902 column). For example, an unpartitioned, unique index on `roachmart.users` would 903 require cross-ocean hops for writes that would otherwise hit just one 904 continent.* 905 906 Similarly, we could silently prefix the specified primary key (i.e., the natural 907 primary key) with the partition columns, but this too would violate user 908 expectations. Most notably, queries that specify all columns in an index, as 909 specified at creation or returned by reflection like `information_schema`, are 910 expected to be point lookups. 911 912 Instead, we can aid the user with detailed guidance in the error message 913 generated by invalid partitioning schemes. 914 915 *In general, ensuring global uniqueness requires cross-datacenter hops on 916 writes. In limited cases, like `SERIAL` columns, users can achieve both 917 uniqueness and fast writes without a secondary unique index by assuming 918 `unique_rowid()` collisions are sufficiently improbable. The risk with such a 919 scheme, of course, is that someone can manually insert a colliding value. 920 921 922 ## Range splits 923 924 The unit of replication in CockroachDB is the range, so partitioning necessarily 925 requires splitting at least one range for each partition. In the worst case, 926 when partitioning by list, partitions containing non-adjacent values will 927 generate an extra range for *each* non-adjacent value. 928 929 For example, consider the original partitioning specification from 930 `roachmart.users`, in which `'CA'` and `'MX'` belong to the same partition 931 `north_america`, but are bisected by a value, `'FJ'`, in another partition 932 `oceania`. This forces each country on to its own range. As more countries are 933 added, the effect is amplified. 934 935 This is unfortunate but should be fine in larger tables. It can be worked around 936 by introducing and partitioning on a `partition_id` column, which is derived 937 from country. (Computed columns are a natural choice for this.) The tradeoff 938 here is that repartitioning will necessitate rewriting rows instead of just 939 updating range metadata. 940 941 Note that repartitioning could make this worse, especially until we support 942 range merges. As discussed in [Range splits and schema changes], list 943 partitioning may create more ranges than expected when partition values are not 944 adjacent. Additionally, repartitioning may result in small or empty ranges that 945 cannot be cleaned up until we support range merges. 946 947 948 # Future work 949 950 ## Admin UI 951 952 Exposing partitioning information and zone configs in the admin UI is out of 953 scope for this document. See [#14113], which tracks providing broad insight into 954 zone constraint violations. 955 956 957 ## Bulk load 958 959 Other PARTITION BY implementations can be used to quickly bulk load data into a 960 table (or bulk remove it) and this appears to be a popular use. We currently 961 allow only bulk load of an entire table at once, so this may be useful for us to 962 consider as well, but it’s out of scope for this document. 963 964 965 # Alternatives 966 967 The separation between partition specification, which happens via SQL, and zone 968 configuration, which happens via the CLI, is unfortunate and largely historical 969 happenstance. A table’s partitioning is rarely updated independently of its zone 970 config, so if and when we move zone configuration to SQL, we should consider 971 tightly coupling its interface to table definitions/partitioning. 972 973 974 # Appendix: Other PARTITIONING syntaxes 975 976 There is unfortunately no SQL standard for partitioning. As a result, separate 977 partitioning syntaxes have emerged: 978 979 - MySQL and Oracle specify partitions inline in `CREATE TABLE`, like we do 980 above. 981 982 - Microsoft SQL Server requires four steps: allocating physical storage called 983 “filegroups” for each partition with `ALTER DATABASE… ADD FILE`, followed by 984 `CREATE PARTITION FUNCTION` to define the partition split points, followed by 985 `CREATE PARTITION SCHEME` to tie the partition function output to the created 986 file groups, followed by `CREATE TABLE... ON partition_scheme` to tie the 987 table to the partitioning scheme. 988 989 - PostgreSQL 10 takes a hybrid approach: the partition columns and scheme (i.e, 990 `RANGE` or `LIST`) are specified in the `CREATE TABLE` statement, but the 991 partition split points are specified by running `CREATE TABLE… PARTITION OF… 992 FOR VALUES` once for each partition. 993 994 We normally reuse PostgreSQL syntax for compatibility reasons, but we’ve 995 deliberately rejected it here. Partitioning was not a first class feature before 996 PostgreSQL 10, which was only released on 2017-10-05, so we shouldn’t have 997 compatibility issues with existing ORMs and applications. The syntax being 998 introduced in PostgreSQL 10, shown below, treats a partitioned table roughly as 999 a collection of tables, which each have their own indexes, constraints, etc. It 1000 does not allow for a global index across all partitions of a table and the 1001 global table namespace is polluted with each partition. 1002 1003 We’ve instead chosen to closely follow the MySQL and Oracle syntax, since it 1004 fits with our model of partitions as mostly-invisible subdivisions of a table. 1005 1006 For reference, we’ve replicated the `roachblog.articles` example in each of the 1007 three syntaxes. 1008 1009 ## MySQL and Oracle 1010 1011 ```sql 1012 CREATE TABLE articles ( 1013 id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, 1014 published DATE, 1015 ... 1016 ) PARTITION BY RANGE (published) ( 1017 PARTITION archived VALUES LESS THAN ('2017-01-25'), 1018 PARTITION recent VALUES LESS THAN MAXVALUE 1019 ); 1020 ``` 1021 1022 ## PostgreSQL 1023 1024 ```sql 1025 CREATE TABLE articles ( 1026 id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, 1027 published DATE, 1028 ... 1029 ) PARTITION BY RANGE (published); 1030 CREATE TABLE articles_archived PARTITION OF articles 1031 FOR VALUES FROM (MINVALUE) TO ('2017-01-25'); 1032 CREATE TABLE articles_recent PARTITION OF articles 1033 FOR VALUES FROM ('2017-01-25') TO (MAXVALUE); 1034 ``` 1035 1036 ## Microsoft SQL Server 1037 1038 ```sql 1039 CREATE PARTITION FUNCTION fun (int) AS RANGE LEFT FOR VALUES ('2017-01-25'); 1040 CREATE PARTITION SCHEME sch AS PARTITION fun TO (filegroups...); 1041 CREATE TABLE articles (id int PRIMARY KEY, published DATE) ON sch (published); 1042 ``` 1043 1044 [#14113]: https://github.com/cockroachdb/cockroach/issues/14113 1045 [#18683]: https://github.com/cockroachdb/cockroach/pull/18683 1046 [#19141]: https://github.com/cockroachdb/cockroach/issues/19141 1047 [computed columns]: https://github.com/cockroachdb/cockroach/pull/20735 1048 [example: date partitioning]: #example-date-partitioning 1049 [example: geographic partitioning]: #example-geographic-partitioning 1050 [index key prefix]: https://github.com/cockroachdb/cockroach/blob/1f3c72f17546f944490e0a4dcd928fd96a375987/docs/RFCS/sql_partitioning.md#key-encoding 1051 [interleaved tables]: https://www.cockroachlabs.com/docs/stable/interleave-in-parent.html 1052 [leases follow the sun]: https://github.com/cockroachdb/cockroach/blob/763d21e6fad69728a523d3cdd8b449c8513094b7/docs/RFCS/20170125_leaseholder_locality.md 1053 [locality–resilience tradeoff]: #usage-localityresilience-tradeoff 1054 [partitioning and index columns]: #partitioning-and-index-columns 1055 [partitioning key selection]: #partitioning-key-selection 1056 [range splits and schema changes]: #range-splits-and-schema-changes 1057 [range splits]: #range-splits 1058 [select from partition]: #select-from-partition 1059 [system.subzones]: https://github.com/cockroachdb/cockroach/blob/1f3c72f17546f944490e0a4dcd928fd96a375987/docs/RFCS/sql_partitioning.md#table-subzones 1060 [zone config]: https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html