github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171213_sql_query_planning.md (about) 1 - Feature Name: SQL Query Planning 2 - Status: in-progress 3 - Start Date: 2017-12-13 4 - Authors: Peter Mattis 5 - RFC PR: #19135 6 - Cockroach Issue: (one or more # from the issue tracker) 7 8 # Summary 9 10 This RFC sketches the outlines of the high-level modules of a SQL 11 query planning including a full-featured optimizer. 12 13 # Motivation 14 15 SQL query planning is concerned with transforming the AST of a SQL 16 query into a physical query plan for execution. Naive execution of a 17 SQL query can be prohibitively expensive, because SQL specifies the 18 desired results and not how to achieve them. A given SQL query can 19 have thousands of alternate query plans with vastly different 20 execution times. The techniques used to generate and select a good 21 query plan involve significant engineering challenges. 22 23 This RFC is intended to provide guidance for both short term and long 24 term work on the SQL optimizer, and highlight areas of the current 25 system that will need to evolve. 26 27 # Guide-level explanation 28 29 ## Overview 30 31 SQL query planning is often described in terms of 8 modules: 32 33 1. [Stats](#stats) 34 2. [Prep](#prep) 35 3. [Rewrite](#rewrite) 36 4. [Memo](#memo) 37 5. [Cost Model](#cost-model) 38 6. [Search](#search-aka-enumeration) 39 7. [Properties](#properties) 40 8. [Transformations](#transformations) 41 42 Note that Stats, Cost Model, Memo, Properties and Transformations 43 could be considered modules, while Prep, Rewrite and Search could be 44 considered phases, though we'll refer to all 8 uniformly as modules in 45 this document. Memo is a technique for compactly representing the 46 forest of trees generated during Search. Stats, Properties, Cost Model 47 and Transformations are modules that power Prep, Rewrite and Search. 48 49 ``` 50 SQL query text 51 | 52 +-----v-----+ 53 | Parse | 54 +-----+-----+ 55 | 56 (ast) 57 | 58 +-------+ +-----v-----+ - constant folding, type checking, name resolution 59 | Stats +-----> Prep | - computes initial properties 60 +-------+ +-----+-----+ - retrieves and attaches stats 61 | - done once per PREPARE 62 (expr) 63 | 64 +-----v-----+ - capture placeholder values / timestamps 65 +--> Rewrite | - cost-agnostic transformations, eg. predicate push-down 66 +------------+ | +-----+-----+ - done once per EXECUTE 67 | Transforms +--+ | 68 +------------+ | (expr) 69 | | 70 +-->-----v-----+ - cost-based transformations 71 +------------+ | Search | - finds lowest cost physical plan 72 | Cost Model +----->-----+-----+ - includes DistSQL physical planning 73 +------------+ | 74 (physical plan) 75 | 76 +-----v-----+ 77 | Execution | 78 +-----------+ 79 ``` 80 81 82 CockroachDB already has implementations of portions of these modules 83 except for Stats and Memo. For example, CockroachDB performs name 84 resolution and type checking which is part of Prep, and performs 85 predicate push down through joins which traditionally happens during 86 Rewrite. CockroachDB utilizes a primitive Cost model during index 87 selection (a portion of Search) to choose which index to use based on 88 filters and desired ordering. 89 90 In addition to the 8 modules, another aspect of the optimizer that 91 needs discussion is [Testing](#testing) and test infrastructure. 92 93 Lastly, a strawman [Roadmap](#roadmap) is proposed for how to break up 94 this work over the next several releases. 95 96 ## Glossary 97 98 The following terms are introduced/defined in this RFC: 99 100 - [**algebraic equivalence**](#properties) 101 - [**attributes** of expressions](#properties-vs-attributes) 102 - [**cardinality**](#stats) 103 - [**decorrelating**](#rewrite), syn. "unnesting" 104 - [**derived** vs **required** properties](#properties) 105 - [**enforcer** operator for properties](#properties) 106 - [**equivalence class**](#memo) 107 - [**exploration** vs **implementation** transformations](#search) 108 - [**expressions** in queries](#prep) 109 - [**functional dependencies**](#prep) 110 - [**logical** vs **physical** properties](#memo) 111 - [**logical** vs **physical** vs **scalar** operators](#prep) 112 - [**memo-expressions**](#memo) 113 - [**operator** in query expressions](#prep) 114 - [**pattern** in transformations](#memo) 115 - [**predicate push-down**](#rewrite) 116 - [**prep** phase](#prep) 117 - **properties** of expressions [1](#memo) [2](#properties) 118 - [**pruning** during search](#search) 119 - [**query text**](#modules) 120 - [**rewrite** phase](#rewrite) 121 - [**scalar** vs **relational** properties](#properties) 122 - [**search** phase](#search) 123 - [**selectivity**](#stats) 124 - [**tracked** vs **computed** properties](#properties) 125 - [**transformation** of expressions](#rewrite) 126 - [**unnesting**](#rewrite), syn. "decorrelating" 127 128 ## Modules 129 130 The parse phase is not discussed in this RFC. It handles the 131 transformation of the *SQL query text* into an abstract syntax tree 132 (AST). 133 134 ### Prep 135 136 *Prep* (short for "prepare") is the first phase of query optimization 137 where the AST is transformed into a form more suitable for 138 optimization and annotated with information that will be used by later 139 phases. Prep includes resolving table and column references (i.e. name 140 resolution) and type checking, both of which are already performed by 141 CockroachDB. 142 143 During Prep, the AST is transformed from the raw output of the parser 144 into an expression "tree". 145 146 ```go 147 type operator int16 148 149 type expr struct { 150 op operator 151 children []*expr 152 relationalProps *relationalProps // See [relational properties](#tracked_properties) 153 scalarProps *scalarProps // See [scalar properties](#tracked_properties) 154 physicalProps *physicalProps // See [physical properties](#tracked_properties) 155 private interface{} 156 } 157 ``` 158 159 The term *"expression"* here is based on usage from literature, though 160 it is mildly confusing as the current SQL code uses "expression" to 161 refer to scalar expressions. In this document, "expression" refers to 162 either a relational or a scalar expression. Using a uniform node type 163 for expressions facilitates transforms used during the Rewrite and 164 Search phases of optimization. 165 166 Each expression has an *operator* and zero or more operands 167 (`expr.children`). Operators can be *relational* (e.g. `join`) or 168 *scalar* (e.g. `<`). Relational operators can be *logical* (only 169 specifies results) or *physical* (specifies both result and a 170 particular implementation). 171 172 During Prep all the columns are given a unique index (number). Column 173 numbering involves assigning every base column and non-trivial 174 projection in a query a unique query-specific index. 175 176 Giving each column a unique index allows the expression nodes 177 mentioned above to track input and output columns, or really any set 178 of columns during Prep and later phases, using a bitmap. The bitmap 179 representation allows fast determination of compatibility between 180 expression nodes and is utilized during rewrites and transformations 181 to determine the legality of such operations. 182 183 The Prep phase also computes *logical properties*, such as the input 184 and output columns of each (sub-)expression, equivalent columns, 185 not-null columns and functional dependencies. 186 187 The functional dependencies for an expression are constraints over one 188 or more sets of columns. Specific examples of functional dependencies 189 are the projections, where 1 or more input columns determine an output 190 column, and "keys" which are a set of columns where no two rows output 191 by the expression are equal after projection on to that set (e.g. a 192 unique index for a table where all of the columns are NOT 193 NULL). Conceptually, the functional dependencies form a graph, though 194 they are not represented as such in code. 195 196 ### Rewrite 197 198 The second phase of query optimization is *rewrite*. The rewrite phase 199 performs *transformations* on the logical query tree which are always 200 beneficial (i.e. cost-agnostic). 201 202 A transformation transforms a (part of a) query into another. Note 203 that there is conceptual overlap with the Search phase which also 204 performs transformations on the query. Both phases employ 205 transformations, yet Search needs to track and cost the alternatives 206 while Rewrite does not. In the specific context of the rewrite phase, 207 transformations are commonly called *rewrites*. 208 209 During Rewrite, the previous version of an expression is 210 discarded. During Search, both the original and new expression are 211 preserved side-by-side as alternatives, see the [section 212 below](#search) for details. 213 214 Also note that some of the transformations performed by Rewrite need 215 not be performed again by Search (decorrelation is the prime 216 example). The vast majority of transforms performed by Search are not 217 used by Rewrite. 218 219 Rewrite is the phase where e.g. correlated subqueries are 220 *decorrelated* (synonym: *unnesting*), additional predicates are 221 inferred and *predicate push down* occurs, and various other 222 simplifications to the relational algebra tree (e.g. projection & join 223 elimination). As an example of predicate push down, consider the 224 query: 225 226 ```sql 227 SELECT * FROM a, b USING (x) WHERE a.x < 10 228 ``` 229 230 The naive execution of this query retrieves all rows from `a` and `b`, 231 joins (i.e. filters) them on the variable `x`, and then filters them 232 again on `a.x < 10`. Predicate push down attempts to push down the 233 predicate `a.x < 10` below the join. This can obviously be done for 234 the scan from `a`: 235 236 ```sql 237 SELECT * FROM (SELECT * FROM a WHERE a.x < 10), b USING (x) 238 ``` 239 240 Slightly more complicated, we can also generate a new predicate using 241 the functional dependence that `a.x = b.x` (due to the join 242 predicate): 243 244 ```sql 245 SELECT * FROM 246 (SELECT * FROM a WHERE a.x < 10), 247 (SELECT * FROM b WHERE b.x < 10) USING (x) 248 ``` 249 250 Predicate push down is aided by predicate inference. Consider the query: 251 252 ```sql 253 SELECT * FROM a, b USING (x) 254 ``` 255 256 Due to the join condition, we can infer the predicates `a.x IS NOT 257 NULL` and `b.x IS NOT NULL`: 258 259 ```sql 260 SELECT * FROM a, b USING (x) 261 WHERE a.x IS NOT NULL AND b.x IS NOT NULL 262 ``` 263 264 And predicate push down can push these predicates through the join: 265 266 ```sql 267 SELECT * FROM 268 (SELECT * FROM a WHERE a.x IS NOT NULL), 269 (SELECT * FROM b WHERE b.x IS NOT NULL) USING (x) 270 ``` 271 272 ### Stats 273 274 Table statistics power both the cost model and the search of alternate 275 query plans. A simple example of where stastistics guide the search of 276 alternate query plans is in join ordering: 277 278 ```sql 279 SELECT * FROM a JOIN b 280 ``` 281 282 In the absence of other opportunities, this might be implemented as a 283 hash join. With a hash join, we want to load the smaller set of rows 284 (either from `a` or `b`) into the hash table and then query that table 285 while looping through the larger set of rows. How do we know whether 286 `a` or `b` is larger? We keep statistics about the *cardinality* of `a` 287 and `b`, i.e. the (approximate) number of different values. 288 289 Simple table cardinality is sufficient for the above query but fails 290 in other queries. Consider: 291 292 ```sql 293 SELECT * FROM a JOIN b ON a.x = b.x WHERE a.y > 10 294 ``` 295 296 Table statistics might indicate that `a` contains 10x more data than 297 `b`, but the predicate `a.y > 10` is filtering a chunk of the 298 table. What we care about is whether the result of the scan of `a` 299 after filtering returns more rows than the scan of `b`. This can be 300 accomplished by making a determination of the *selectivity* of the 301 predicate `a.y > 10` (the % of rows it will filter) and then 302 multiplying that selectivity by the cardinality of `a`. The common 303 technique for estimating selectivity is to collect a histogram on 304 `a.y` (prior to running the query). 305 306 The collection of table statistics occurs prior to receiving the 307 query. As such, the statistics are necessarily out of date and may be 308 inaccurate. The system may bound the inaccuracy by recomputing the 309 stats based on how fast a table is being modified. Or the system may 310 notice when stat estimations are inaccurate during query execution. 311 312 [A separate RFC covers statistics collection in 313 CockroachDB.](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20170908_sql_optimizer_statistics.md) 314 315 ### Memo 316 317 Memo is a data structure for efficiently storing a forest of query 318 plans. Conceptually, the memo is composed of a numbered set of 319 **equivalency classes** called **groups** where each group contains a 320 set of logically equivalent expressions. The different expressions in 321 a single group are called **memo-expressions** (memo-ized 322 expressions). While an expression node outside of the memo contains a 323 list of child expressions, a memo-expression contains a list of child 324 groups. 325 326 By definition, all the memo-expressions in a group share the same 327 *logical properties*, a concept explored more in depth in the [section 328 below](#properties). The memo-expression structure mirrors the 329 expression structure: 330 331 ```go 332 type exprID int32 333 type groupID int32 334 335 type memoExpr struct { 336 op operator 337 children []groupID 338 physicalProps *physicalProps 339 private interface{} 340 } 341 342 type memoGroup struct { 343 exprs []memoExpr 344 relationalProps *relationalProps 345 scalarProps *scalarProps 346 } 347 ``` 348 349 Transformations are *not* performed directly on the memo because 350 transformations operate on trees while the memo models a forest of 351 trees. Instead, expression fragments are extracted from the memo, 352 transformed, and re-inserted into the memo. At first glance, this 353 seems onerous and inefficient, but it allows transformations to be 354 rewritten more naturally and the extraction of expression fragments 355 can be performed efficiently. 356 357 Extracting an expression fragment for transformation is performed via 358 a process called *binding*. Binding allows iterating over all of the 359 expressions matching a pattern that are rooted at a particular 360 memo-expression. A pattern is specified using the same expression 361 structure that is to be extracted, with the addition of "pattern-leaf" 362 and "pattern-tree" placeholders that act as wildcards: 363 364 * A **pattern leaf** matches any expression tree, with only the root 365 of the tree being retained in the bound expression. It is used when 366 the expression is used opaquely by the transformation. In other 367 words, the transformation doesn't care what's inside the subtree. It 368 is a "leaf" in the sense that it's a leaf in any binding matching a 369 pattern. 370 * A **pattern tree** matches any expression tree and indicates that 371 recursive extraction of the full subtree is required. It is 372 typically used for scalar expressions when some manipulation of that 373 expression is required by the transformation. Note that a pattern 374 tree results in all possible subtrees being enumerated, however 375 scalar expressions typically don't have many subtrees (if there are 376 no subqueries, there is only one subtree). [TODO(peter): what to do 377 about subqueries in a scalar context? Iterating over all of the 378 subquery expressions doesn't seem right. There is a TODO in `opttoy` 379 to cache scalar expressions in `memoGroup`. Need to investigate this 380 further.] 381 382 To better understand the structure of the memo, consider the query: 383 384 ```sql 385 SELECT * FROM a, b WHERE a.x = b.x 386 ``` 387 388 Converted to the expression structure which models the extended 389 relational algebra the query looks like: 390 391 ``` 392 inner-join [columns: a.x a.y b.x b.z] 393 filters: 394 eq 395 inputs: 396 variable (a.x) 397 variable (b.x) 398 inputs: 399 scan [columns: a.x a.y] 400 scan [columns: b.x b.z] 401 ``` 402 403 Inserting the expression tree into the memo results in: 404 405 ``` 406 6: [inner-join [1 2 5]] 407 5: [eq [3 4]] 408 4: [variable b.x] 409 3: [variable a.x] 410 2: [scan b] 411 1: [scan a] 412 ``` 413 414 Memo groups are numbered by when they were created and the groups are 415 topologically sorted for display (this is an implementation detail and 416 not intended to be prescriptive). In the above example, each group 417 contains only a single memo-expression. After performing the join 418 commutativity transformation, the memo would expand: 419 420 ``` 421 6: [inner-join [1 2 5]] [inner-join [2 1 5]] 422 5: [eq [3 4]] 423 4: [variable b.x] 424 3: [variable a.x] 425 2: [scan b] 426 1: [scan a] 427 ``` 428 429 Memo groups contain logically equivalent expressions, but two 430 logically equivalent expression may not be placed in the same memo 431 group. This occurs because determining logical equivalency of two 432 relational expressions is complex to perform 100% correctly. A 433 correctness failure (i.e. considering two expressions logically 434 equivalent when they are not) results in invalid transformations and 435 invalid plans. Placing two logically equivalent expressions in 436 different groups has a much gentler failure mode: the memo and search 437 are less efficient. 438 439 Insertion of an expression into the memo is performed by recursively 440 inserting all of the sub-expressions into the memo and then computing 441 a **fingerprint** for the memo-expression. The fingerprint for a 442 memo-expression is simply the expression operator and the list of 443 child groups. For example, in the memo examples above, the fingerprint 444 for the first inner-join expression is `[inner-join [1 2 5]]`. The 445 memo maintains a map from expression fingerprint to memo group which 446 allows quick determination if an expression fragment already exists in 447 the memo. A small amount of operator-specific normalization is 448 performed when computing the group fingerprint for a 449 memo-expression. For example, the left and right inputs of an 450 inner-join are output in sorted order which results in the expressions 451 `[inner-join [1 2 5]]` and `[inner-join [2 1 5]]` having the same 452 group fingerprint. The operator-specific normalization is 453 conservative. The common case for placing logically equivalent 454 expressions in the same group is adherence to the invariant that 455 transformed expressions are logically equivalent to their input. 456 457 ```go 458 type memo struct { 459 // Map from memo-expression "group" fingerprint to group ID. 460 groupMap map[string]groupID 461 groups []memoGroup 462 } 463 ``` 464 465 In addition to memo expressions, memo groups also contain a map from 466 desired physical properties to optimization state for the group for 467 those properties. This state is discussed more in 468 [Search](#search-aka-enumeration-or-transformation). 469 470 A **location** within the memo identifies a particular memo-expression 471 by its group and expression number. When an expression fragment is 472 extracted from the memo, each `expr` is tagged with the location it 473 originated from in the memo. This allows subsequent reinsertion of a 474 transformed expression into the memo to quickly determine which groups 475 the expression nodes should be added to. 476 477 ```go 478 type memoLoc struct { 479 group groupID 480 expr exprID 481 } 482 ... 483 type expr struct { 484 op operator 485 loc memoLoc 486 ... 487 } 488 ``` 489 490 The above depictions of the memo structures are simplified for 491 explanatory purposes. The actual structures are similar, though 492 optimized to reduce allocations. 493 494 ### Properties 495 496 Properties are meta-information that are maintained at each node in an 497 expression. Properties power transformations and optimization. 498 499 #### Properties vs attributes 500 501 The term "property" encompasses information that is well-defined over 502 any expression in its group: a given scalar property is well-defined 503 for all scalar operators; a relational property is well-defined for 504 all relational operators. For example, "nullability" is a property 505 that is properly defined (and says something meaningful for) any any 506 scalar expression. 507 508 In contrast, some bits of information are only relevant for specific 509 operators. For example, the "join algorithm" is only relevant for join 510 operators; the "index name" is only relevant for table scan operators, 511 etc. This operator-specific data is called an *attribute* and is 512 attached to a particular memo-expression. 513 514 #### Logical vs physical properties 515 516 Logical properties are maintained for both relational and scalar 517 operators. A logical property refers to logical information about the 518 expression such as column equivalencies or functional dependencies. 519 520 Physical properties are those that exist outside of the relational 521 algebra such as row order and data distribution. Physical property 522 requirements arise from both the query itself (the non-relational 523 `ORDER BY` operator) and by the selection of specific implementations 524 during optimization (e.g. a merge-join requires the inputs to be 525 sorted in a particular order). 526 527 By definition, two memo-expressions in the same group have the same 528 logical properties and the logical properties are attached to the memo 529 group. The physical properties for the memo-expressions in a group may 530 differ. For example, a memo group containing inner-join will also have 531 hash-join and merge-join implementations which produce the same set of 532 output rows but in different orders. 533 534 #### Relational vs scalar properties 535 536 The memo contains memo-expressions with either scalar (e.g. `+`, `<`, 537 etc.) or relational (e.g. `join`, `project`, etc.) operators, 538 distinguished as scalar expressions vs relational expressions. 539 540 Scalar and relational properties are maintained in separate data 541 structures, but note that both scalar and relational properties are 542 considered *logical*. 543 544 #### Derived vs required properties 545 546 Properties can be *required* or *derived*. 547 548 A required property is one specified by the SQL query text. For 549 example, a DISTINCT clause is a required property on the set of 550 columns of the corresponding projection -- that the tuple of columns 551 forms a key (unique values) in the results. 552 553 A derived property is one derived by the optimizer for an 554 expression based on the properties of the children nodes. 555 556 For example, in `SELECT k+1 FROM kv`, once the ordering of "k" is 557 known from kv's descriptor, the same ordering property can be derived 558 for `k+1`. 559 560 During optimization, for each node with required properties the 561 optimizer will look at the children node to check whether their actual 562 properties (which can be derived) match the requirement. If they don't 563 the optimizer must introduce an *enforcer* operator in the plan that 564 provides the required property. 565 566 For example, an `ORDER BY` clause creates a required ordering property 567 can cause the optimizer to add a sort node as an enforcer of that 568 property. 569 570 #### Tracked vs computed properties 571 572 A [tracked property](#tracked_properties) is one which is maintained 573 in a data structure (e.g. `relationalProps`, `scalarProps`, 574 `physicalProps`). A computed property is one which is computed from an 575 expression or an expression fragment as needed. For intermediate 576 nodes, all properties can be computed which makes tracked properties 577 akin to a cache. The decision for whether to track or compute a 578 property is pragmatic. Tracking a property requires overhead whether 579 the property is used or not, but makes accessing the property in a 580 transformation fast. Computing a property can be done only when the 581 property is used, but is not feasible if the computation requires an 582 entire sub-expression tree (as opposed to a fragment). [Computed 583 properties](#computed_properties) primarly occur for scalar properties 584 for which transformations often have the entire scalar expression. 585 586 #### Tracked properties 587 588 The determination of the properties to track is a key aspect of the 589 design of an optimizer. Track too many and adding new operators 590 becomes onerous and maintaining the properties through transformations 591 becomes expensive. Track too few and certain transformations become 592 difficult. 593 594 Relational properties: 595 596 * Output columns [`intset`]. The set of columns output by an 597 expression. Used to determine if a predicate is compatible with an 598 expression. 599 * Outer columns [`intset`]. The set of columns that are used by the 600 operator but not defined in the underlying expression tree (i.e. not 601 supplied by the inputs to the current expression). Synonym: *free 602 vars*. 603 * Not-NULL columns [`intset`]. Column nullability is associated with 604 keys which are a factor in many transformations such as join 605 elimination, group-by simplification 606 * Keys [`[]intset`]. A set of columns for which no two rows are equal 607 after projection onto that set. The simplest example of a key is the 608 primary key for a table. Note that a key requires all of the columns 609 in the key to be not-NULL. 610 * Weak keys [`[]intset`]. A set of columns where no two rows 611 containing non-NULL values are equal after projection onto that 612 set. A UNIQUE index on a table is a weak key and possibly a key if 613 all of the columns are not-NULL. Weak keys are tracked because they 614 can become keys at higher levels of a query due to null-intolerant 615 predicates. 616 * Foreign keys [`map[intset]intset`]. A set of columns that uniquely 617 identify a single row in another relation. In practice, this is a 618 map from one set of columns to another set of columns. 619 * Equivalency groups [`[]intset`]. A set of column groups (sets) where all columns 620 in a group are equal with each other. 621 * Constant columns [`intset`]. Columns for which we know we have a 622 single value. 623 624 Scalar properties: 625 626 * Input columns [`intset`]. The set of columns used by the scalar 627 expression. Used to determine if a scalar expression is compatible 628 with the output columns of a relational expression. 629 * Defined columns [`intset`]. The set of columns defined by the scalar 630 expression. 631 632 Physical properties: 633 634 * Column ordering. Specified by a top-level projection. 635 * Row ordering. Specified by an `ORDER BY` clause or required by a 636 physical operator (e.g. merge-join). Row ordering is enforced by the 637 **sort** operator. 638 * Rewindability. Required by multi-use CTEs. Every reference to the 639 CTE in the query needs to return the same results. A read-only query 640 has this property by default, though care must be taken with regards 641 to the [Halloween 642 Problem](https://en.wikipedia.org/wiki/Halloween_Problem) if the 643 read-only query exists in the context of a DML query. A CTE 644 containing a DML (such as `INSERT` or `UPDATE`) needs to have its 645 results materialized in temporary storage and thus provide 646 *rewindability*. This property is enforced using a **spool** 647 operator. 648 649 Note that this list of properties is not exhaustive. In particular, 650 there are a large number of scalar properties for which it isn't clear 651 if the property should be tracked or computed when necessary. For 652 example, null-tolerance (does a predicate ever return true for a NULL 653 value) can be computed from a scalar expression when needed. It is an 654 open question as to whether it is utilized frequently enough that it 655 should be tracked. 656 657 Tracking is a bit more than caching of computed properties: we can't 658 compute certain relational properties without the entire 659 sub-expression. Keys are an example: if you have a deeply nested join, 660 in order to compute the keys after performing a join associativity 661 transform, you would need to have the entire expression tree. By 662 tracking the keys property and maintaining it at each relational 663 expression, we only need the fragment of the expression needed by the 664 transform. 665 666 ### Computed properties 667 668 Computed properties are used primarily in conjunction with scalar 669 expressions. The properties are computed rather than tracked because 670 we usually have the full scalar expression vs just a fragment for 671 relational expressions. 672 673 Computed scalar properties: 674 675 * Injectivity. An injective expression preserves distinctness: it 676 never maps distinct elements of its domain to the same element of 677 its codomain. `exp(x) = e^x` is injective. 678 * Monotonicity. An monotonic expression preserves ordering. The 679 preservation may be positive or negative (maintains order or inverts 680 order) and strict or weak (maintains uniqueness or invalidates it). 681 `floor(x)` is a positive-weak monotonic expression. `-x` is a 682 negative-strict monotonic expression. 683 * Null-intolerance. A null-intolerant expression is a predicate which 684 never returns `true` for a `NULL` input column. Null-intolerance is 685 used to infer nullability of columns. `x = y` (where `x` and `y` are 686 columns) is a null-intolerant expression. 687 * Contains-aggregate. Does the scalar expression contain any aggregate 688 functions? 689 * Contains-subquery. Does the scalar expression contain any 690 subqueries? 691 692 ### Transformations 693 694 Transformations convert an input expression tree into zero or more 695 logically equivalent trees. Transformations utilize properties in 696 order to determine the validity of the transformation. Transforms are 697 configured with an expression pattern, a check method and an apply 698 method. The expression pattern is used to identify locations within 699 the full expression where the transform can be applied. The check 700 method performs additional checks to determine the validity of a 701 transformation. And the apply method applies the transformation, 702 generating zero or more logically equivalent expressions. 703 704 Transformations are categorized as *exploration* or 705 *implementation*. An exploration transformation creates a logical 706 expression from an existing logical expression. An implementation 707 transform creates a physical expression from a logical 708 expression. Note that both exploration and implementation transforms 709 take as input logical expressions. 710 711 Some examples of transformations: 712 713 * Join commutativity swaps the order of the inputs to an inner join: 714 `[join a b] -> [join b a]`. 715 * Join associativity reorders the children of a parent and child join: 716 `[join [join a b] c]` -> `[join [join a c] b]` 717 * Join elimination removes unnecessary joins based on projected 718 columns and foreign keys. 719 * Distinct/group-by elimination removes unnecessary distinct/group-by 720 operations based on keys. 721 * Decorrelation replaces correlated subqueries with semi-join, 722 anti-join and apply operators. 723 * Scan to index scan transforms the logical scan operator into one or 724 more index scans on covering indexes. 725 * Inner join to merge-join transforms a logical inner join operator 726 into a merge-join operator. 727 728 An example transformation is join commutativity. The pattern for join 729 commutativity is an inner-join: 730 731 ``` 732 inner-join 733 | 734 +-- pattern leaf // left input 735 | 736 +-- pattern leaf // right input 737 | 738 +-- pattern leaf // join condition 739 ``` 740 741 An inner-join always has 3 children: the left and right inputs and the 742 join condition. Join commutativity only needs to swap the left and 743 right inputs an this specifies pattern leaf for all 3 children. 744 745 The actual join commutativity transform is straightforward: 746 747 ```go 748 // This is demonstration code, the real implementation will be mildly 749 // more complex in order to reduce heap allocations. 750 func (joinCommutativity) apply(e *expr) *expr { 751 return &expr{ 752 op: innerJoinOp, 753 children: []*expr{ 754 e.children[1], 755 e.children[0], 756 e.children[2], 757 } 758 props: e.props, 759 } 760 } 761 ``` 762 763 Note that join commutativity is the simplest transform. More 764 sophisticated transforms have to perform complex checks for whether 765 they can be applied to an expression and for generating the resulting 766 transformed expression. For a slightly more complex example, join 767 associativity sorts the join conditions between the upper and lower 768 joins and checks to see if it is creating an undesirable cross-join. 769 770 Implicit in the join commutativity example above is that 771 transformations are written in code. An alternative is to create a 772 domain specific language for expressing transformations. The benefit 773 of such a language is the potential for more compact and expressive 774 transformations. The downside is the need to write a compiler for the 775 DSL. The current decision is to eschew a DSL for transformations as 776 the work involved seems strictly greater than writing transformations 777 in Go. In particular, a DSL would require both the author and reviewer 778 to learn the DSL. And a DSL doesn't necessarily ease writing a 779 transformation. Complex transformations may require extensions to the 780 DSL and the DSL compiler and thus not simplify writing the 781 transformation at all. In the short and medium term, the set of 782 transformations is expected to remain small as energies go into 783 fleshing out other query planning modules. The decision about a DSL 784 for transformations should be revisited as the transformation set 785 grows or in the light of experimentation with a DSL that proves its 786 worth. 787 788 ### Cost model 789 790 The cost model takes as input a physical query plan and computes an 791 estimated "cost" to execute the plan. The unit of "cost" can be 792 arbitrary, though it is desirable if it has some real world meaning 793 such as expected execution time. What is required is for the costs of 794 different query plans to be comparable. A SQL optimizer is seeking to 795 find the shortest expected execution time for a query and uses cost as 796 a proxy for execution time. 797 798 Cost is roughly calculated by estimating how much time each node in 799 the expression tree will use to process all results and modelling how 800 data flows through the expression tree. [Table statistics](#stats) are 801 used to power cardinality estimates of base relations which in term 802 power cardinality estimates of intermediate relations. This is 803 accomplished by propagating histograms of column values from base 804 relations up through intermediate nodes (e.g. combining histograms 805 from the two join inputs into a single histogram). Operator-specific 806 computations model the network, disk and CPU costs. The cost model 807 should include data layout and the specific operating environment. For 808 example, network RTT in one cluster might be vastly different than 809 another. 810 811 The operator-specific computations model the work performed by the 812 operator. A hash-join needs to model if temporary disk will be needed 813 based on the estimated size of the inputs. 814 815 Because the cost for a query plan is an estimate, there is an 816 associated error. This error might be implicit in the cost, or could 817 be explicitly tracked. One advantage to explicitly tracking the 818 expected error is that it can allow selecting a higher cost but lower 819 expected error plan over a lower cost but higher expected error 820 plan. Where does the error come from? One source is the innate 821 inaccuracy of stats: selectivity estimation might be wildly off due to 822 an outlier value. Another source is the accumulated build up of 823 estimation errors the higher up in the query tree. Lastly, the cost 824 model is making an estimation for the execution time of an operation 825 such as a network RTT. This estimate can also be wildly inaccurate due 826 to bursts of activity. 827 828 Search finds the lowest cost plan using dynamic programming. That 829 imposes a restriction on the cost model: it must exhibit optimal 830 substructure. An optimal solution can be constructed from optimal 831 solutions of its subproblems. 832 833 ### Search (a.k.a. Enumeration) 834 835 Search is the final phase of optimization where many alternative 836 logical and physical query plans are explored in order to find the 837 best physical query plan. The output of Search is a physical query 838 plan to execute. Note that in this context, a physical query plan 839 refers to a query plan for which the leaves of the tree are table 840 scans or index scans. In the long term, DistSQL planning will be 841 incorporated into Search, though in the short term it may be kept 842 separate. 843 844 In order to avoid a combinatorial explosion in the number of 845 expression trees, Search utilizes the Memo structure. Due to the large 846 number of possible plans for some queries, Search cannot explore all 847 of them and thus requires *pruning* heuristics. For example, Search 848 can cost query plans early and stop exploring a branch of plans if the 849 cost is greater than the current best cost so far. 850 851 Search begins with a Memo populated with the expression provided by 852 Rewrite. Search is modelled as a series of tasks that optimize an 853 expression. Conceptually, the tasks form a dependency tree very much 854 like the dependency tree formed by tools like make. Each task has a 855 count of its unfinished dependencies and a pointer to its parent 856 task. When a task is run it is passed its parent task and as part of 857 running it can add additional dependencies to its parent, thus making 858 the tree of dependencies dynamic. After a task is run, it decrements 859 its parent tasks and schedules it for execution if it was the last 860 dependency. Note that new tasks are only created if new expressions 861 were added to the memo. Search will not terminate if we continually 862 created new expressions via transformations, but that would also 863 indicate that we have an unbounded growth in expressions. In practice, 864 Search will have some limits on the number of steps it performs or 865 time it can take. 866 867 The initial task for Search is to optimize the "root" group. The tasks 868 described are the standard Cascades-style search tasks: 869 870 1. `OptimizeGroup(reqProps)`. Implements the group (via 871 `ImplementGroup`) which generates implementations for the 872 expressions in the group, then selects the plan with the least 873 estimated cost. Enforcers (e.g. sort) are added as needed. 874 875 2. `ImplementGroup`. Explores the group (via `ExploreGroup`) which 876 generates more logical expressions in the group and in child 877 groups, then generates implementations for all of the logical 878 expressions (via `ImplementGroupExpr`). `ImplementGroup` itself 879 does not perform any transformations, but acts as a synchronization 880 point for dependent tasks. 881 882 3. `ImplementGroupExpr`. Implements all of the child groups (via 883 `ImplementGroup`), then applies any applicable implementation 884 transformations (via `Transform`) to the forest of expressions 885 rooted at the specified memo-expression. Example transformation: 886 inner-join to merge-join and hash-join. 887 888 4. `ExploreGroup`. Explores each expression in the group (via 889 `ExploreGroupExpr`). `ExploreGroup` itself does not perform any 890 transformations, but acts as a synchronization point for dependent 891 tasks. 892 893 5. `ExploreGroupExpr`. Explores all of the child groups (via 894 `ExploreGroup`), then applies any applicable exploration 895 transformations (via `Transform`) to the forest of expressions 896 rooted at the specified memo-expression. Example transformations: 897 join commutativity and join associativity. 898 899 6. `Transform`. Applies a transform to the forest of expressions 900 rooted at a particular memo-expression. There are two flavors of 901 transformation task: exploration and implementation. The primary 902 difference is the state transition after the task finishes. An 903 exploration transform task recursively schedules exploration of the 904 group it is associated with. An implementation transform task 905 schedules optimization of the group. 906 907 A search *stage* is configured by a set of exploration and 908 implementation transforms, and a *budget*. The budget is used to prune 909 branches of the search tree which appear undesirable. The initial 910 search stage has a limited set of exploration and implementation 911 transforms (perhaps 0 exploration transforms), an unlimited budget, 912 and aims to quickly find a workable, though possibly slow, plan. Each 913 subsequent stage uses the cost from the best plan of the previous 914 stage for pruning. [TODO(peter): my understanding of how this will 915 work is slightly fuzzy. My usage of the term budget might be 916 off. Perhaps better to describe it as "max cost".] 917 918 Full featured optimizers can contain hundreds of 919 transformations. Checking whether each transformation is applicable at 920 each node would be prohibitively expensive, so the transformations are 921 indexed by the root operator of their pattern. Transformations are 922 further categorized as exploration and implementation and divided 923 amongst the search stages based on generality and expected benefit. 924 925 Search is naturally parallelizable, yet exploiting that parallelism 926 involves synchronization overhead. Parallelization also can allow one 927 query to utilize more planning resources than other queries. Rather 928 than support parallelization of search, energy will instead be 929 directed at making search and transformations fast and memory 930 efficient. 931 932 ### Testing 933 934 Historically, SQL databases have introduced subtle bugs that have 935 lasted for years through invalid transformations. Search should be 936 designed for testability. One example of this is to allow verification 937 that all of the alternate plans generated by Search actually produce 938 the same result. 939 940 In addition to testing the alternative query plans, there is utility 941 in generating a large number of valid SQL statements. The existing 942 Random Syntax Generator does one level of this by generating 943 syntactically valid SQL. An additional level would be to generate 944 semantically valid queries which might be more feasible by random 945 generation at the expression level. 946 947 The relational algebra expression trees should provide a textual 948 format to ease testing using infrastructure similar to the existing 949 logic tests where test files define queries and expected results. 950 951 Optimization is concerned with making queries faster and it is quite 952 disturbing to users when inadvertent regressions occur. A large test 953 suite needs to be developed over time which ensures that the addition 954 of new transformations or improvements to the various modules do not 955 cause regressions in the chosen plans. 956 957 Generating actual table data with various data distributions for 958 testing purposes would be both onerous and slow. Table statistics are 959 a key factor in the decisions performed by search. In order to 960 adequately test how the behavior of search changes with changing table 961 statistics, we need an easy mechanism for injecting fake statistics. 962 963 ## Roadmap 964 965 The above outline sketches a large amount of work. How do we get there 966 from here? A strawman proposal divides the work into several 967 releases. The farther out the proposed work, the fuzzier the proposal 968 becomes. 969 970 ### 2.0 971 972 * Stats. Stats are not dependent on other planning modules but are a 973 prerequisite to cost-based transformations. Stats are only generated 974 explicitly via `CREATE STATISTICS`. 975 976 * Prep. Introduce the expression tree. Construct the expression tree 977 from the existing AST output by the parser. Use the AST-based type 978 checking and name resolution. The existing AST-based planning code 979 will be left in place and a parallel world of expression-based 980 planning will be erected. The new planning code will not be used in 981 this release. 982 983 * Rewrite. Predicate inference and predicate push down. 984 985 * Memo. Introduce the memo structure. 986 987 * Testing. Use ugly hacks to hook up a hobbled version of something as 988 an alternate query planner. Perhaps a flag to pass queries through 989 the expression format and memo and then translate them back into the 990 AST in order to use the legacy planner. 991 992 ### 2.1 993 994 * Stats. Automatically gather stats on PKs and index columns. 995 996 * Prep. Perform name resolution and type checking on the expression 997 tree. Support non-recursive CTEs. Fall-back to legacy planning code 998 for unsupported queries. 999 1000 * Rewrite. Transform correlated subqueries into apply 1001 variants. Transform common apply variants into joins. 1002 1003 * Execution. Nested-loop-join, semi-join, anti-join and apply 1004 processors. 1005 1006 * Cost model. Basic cost model that is powered by stats. 1007 1008 * Search. Task-based single stage search. No pruning. Use existing 1009 DistSQL planning. Facility for time-travel debugging of the search 1010 process and inspecting the memo state (e.g. logical and physical 1011 properties). Global and per-session disablement of individual 1012 transforms. 1013 1014 * Transforms. Join elimination, distinct/group-by elimination, join 1015 commutativity, join associativity, index selection, and scalar 1016 normalization. 1017 1018 * Testing. Random generation of table data based on schema and query 1019 to exercise corner conditions. Random sampling and execution of 1020 alternate query plans to verify equivalence. Test suite for plan 1021 selection using injected stats. 1022 1023 ### 2.2 1024 1025 * Stats. Support more advanced statistics (e.g. filtered statistics). 1026 1027 * Prep. Support 100% of queries, enabling the deletion of the legacy 1028 planning code. 1029 1030 * Cost model. Make the cost model more sophisticated by taking into 1031 account measurements of network bandwidth and latency. Validate cost 1032 model against actual queries. 1033 1034 * Search. Add multiple stages with pruning heuristics. Integrate 1035 DistSQL planning. 1036 1037 * Transforms. Pull group-by above a join. Push group-by below a 1038 join. Split group-by into local and global components. Simplify 1039 outer joins. 1040 1041 * Execution. Stream-group-by. 1042 1043 ## Unresolved questions 1044 1045 * Flesh out understanding of where physical properties such as 1046 ordering can be imposed by the query itself. For example, a 1047 top-level `ORDER BY` clause definitely imposes ordering. But so does 1048 an `ORDER BY` clause that is the immediate sub-expression of 1049 `LIMIT/OFFSET`, `DISTINCT ON`, `WITH ORDINALITY`, 1050 `{INSERT,UPSERT,DELETE,UPDATE}` and `CREATE TABLE ... AS ...`. We 1051 also need to pay attention to `ORDER BY INDEX` and `ORDER BY PRIMARY 1052 KEY`, though those clauses likely degenerate into `ORDER 1053 BY`. Are there other places we need to pay attention to physical 1054 properties? Are there other physical properties to capture at 1055 intermediate nodes? 1056 1057 * Which parts of query planning can be performed during PREPARE vs 1058 EXECUTE? Most (all?) of the transformations that are part of Rewrite 1059 can be performed during PREPARE. For example, predicate push-down 1060 and decorrelation do not require placeholder values. And some parts 1061 of Search, such as join enumeration, can be performed during 1062 PREPARE. The part that is restricted to EXECUTE are certain parts of 1063 index selection and thus costing of query plans. 1064 1065 * The performance of the query planner itself is important because 1066 query planning occurs for every query executed. What sorts of fast 1067 paths are possible for simple queries? 1068 1069 * Window functions. 1070 1071 * Describe max1row operator and why it is necessary. 1072 1073 ## Appendix 1074 1075 ### Expr/Memo examples 1076 1077 Consider the query: 1078 1079 ```sql 1080 SELECT v, k FROM kv WHERE k < 3 1081 ``` 1082 1083 Building the expression tree results in: 1084 1085 ``` 1086 project [out=(0,1)] 1087 columns: kv.v:1 kv.k:0 1088 projections: 1089 variable (kv.v) [in=(1)] 1090 variable (kv.k) [in=(0)] 1091 inputs: 1092 select [out=(0,1)] 1093 columns: kv.k:0* kv.v:1 1094 filters: 1095 lt [in=(0)] 1096 inputs: 1097 variable (kv.k) [in=(0)] 1098 const (3) 1099 inputs: 1100 scan [out=(0,1)] 1101 columns: kv.k:0 kv.v:1 1102 ``` 1103 1104 Some points to notice above. The relational operators (`project`, 1105 `select` and `scan`) track their output column set as a bitmap 1106 (i.e. `out=(0,1)`). Scalar expressions such as `variable` and `lt` 1107 track their required input columns. Relational operators have a slice 1108 of children where the interpretation of the children is operator 1109 specific. The `project` operator has 2 children: a relational input 1110 and a list of projections. Note that the order of projections is 1111 important and are stored using an `ordered-list` operator in the 1112 memo. The `select` operator also has 2 children: a relational input 1113 and a list of filters. 1114 1115 Inserting the expression tree into the memo results in: 1116 1117 ``` 1118 8: [project [5 7]] 1119 7: [ordered-list [6 2]] 1120 6: [variable kv.v] 1121 5: [select [1 4]] 1122 4: [lt [2 3]] 1123 3: [const 3] 1124 2: [variable kv.k] 1125 1: [scan kv] 1126 ``` 1127 1128 Here we can see more clearly the child structure of the various 1129 relational operators. The `select` expression in group 5 has 2 1130 children: groups 1 and 4. Group 1 is a `scan` and group 4 is the 1131 filter. 1132 1133 As another example, consider the query: 1134 1135 ```sql 1136 SELECT k, v FROM (SELECT v, k FROM kv) 1137 ``` 1138 1139 Inserting into the memo we get: 1140 1141 ``` 1142 7: [project [5 6]] 1143 6: [ordered-list [3 2]] 1144 5: [project [1 4]] 1145 4: [ordered-list [2 3]] 1146 3: [variable kv.k] 1147 2: [variable kv.v] 1148 1: [scan kv] 1149 ``` 1150 1151 Notice that the variables (`kv.k` and `kv.v`) are only present once in 1152 the memo and their groups are shared by both projection lists.