github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171213_sql_query_planning.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171213_sql_query_planning.md (about)

1 - Feature Name: SQL Query Planning
2 - Status: in-progress
3 - Start Date: 2017-12-13
4 - Authors: Peter Mattis
5 - RFC PR: #19135
6 - Cockroach Issue: (one or more # from the issue tracker)
7
8 # Summary
9
10 This RFC sketches the outlines of the high-level modules of a SQL
11 query planning including a full-featured optimizer.
12
13 # Motivation
14
15 SQL query planning is concerned with transforming the AST of a SQL
16 query into a physical query plan for execution. Naive execution of a
17 SQL query can be prohibitively expensive, because SQL specifies the
18 desired results and not how to achieve them. A given SQL query can
19 have thousands of alternate query plans with vastly different
20 execution times. The techniques used to generate and select a good
21 query plan involve significant engineering challenges.
22
23 This RFC is intended to provide guidance for both short term and long
24 term work on the SQL optimizer, and highlight areas of the current
25 system that will need to evolve.
26
27 # Guide-level explanation
28
29 ## Overview
30
31 SQL query planning is often described in terms of 8 modules:
32
33 1. [Stats](#stats)
34 2. [Prep](#prep)
35 3. [Rewrite](#rewrite)
36 4. [Memo](#memo)
37 5. [Cost Model](#cost-model)
38 6. [Search](#search-aka-enumeration)
39 7. [Properties](#properties)
40 8. [Transformations](#transformations)
41
42 Note that Stats, Cost Model, Memo, Properties and Transformations
43 could be considered modules, while Prep, Rewrite and Search could be
44 considered phases, though we'll refer to all 8 uniformly as modules in
45 this document. Memo is a technique for compactly representing the
46 forest of trees generated during Search. Stats, Properties, Cost Model
47 and Transformations are modules that power Prep, Rewrite and Search.
48
49 ```
50 SQL query text
51 |
52 +-----v-----+
53 | Parse |
54 +-----+-----+
55 |
56 (ast)
57 |
58 +-------+ +-----v-----+ - constant folding, type checking, name resolution
59 | Stats +-----> Prep | - computes initial properties
60 +-------+ +-----+-----+ - retrieves and attaches stats
61 | - done once per PREPARE
62 (expr)
63 |
64 +-----v-----+ - capture placeholder values / timestamps
65 +--> Rewrite | - cost-agnostic transformations, eg. predicate push-down
66 +------------+ | +-----+-----+ - done once per EXECUTE
67 | Transforms +--+ |
68 +------------+ | (expr)
69 | |
70 +-->-----v-----+ - cost-based transformations
71 +------------+ | Search | - finds lowest cost physical plan
72 | Cost Model +----->-----+-----+ - includes DistSQL physical planning
73 +------------+ |
74 (physical plan)
75 |
76 +-----v-----+
77 | Execution |
78 +-----------+
79 ```
80
81
82 CockroachDB already has implementations of portions of these modules
83 except for Stats and Memo. For example, CockroachDB performs name
84 resolution and type checking which is part of Prep, and performs
85 predicate push down through joins which traditionally happens during
86 Rewrite. CockroachDB utilizes a primitive Cost model during index
87 selection (a portion of Search) to choose which index to use based on
88 filters and desired ordering.
89
90 In addition to the 8 modules, another aspect of the optimizer that
91 needs discussion is [Testing](#testing) and test infrastructure.
92
93 Lastly, a strawman [Roadmap](#roadmap) is proposed for how to break up
94 this work over the next several releases.
95
96 ## Glossary
97
98 The following terms are introduced/defined in this RFC:
99
100 - [**algebraic equivalence**](#properties)
101 - [**attributes** of expressions](#properties-vs-attributes)
102 - [**cardinality**](#stats)
103 - [**decorrelating**](#rewrite), syn. "unnesting"
104 - [**derived** vs **required** properties](#properties)
105 - [**enforcer** operator for properties](#properties)
106 - [**equivalence class**](#memo)
107 - [**exploration** vs **implementation** transformations](#search)
108 - [**expressions** in queries](#prep)
109 - [**functional dependencies**](#prep)
110 - [**logical** vs **physical** properties](#memo)
111 - [**logical** vs **physical** vs **scalar** operators](#prep)
112 - [**memo-expressions**](#memo)
113 - [**operator** in query expressions](#prep)
114 - [**pattern** in transformations](#memo)
115 - [**predicate push-down**](#rewrite)
116 - [**prep** phase](#prep)
117 - **properties** of expressions [1](#memo) [2](#properties)
118 - [**pruning** during search](#search)
119 - [**query text**](#modules)
120 - [**rewrite** phase](#rewrite)
121 - [**scalar** vs **relational** properties](#properties)
122 - [**search** phase](#search)
123 - [**selectivity**](#stats)
124 - [**tracked** vs **computed** properties](#properties)
125 - [**transformation** of expressions](#rewrite)
126 - [**unnesting**](#rewrite), syn. "decorrelating"
127
128 ## Modules
129
130 The parse phase is not discussed in this RFC. It handles the
131 transformation of the *SQL query text* into an abstract syntax tree
132 (AST).
133
134 ### Prep
135
136 *Prep* (short for "prepare") is the first phase of query optimization
137 where the AST is transformed into a form more suitable for
138 optimization and annotated with information that will be used by later
139 phases. Prep includes resolving table and column references (i.e. name
140 resolution) and type checking, both of which are already performed by
141 CockroachDB.
142
143 During Prep, the AST is transformed from the raw output of the parser
144 into an expression "tree".
145
146 ```go
147 type operator int16
148
149 type expr struct {
150 op operator
151 children []*expr
152 relationalProps *relationalProps // See [relational properties](#tracked_properties)
153 scalarProps *scalarProps // See [scalar properties](#tracked_properties)
154 physicalProps *physicalProps // See [physical properties](#tracked_properties)
155 private interface{}
156 }
157 ```
158
159 The term *"expression"* here is based on usage from literature, though
160 it is mildly confusing as the current SQL code uses "expression" to
161 refer to scalar expressions. In this document, "expression" refers to
162 either a relational or a scalar expression. Using a uniform node type
163 for expressions facilitates transforms used during the Rewrite and
164 Search phases of optimization.
165
166 Each expression has an *operator* and zero or more operands
167 (`expr.children`). Operators can be *relational* (e.g. `join`) or
168 *scalar* (e.g. `<`). Relational operators can be *logical* (only
169 specifies results) or *physical* (specifies both result and a
170 particular implementation).
171
172 During Prep all the columns are given a unique index (number). Column
173 numbering involves assigning every base column and non-trivial
174 projection in a query a unique query-specific index.
175
176 Giving each column a unique index allows the expression nodes
177 mentioned above to track input and output columns, or really any set
178 of columns during Prep and later phases, using a bitmap. The bitmap
179 representation allows fast determination of compatibility between
180 expression nodes and is utilized during rewrites and transformations
181 to determine the legality of such operations.
182
183 The Prep phase also computes *logical properties*, such as the input
184 and output columns of each (sub-)expression, equivalent columns,
185 not-null columns and functional dependencies.
186
187 The functional dependencies for an expression are constraints over one
188 or more sets of columns. Specific examples of functional dependencies
189 are the projections, where 1 or more input columns determine an output
190 column, and "keys" which are a set of columns where no two rows output
191 by the expression are equal after projection on to that set (e.g. a
192 unique index for a table where all of the columns are NOT
193 NULL). Conceptually, the functional dependencies form a graph, though
194 they are not represented as such in code.
195
196 ### Rewrite
197
198 The second phase of query optimization is *rewrite*. The rewrite phase
199 performs *transformations* on the logical query tree which are always
200 beneficial (i.e. cost-agnostic).
201
202 A transformation transforms a (part of a) query into another. Note
203 that there is conceptual overlap with the Search phase which also
204 performs transformations on the query. Both phases employ
205 transformations, yet Search needs to track and cost the alternatives
206 while Rewrite does not. In the specific context of the rewrite phase,
207 transformations are commonly called *rewrites*.
208
209 During Rewrite, the previous version of an expression is
210 discarded. During Search, both the original and new expression are
211 preserved side-by-side as alternatives, see the [section
212 below](#search) for details.
213
214 Also note that some of the transformations performed by Rewrite need
215 not be performed again by Search (decorrelation is the prime
216 example). The vast majority of transforms performed by Search are not
217 used by Rewrite.
218
219 Rewrite is the phase where e.g. correlated subqueries are
220 *decorrelated* (synonym: *unnesting*), additional predicates are
221 inferred and *predicate push down* occurs, and various other
222 simplifications to the relational algebra tree (e.g. projection & join
223 elimination). As an example of predicate push down, consider the
224 query:
225
226 ```sql
227 SELECT * FROM a, b USING (x) WHERE a.x < 10
228 ```
229
230 The naive execution of this query retrieves all rows from `a` and `b`,
231 joins (i.e. filters) them on the variable `x`, and then filters them
232 again on `a.x < 10`. Predicate push down attempts to push down the
233 predicate `a.x < 10` below the join. This can obviously be done for
234 the scan from `a`:
235
236 ```sql
237 SELECT * FROM (SELECT * FROM a WHERE a.x < 10), b USING (x)
238 ```
239
240 Slightly more complicated, we can also generate a new predicate using
241 the functional dependence that `a.x = b.x` (due to the join
242 predicate):
243
244 ```sql
245 SELECT * FROM
246 (SELECT * FROM a WHERE a.x < 10),
247 (SELECT * FROM b WHERE b.x < 10) USING (x)
248 ```
249
250 Predicate push down is aided by predicate inference. Consider the query:
251
252 ```sql
253 SELECT * FROM a, b USING (x)
254 ```
255
256 Due to the join condition, we can infer the predicates `a.x IS NOT
257 NULL` and `b.x IS NOT NULL`:
258
259 ```sql
260 SELECT * FROM a, b USING (x)
261 WHERE a.x IS NOT NULL AND b.x IS NOT NULL
262 ```
263
264 And predicate push down can push these predicates through the join:
265
266 ```sql
267 SELECT * FROM
268 (SELECT * FROM a WHERE a.x IS NOT NULL),
269 (SELECT * FROM b WHERE b.x IS NOT NULL) USING (x)
270 ```
271
272 ### Stats
273
274 Table statistics power both the cost model and the search of alternate
275 query plans. A simple example of where stastistics guide the search of
276 alternate query plans is in join ordering:
277
278 ```sql
279 SELECT * FROM a JOIN b
280 ```
281
282 In the absence of other opportunities, this might be implemented as a
283 hash join. With a hash join, we want to load the smaller set of rows
284 (either from `a` or `b`) into the hash table and then query that table
285 while looping through the larger set of rows. How do we know whether
286 `a` or `b` is larger? We keep statistics about the *cardinality* of `a`
287 and `b`, i.e. the (approximate) number of different values.
288
289 Simple table cardinality is sufficient for the above query but fails
290 in other queries. Consider:
291
292 ```sql
293 SELECT * FROM a JOIN b ON a.x = b.x WHERE a.y > 10
294 ```
295
296 Table statistics might indicate that `a` contains 10x more data than
297 `b`, but the predicate `a.y > 10` is filtering a chunk of the
298 table. What we care about is whether the result of the scan of `a`
299 after filtering returns more rows than the scan of `b`. This can be
300 accomplished by making a determination of the *selectivity* of the
301 predicate `a.y > 10` (the % of rows it will filter) and then
302 multiplying that selectivity by the cardinality of `a`. The common
303 technique for estimating selectivity is to collect a histogram on
304 `a.y` (prior to running the query).
305
306 The collection of table statistics occurs prior to receiving the
307 query. As such, the statistics are necessarily out of date and may be
308 inaccurate. The system may bound the inaccuracy by recomputing the
309 stats based on how fast a table is being modified. Or the system may
310 notice when stat estimations are inaccurate during query execution.
311
312 [A separate RFC covers statistics collection in
313 CockroachDB.](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20170908_sql_optimizer_statistics.md)
314
315 ### Memo
316
317 Memo is a data structure for efficiently storing a forest of query
318 plans. Conceptually, the memo is composed of a numbered set of
319 **equivalency classes** called **groups** where each group contains a
320 set of logically equivalent expressions. The different expressions in
321 a single group are called **memo-expressions** (memo-ized
322 expressions). While an expression node outside of the memo contains a
323 list of child expressions, a memo-expression contains a list of child
324 groups.
325
326 By definition, all the memo-expressions in a group share the same
327 *logical properties*, a concept explored more in depth in the [section
328 below](#properties). The memo-expression structure mirrors the
329 expression structure:
330
331 ```go
332 type exprID int32
333 type groupID int32
334
335 type memoExpr struct {
336 op operator
337 children []groupID
338 physicalProps *physicalProps
339 private interface{}
340 }
341
342 type memoGroup struct {
343 exprs []memoExpr
344 relationalProps *relationalProps
345 scalarProps *scalarProps
346 }
347 ```
348
349 Transformations are *not* performed directly on the memo because
350 transformations operate on trees while the memo models a forest of
351 trees. Instead, expression fragments are extracted from the memo,
352 transformed, and re-inserted into the memo. At first glance, this
353 seems onerous and inefficient, but it allows transformations to be
354 rewritten more naturally and the extraction of expression fragments
355 can be performed efficiently.
356
357 Extracting an expression fragment for transformation is performed via
358 a process called *binding*. Binding allows iterating over all of the
359 expressions matching a pattern that are rooted at a particular
360 memo-expression. A pattern is specified using the same expression
361 structure that is to be extracted, with the addition of "pattern-leaf"
362 and "pattern-tree" placeholders that act as wildcards:
363
364 * A **pattern leaf** matches any expression tree, with only the root
365 of the tree being retained in the bound expression. It is used when
366 the expression is used opaquely by the transformation. In other
367 words, the transformation doesn't care what's inside the subtree. It
368 is a "leaf" in the sense that it's a leaf in any binding matching a
369 pattern.
370 * A **pattern tree** matches any expression tree and indicates that
371 recursive extraction of the full subtree is required. It is
372 typically used for scalar expressions when some manipulation of that
373 expression is required by the transformation. Note that a pattern
374 tree results in all possible subtrees being enumerated, however
375 scalar expressions typically don't have many subtrees (if there are
376 no subqueries, there is only one subtree). [TODO(peter): what to do
377 about subqueries in a scalar context? Iterating over all of the
378 subquery expressions doesn't seem right. There is a TODO in `opttoy`
379 to cache scalar expressions in `memoGroup`. Need to investigate this
380 further.]
381
382 To better understand the structure of the memo, consider the query:
383
384 ```sql
385 SELECT * FROM a, b WHERE a.x = b.x
386 ```
387
388 Converted to the expression structure which models the extended
389 relational algebra the query looks like:
390
391 ```
392 inner-join [columns: a.x a.y b.x b.z]
393 filters:
394 eq
395 inputs:
396 variable (a.x)
397 variable (b.x)
398 inputs:
399 scan [columns: a.x a.y]
400 scan [columns: b.x b.z]
401 ```
402
403 Inserting the expression tree into the memo results in:
404
405 ```
406 6: [inner-join [1 2 5]]
407 5: [eq [3 4]]
408 4: [variable b.x]
409 3: [variable a.x]
410 2: [scan b]
411 1: [scan a]
412 ```
413
414 Memo groups are numbered by when they were created and the groups are
415 topologically sorted for display (this is an implementation detail and
416 not intended to be prescriptive). In the above example, each group
417 contains only a single memo-expression. After performing the join
418 commutativity transformation, the memo would expand:
419
420 ```
421 6: [inner-join [1 2 5]] [inner-join [2 1 5]]
422 5: [eq [3 4]]
423 4: [variable b.x]
424 3: [variable a.x]
425 2: [scan b]
426 1: [scan a]
427 ```
428
429 Memo groups contain logically equivalent expressions, but two
430 logically equivalent expression may not be placed in the same memo
431 group. This occurs because determining logical equivalency of two
432 relational expressions is complex to perform 100% correctly. A
433 correctness failure (i.e. considering two expressions logically
434 equivalent when they are not) results in invalid transformations and
435 invalid plans. Placing two logically equivalent expressions in
436 different groups has a much gentler failure mode: the memo and search
437 are less efficient.
438
439 Insertion of an expression into the memo is performed by recursively
440 inserting all of the sub-expressions into the memo and then computing
441 a **fingerprint** for the memo-expression. The fingerprint for a
442 memo-expression is simply the expression operator and the list of
443 child groups. For example, in the memo examples above, the fingerprint
444 for the first inner-join expression is `[inner-join [1 2 5]]`. The
445 memo maintains a map from expression fingerprint to memo group which
446 allows quick determination if an expression fragment already exists in
447 the memo. A small amount of operator-specific normalization is
448 performed when computing the group fingerprint for a
449 memo-expression. For example, the left and right inputs of an
450 inner-join are output in sorted order which results in the expressions
451 `[inner-join [1 2 5]]` and `[inner-join [2 1 5]]` having the same
452 group fingerprint. The operator-specific normalization is
453 conservative. The common case for placing logically equivalent
454 expressions in the same group is adherence to the invariant that
455 transformed expressions are logically equivalent to their input.
456
457 ```go
458 type memo struct {
459 // Map from memo-expression "group" fingerprint to group ID.
460 groupMap map[string]groupID
461 groups []memoGroup
462 }
463 ```
464
465 In addition to memo expressions, memo groups also contain a map from
466 desired physical properties to optimization state for the group for
467 those properties. This state is discussed more in
468 [Search](#search-aka-enumeration-or-transformation).
469
470 A **location** within the memo identifies a particular memo-expression
471 by its group and expression number. When an expression fragment is
472 extracted from the memo, each `expr` is tagged with the location it
473 originated from in the memo. This allows subsequent reinsertion of a
474 transformed expression into the memo to quickly determine which groups
475 the expression nodes should be added to.
476
477 ```go
478 type memoLoc struct {
479 group groupID
480 expr exprID
481 }
482 ...
483 type expr struct {
484 op operator
485 loc memoLoc
486 ...
487 }
488 ```
489
490 The above depictions of the memo structures are simplified for
491 explanatory purposes. The actual structures are similar, though
492 optimized to reduce allocations.
493
494 ### Properties
495
496 Properties are meta-information that are maintained at each node in an
497 expression. Properties power transformations and optimization.
498
499 #### Properties vs attributes
500
501 The term "property" encompasses information that is well-defined over
502 any expression in its group: a given scalar property is well-defined
503 for all scalar operators; a relational property is well-defined for
504 all relational operators. For example, "nullability" is a property
505 that is properly defined (and says something meaningful for) any any
506 scalar expression.
507
508 In contrast, some bits of information are only relevant for specific
509 operators. For example, the "join algorithm" is only relevant for join
510 operators; the "index name" is only relevant for table scan operators,
511 etc. This operator-specific data is called an *attribute* and is
512 attached to a particular memo-expression.
513
514 #### Logical vs physical properties
515
516 Logical properties are maintained for both relational and scalar
517 operators. A logical property refers to logical information about the
518 expression such as column equivalencies or functional dependencies.
519
520 Physical properties are those that exist outside of the relational
521 algebra such as row order and data distribution. Physical property
522 requirements arise from both the query itself (the non-relational
523 `ORDER BY` operator) and by the selection of specific implementations
524 during optimization (e.g. a merge-join requires the inputs to be
525 sorted in a particular order).
526
527 By definition, two memo-expressions in the same group have the same
528 logical properties and the logical properties are attached to the memo
529 group. The physical properties for the memo-expressions in a group may
530 differ. For example, a memo group containing inner-join will also have
531 hash-join and merge-join implementations which produce the same set of
532 output rows but in different orders.
533
534 #### Relational vs scalar properties
535
536 The memo contains memo-expressions with either scalar (e.g. `+`, `<`,
537 etc.) or relational (e.g. `join`, `project`, etc.) operators,
538 distinguished as scalar expressions vs relational expressions.
539
540 Scalar and relational properties are maintained in separate data
541 structures, but note that both scalar and relational properties are
542 considered *logical*.
543
544 #### Derived vs required properties
545
546 Properties can be *required* or *derived*.
547
548 A required property is one specified by the SQL query text. For
549 example, a DISTINCT clause is a required property on the set of
550 columns of the corresponding projection -- that the tuple of columns
551 forms a key (unique values) in the results.
552
553 A derived property is one derived by the optimizer for an
554 expression based on the properties of the children nodes.
555
556 For example, in `SELECT k+1 FROM kv`, once the ordering of "k" is
557 known from kv's descriptor, the same ordering property can be derived
558 for `k+1`.
559
560 During optimization, for each node with required properties the
561 optimizer will look at the children node to check whether their actual
562 properties (which can be derived) match the requirement. If they don't
563 the optimizer must introduce an *enforcer* operator in the plan that
564 provides the required property.
565
566 For example, an `ORDER BY` clause creates a required ordering property
567 can cause the optimizer to add a sort node as an enforcer of that
568 property.
569
570 #### Tracked vs computed properties
571
572 A [tracked property](#tracked_properties) is one which is maintained
573 in a data structure (e.g. `relationalProps`, `scalarProps`,
574 `physicalProps`). A computed property is one which is computed from an
575 expression or an expression fragment as needed. For intermediate
576 nodes, all properties can be computed which makes tracked properties
577 akin to a cache. The decision for whether to track or compute a
578 property is pragmatic. Tracking a property requires overhead whether
579 the property is used or not, but makes accessing the property in a
580 transformation fast. Computing a property can be done only when the
581 property is used, but is not feasible if the computation requires an
582 entire sub-expression tree (as opposed to a fragment). [Computed
583 properties](#computed_properties) primarly occur for scalar properties
584 for which transformations often have the entire scalar expression.
585
586 #### Tracked properties
587
588 The determination of the properties to track is a key aspect of the
589 design of an optimizer. Track too many and adding new operators
590 becomes onerous and maintaining the properties through transformations
591 becomes expensive. Track too few and certain transformations become
592 difficult.
593
594 Relational properties:
595
596 * Output columns [`intset`]. The set of columns output by an
597 expression. Used to determine if a predicate is compatible with an
598 expression.
599 * Outer columns [`intset`]. The set of columns that are used by the
600 operator but not defined in the underlying expression tree (i.e. not
601 supplied by the inputs to the current expression). Synonym: *free
602 vars*.
603 * Not-NULL columns [`intset`]. Column nullability is associated with
604 keys which are a factor in many transformations such as join
605 elimination, group-by simplification
606 * Keys [`[]intset`]. A set of columns for which no two rows are equal
607 after projection onto that set. The simplest example of a key is the
608 primary key for a table. Note that a key requires all of the columns
609 in the key to be not-NULL.
610 * Weak keys [`[]intset`]. A set of columns where no two rows
611 containing non-NULL values are equal after projection onto that
612 set. A UNIQUE index on a table is a weak key and possibly a key if
613 all of the columns are not-NULL. Weak keys are tracked because they
614 can become keys at higher levels of a query due to null-intolerant
615 predicates.
616 * Foreign keys [`map[intset]intset`]. A set of columns that uniquely
617 identify a single row in another relation. In practice, this is a
618 map from one set of columns to another set of columns.
619 * Equivalency groups [`[]intset`]. A set of column groups (sets) where all columns
620 in a group are equal with each other.
621 * Constant columns [`intset`]. Columns for which we know we have a
622 single value.
623
624 Scalar properties:
625
626 * Input columns [`intset`]. The set of columns used by the scalar
627 expression. Used to determine if a scalar expression is compatible
628 with the output columns of a relational expression.
629 * Defined columns [`intset`]. The set of columns defined by the scalar
630 expression.
631
632 Physical properties:
633
634 * Column ordering. Specified by a top-level projection.
635 * Row ordering. Specified by an `ORDER BY` clause or required by a
636 physical operator (e.g. merge-join). Row ordering is enforced by the
637 **sort** operator.
638 * Rewindability. Required by multi-use CTEs. Every reference to the
639 CTE in the query needs to return the same results. A read-only query
640 has this property by default, though care must be taken with regards
641 to the [Halloween
642 Problem](https://en.wikipedia.org/wiki/Halloween_Problem) if the
643 read-only query exists in the context of a DML query. A CTE
644 containing a DML (such as `INSERT` or `UPDATE`) needs to have its
645 results materialized in temporary storage and thus provide
646 *rewindability*. This property is enforced using a **spool**
647 operator.
648
649 Note that this list of properties is not exhaustive. In particular,
650 there are a large number of scalar properties for which it isn't clear
651 if the property should be tracked or computed when necessary. For
652 example, null-tolerance (does a predicate ever return true for a NULL
653 value) can be computed from a scalar expression when needed. It is an
654 open question as to whether it is utilized frequently enough that it
655 should be tracked.
656
657 Tracking is a bit more than caching of computed properties: we can't
658 compute certain relational properties without the entire
659 sub-expression. Keys are an example: if you have a deeply nested join,
660 in order to compute the keys after performing a join associativity
661 transform, you would need to have the entire expression tree. By
662 tracking the keys property and maintaining it at each relational
663 expression, we only need the fragment of the expression needed by the
664 transform.
665
666 ### Computed properties
667
668 Computed properties are used primarily in conjunction with scalar
669 expressions. The properties are computed rather than tracked because
670 we usually have the full scalar expression vs just a fragment for
671 relational expressions.
672
673 Computed scalar properties:
674
675 * Injectivity. An injective expression preserves distinctness: it
676 never maps distinct elements of its domain to the same element of
677 its codomain. `exp(x) = e^x` is injective.
678 * Monotonicity. An monotonic expression preserves ordering. The
679 preservation may be positive or negative (maintains order or inverts
680 order) and strict or weak (maintains uniqueness or invalidates it).
681 `floor(x)` is a positive-weak monotonic expression. `-x` is a
682 negative-strict monotonic expression.
683 * Null-intolerance. A null-intolerant expression is a predicate which
684 never returns `true` for a `NULL` input column. Null-intolerance is
685 used to infer nullability of columns. `x = y` (where `x` and `y` are
686 columns) is a null-intolerant expression.
687 * Contains-aggregate. Does the scalar expression contain any aggregate
688 functions?
689 * Contains-subquery. Does the scalar expression contain any
690 subqueries?
691
692 ### Transformations
693
694 Transformations convert an input expression tree into zero or more
695 logically equivalent trees. Transformations utilize properties in
696 order to determine the validity of the transformation. Transforms are
697 configured with an expression pattern, a check method and an apply
698 method. The expression pattern is used to identify locations within
699 the full expression where the transform can be applied. The check
700 method performs additional checks to determine the validity of a
701 transformation. And the apply method applies the transformation,
702 generating zero or more logically equivalent expressions.
703
704 Transformations are categorized as *exploration* or
705 *implementation*. An exploration transformation creates a logical
706 expression from an existing logical expression. An implementation
707 transform creates a physical expression from a logical
708 expression. Note that both exploration and implementation transforms
709 take as input logical expressions.
710
711 Some examples of transformations:
712
713 * Join commutativity swaps the order of the inputs to an inner join:
714 `[join a b] -> [join b a]`.
715 * Join associativity reorders the children of a parent and child join:
716 `[join [join a b] c]` -> `[join [join a c] b]`
717 * Join elimination removes unnecessary joins based on projected
718 columns and foreign keys.
719 * Distinct/group-by elimination removes unnecessary distinct/group-by
720 operations based on keys.
721 * Decorrelation replaces correlated subqueries with semi-join,
722 anti-join and apply operators.
723 * Scan to index scan transforms the logical scan operator into one or
724 more index scans on covering indexes.
725 * Inner join to merge-join transforms a logical inner join operator
726 into a merge-join operator.
727
728 An example transformation is join commutativity. The pattern for join
729 commutativity is an inner-join:
730
731 ```
732 inner-join
733 |
734 +-- pattern leaf // left input
735 |
736 +-- pattern leaf // right input
737 |
738 +-- pattern leaf // join condition
739 ```
740
741 An inner-join always has 3 children: the left and right inputs and the
742 join condition. Join commutativity only needs to swap the left and
743 right inputs an this specifies pattern leaf for all 3 children.
744
745 The actual join commutativity transform is straightforward:
746
747 ```go
748 // This is demonstration code, the real implementation will be mildly
749 // more complex in order to reduce heap allocations.
750 func (joinCommutativity) apply(e *expr) *expr {
751 return &expr{
752 op: innerJoinOp,
753 children: []*expr{
754 e.children[1],
755 e.children[0],
756 e.children[2],
757 }
758 props: e.props,
759 }
760 }
761 ```
762
763 Note that join commutativity is the simplest transform. More
764 sophisticated transforms have to perform complex checks for whether
765 they can be applied to an expression and for generating the resulting
766 transformed expression. For a slightly more complex example, join
767 associativity sorts the join conditions between the upper and lower
768 joins and checks to see if it is creating an undesirable cross-join.
769
770 Implicit in the join commutativity example above is that
771 transformations are written in code. An alternative is to create a
772 domain specific language for expressing transformations. The benefit
773 of such a language is the potential for more compact and expressive
774 transformations. The downside is the need to write a compiler for the
775 DSL. The current decision is to eschew a DSL for transformations as
776 the work involved seems strictly greater than writing transformations
777 in Go. In particular, a DSL would require both the author and reviewer
778 to learn the DSL. And a DSL doesn't necessarily ease writing a
779 transformation. Complex transformations may require extensions to the
780 DSL and the DSL compiler and thus not simplify writing the
781 transformation at all. In the short and medium term, the set of
782 transformations is expected to remain small as energies go into
783 fleshing out other query planning modules. The decision about a DSL
784 for transformations should be revisited as the transformation set
785 grows or in the light of experimentation with a DSL that proves its
786 worth.
787
788 ### Cost model
789
790 The cost model takes as input a physical query plan and computes an
791 estimated "cost" to execute the plan. The unit of "cost" can be
792 arbitrary, though it is desirable if it has some real world meaning
793 such as expected execution time. What is required is for the costs of
794 different query plans to be comparable. A SQL optimizer is seeking to
795 find the shortest expected execution time for a query and uses cost as
796 a proxy for execution time.
797
798 Cost is roughly calculated by estimating how much time each node in
799 the expression tree will use to process all results and modelling how
800 data flows through the expression tree. [Table statistics](#stats) are
801 used to power cardinality estimates of base relations which in term
802 power cardinality estimates of intermediate relations. This is
803 accomplished by propagating histograms of column values from base
804 relations up through intermediate nodes (e.g. combining histograms
805 from the two join inputs into a single histogram). Operator-specific
806 computations model the network, disk and CPU costs. The cost model
807 should include data layout and the specific operating environment. For
808 example, network RTT in one cluster might be vastly different than
809 another.
810
811 The operator-specific computations model the work performed by the
812 operator. A hash-join needs to model if temporary disk will be needed
813 based on the estimated size of the inputs.
814
815 Because the cost for a query plan is an estimate, there is an
816 associated error. This error might be implicit in the cost, or could
817 be explicitly tracked. One advantage to explicitly tracking the
818 expected error is that it can allow selecting a higher cost but lower
819 expected error plan over a lower cost but higher expected error
820 plan. Where does the error come from? One source is the innate
821 inaccuracy of stats: selectivity estimation might be wildly off due to
822 an outlier value. Another source is the accumulated build up of
823 estimation errors the higher up in the query tree. Lastly, the cost
824 model is making an estimation for the execution time of an operation
825 such as a network RTT. This estimate can also be wildly inaccurate due
826 to bursts of activity.
827
828 Search finds the lowest cost plan using dynamic programming. That
829 imposes a restriction on the cost model: it must exhibit optimal
830 substructure. An optimal solution can be constructed from optimal
831 solutions of its subproblems.
832
833 ### Search (a.k.a. Enumeration)
834
835 Search is the final phase of optimization where many alternative
836 logical and physical query plans are explored in order to find the
837 best physical query plan. The output of Search is a physical query
838 plan to execute. Note that in this context, a physical query plan
839 refers to a query plan for which the leaves of the tree are table
840 scans or index scans. In the long term, DistSQL planning will be
841 incorporated into Search, though in the short term it may be kept
842 separate.
843
844 In order to avoid a combinatorial explosion in the number of
845 expression trees, Search utilizes the Memo structure. Due to the large
846 number of possible plans for some queries, Search cannot explore all
847 of them and thus requires *pruning* heuristics. For example, Search
848 can cost query plans early and stop exploring a branch of plans if the
849 cost is greater than the current best cost so far.
850
851 Search begins with a Memo populated with the expression provided by
852 Rewrite. Search is modelled as a series of tasks that optimize an
853 expression. Conceptually, the tasks form a dependency tree very much
854 like the dependency tree formed by tools like make. Each task has a
855 count of its unfinished dependencies and a pointer to its parent
856 task. When a task is run it is passed its parent task and as part of
857 running it can add additional dependencies to its parent, thus making
858 the tree of dependencies dynamic. After a task is run, it decrements
859 its parent tasks and schedules it for execution if it was the last
860 dependency. Note that new tasks are only created if new expressions
861 were added to the memo. Search will not terminate if we continually
862 created new expressions via transformations, but that would also
863 indicate that we have an unbounded growth in expressions. In practice,
864 Search will have some limits on the number of steps it performs or
865 time it can take.
866
867 The initial task for Search is to optimize the "root" group. The tasks
868 described are the standard Cascades-style search tasks:
869
870 1. `OptimizeGroup(reqProps)`. Implements the group (via
871 `ImplementGroup`) which generates implementations for the
872 expressions in the group, then selects the plan with the least
873 estimated cost. Enforcers (e.g. sort) are added as needed.
874
875 2. `ImplementGroup`. Explores the group (via `ExploreGroup`) which
876 generates more logical expressions in the group and in child
877 groups, then generates implementations for all of the logical
878 expressions (via `ImplementGroupExpr`). `ImplementGroup` itself
879 does not perform any transformations, but acts as a synchronization
880 point for dependent tasks.
881
882 3. `ImplementGroupExpr`. Implements all of the child groups (via
883 `ImplementGroup`), then applies any applicable implementation
884 transformations (via `Transform`) to the forest of expressions
885 rooted at the specified memo-expression. Example transformation:
886 inner-join to merge-join and hash-join.
887
888 4. `ExploreGroup`. Explores each expression in the group (via
889 `ExploreGroupExpr`). `ExploreGroup` itself does not perform any
890 transformations, but acts as a synchronization point for dependent
891 tasks.
892
893 5. `ExploreGroupExpr`. Explores all of the child groups (via
894 `ExploreGroup`), then applies any applicable exploration
895 transformations (via `Transform`) to the forest of expressions
896 rooted at the specified memo-expression. Example transformations:
897 join commutativity and join associativity.
898
899 6. `Transform`. Applies a transform to the forest of expressions
900 rooted at a particular memo-expression. There are two flavors of
901 transformation task: exploration and implementation. The primary
902 difference is the state transition after the task finishes. An
903 exploration transform task recursively schedules exploration of the
904 group it is associated with. An implementation transform task
905 schedules optimization of the group.
906
907 A search *stage* is configured by a set of exploration and
908 implementation transforms, and a *budget*. The budget is used to prune
909 branches of the search tree which appear undesirable. The initial
910 search stage has a limited set of exploration and implementation
911 transforms (perhaps 0 exploration transforms), an unlimited budget,
912 and aims to quickly find a workable, though possibly slow, plan. Each
913 subsequent stage uses the cost from the best plan of the previous
914 stage for pruning. [TODO(peter): my understanding of how this will
915 work is slightly fuzzy. My usage of the term budget might be
916 off. Perhaps better to describe it as "max cost".]
917
918 Full featured optimizers can contain hundreds of
919 transformations. Checking whether each transformation is applicable at
920 each node would be prohibitively expensive, so the transformations are
921 indexed by the root operator of their pattern. Transformations are
922 further categorized as exploration and implementation and divided
923 amongst the search stages based on generality and expected benefit.
924
925 Search is naturally parallelizable, yet exploiting that parallelism
926 involves synchronization overhead. Parallelization also can allow one
927 query to utilize more planning resources than other queries. Rather
928 than support parallelization of search, energy will instead be
929 directed at making search and transformations fast and memory
930 efficient.
931
932 ### Testing
933
934 Historically, SQL databases have introduced subtle bugs that have
935 lasted for years through invalid transformations. Search should be
936 designed for testability. One example of this is to allow verification
937 that all of the alternate plans generated by Search actually produce
938 the same result.
939
940 In addition to testing the alternative query plans, there is utility
941 in generating a large number of valid SQL statements. The existing
942 Random Syntax Generator does one level of this by generating
943 syntactically valid SQL. An additional level would be to generate
944 semantically valid queries which might be more feasible by random
945 generation at the expression level.
946
947 The relational algebra expression trees should provide a textual
948 format to ease testing using infrastructure similar to the existing
949 logic tests where test files define queries and expected results.
950
951 Optimization is concerned with making queries faster and it is quite
952 disturbing to users when inadvertent regressions occur. A large test
953 suite needs to be developed over time which ensures that the addition
954 of new transformations or improvements to the various modules do not
955 cause regressions in the chosen plans.
956
957 Generating actual table data with various data distributions for
958 testing purposes would be both onerous and slow. Table statistics are
959 a key factor in the decisions performed by search. In order to
960 adequately test how the behavior of search changes with changing table
961 statistics, we need an easy mechanism for injecting fake statistics.
962
963 ## Roadmap
964
965 The above outline sketches a large amount of work. How do we get there
966 from here? A strawman proposal divides the work into several
967 releases. The farther out the proposed work, the fuzzier the proposal
968 becomes.
969
970 ### 2.0
971
972 * Stats. Stats are not dependent on other planning modules but are a
973 prerequisite to cost-based transformations. Stats are only generated
974 explicitly via `CREATE STATISTICS`.
975
976 * Prep. Introduce the expression tree. Construct the expression tree
977 from the existing AST output by the parser. Use the AST-based type
978 checking and name resolution. The existing AST-based planning code
979 will be left in place and a parallel world of expression-based
980 planning will be erected. The new planning code will not be used in
981 this release.
982
983 * Rewrite. Predicate inference and predicate push down.
984
985 * Memo. Introduce the memo structure.
986
987 * Testing. Use ugly hacks to hook up a hobbled version of something as
988 an alternate query planner. Perhaps a flag to pass queries through
989 the expression format and memo and then translate them back into the
990 AST in order to use the legacy planner.
991
992 ### 2.1
993
994 * Stats. Automatically gather stats on PKs and index columns.
995
996 * Prep. Perform name resolution and type checking on the expression
997 tree. Support non-recursive CTEs. Fall-back to legacy planning code
998 for unsupported queries.
999
1000 * Rewrite. Transform correlated subqueries into apply
1001 variants. Transform common apply variants into joins.
1002
1003 * Execution. Nested-loop-join, semi-join, anti-join and apply
1004 processors.
1005
1006 * Cost model. Basic cost model that is powered by stats.
1007
1008 * Search. Task-based single stage search. No pruning. Use existing
1009 DistSQL planning. Facility for time-travel debugging of the search
1010 process and inspecting the memo state (e.g. logical and physical
1011 properties). Global and per-session disablement of individual
1012 transforms.
1013
1014 * Transforms. Join elimination, distinct/group-by elimination, join
1015 commutativity, join associativity, index selection, and scalar
1016 normalization.
1017
1018 * Testing. Random generation of table data based on schema and query
1019 to exercise corner conditions. Random sampling and execution of
1020 alternate query plans to verify equivalence. Test suite for plan
1021 selection using injected stats.
1022
1023 ### 2.2
1024
1025 * Stats. Support more advanced statistics (e.g. filtered statistics).
1026
1027 * Prep. Support 100% of queries, enabling the deletion of the legacy
1028 planning code.
1029
1030 * Cost model. Make the cost model more sophisticated by taking into
1031 account measurements of network bandwidth and latency. Validate cost
1032 model against actual queries.
1033
1034 * Search. Add multiple stages with pruning heuristics. Integrate
1035 DistSQL planning.
1036
1037 * Transforms. Pull group-by above a join. Push group-by below a
1038 join. Split group-by into local and global components. Simplify
1039 outer joins.
1040
1041 * Execution. Stream-group-by.
1042
1043 ## Unresolved questions
1044
1045 * Flesh out understanding of where physical properties such as
1046 ordering can be imposed by the query itself. For example, a
1047 top-level `ORDER BY` clause definitely imposes ordering. But so does
1048 an `ORDER BY` clause that is the immediate sub-expression of
1049 `LIMIT/OFFSET`, `DISTINCT ON`, `WITH ORDINALITY`,
1050 `{INSERT,UPSERT,DELETE,UPDATE}` and `CREATE TABLE ... AS ...`. We
1051 also need to pay attention to `ORDER BY INDEX` and `ORDER BY PRIMARY
1052 KEY`, though those clauses likely degenerate into `ORDER
1053 BY`. Are there other places we need to pay attention to physical
1054 properties? Are there other physical properties to capture at
1055 intermediate nodes?
1056
1057 * Which parts of query planning can be performed during PREPARE vs
1058 EXECUTE? Most (all?) of the transformations that are part of Rewrite
1059 can be performed during PREPARE. For example, predicate push-down
1060 and decorrelation do not require placeholder values. And some parts
1061 of Search, such as join enumeration, can be performed during
1062 PREPARE. The part that is restricted to EXECUTE are certain parts of
1063 index selection and thus costing of query plans.
1064
1065 * The performance of the query planner itself is important because
1066 query planning occurs for every query executed. What sorts of fast
1067 paths are possible for simple queries?
1068
1069 * Window functions.
1070
1071 * Describe max1row operator and why it is necessary.
1072
1073 ## Appendix
1074
1075 ### Expr/Memo examples
1076
1077 Consider the query:
1078
1079 ```sql
1080 SELECT v, k FROM kv WHERE k < 3
1081 ```
1082
1083 Building the expression tree results in:
1084
1085 ```
1086 project [out=(0,1)]
1087 columns: kv.v:1 kv.k:0
1088 projections:
1089 variable (kv.v) [in=(1)]
1090 variable (kv.k) [in=(0)]
1091 inputs:
1092 select [out=(0,1)]
1093 columns: kv.k:0* kv.v:1
1094 filters:
1095 lt [in=(0)]
1096 inputs:
1097 variable (kv.k) [in=(0)]
1098 const (3)
1099 inputs:
1100 scan [out=(0,1)]
1101 columns: kv.k:0 kv.v:1
1102 ```
1103
1104 Some points to notice above. The relational operators (`project`,
1105 `select` and `scan`) track their output column set as a bitmap
1106 (i.e. `out=(0,1)`). Scalar expressions such as `variable` and `lt`
1107 track their required input columns. Relational operators have a slice
1108 of children where the interpretation of the children is operator
1109 specific. The `project` operator has 2 children: a relational input
1110 and a list of projections. Note that the order of projections is
1111 important and are stored using an `ordered-list` operator in the
1112 memo. The `select` operator also has 2 children: a relational input
1113 and a list of filters.
1114
1115 Inserting the expression tree into the memo results in:
1116
1117 ```
1118 8: [project [5 7]]
1119 7: [ordered-list [6 2]]
1120 6: [variable kv.v]
1121 5: [select [1 4]]
1122 4: [lt [2 3]]
1123 3: [const 3]
1124 2: [variable kv.k]
1125 1: [scan kv]
1126 ```
1127
1128 Here we can see more clearly the child structure of the various
1129 relational operators. The `select` expression in group 5 has 2
1130 children: groups 1 and 4. Group 1 is a `scan` and group 4 is the
1131 filter.
1132
1133 As another example, consider the query:
1134
1135 ```sql
1136 SELECT k, v FROM (SELECT v, k FROM kv)
1137 ```
1138
1139 Inserting into the memo we get:
1140
1141 ```
1142 7: [project [5 6]]
1143 6: [ordered-list [3 2]]
1144 5: [project [1 4]]
1145 4: [ordered-list [2 3]]
1146 3: [variable kv.k]
1147 2: [variable kv.v]
1148 1: [scan kv]
1149 ```
1150
1151 Notice that the variables (`kv.k` and `kv.v`) are only present once in
1152 the memo and their groups are shared by both projection lists.