github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171213_sql_query_planning.md (about)

     1  - Feature Name: SQL Query Planning
     2  - Status: in-progress
     3  - Start Date: 2017-12-13
     4  - Authors: Peter Mattis
     5  - RFC PR: #19135
     6  - Cockroach Issue: (one or more # from the issue tracker)
     7  
     8  # Summary
     9  
    10  This RFC sketches the outlines of the high-level modules of a SQL
    11  query planning including a full-featured optimizer.
    12  
    13  # Motivation
    14  
    15  SQL query planning is concerned with transforming the AST of a SQL
    16  query into a physical query plan for execution. Naive execution of a
    17  SQL query can be prohibitively expensive, because SQL specifies the
    18  desired results and not how to achieve them. A given SQL query can
    19  have thousands of alternate query plans with vastly different
    20  execution times. The techniques used to generate and select a good
    21  query plan involve significant engineering challenges.
    22  
    23  This RFC is intended to provide guidance for both short term and long
    24  term work on the SQL optimizer, and highlight areas of the current
    25  system that will need to evolve.
    26  
    27  # Guide-level explanation
    28  
    29  ## Overview
    30  
    31  SQL query planning is often described in terms of 8 modules:
    32  
    33  1. [Stats](#stats)
    34  2. [Prep](#prep)
    35  3. [Rewrite](#rewrite)
    36  4. [Memo](#memo)
    37  5. [Cost Model](#cost-model)
    38  6. [Search](#search-aka-enumeration)
    39  7. [Properties](#properties)
    40  8. [Transformations](#transformations)
    41  
    42  Note that Stats, Cost Model, Memo, Properties and Transformations
    43  could be considered modules, while Prep, Rewrite and Search could be
    44  considered phases, though we'll refer to all 8 uniformly as modules in
    45  this document. Memo is a technique for compactly representing the
    46  forest of trees generated during Search. Stats, Properties, Cost Model
    47  and Transformations are modules that power Prep, Rewrite and Search.
    48  
    49  ```
    50                     SQL query text
    51                           |
    52                     +-----v-----+
    53                     |   Parse   |
    54                     +-----+-----+
    55                           |
    56                         (ast)
    57                           |
    58       +-------+     +-----v-----+  - constant folding, type checking, name resolution
    59       | Stats +----->   Prep    |  - computes initial properties
    60       +-------+     +-----+-----+  - retrieves and attaches stats
    61                           |        - done once per PREPARE
    62                        (expr)
    63                           |
    64                     +-----v-----+  - capture placeholder values / timestamps
    65                  +-->  Rewrite  |  - cost-agnostic transformations, eg. predicate push-down
    66  +------------+  |  +-----+-----+  - done once per EXECUTE
    67  | Transforms +--+        |
    68  +------------+  |     (expr)
    69                  |        |
    70                  +-->-----v-----+  - cost-based transformations
    71  +------------+     |  Search   |  - finds lowest cost physical plan
    72  | Cost Model +----->-----+-----+  - includes DistSQL physical planning
    73  +------------+           |
    74                    (physical plan)
    75                           |
    76                     +-----v-----+
    77                     | Execution |
    78                     +-----------+
    79  ```
    80  
    81  
    82  CockroachDB already has implementations of portions of these modules
    83  except for Stats and Memo. For example, CockroachDB performs name
    84  resolution and type checking which is part of Prep, and performs
    85  predicate push down through joins which traditionally happens during
    86  Rewrite. CockroachDB utilizes a primitive Cost model during index
    87  selection (a portion of Search) to choose which index to use based on
    88  filters and desired ordering.
    89  
    90  In addition to the 8 modules, another aspect of the optimizer that
    91  needs discussion is [Testing](#testing) and test infrastructure.
    92  
    93  Lastly, a strawman [Roadmap](#roadmap) is proposed for how to break up
    94  this work over the next several releases.
    95  
    96  ## Glossary
    97  
    98  The following terms are introduced/defined in this RFC:
    99  
   100  - [**algebraic equivalence**](#properties)
   101  - [**attributes** of expressions](#properties-vs-attributes)
   102  - [**cardinality**](#stats)
   103  - [**decorrelating**](#rewrite), syn. "unnesting"
   104  - [**derived** vs **required** properties](#properties)
   105  - [**enforcer** operator for properties](#properties)
   106  - [**equivalence class**](#memo)
   107  - [**exploration** vs **implementation** transformations](#search)
   108  - [**expressions** in queries](#prep)
   109  - [**functional dependencies**](#prep)
   110  - [**logical** vs **physical** properties](#memo)
   111  - [**logical** vs **physical** vs **scalar** operators](#prep)
   112  - [**memo-expressions**](#memo)
   113  - [**operator** in query expressions](#prep)
   114  - [**pattern** in transformations](#memo)
   115  - [**predicate push-down**](#rewrite)
   116  - [**prep** phase](#prep)
   117  - **properties** of expressions [1](#memo) [2](#properties)
   118  - [**pruning** during search](#search)
   119  - [**query text**](#modules)
   120  - [**rewrite** phase](#rewrite)
   121  - [**scalar** vs **relational** properties](#properties)
   122  - [**search** phase](#search)
   123  - [**selectivity**](#stats)
   124  - [**tracked** vs **computed** properties](#properties)
   125  - [**transformation** of expressions](#rewrite)
   126  - [**unnesting**](#rewrite), syn. "decorrelating"
   127  
   128  ## Modules
   129  
   130  The parse phase is not discussed in this RFC. It handles the
   131  transformation of the *SQL query text* into an abstract syntax tree
   132  (AST).
   133  
   134  ### Prep
   135  
   136  *Prep* (short for "prepare") is the first phase of query optimization
   137  where the AST is transformed into a form more suitable for
   138  optimization and annotated with information that will be used by later
   139  phases. Prep includes resolving table and column references (i.e. name
   140  resolution) and type checking, both of which are already performed by
   141  CockroachDB.
   142  
   143  During Prep, the AST is transformed from the raw output of the parser
   144  into an expression "tree".
   145  
   146  ```go
   147  type operator int16
   148  
   149  type expr struct {
   150    op              operator
   151    children        []*expr
   152    relationalProps *relationalProps // See [relational properties](#tracked_properties)
   153    scalarProps     *scalarProps     // See [scalar properties](#tracked_properties)
   154    physicalProps   *physicalProps   // See [physical properties](#tracked_properties)
   155    private         interface{}
   156  }
   157  ```
   158  
   159  The term *"expression"* here is based on usage from literature, though
   160  it is mildly confusing as the current SQL code uses "expression" to
   161  refer to scalar expressions. In this document, "expression" refers to
   162  either a relational or a scalar expression. Using a uniform node type
   163  for expressions facilitates transforms used during the Rewrite and
   164  Search phases of optimization.
   165  
   166  Each expression has an *operator* and zero or more operands
   167  (`expr.children`). Operators can be *relational* (e.g. `join`) or
   168  *scalar* (e.g. `<`). Relational operators can be *logical* (only
   169  specifies results) or *physical* (specifies both result and a
   170  particular implementation).
   171  
   172  During Prep all the columns are given a unique index (number). Column
   173  numbering involves assigning every base column and non-trivial
   174  projection in a query a unique query-specific index.
   175  
   176  Giving each column a unique index allows the expression nodes
   177  mentioned above to track input and output columns, or really any set
   178  of columns during Prep and later phases, using a bitmap. The bitmap
   179  representation allows fast determination of compatibility between
   180  expression nodes and is utilized during rewrites and transformations
   181  to determine the legality of such operations.
   182  
   183  The Prep phase also computes *logical properties*, such as the input
   184  and output columns of each (sub-)expression, equivalent columns,
   185  not-null columns and functional dependencies.
   186  
   187  The functional dependencies for an expression are constraints over one
   188  or more sets of columns. Specific examples of functional dependencies
   189  are the projections, where 1 or more input columns determine an output
   190  column, and "keys" which are a set of columns where no two rows output
   191  by the expression are equal after projection on to that set (e.g. a
   192  unique index for a table where all of the columns are NOT
   193  NULL). Conceptually, the functional dependencies form a graph, though
   194  they are not represented as such in code.
   195  
   196  ### Rewrite
   197  
   198  The second phase of query optimization is *rewrite*. The rewrite phase
   199  performs *transformations* on the logical query tree which are always
   200  beneficial (i.e. cost-agnostic).
   201  
   202  A transformation transforms a (part of a) query into another. Note
   203  that there is conceptual overlap with the Search phase which also
   204  performs transformations on the query. Both phases employ
   205  transformations, yet Search needs to track and cost the alternatives
   206  while Rewrite does not. In the specific context of the rewrite phase,
   207  transformations are commonly called *rewrites*.
   208  
   209  During Rewrite, the previous version of an expression is
   210  discarded. During Search, both the original and new expression are
   211  preserved side-by-side as alternatives, see the [section
   212  below](#search) for details.
   213  
   214  Also note that some of the transformations performed by Rewrite need
   215  not be performed again by Search (decorrelation is the prime
   216  example). The vast majority of transforms performed by Search are not
   217  used by Rewrite.
   218  
   219  Rewrite is the phase where e.g. correlated subqueries are
   220  *decorrelated* (synonym: *unnesting*), additional predicates are
   221  inferred and *predicate push down* occurs, and various other
   222  simplifications to the relational algebra tree (e.g. projection & join
   223  elimination). As an example of predicate push down, consider the
   224  query:
   225  
   226  ```sql
   227  SELECT * FROM a, b USING (x) WHERE a.x < 10
   228  ```
   229  
   230  The naive execution of this query retrieves all rows from `a` and `b`,
   231  joins (i.e. filters) them on the variable `x`, and then filters them
   232  again on `a.x < 10`. Predicate push down attempts to push down the
   233  predicate `a.x < 10` below the join. This can obviously be done for
   234  the scan from `a`:
   235  
   236  ```sql
   237  SELECT * FROM (SELECT * FROM a WHERE a.x < 10), b USING (x)
   238  ```
   239  
   240  Slightly more complicated, we can also generate a new predicate using
   241  the functional dependence that `a.x = b.x` (due to the join
   242  predicate):
   243  
   244  ```sql
   245  SELECT * FROM
   246    (SELECT * FROM a WHERE a.x < 10),
   247    (SELECT * FROM b WHERE b.x < 10) USING (x)
   248  ```
   249  
   250  Predicate push down is aided by predicate inference. Consider the query:
   251  
   252  ```sql
   253  SELECT * FROM a, b USING (x)
   254  ```
   255  
   256  Due to the join condition, we can infer the predicates `a.x IS NOT
   257  NULL` and `b.x IS NOT NULL`:
   258  
   259  ```sql
   260  SELECT * FROM a, b USING (x)
   261    WHERE a.x IS NOT NULL AND b.x IS NOT NULL
   262  ```
   263  
   264  And predicate push down can push these predicates through the join:
   265  
   266  ```sql
   267  SELECT * FROM
   268    (SELECT * FROM a WHERE a.x IS NOT NULL),
   269    (SELECT * FROM b WHERE b.x IS NOT NULL) USING (x)
   270  ```
   271  
   272  ### Stats
   273  
   274  Table statistics power both the cost model and the search of alternate
   275  query plans. A simple example of where stastistics guide the search of
   276  alternate query plans is in join ordering:
   277  
   278  ```sql
   279  SELECT * FROM a JOIN b
   280  ```
   281  
   282  In the absence of other opportunities, this might be implemented as a
   283  hash join. With a hash join, we want to load the smaller set of rows
   284  (either from `a` or `b`) into the hash table and then query that table
   285  while looping through the larger set of rows. How do we know whether
   286  `a` or `b` is larger? We keep statistics about the *cardinality* of `a`
   287  and `b`, i.e. the (approximate) number of different values.
   288  
   289  Simple table cardinality is sufficient for the above query but fails
   290  in other queries. Consider:
   291  
   292  ```sql
   293  SELECT * FROM a JOIN b ON a.x = b.x WHERE a.y > 10
   294  ```
   295  
   296  Table statistics might indicate that `a` contains 10x more data than
   297  `b`, but the predicate `a.y > 10` is filtering a chunk of the
   298  table. What we care about is whether the result of the scan of `a`
   299  after filtering returns more rows than the scan of `b`. This can be
   300  accomplished by making a determination of the *selectivity* of the
   301  predicate `a.y > 10` (the % of rows it will filter) and then
   302  multiplying that selectivity by the cardinality of `a`. The common
   303  technique for estimating selectivity is to collect a histogram on
   304  `a.y` (prior to running the query).
   305  
   306  The collection of table statistics occurs prior to receiving the
   307  query. As such, the statistics are necessarily out of date and may be
   308  inaccurate. The system may bound the inaccuracy by recomputing the
   309  stats based on how fast a table is being modified. Or the system may
   310  notice when stat estimations are inaccurate during query execution.
   311  
   312  [A separate RFC covers statistics collection in
   313  CockroachDB.](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20170908_sql_optimizer_statistics.md)
   314  
   315  ### Memo
   316  
   317  Memo is a data structure for efficiently storing a forest of query
   318  plans. Conceptually, the memo is composed of a numbered set of
   319  **equivalency classes** called **groups** where each group contains a
   320  set of logically equivalent expressions. The different expressions in
   321  a single group are called **memo-expressions** (memo-ized
   322  expressions). While an expression node outside of the memo contains a
   323  list of child expressions, a memo-expression contains a list of child
   324  groups.
   325  
   326  By definition, all the memo-expressions in a group share the same
   327  *logical properties*, a concept explored more in depth in the [section
   328  below](#properties). The memo-expression structure mirrors the
   329  expression structure:
   330  
   331  ```go
   332  type exprID int32
   333  type groupID int32
   334  
   335  type memoExpr struct {
   336    op            operator
   337    children      []groupID
   338    physicalProps *physicalProps
   339    private       interface{}
   340  }
   341  
   342  type memoGroup struct {
   343    exprs           []memoExpr
   344    relationalProps *relationalProps
   345    scalarProps     *scalarProps
   346  }
   347  ```
   348  
   349  Transformations are *not* performed directly on the memo because
   350  transformations operate on trees while the memo models a forest of
   351  trees. Instead, expression fragments are extracted from the memo,
   352  transformed, and re-inserted into the memo. At first glance, this
   353  seems onerous and inefficient, but it allows transformations to be
   354  rewritten more naturally and the extraction of expression fragments
   355  can be performed efficiently.
   356  
   357  Extracting an expression fragment for transformation is performed via
   358  a process called *binding*. Binding allows iterating over all of the
   359  expressions matching a pattern that are rooted at a particular
   360  memo-expression. A pattern is specified using the same expression
   361  structure that is to be extracted, with the addition of "pattern-leaf"
   362  and "pattern-tree" placeholders that act as wildcards:
   363  
   364  * A **pattern leaf** matches any expression tree, with only the root
   365    of the tree being retained in the bound expression. It is used when
   366    the expression is used opaquely by the transformation. In other
   367    words, the transformation doesn't care what's inside the subtree. It
   368    is a "leaf" in the sense that it's a leaf in any binding matching a
   369    pattern.
   370  * A **pattern tree** matches any expression tree and indicates that
   371    recursive extraction of the full subtree is required. It is
   372    typically used for scalar expressions when some manipulation of that
   373    expression is required by the transformation. Note that a pattern
   374    tree results in all possible subtrees being enumerated, however
   375    scalar expressions typically don't have many subtrees (if there are
   376    no subqueries, there is only one subtree). [TODO(peter): what to do
   377    about subqueries in a scalar context? Iterating over all of the
   378    subquery expressions doesn't seem right. There is a TODO in `opttoy`
   379    to cache scalar expressions in `memoGroup`. Need to investigate this
   380    further.]
   381  
   382  To better understand the structure of the memo, consider the query:
   383  
   384  ```sql
   385  SELECT * FROM a, b WHERE a.x = b.x
   386  ```
   387  
   388  Converted to the expression structure which models the extended
   389  relational algebra the query looks like:
   390  
   391  ```
   392  inner-join [columns: a.x a.y b.x b.z]
   393    filters:
   394      eq
   395        inputs:
   396          variable (a.x)
   397          variable (b.x)
   398    inputs:
   399      scan [columns: a.x a.y]
   400      scan [columns: b.x b.z]
   401  ```
   402  
   403  Inserting the expression tree into the memo results in:
   404  
   405  ```
   406  6: [inner-join [1 2 5]]
   407  5: [eq [3 4]]
   408  4: [variable b.x]
   409  3: [variable a.x]
   410  2: [scan b]
   411  1: [scan a]
   412  ```
   413  
   414  Memo groups are numbered by when they were created and the groups are
   415  topologically sorted for display (this is an implementation detail and
   416  not intended to be prescriptive). In the above example, each group
   417  contains only a single memo-expression. After performing the join
   418  commutativity transformation, the memo would expand:
   419  
   420  ```
   421  6: [inner-join [1 2 5]] [inner-join [2 1 5]]
   422  5: [eq [3 4]]
   423  4: [variable b.x]
   424  3: [variable a.x]
   425  2: [scan b]
   426  1: [scan a]
   427  ```
   428  
   429  Memo groups contain logically equivalent expressions, but two
   430  logically equivalent expression may not be placed in the same memo
   431  group. This occurs because determining logical equivalency of two
   432  relational expressions is complex to perform 100% correctly. A
   433  correctness failure (i.e. considering two expressions logically
   434  equivalent when they are not) results in invalid transformations and
   435  invalid plans. Placing two logically equivalent expressions in
   436  different groups has a much gentler failure mode: the memo and search
   437  are less efficient.
   438  
   439  Insertion of an expression into the memo is performed by recursively
   440  inserting all of the sub-expressions into the memo and then computing
   441  a **fingerprint** for the memo-expression. The fingerprint for a
   442  memo-expression is simply the expression operator and the list of
   443  child groups. For example, in the memo examples above, the fingerprint
   444  for the first inner-join expression is `[inner-join [1 2 5]]`. The
   445  memo maintains a map from expression fingerprint to memo group which
   446  allows quick determination if an expression fragment already exists in
   447  the memo. A small amount of operator-specific normalization is
   448  performed when computing the group fingerprint for a
   449  memo-expression. For example, the left and right inputs of an
   450  inner-join are output in sorted order which results in the expressions
   451  `[inner-join [1 2 5]]` and `[inner-join [2 1 5]]` having the same
   452  group fingerprint. The operator-specific normalization is
   453  conservative. The common case for placing logically equivalent
   454  expressions in the same group is adherence to the invariant that
   455  transformed expressions are logically equivalent to their input.
   456  
   457  ```go
   458  type memo struct {
   459    // Map from memo-expression "group" fingerprint to group ID.
   460    groupMap map[string]groupID
   461    groups   []memoGroup
   462  }
   463  ```
   464  
   465  In addition to memo expressions, memo groups also contain a map from
   466  desired physical properties to optimization state for the group for
   467  those properties. This state is discussed more in
   468  [Search](#search-aka-enumeration-or-transformation).
   469  
   470  A **location** within the memo identifies a particular memo-expression
   471  by its group and expression number. When an expression fragment is
   472  extracted from the memo, each `expr` is tagged with the location it
   473  originated from in the memo. This allows subsequent reinsertion of a
   474  transformed expression into the memo to quickly determine which groups
   475  the expression nodes should be added to.
   476  
   477  ```go
   478  type memoLoc struct {
   479    group groupID
   480    expr  exprID
   481  }
   482  ...
   483  type expr struct {
   484    op  operator
   485    loc memoLoc
   486    ...
   487  }
   488  ```
   489  
   490  The above depictions of the memo structures are simplified for
   491  explanatory purposes. The actual structures are similar, though
   492  optimized to reduce allocations.
   493  
   494  ### Properties
   495  
   496  Properties are meta-information that are maintained at each node in an
   497  expression. Properties power transformations and optimization.
   498  
   499  #### Properties vs attributes
   500  
   501  The term "property" encompasses information that is well-defined over
   502  any expression in its group: a given scalar property is well-defined
   503  for all scalar operators; a relational property is well-defined for
   504  all relational operators. For example, "nullability" is a property
   505  that is properly defined (and says something meaningful for) any any
   506  scalar expression.
   507  
   508  In contrast, some bits of information are only relevant for specific
   509  operators. For example, the "join algorithm" is only relevant for join
   510  operators; the "index name" is only relevant for table scan operators,
   511  etc. This operator-specific data is called an *attribute* and is
   512  attached to a particular memo-expression.
   513  
   514  #### Logical vs physical properties
   515  
   516  Logical properties are maintained for both relational and scalar
   517  operators. A logical property refers to logical information about the
   518  expression such as column equivalencies or functional dependencies.
   519  
   520  Physical properties are those that exist outside of the relational
   521  algebra such as row order and data distribution. Physical property
   522  requirements arise from both the query itself (the non-relational
   523  `ORDER BY` operator) and by the selection of specific implementations
   524  during optimization (e.g. a merge-join requires the inputs to be
   525  sorted in a particular order).
   526  
   527  By definition, two memo-expressions in the same group have the same
   528  logical properties and the logical properties are attached to the memo
   529  group. The physical properties for the memo-expressions in a group may
   530  differ. For example, a memo group containing inner-join will also have
   531  hash-join and merge-join implementations which produce the same set of
   532  output rows but in different orders.
   533  
   534  #### Relational vs scalar properties
   535  
   536  The memo contains memo-expressions with either scalar (e.g. `+`, `<`,
   537  etc.) or relational (e.g. `join`, `project`, etc.) operators,
   538  distinguished as scalar expressions vs relational expressions.
   539  
   540  Scalar and relational properties are maintained in separate data
   541  structures, but note that both scalar and relational properties are
   542  considered *logical*.
   543  
   544  #### Derived vs required properties
   545  
   546  Properties can be *required* or *derived*.
   547  
   548  A required property is one specified by the SQL query text.  For
   549  example, a DISTINCT clause is a required property on the set of
   550  columns of the corresponding projection -- that the tuple of columns
   551  forms a key (unique values) in the results.
   552  
   553  A derived property is one derived by the optimizer for an
   554  expression based on the properties of the children nodes.
   555  
   556  For example, in `SELECT k+1 FROM kv`, once the ordering of "k" is
   557  known from kv's descriptor, the same ordering property can be derived
   558  for `k+1`.
   559  
   560  During optimization, for each node with required properties the
   561  optimizer will look at the children node to check whether their actual
   562  properties (which can be derived) match the requirement. If they don't
   563  the optimizer must introduce an *enforcer* operator in the plan that
   564  provides the required property.
   565  
   566  For example, an `ORDER BY` clause creates a required ordering property
   567  can cause the optimizer to add a sort node as an enforcer of that
   568  property.
   569  
   570  #### Tracked vs computed properties
   571  
   572  A [tracked property](#tracked_properties) is one which is maintained
   573  in a data structure (e.g. `relationalProps`, `scalarProps`,
   574  `physicalProps`). A computed property is one which is computed from an
   575  expression or an expression fragment as needed. For intermediate
   576  nodes, all properties can be computed which makes tracked properties
   577  akin to a cache. The decision for whether to track or compute a
   578  property is pragmatic. Tracking a property requires overhead whether
   579  the property is used or not, but makes accessing the property in a
   580  transformation fast. Computing a property can be done only when the
   581  property is used, but is not feasible if the computation requires an
   582  entire sub-expression tree (as opposed to a fragment). [Computed
   583  properties](#computed_properties) primarly occur for scalar properties
   584  for which transformations often have the entire scalar expression.
   585  
   586  #### Tracked properties
   587  
   588  The determination of the properties to track is a key aspect of the
   589  design of an optimizer. Track too many and adding new operators
   590  becomes onerous and maintaining the properties through transformations
   591  becomes expensive. Track too few and certain transformations become
   592  difficult.
   593  
   594  Relational properties:
   595  
   596  * Output columns [`intset`]. The set of columns output by an
   597    expression. Used to determine if a predicate is compatible with an
   598    expression.
   599  * Outer columns [`intset`]. The set of columns that are used by the
   600    operator but not defined in the underlying expression tree (i.e. not
   601    supplied by the inputs to the current expression). Synonym: *free
   602    vars*.
   603  * Not-NULL columns [`intset`]. Column nullability is associated with
   604    keys which are a factor in many transformations such as join
   605    elimination, group-by simplification
   606  * Keys [`[]intset`]. A set of columns for which no two rows are equal
   607    after projection onto that set. The simplest example of a key is the
   608    primary key for a table. Note that a key requires all of the columns
   609    in the key to be not-NULL.
   610  * Weak keys [`[]intset`]. A set of columns where no two rows
   611    containing non-NULL values are equal after projection onto that
   612    set. A UNIQUE index on a table is a weak key and possibly a key if
   613    all of the columns are not-NULL. Weak keys are tracked because they
   614    can become keys at higher levels of a query due to null-intolerant
   615    predicates.
   616  * Foreign keys [`map[intset]intset`]. A set of columns that uniquely
   617    identify a single row in another relation. In practice, this is a
   618    map from one set of columns to another set of columns.
   619  * Equivalency groups [`[]intset`]. A set of column groups (sets) where all columns
   620    in a group are equal with each other.
   621  * Constant columns [`intset`]. Columns for which we know we have a
   622    single value.
   623  
   624  Scalar properties:
   625  
   626  * Input columns [`intset`]. The set of columns used by the scalar
   627    expression. Used to determine if a scalar expression is compatible
   628    with the output columns of a relational expression.
   629  * Defined columns [`intset`]. The set of columns defined by the scalar
   630    expression.
   631  
   632  Physical properties:
   633  
   634  * Column ordering. Specified by a top-level projection.
   635  * Row ordering. Specified by an `ORDER BY` clause or required by a
   636    physical operator (e.g. merge-join). Row ordering is enforced by the
   637    **sort** operator.
   638  * Rewindability. Required by multi-use CTEs. Every reference to the
   639    CTE in the query needs to return the same results. A read-only query
   640    has this property by default, though care must be taken with regards
   641    to the [Halloween
   642    Problem](https://en.wikipedia.org/wiki/Halloween_Problem) if the
   643    read-only query exists in the context of a DML query. A CTE
   644    containing a DML (such as `INSERT` or `UPDATE`) needs to have its
   645    results materialized in temporary storage and thus provide
   646    *rewindability*. This property is enforced using a **spool**
   647    operator.
   648  
   649  Note that this list of properties is not exhaustive. In particular,
   650  there are a large number of scalar properties for which it isn't clear
   651  if the property should be tracked or computed when necessary. For
   652  example, null-tolerance (does a predicate ever return true for a NULL
   653  value) can be computed from a scalar expression when needed. It is an
   654  open question as to whether it is utilized frequently enough that it
   655  should be tracked.
   656  
   657  Tracking is a bit more than caching of computed properties: we can't
   658  compute certain relational properties without the entire
   659  sub-expression. Keys are an example: if you have a deeply nested join,
   660  in order to compute the keys after performing a join associativity
   661  transform, you would need to have the entire expression tree. By
   662  tracking the keys property and maintaining it at each relational
   663  expression, we only need the fragment of the expression needed by the
   664  transform.
   665  
   666  ### Computed properties
   667  
   668  Computed properties are used primarily in conjunction with scalar
   669  expressions. The properties are computed rather than tracked because
   670  we usually have the full scalar expression vs just a fragment for
   671  relational expressions.
   672  
   673  Computed scalar properties:
   674  
   675  * Injectivity. An injective expression preserves distinctness: it
   676    never maps distinct elements of its domain to the same element of
   677    its codomain. `exp(x) = e^x` is injective.
   678  * Monotonicity. An monotonic expression preserves ordering. The
   679    preservation may be positive or negative (maintains order or inverts
   680    order) and strict or weak (maintains uniqueness or invalidates it).
   681    `floor(x)` is a positive-weak monotonic expression. `-x` is a
   682    negative-strict monotonic expression.
   683  * Null-intolerance. A null-intolerant expression is a predicate which
   684    never returns `true` for a `NULL` input column. Null-intolerance is
   685    used to infer nullability of columns. `x = y` (where `x` and `y` are
   686    columns) is a null-intolerant expression.
   687  * Contains-aggregate. Does the scalar expression contain any aggregate
   688    functions?
   689  * Contains-subquery. Does the scalar expression contain any
   690    subqueries?
   691  
   692  ### Transformations
   693  
   694  Transformations convert an input expression tree into zero or more
   695  logically equivalent trees. Transformations utilize properties in
   696  order to determine the validity of the transformation. Transforms are
   697  configured with an expression pattern, a check method and an apply
   698  method. The expression pattern is used to identify locations within
   699  the full expression where the transform can be applied. The check
   700  method performs additional checks to determine the validity of a
   701  transformation. And the apply method applies the transformation,
   702  generating zero or more logically equivalent expressions.
   703  
   704  Transformations are categorized as *exploration* or
   705  *implementation*. An exploration transformation creates a logical
   706  expression from an existing logical expression. An implementation
   707  transform creates a physical expression from a logical
   708  expression. Note that both exploration and implementation transforms
   709  take as input logical expressions.
   710  
   711  Some examples of transformations:
   712  
   713  * Join commutativity swaps the order of the inputs to an inner join:
   714    `[join a b] -> [join b a]`.
   715  * Join associativity reorders the children of a parent and child join:
   716    `[join [join a b] c]` -> `[join [join a c] b]`
   717  * Join elimination removes unnecessary joins based on projected
   718    columns and foreign keys.
   719  * Distinct/group-by elimination removes unnecessary distinct/group-by
   720    operations based on keys.
   721  * Decorrelation replaces correlated subqueries with semi-join,
   722    anti-join and apply operators.
   723  * Scan to index scan transforms the logical scan operator into one or
   724    more index scans on covering indexes.
   725  * Inner join to merge-join transforms a logical inner join operator
   726    into a merge-join operator.
   727  
   728  An example transformation is join commutativity. The pattern for join
   729  commutativity is an inner-join:
   730  
   731  ```
   732  inner-join
   733   |
   734   +-- pattern leaf  // left input
   735   |
   736   +-- pattern leaf  // right input
   737   |
   738   +-- pattern leaf  // join condition
   739  ```
   740  
   741  An inner-join always has 3 children: the left and right inputs and the
   742  join condition. Join commutativity only needs to swap the left and
   743  right inputs an this specifies pattern leaf for all 3 children.
   744  
   745  The actual join commutativity transform is straightforward:
   746  
   747  ```go
   748  // This is demonstration code, the real implementation will be mildly
   749  // more complex in order to reduce heap allocations.
   750  func (joinCommutativity) apply(e *expr) *expr {
   751    return &expr{
   752      op: innerJoinOp,
   753      children: []*expr{
   754        e.children[1],
   755        e.children[0],
   756        e.children[2],
   757      }
   758      props: e.props,
   759    }
   760  }
   761  ```
   762  
   763  Note that join commutativity is the simplest transform. More
   764  sophisticated transforms have to perform complex checks for whether
   765  they can be applied to an expression and for generating the resulting
   766  transformed expression. For a slightly more complex example, join
   767  associativity sorts the join conditions between the upper and lower
   768  joins and checks to see if it is creating an undesirable cross-join.
   769  
   770  Implicit in the join commutativity example above is that
   771  transformations are written in code. An alternative is to create a
   772  domain specific language for expressing transformations. The benefit
   773  of such a language is the potential for more compact and expressive
   774  transformations. The downside is the need to write a compiler for the
   775  DSL. The current decision is to eschew a DSL for transformations as
   776  the work involved seems strictly greater than writing transformations
   777  in Go. In particular, a DSL would require both the author and reviewer
   778  to learn the DSL. And a DSL doesn't necessarily ease writing a
   779  transformation. Complex transformations may require extensions to the
   780  DSL and the DSL compiler and thus not simplify writing the
   781  transformation at all. In the short and medium term, the set of
   782  transformations is expected to remain small as energies go into
   783  fleshing out other query planning modules. The decision about a DSL
   784  for transformations should be revisited as the transformation set
   785  grows or in the light of experimentation with a DSL that proves its
   786  worth.
   787  
   788  ### Cost model
   789  
   790  The cost model takes as input a physical query plan and computes an
   791  estimated "cost" to execute the plan. The unit of "cost" can be
   792  arbitrary, though it is desirable if it has some real world meaning
   793  such as expected execution time. What is required is for the costs of
   794  different query plans to be comparable. A SQL optimizer is seeking to
   795  find the shortest expected execution time for a query and uses cost as
   796  a proxy for execution time.
   797  
   798  Cost is roughly calculated by estimating how much time each node in
   799  the expression tree will use to process all results and modelling how
   800  data flows through the expression tree. [Table statistics](#stats) are
   801  used to power cardinality estimates of base relations which in term
   802  power cardinality estimates of intermediate relations. This is
   803  accomplished by propagating histograms of column values from base
   804  relations up through intermediate nodes (e.g. combining histograms
   805  from the two join inputs into a single histogram). Operator-specific
   806  computations model the network, disk and CPU costs. The cost model
   807  should include data layout and the specific operating environment. For
   808  example, network RTT in one cluster might be vastly different than
   809  another.
   810  
   811  The operator-specific computations model the work performed by the
   812  operator. A hash-join needs to model if temporary disk will be needed
   813  based on the estimated size of the inputs.
   814  
   815  Because the cost for a query plan is an estimate, there is an
   816  associated error. This error might be implicit in the cost, or could
   817  be explicitly tracked. One advantage to explicitly tracking the
   818  expected error is that it can allow selecting a higher cost but lower
   819  expected error plan over a lower cost but higher expected error
   820  plan. Where does the error come from? One source is the innate
   821  inaccuracy of stats: selectivity estimation might be wildly off due to
   822  an outlier value. Another source is the accumulated build up of
   823  estimation errors the higher up in the query tree. Lastly, the cost
   824  model is making an estimation for the execution time of an operation
   825  such as a network RTT. This estimate can also be wildly inaccurate due
   826  to bursts of activity.
   827  
   828  Search finds the lowest cost plan using dynamic programming. That
   829  imposes a restriction on the cost model: it must exhibit optimal
   830  substructure. An optimal solution can be constructed from optimal
   831  solutions of its subproblems.
   832  
   833  ### Search (a.k.a. Enumeration)
   834  
   835  Search is the final phase of optimization where many alternative
   836  logical and physical query plans are explored in order to find the
   837  best physical query plan. The output of Search is a physical query
   838  plan to execute. Note that in this context, a physical query plan
   839  refers to a query plan for which the leaves of the tree are table
   840  scans or index scans. In the long term, DistSQL planning will be
   841  incorporated into Search, though in the short term it may be kept
   842  separate.
   843  
   844  In order to avoid a combinatorial explosion in the number of
   845  expression trees, Search utilizes the Memo structure. Due to the large
   846  number of possible plans for some queries, Search cannot explore all
   847  of them and thus requires *pruning* heuristics. For example, Search
   848  can cost query plans early and stop exploring a branch of plans if the
   849  cost is greater than the current best cost so far.
   850  
   851  Search begins with a Memo populated with the expression provided by
   852  Rewrite. Search is modelled as a series of tasks that optimize an
   853  expression. Conceptually, the tasks form a dependency tree very much
   854  like the dependency tree formed by tools like make. Each task has a
   855  count of its unfinished dependencies and a pointer to its parent
   856  task. When a task is run it is passed its parent task and as part of
   857  running it can add additional dependencies to its parent, thus making
   858  the tree of dependencies dynamic. After a task is run, it decrements
   859  its parent tasks and schedules it for execution if it was the last
   860  dependency. Note that new tasks are only created if new expressions
   861  were added to the memo. Search will not terminate if we continually
   862  created new expressions via transformations, but that would also
   863  indicate that we have an unbounded growth in expressions. In practice,
   864  Search will have some limits on the number of steps it performs or
   865  time it can take.
   866  
   867  The initial task for Search is to optimize the "root" group. The tasks
   868  described are the standard Cascades-style search tasks:
   869  
   870  1. `OptimizeGroup(reqProps)`. Implements the group (via
   871     `ImplementGroup`) which generates implementations for the
   872     expressions in the group, then selects the plan with the least
   873     estimated cost. Enforcers (e.g. sort) are added as needed.
   874  
   875  2. `ImplementGroup`. Explores the group (via `ExploreGroup`) which
   876     generates more logical expressions in the group and in child
   877     groups, then generates implementations for all of the logical
   878     expressions (via `ImplementGroupExpr`). `ImplementGroup` itself
   879     does not perform any transformations, but acts as a synchronization
   880     point for dependent tasks.
   881  
   882  3. `ImplementGroupExpr`. Implements all of the child groups (via
   883     `ImplementGroup`), then applies any applicable implementation
   884     transformations (via `Transform`) to the forest of expressions
   885     rooted at the specified memo-expression. Example transformation:
   886     inner-join to merge-join and hash-join.
   887  
   888  4. `ExploreGroup`. Explores each expression in the group (via
   889     `ExploreGroupExpr`). `ExploreGroup` itself does not perform any
   890     transformations, but acts as a synchronization point for dependent
   891     tasks.
   892  
   893  5. `ExploreGroupExpr`. Explores all of the child groups (via
   894     `ExploreGroup`), then applies any applicable exploration
   895     transformations (via `Transform`) to the forest of expressions
   896     rooted at the specified memo-expression. Example transformations:
   897     join commutativity and join associativity.
   898  
   899  6. `Transform`. Applies a transform to the forest of expressions
   900     rooted at a particular memo-expression. There are two flavors of
   901     transformation task: exploration and implementation. The primary
   902     difference is the state transition after the task finishes. An
   903     exploration transform task recursively schedules exploration of the
   904     group it is associated with. An implementation transform task
   905     schedules optimization of the group.
   906  
   907  A search *stage* is configured by a set of exploration and
   908  implementation transforms, and a *budget*. The budget is used to prune
   909  branches of the search tree which appear undesirable. The initial
   910  search stage has a limited set of exploration and implementation
   911  transforms (perhaps 0 exploration transforms), an unlimited budget,
   912  and aims to quickly find a workable, though possibly slow, plan. Each
   913  subsequent stage uses the cost from the best plan of the previous
   914  stage for pruning. [TODO(peter): my understanding of how this will
   915  work is slightly fuzzy. My usage of the term budget might be
   916  off. Perhaps better to describe it as "max cost".]
   917  
   918  Full featured optimizers can contain hundreds of
   919  transformations. Checking whether each transformation is applicable at
   920  each node would be prohibitively expensive, so the transformations are
   921  indexed by the root operator of their pattern. Transformations are
   922  further categorized as exploration and implementation and divided
   923  amongst the search stages based on generality and expected benefit.
   924  
   925  Search is naturally parallelizable, yet exploiting that parallelism
   926  involves synchronization overhead. Parallelization also can allow one
   927  query to utilize more planning resources than other queries. Rather
   928  than support parallelization of search, energy will instead be
   929  directed at making search and transformations fast and memory
   930  efficient.
   931  
   932  ### Testing
   933  
   934  Historically, SQL databases have introduced subtle bugs that have
   935  lasted for years through invalid transformations. Search should be
   936  designed for testability. One example of this is to allow verification
   937  that all of the alternate plans generated by Search actually produce
   938  the same result.
   939  
   940  In addition to testing the alternative query plans, there is utility
   941  in generating a large number of valid SQL statements. The existing
   942  Random Syntax Generator does one level of this by generating
   943  syntactically valid SQL. An additional level would be to generate
   944  semantically valid queries which might be more feasible by random
   945  generation at the expression level.
   946  
   947  The relational algebra expression trees should provide a textual
   948  format to ease testing using infrastructure similar to the existing
   949  logic tests where test files define queries and expected results.
   950  
   951  Optimization is concerned with making queries faster and it is quite
   952  disturbing to users when inadvertent regressions occur. A large test
   953  suite needs to be developed over time which ensures that the addition
   954  of new transformations or improvements to the various modules do not
   955  cause regressions in the chosen plans.
   956  
   957  Generating actual table data with various data distributions for
   958  testing purposes would be both onerous and slow. Table statistics are
   959  a key factor in the decisions performed by search. In order to
   960  adequately test how the behavior of search changes with changing table
   961  statistics, we need an easy mechanism for injecting fake statistics.
   962  
   963  ## Roadmap
   964  
   965  The above outline sketches a large amount of work. How do we get there
   966  from here? A strawman proposal divides the work into several
   967  releases. The farther out the proposed work, the fuzzier the proposal
   968  becomes.
   969  
   970  ### 2.0
   971  
   972  * Stats. Stats are not dependent on other planning modules but are a
   973    prerequisite to cost-based transformations. Stats are only generated
   974    explicitly via `CREATE STATISTICS`.
   975  
   976  * Prep. Introduce the expression tree. Construct the expression tree
   977    from the existing AST output by the parser. Use the AST-based type
   978    checking and name resolution. The existing AST-based planning code
   979    will be left in place and a parallel world of expression-based
   980    planning will be erected. The new planning code will not be used in
   981    this release.
   982  
   983  * Rewrite. Predicate inference and predicate push down.
   984  
   985  * Memo. Introduce the memo structure.
   986  
   987  * Testing. Use ugly hacks to hook up a hobbled version of something as
   988    an alternate query planner. Perhaps a flag to pass queries through
   989    the expression format and memo and then translate them back into the
   990    AST in order to use the legacy planner.
   991  
   992  ### 2.1
   993  
   994  * Stats. Automatically gather stats on PKs and index columns.
   995  
   996  * Prep. Perform name resolution and type checking on the expression
   997    tree. Support non-recursive CTEs. Fall-back to legacy planning code
   998    for unsupported queries.
   999  
  1000  * Rewrite. Transform correlated subqueries into apply
  1001    variants. Transform common apply variants into joins.
  1002  
  1003  * Execution. Nested-loop-join, semi-join, anti-join and apply
  1004    processors.
  1005  
  1006  * Cost model. Basic cost model that is powered by stats.
  1007  
  1008  * Search. Task-based single stage search. No pruning. Use existing
  1009    DistSQL planning. Facility for time-travel debugging of the search
  1010    process and inspecting the memo state (e.g. logical and physical
  1011    properties). Global and per-session disablement of individual
  1012    transforms.
  1013  
  1014  * Transforms. Join elimination, distinct/group-by elimination, join
  1015    commutativity, join associativity, index selection, and scalar
  1016    normalization.
  1017  
  1018  * Testing. Random generation of table data based on schema and query
  1019    to exercise corner conditions. Random sampling and execution of
  1020    alternate query plans to verify equivalence. Test suite for plan
  1021    selection using injected stats.
  1022  
  1023  ### 2.2
  1024  
  1025  * Stats. Support more advanced statistics (e.g. filtered statistics).
  1026  
  1027  * Prep. Support 100% of queries, enabling the deletion of the legacy
  1028    planning code.
  1029  
  1030  * Cost model. Make the cost model more sophisticated by taking into
  1031    account measurements of network bandwidth and latency. Validate cost
  1032    model against actual queries.
  1033  
  1034  * Search. Add multiple stages with pruning heuristics. Integrate
  1035    DistSQL planning.
  1036  
  1037  * Transforms. Pull group-by above a join. Push group-by below a
  1038    join. Split group-by into local and global components. Simplify
  1039    outer joins.
  1040  
  1041  * Execution. Stream-group-by.
  1042  
  1043  ## Unresolved questions
  1044  
  1045  * Flesh out understanding of where physical properties such as
  1046    ordering can be imposed by the query itself. For example, a
  1047    top-level `ORDER BY` clause definitely imposes ordering. But so does
  1048    an `ORDER BY` clause that is the immediate sub-expression of
  1049    `LIMIT/OFFSET`, `DISTINCT ON`, `WITH ORDINALITY`,
  1050    `{INSERT,UPSERT,DELETE,UPDATE}` and `CREATE TABLE ... AS ...`. We
  1051    also need to pay attention to `ORDER BY INDEX` and `ORDER BY PRIMARY
  1052    KEY`, though those clauses likely degenerate into `ORDER
  1053    BY`. Are there other places we need to pay attention to physical
  1054    properties?  Are there other physical properties to capture at
  1055    intermediate nodes?
  1056  
  1057  * Which parts of query planning can be performed during PREPARE vs
  1058    EXECUTE? Most (all?) of the transformations that are part of Rewrite
  1059    can be performed during PREPARE. For example, predicate push-down
  1060    and decorrelation do not require placeholder values. And some parts
  1061    of Search, such as join enumeration, can be performed during
  1062    PREPARE. The part that is restricted to EXECUTE are certain parts of
  1063    index selection and thus costing of query plans.
  1064  
  1065  * The performance of the query planner itself is important because
  1066    query planning occurs for every query executed. What sorts of fast
  1067    paths are possible for simple queries?
  1068  
  1069  * Window functions.
  1070  
  1071  * Describe max1row operator and why it is necessary.
  1072  
  1073  ## Appendix
  1074  
  1075  ### Expr/Memo examples
  1076  
  1077  Consider the query:
  1078  
  1079  ```sql
  1080  SELECT v, k FROM kv WHERE k < 3
  1081  ```
  1082  
  1083  Building the expression tree results in:
  1084  
  1085  ```
  1086  project [out=(0,1)]
  1087    columns: kv.v:1 kv.k:0
  1088    projections:
  1089      variable (kv.v) [in=(1)]
  1090      variable (kv.k) [in=(0)]
  1091    inputs:
  1092      select [out=(0,1)]
  1093        columns: kv.k:0* kv.v:1
  1094        filters:
  1095          lt [in=(0)]
  1096            inputs:
  1097              variable (kv.k) [in=(0)]
  1098              const (3)
  1099        inputs:
  1100          scan [out=(0,1)]
  1101            columns: kv.k:0 kv.v:1
  1102  ```
  1103  
  1104  Some points to notice above. The relational operators (`project`,
  1105  `select` and `scan`) track their output column set as a bitmap
  1106  (i.e. `out=(0,1)`). Scalar expressions such as `variable` and `lt`
  1107  track their required input columns. Relational operators have a slice
  1108  of children where the interpretation of the children is operator
  1109  specific. The `project` operator has 2 children: a relational input
  1110  and a list of projections. Note that the order of projections is
  1111  important and are stored using an `ordered-list` operator in the
  1112  memo. The `select` operator also has 2 children: a relational input
  1113  and a list of filters.
  1114  
  1115  Inserting the expression tree into the memo results in:
  1116  
  1117  ```
  1118  8: [project [5 7]]
  1119  7: [ordered-list [6 2]]
  1120  6: [variable kv.v]
  1121  5: [select [1 4]]
  1122  4: [lt [2 3]]
  1123  3: [const 3]
  1124  2: [variable kv.k]
  1125  1: [scan kv]
  1126  ```
  1127  
  1128  Here we can see more clearly the child structure of the various
  1129  relational operators. The `select` expression in group 5 has 2
  1130  children: groups 1 and 4. Group 1 is a `scan` and group 4 is the
  1131  filter.
  1132  
  1133  As another example, consider the query:
  1134  
  1135  ```sql
  1136  SELECT k, v FROM (SELECT v, k FROM kv)
  1137  ```
  1138  
  1139  Inserting into the memo we get:
  1140  
  1141  ```
  1142  7: [project [5 6]]
  1143  6: [ordered-list [3 2]]
  1144  5: [project [1 4]]
  1145  4: [ordered-list [2 3]]
  1146  3: [variable kv.k]
  1147  2: [variable kv.v]
  1148  1: [scan kv]
  1149  ```
  1150  
  1151  Notice that the variables (`kv.k` and `kv.v`) are only present once in
  1152  the memo and their groups are shared by both projection lists.