github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/pkg/sql/opt/doc.go (about)

     1  // Copyright 2018 The Cockroach Authors.
     2  //
     3  // Use of this software is governed by the Business Source License
     4  // included in the file licenses/BSL.txt.
     5  //
     6  // As of the Change Date specified in that file, in accordance with
     7  // the Business Source License, use of this software will be governed
     8  // by the Apache License, Version 2.0, included in the file
     9  // licenses/APL.txt.
    10  
    11  /*
    12  Package opt contains the Cockroach SQL optimizer. The optimizer transforms the
    13  AST of a SQL query into a physical query plan for execution. Naive execution of
    14  a SQL query can be prohibitively expensive, because SQL specifies the desired
    15  results and not how to achieve them. A given SQL query can have thousands of
    16  equivalent query plans with vastly different execution times. The Cockroach
    17  optimizer is cost-based, meaning that it enumerates some or all of these
    18  alternate plans and chooses the one with the lowest estimated cost.
    19  
    20  Overview
    21  
    22  SQL query planning is often described in terms of 8 modules:
    23  
    24  1. Properties
    25  
    26  2. Stats
    27  
    28  3. Cost Model
    29  
    30  4. Memo
    31  
    32  5. Transforms
    33  
    34  6. Prep
    35  
    36  7. Rewrite
    37  
    38  8. Search
    39  
    40  Note Prep, Rewrite and Search could be considered phases, though this document
    41  will refer to all 8 uniformly as modules. Memo is a technique for compactly
    42  representing the forest of trees generated during Search. Stats, Properties,
    43  Cost Model and Transformations are modules that power the Prep, Rewrite and
    44  Search phases.
    45  
    46                         SQL query text
    47                               |
    48                         +-----v-----+  - parse SQL text according to grammar
    49                         |   Parse   |  - report syntax errors
    50                         +-----+-----+
    51                               |
    52                             (ast)
    53                               |
    54                         +-----v-----+  - fold constants, check types, resolve
    55                         |  Analyze  |    names
    56                         +-----+-----+  - annotate tree with semantic info
    57                               |        - report semantic errors
    58                             (ast+)
    59           +-------+           |
    60           | Stats +----->-----v-----+  - normalize tree with cost-agnostic
    61           +-------+     |   Prep    |    transforms (placeholders present)
    62                      +-->-----+-----+  - compute initial properties
    63                      |        |        - retrieve and attach stats
    64                      |     (expr)      - done once per PREPARE
    65                      |        |
    66      +------------+  |  +-----v-----+  - capture placeholder values / timestamps
    67      | Transforms |--+-->  Rewrite  |  - normalize tree with cost-agnostic
    68      +------------+  |  +-----+-----+    transforms (placeholders not present)
    69                      |        |        - done once per EXECUTE
    70                      |     (expr)
    71                      |        |
    72                      +-->-----v-----+  - generate equivalent expression trees
    73      +------------+     |  Search   |  - find lowest cost physical plan
    74      | Cost Model +----->-----+-----+  - includes DistSQL physical planning
    75      +------------+           |
    76                        (physical plan)
    77                               |
    78                         +-----v-----+
    79                         | Execution |
    80                         +-----------+
    81  
    82  The opt-related packages implement portions of these modules, while other parts
    83  are implemented elsewhere. For example, other sql packages are used to perform
    84  name resolution and type checking which are part of the Analyze phase.
    85  
    86  Parse
    87  
    88  The parse phase is not discussed in this document. It transforms the SQL query
    89  text into an abstract syntax tree (AST).
    90  
    91  Analyze
    92  
    93  The analyze phase ensures that the AST obeys all SQL semantic rules, and
    94  annotates the AST with information that will be used by later phases. In
    95  addition, some simple transforms are applied to the AST in order to simplify
    96  handling in later phases. Semantic rules are many and varied; this document
    97  describes a few major categories of semantic checks and rewrites.
    98  
    99  "Name resolution" binds table, column, and other references. Each name must be
   100  matched to the appropriate schema object, and an error reported if no matching
   101  object can be found. Name binding can result in AST annotations that make it
   102  easy for other components to find the target object, or rewrites that replace
   103  unbound name nodes with new nodes that are easier to handle (e.g. IndexedVar).
   104  
   105  "Constant folding" rewrites expressions that have constant inputs. For example,
   106  1+1 would be folded to 2. Cockroach's typing rules assume that constants have
   107  been folded, as there are some expressions that would otherwise produce a
   108  semantic error if they are not first folded.
   109  
   110  "Type inference" automatically determines the return data type of various SQL
   111  expressions, based on the types of inputs, as well as the context in which the
   112  expression is used. The AST is annotated with the resolved types for later use.
   113  
   114  "Type checking" ensures that all inputs to SQL expressions and statements have
   115  legal static types. For example, the CONCAT function only accepts zero or more
   116  arguments that are statically typed as strings. Violation of the typing rules
   117  produces a semantic error.
   118  
   119  Properties
   120  
   121  Properties are meta-information that are computed (and sometimes stored) for
   122  each node in an expression. Properties power transformations and optimization.
   123  
   124  "Logical properties" describe the structure and content of data returned by an
   125  expression, such as whether relational output columns can contain nulls, or the
   126  data type of a scalar expression. Two expressions which are logically
   127  equivalent according to the rules of the relational algebra will return the
   128  same set of rows and columns, and will have the same set of logical properties.
   129  However, the order of the rows, naming of the columns, and other presentational
   130  aspects of the result are not governed by the logical properties.
   131  
   132  "Physical properties" are interesting characteristics of an expression that
   133  impact its layout, presentation, or location, but not its logical content.
   134  Examples include row order, column naming, and data distribution (physical
   135  location of data ranges). Physical properties exist outside of the relational
   136  algebra, and arise from both the SQL query itself (e.g. the non-relational
   137  ORDER BY operator) and by the selection of specific implementations during
   138  optimization (e.g. a merge join requires the inputs to be sorted in a
   139  particular order).
   140  
   141  Properties can be "required" or "derived". A required property is one specified
   142  by the SQL query text. For example, a DISTINCT clause is a required property on
   143  the set of columns of the corresponding projection -- that the tuple of columns
   144  forms a key (unique values) in the results. A derived property is one derived
   145  by the optimizer for an expression based on the properties of the child
   146  expressions. For example:
   147  
   148    SELECT k+1 FROM kv
   149  
   150  Once the ordering of "k" is known from kv's descriptor, the same ordering
   151  property can be derived for k+1. During optimization, for each expression with
   152  required properties, the optimizer will look at child expressions to check
   153  whether their actual properties (which can be derived) match the requirement.
   154  If they don't, the optimizer must introduce an "enforcer" operator in the plan
   155  that provides the required property. For example, an ORDER BY clause creates a
   156  required ordering property that can cause the optimizer to add a Sort operator
   157  as an enforcer of that property.
   158  
   159  Stats
   160  
   161  Table statistics power both the cost model and the search of alternate query
   162  plans. A simple example of where statistics guide the search of alternate query
   163  plans is in join ordering:
   164  
   165  	SELECT * FROM a JOIN b
   166  
   167  In the absence of other opportunities, this might be implemented as a hash
   168  join. With a hash join, we want to load the smaller set of rows (either from a
   169  or b) into the hash table and then query that table while looping through the
   170  larger set of rows. How do we know whether a or b is larger? We keep statistics
   171  about the cardinality of a and b, i.e. the (approximate) number of different
   172  values.
   173  
   174  Simple table cardinality is sufficient for the above query but fails in other
   175  queries. Consider:
   176  
   177  	SELECT * FROM a JOIN b ON a.x = b.x WHERE a.y > 10
   178  
   179  Table statistics might indicate that a contains 10x more data than b, but the
   180  predicate a.y > 10 is filtering a chunk of the table. What we care about is
   181  whether the result of the scan *after* filtering returns more rows than the
   182  scan of b. This can be accomplished by making a determination of the
   183  selectivity of the predicate a.y > 10 (the % of rows it will filter) and then
   184  multiplying that selectivity by the cardinality of a. The common technique for
   185  estimating selectivity is to collect a histogram on a.y.
   186  
   187  The collection of table statistics occurs prior to receiving the query. As
   188  such, the statistics are necessarily out of date and may be inaccurate. The
   189  system may bound the inaccuracy by recomputing the stats based on how fast a
   190  table is being modified. Or the system may notice when stat estimations are
   191  inaccurate during query execution.
   192  
   193  Cost Model
   194  
   195  The cost model takes an expression as input and computes an estimated "cost"
   196  to execute that expression. The unit of "cost" can be arbitrary, though it is
   197  desirable if it has some real world meaning such as expected execution time.
   198  What is required is for the costs of different query plans to be comparable.
   199  The optimizer seeks to find the shortest expected execution time for a query
   200  and uses cost as a proxy for execution time.
   201  
   202  Cost is roughly calculated by estimating how much time each node in the
   203  expression tree will use to process all results and modeling how data flows
   204  through the expression tree. Table statistics are used to power cardinality
   205  estimates of base relations which in term power cardinality estimates of
   206  intermediate relations. This is accomplished by propagating histograms of
   207  column values from base relations up through intermediate nodes (e.g. combining
   208  histograms from the two join inputs into a single histogram). Operator-specific
   209  computations model the network, disk and CPU costs. The cost model should
   210  include data layout and the specific operating environment. For example,
   211  network RTT in one cluster might be vastly different than another.
   212  
   213  Because the cost for a query plan is an estimate, there is an associated error.
   214  This error might be implicit in the cost, or could be explicitly tracked. One
   215  advantage to explicitly tracking the expected error is that it can allow
   216  selecting a higher cost but lower expected error plan over a lower cost but
   217  higher expected error plan. Where does the error come from? One source is the
   218  innate inaccuracy of stats: selectivity estimation might be wildly off due to
   219  an outlier value. Another source is the accumulated build up of estimation
   220  errors the higher up in the query tree. Lastly, the cost model is making an
   221  estimation for the execution time of an operation such as a network RTT. This
   222  estimate can also be wildly inaccurate due to bursts of activity.
   223  
   224  Search finds the lowest cost plan using dynamic programming. That imposes a
   225  restriction on the cost model: it must exhibit optimal substructure. An optimal
   226  solution can be constructed from optimal solutions of its sub-problems.
   227  
   228  Memo
   229  
   230  Memo is a data structure for efficiently storing a forest of query plans.
   231  Conceptually, the memo is composed of a numbered set of equivalency classes
   232  called groups where each group contains a set of logically equivalent
   233  expressions. The different expressions in a single group are called memo
   234  expressions (memo-ized expressions). A memo expression has a list of child
   235  groups as its children rather than a list of individual expressions. The
   236  forest is composed of every possible combination of parent expression with
   237  its children, recursively applied.
   238  
   239  Memo expressions can be relational (e.g. join) or scalar (e.g. <). Operators
   240  are always both logical (specify results) and physical (specify results and a
   241  particular implementation). This means that even a "raw" unoptimized expression
   242  tree can be executed (naively). Both relational and scalar operators are
   243  uniformly represented as nodes in memo expression trees, which facilitates tree
   244  pattern matching and replacement.
   245  
   246  Because memo groups contain logically equivalent expressions, all the memo
   247  expressions in a group share the same logical properties. However, it's
   248  possible for two logically equivalent expressions to be placed in different
   249  memo groups. This occurs because determining logical equivalency of two
   250  relational expressions is too complex to perform 100% correctly. A correctness
   251  failure (i.e. considering two expressions logically equivalent when they are
   252  not) results in invalid transformations and invalid plans. But placing two
   253  logically equivalent expressions in different groups has a much gentler failure
   254  mode: the memo and transformations are less efficient. Expressions within the
   255  memo may have different physical properties. For example, a memo group might
   256  contain both hash join and merge join expressions which produce the same set of
   257  output rows, but produce them in different orders.
   258  
   259  Expressions are inserted into the memo by the factory, which ensure that
   260  expressions have been fully normalized before insertion (see the Prep section
   261  for more details). A new group is created only when unique normalized
   262  expressions are created by the factory during construction or rewrite of the
   263  tree. Uniqueness is determined by computing the fingerprint for a memo
   264  expression, which is simply the expression operator and its list of child
   265  groups. For example, consider this query:
   266  
   267  	SELECT * FROM a, b WHERE a.x = b.x
   268  
   269  After insertion into the memo, the memo would contain these six groups:
   270  
   271  	6: [inner-join [1 2 5]]
   272  	5: [eq [3 4]]
   273  	4: [variable b.x]
   274  	3: [variable a.x]
   275  	2: [scan b]
   276  	1: [scan a]
   277  
   278  The fingerprint for the inner-join expression is [inner-join [1 2 5]]. The
   279  memo maintains a map from expression fingerprint to memo group which allows
   280  quick determination of whether the normalized form of an expression already
   281  exists in the memo.
   282  
   283  The normalizing factory will never add more than one expression to a memo
   284  group. But the explorer (see Search section for more details) does add
   285  denormalized expressions to existing memo groups, since oftentimes one of these
   286  equivalent, but denormalized expressions will have a lower cost than the
   287  initial normalized expression added by the factory. For example, the join
   288  commutativity transformation expands the memo like this:
   289  
   290  	6: [inner-join [1 2 5]] [inner-join [2 1 5]]
   291  	5: [eq [3 4]]
   292  	4: [variable b.x]
   293  	3: [variable a.x]
   294  	2: [scan b]
   295  	1: [scan a]
   296  
   297  Notice that there are now two expressions in memo group 6. The coster (see Cost
   298  Model section for more details) will estimate the execution cost of each
   299  expression, and the optimizer will select the lowest cost alternative.
   300  
   301  Transforms
   302  
   303  Transforms convert an input expression tree into zero or more logically
   304  equivalent trees. Transforms consist of two parts: a "match pattern" and a
   305  "replace pattern". Together, the match pattern and replace pattern are called a
   306  "rule". Transform rules are categorized as "normalization" or "exploration"
   307  rules.
   308  
   309  If an expression in the tree matches the match pattern, then a new expression
   310  will be constructed according to the replace pattern. Note that "replace" means
   311  the new expression is a logical replacement for the existing expression, not
   312  that the existing expression needs to physically be replaced. Depending on the
   313  context, the existing expression may be discarded, or it may be retained side-
   314  by-side with the new expression in the memo group.
   315  
   316  Normalization rules are cost-agnostic, as they are always considered to be
   317  beneficial. All normalization rules are implemented by the normalizing factory,
   318  which does its best to map all logically equivalent expression trees to a
   319  single canonical form from which searches can branch out. See the Prep section
   320  for more details.
   321  
   322  Exploration rules generate equivalent expression trees that must be costed in
   323  order to determine the lowest cost alternative. All exploration rules are
   324  implemented by the explorer, which is optimized to efficiently enumerate all
   325  possible expression tree combinations in the memo in order to look for rule
   326  matches. When it finds a match, the explorer applies the rule and adds an
   327  equivalent expression to the existing memo group. See the Search section for
   328  more details.
   329  
   330  Some examples of transforms:
   331  
   332  	Join commutativity
   333  	Swaps the order of the inputs to an inner join.
   334  		SELECT * FROM a, b => SELECT * FROM b, a
   335  
   336  	Join associativity
   337  	Reorders the children of a parent and child join
   338  		SELECT * FROM (SELECT * FROM a, b), c
   339  		=>
   340  		SELECT * FROM (SELECT * FROM a, c), b
   341  
   342  	Predicate pushdown
   343  	Moves predicates below joins
   344  		SELECT * FROM a, b USING (x) WHERE a.x < 10
   345  		=>
   346  		SELECT * FROM (SELECT * FROM a WHERE a.x < 10), b USING (x)
   347  
   348  	Join elimination
   349  	Removes unnecessary joins based on projected columns and foreign keys.
   350  		SELECT a.x FROM a, b USING (x)
   351  		=>
   352  		SELECT a.x FROM a
   353  
   354  	Distinct/group-by elimination
   355  	Removes unnecessary distinct/group-by operations based on keys.
   356  		SELECT DISTINCT a.x FROM a
   357  		=>
   358  		SELECT a.x FROM a
   359  
   360  	Predicate inference
   361  	Adds predicates based on filter conditions.
   362  		SELECT * FROM a, b USING (x)
   363  		=>
   364  		SELECT * FROM a, b USING (x) WHERE a.x IS NOT NULL AND b.x IS NOT NULL
   365  
   366  	Decorrelation
   367  	Replaces correlated subqueries with semi-join, anti-join and apply ops.
   368  
   369  	Scan to index scan
   370  	Transforms scan operator into one or more index scans on covering indexes.
   371  
   372  	Inner join to merge join
   373  	Generates alternate merge-join operator from default inner-join operator.
   374  
   375  Much of the optimizer's rule matching and application code is generated by a
   376  tool called Optgen, short for "optimizer generator". Optgen is a domain-
   377  specific language (DSL) that provides an intuitive syntax for defining
   378  transform rules. Here is an example:
   379  
   380    [NormalizeEq]
   381    (Eq
   382      $left:^(Variable)
   383      $right:(Variable)
   384    )
   385    =>
   386    (Eq $right $left)
   387  
   388  The expression above the arrow is the match pattern and the expression below
   389  the arrow is the replace pattern. This example rule will match Eq expressions
   390  which have a left input which is not a Variable operator and a right input
   391  which is a Variable operator. The replace pattern will trigger a replacement
   392  that reverses the two inputs. In addition, custom match and replace functions
   393  can be defined in order to run arbitrary Go code.
   394  
   395  Prep
   396  
   397  Prep (short for "prepare") is the first phase of query optimization, in which
   398  the annotated AST is transformed into a single normalized "expression tree".
   399  The optimizer directly creates the expression tree in the memo data structure
   400  rather than first constructing an intermediate data structure. A forest of
   401  equivalent trees will be generated in later phases, but at the end of the prep
   402  phase, the memo contains just one normalized tree that is logically equivalent
   403  to the SQL query.
   404  
   405  During the prep phase, placeholder values are not yet known, so normalization
   406  cannot go as far as it can during later phases. However, this also means that
   407  the resulting expression tree can be cached in response to a PREPARE statement,
   408  and then be reused as a starting point each time an EXECUTE statement provides
   409  new placeholder values.
   410  
   411  The memo expression tree is constructed by the normalizing factory, which does
   412  its best to map all logically equivalent expression trees to a single canonical
   413  form from which searches can branch out. The factory has an interface similar
   414  to this:
   415  
   416  	ConstructConst(value PrivateID) GroupID
   417  	ConstructAnd(conditions ListID) GroupID
   418  	ConstructInnerJoin(left GroupID, right GroupID, on GroupID) GroupID
   419  
   420  The factory methods construct a memo expression tree bottom-up, with each memo
   421  group becoming an input to operators higher in the tree.
   422  
   423  As each expression is constructed by the factory, it transitively applies
   424  normalization rules defined for that expression type. This may result in the
   425  construction of a different type of expression than what was requested. If,
   426  after normalization, the expression is already part of the memo, then
   427  construction is a no-op. Otherwise, a new memo group is created, with the
   428  normalized expression as its first and only expression.
   429  
   430  By applying normalization rules as the expression tree is constructed, the
   431  factory can avoid creating intermediate expressions; often, "replacement" of
   432  an existing expression means it's never created to begin with.
   433  
   434  During Prep, all columns used by the SQL query are given a numeric index that
   435  is unique across the query. Column numbering involves assigning every base
   436  column and non-trivial projection in a query a unique, query-specific index.
   437  Giving each column a unique index allows the expression nodes mentioned above
   438  to track input and output columns, or really any set of columns during Prep and
   439  later phases, using a bitmap (FastIntSet). The bitmap representation allows
   440  fast determination of compatibility between expression nodes and is utilized by
   441  transforms to determine the legality of such operations.
   442  
   443  The Prep phase also computes logical properties, such as the input and output
   444  columns of each (sub-)expression, equivalent columns, not-null columns and
   445  functional dependencies. These properties are computed bottom-up as part of
   446  constructing the expression tree.
   447  
   448  Rewrite
   449  
   450  Rewrite is the second phase of query optimization. Placeholder values are
   451  available starting at this phase, so new normalization rules will typically
   452  match once constant values are substituted for placeholders. As mentioned in
   453  the previous section, the expression tree produced by the Prep phase can be
   454  cached and serve as the starting point for the Rewrite phase. In addition, the
   455  Rewrite phase takes a set of physical properties that are required from the
   456  result, such as row ordering and column naming.
   457  
   458  The Rewrite and Search phases have significant overlap. Both phases perform
   459  transformations on the expression tree. However, Search preserves the matched
   460  expression side-by-side with the new expression, while Rewrite simply discards
   461  the matched expression, since the new expression is assumed to always be
   462  better. In addition, the application of exploration rules may trigger
   463  additional normalization rules, which may in turn trigger additional
   464  exploration rules.
   465  
   466  Together, the Rewrite and Search phases are responsible for finding the
   467  expression that can provide the required set of physical properties at the
   468  lowest possible execution cost. That mandate is recursively applied; in other
   469  words, each subtree is also optimized with respect to a set of physical
   470  properties required by its parent, and the goal is to find the lowest cost
   471  equivalent expression. An example of an "interior" optimization goal is a merge
   472  join that requires its inner child to return its rows in a specific order. The
   473  same group can be (and sometimes is) optimized multiple times, but with
   474  different required properties each time.
   475  
   476  Search
   477  
   478  Search is the final phase of optimization. Search begins with a single
   479  normalized tree that was created by the earlier phases. For each group, the
   480  "explorer" component generates alternative expressions that are logically
   481  equivalent to the normalized expression, but which may have very different
   482  execution plans. The "coster" component computes the estimated cost for each
   483  alternate expression. The optimizer remembers the "best expression" for each
   484  group, for each set of physical properties required of that group.
   485  
   486  Optimization of a group proceeds in two phases:
   487  
   488  1. Compute the cost of any previously generated expressions. That set initially
   489  contains only the group's normalized expression, but exploration may yield
   490  additional expressions. Costing a parent expression requires that the children
   491  first be costed, so costing triggers a recursive traversal of the memo groups.
   492  
   493  2. Invoke the explorer to generate new equivalent expressions for the group.
   494  Those new expressions are costed once the optimizer loops back to the first
   495  phase.
   496  
   497  In order to avoid a combinatorial explosion in the number of expression trees,
   498  the optimizer utilizes the memo structure. Due to the large number of possible
   499  plans for some queries, the optimizer cannot always explore all of them.
   500  Therefore, it proceeds in multiple iterative "passes", until either it hits
   501  some configured time or resource limit, or until an exhaustive search is
   502  complete. As long as the search is allowed to complete, the best plan will be
   503  found, just as in Volcano and Cascades.
   504  
   505  The optimizer uses several techniques to maximize the chance that it finds the
   506  best plan early on:
   507  
   508  - As with Cascades, the search is highly directed, interleaving exploration
   509  with costing in order to prune parts of the tree that cannot yield a better
   510  plan. This contrasts with Volcano, which first generates all possible plans in
   511  one global phase (exploration), and then determines the lowest cost plan in
   512  another global phase (costing).
   513  
   514  - The optimizer uses a simple hill climbing heuristic to make greedy progress
   515  towards the best plan. During a given pass, the optimizer visits each group and
   516  performs costing and exploration for that group. As long as doing that yields a
   517  lower cost expression for the group, the optimizer will repeat those steps.
   518  This finds a local maxima for each group during the current pass.
   519  
   520  In order to avoid costing or exploring parts of the search space that cannot
   521  yield a better plan, the optimizer performs aggressive "branch and bound
   522  pruning". Each group expression is optimized with respect to a "budget"
   523  parameter. As soon as this budget is exceeded, optimization of that expression
   524  terminates. It's not uncommon for large sections of the search space to never
   525  be costed or explored due to this pruning. Example:
   526  
   527  	innerJoin
   528  		left:  cost = 50
   529  		right: cost = 75
   530  		on:    cost = 25
   531  
   532  If the current best expression for the group has a cost of 100, then the
   533  optimizer does not need to cost or explore the "on" child of the join, and
   534  does not need to cost the join itself. This is because the combined cost of
   535  the left and right children already exceeds 100.
   536  */
   537  package opt