github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/pkg/sql/opt/optgen/lang/doc.go (about)

     1  // Copyright 2018 The Cockroach Authors.
     2  //
     3  // Use of this software is governed by the Business Source License
     4  // included in the file licenses/BSL.txt.
     5  //
     6  // As of the Change Date specified in that file, in accordance with
     7  // the Business Source License, use of this software will be governed
     8  // by the Apache License, Version 2.0, included in the file
     9  // licenses/APL.txt.
    10  
    11  /*
    12  Package lang implements a language called Optgen, short for "optimizer
    13  generator". Optgen is a domain-specific language (DSL) that provides an
    14  intuitive syntax for defining, matching, and replacing nodes in a target
    15  expression tree. Here is an example:
    16  
    17    [NormalizeEq]
    18    (Eq
    19      $left:^(Variable)
    20      $right:(Variable)
    21    )
    22    =>
    23    (Eq $right $left)
    24  
    25  The expression above the arrow is called the "match pattern" and the expression
    26  below the arrow is called the "replace pattern". If a node in the target
    27  expression tree matches the match pattern, then it will be replaced by a node
    28  that is constructed according to the replace pattern. Together, the match
    29  pattern and replace pattern are called a "rule".
    30  
    31  In addition to rules, the Optgen language includes "definitions". Each
    32  definition names and describes one of the nodes that the target expression tree
    33  may contain. Match and replace patterns can recognize and construct these
    34  nodes. Here is an example:
    35  
    36    define Eq {
    37      Left  Expr
    38      Right Expr
    39    }
    40  
    41  The following sections provide more detail on the Optgen language syntax and
    42  semantics, as well as some implementation notes.
    43  
    44  Definitions
    45  
    46  Optgen language input files may contain any number of definitions, in any
    47  order. Each definition describes a node in the target expression tree. A
    48  definition has a name and a set of "fields" which describe the node's children.
    49  A definition may have zero fields, in which case it describes a node with zero
    50  children, which is always a "leaf" in the expression tree.
    51  
    52  A field definition consists of two parts - the field's name and its type. The
    53  Optgen parser treats the field's type as an opaque identifier; it's up to other
    54  components to interpret it. However, typically the field type refers to either
    55  some primitive type (like string or int), or else refers to the name of some
    56  other operator or group of operators.
    57  
    58  Here is the syntax for an operator definition:
    59  
    60    define <name> {
    61      <field-1-name> <field-1-type>
    62      <field-2-name> <field-2-type>
    63      ...
    64    }
    65  
    66  And here is an example:
    67  
    68    define Join {
    69      Left  Expr
    70      Right Expr
    71      On    Expr
    72    }
    73  
    74  Definition Tags
    75  
    76  A "definition tag" is an opaque identifier that describes some property of the
    77  defined node. Definitions can have multiple tags or no tags at all, and the
    78  same tag can be attached to multiple definitions. Tags can be used to group
    79  definitions according to some shared property or logical grouping. For example,
    80  arithmetic or boolean operators might be grouped together. Match patterns can
    81  then reference those tags in order to match groups of nodes (see "Matching
    82  Names" section).
    83  
    84  Here is the definition tagging syntax:
    85  
    86    [<tag-1-name>, <tag-2-name>, ...]
    87    define <name> {
    88    }
    89  
    90  And here is an example:
    91  
    92    [Comparison, Inequality]
    93    define Lt {
    94      Left  Expr
    95      Right Expr
    96    }
    97  
    98  Rules
    99  
   100  Optgen language input files may contain any number of rules, in any order. Each
   101  rule has a unique name and consists of a match pattern and a corresponding
   102  replace pattern. A rule's match pattern is tested against every node in the
   103  target expression tree, bottom-up. Each matching node is replaced by a node
   104  constructed according to the replace pattern. The replacement node is itself
   105  tested against every rule, and so on, until no further rules match.
   106  
   107  Note that this is just a conceptual description. Optgen does not actually do
   108  any of this matching or replacing itself. Other components use the Optgen
   109  library to generate code. These components are free to match however they want,
   110  and to replace nodes or keep the new and old nodes side-by-side (as with a
   111  typical optimizer MEMO structure).
   112  
   113  Similar to define statements, a rule may have a set of tags associated with it.
   114  Rule tags logically group rules, and can also serve as directives to the code
   115  generator.
   116  
   117  Here is the partial rule syntax (see Syntax section for full syntax):
   118  
   119    [<rule-name>, <tag-1-name>, <tag-2-name>, ...]
   120    (<match-opname>
   121      <match-expr>
   122      <match-expr>
   123      ...
   124    )
   125    =>
   126    (<replace-opname>
   127      <replace-expr>
   128      <replace-expr>
   129      ...
   130    )
   131  
   132  Match Patterns
   133  
   134  The top-level match pattern matches the name and children of one or more nodes
   135  in the target expression tree. For example:
   136  
   137    (Eq * *)
   138  
   139  The "*" character is the "wildcard matcher", which matches a child of any kind.
   140  Therefore, this pattern matches any node named "Eq" that has at least two
   141  children. Matchers can be nested within one another in order to match children,
   142  grandchildren, etc. For example:
   143  
   144    (Eq (Variable) (Const))
   145  
   146  This pattern matches an "Eq" node with a "Variable" node as its left child
   147  and a "Const" node as its right child.
   148  
   149  Binding
   150  
   151  Child patterns within match and replace patterns can be "bound" to a named
   152  variable. These variables can then be referenced later in the match pattern or
   153  in the replace pattern. This is a critical part of the Optgen language, since
   154  virtually every pattern constructs its replacement pattern based on parts of
   155  the match pattern. For example:
   156  
   157    [EliminateSelect]
   158    (Select $input:* (True)) => $input
   159  
   160  The $input variable is bound to the first child of the "Select" node. If the
   161  second child is a "True" node, then the "Select" node will be replaced by its
   162  input. Variables can also be passed as arguments to custom matchers, which are
   163  described below.
   164  
   165  Matching Names
   166  
   167  In addition to simple name matching, a node matcher can match tag names. Any
   168  node type which has the named tag is matched. For example:
   169  
   170    [Inequality]
   171    define Lt {
   172      Left Expr
   173      Right Expr
   174    }
   175  
   176    [Inequality]
   177    define Gt
   178    {
   179      Left Expr
   180      Right Expr
   181    }
   182  
   183    (Inequality (Variable) (Const))
   184  
   185  This pattern matches either "Lt" or "Gt" nodes. This is useful for writing
   186  patterns that apply to multiple kinds of nodes, without need for duplicate
   187  patterns that only differ in matched node name.
   188  
   189  The node matcher also supports multiple names in the match list, separated by
   190  '|' characters. The node's name is allowed to match any of the names in the
   191  list. For example:
   192  
   193    (Eq | Ne | Inequality)
   194  
   195  This pattern matches "Eq", "Ne", "Lt", or "Gt" nodes.
   196  
   197  Matching Primitive Types
   198  
   199  String and numeric constant nodes in the tree can be matched against literals.
   200  A literal string or number in a match pattern is interpreted as a matcher of
   201  that type, and will be tested for equality with the child node. For example:
   202  
   203    [EliminateConcat]
   204    (Concat $left:* (Const "")) => $left
   205  
   206  If Concat's right operand is a constant expression with the empty string as its
   207  value, then the pattern matches. Similarly, a constant numeric expression can be
   208  matched like this:
   209  
   210    [LimitScan]
   211    (Limit (Scan $def:*) (Const 1)) => (ScanOneRow $def)
   212  
   213  Matching Lists
   214  
   215  Nodes can have a child that is a list of nodes rather than a single node. As an
   216  example, a function call node might have two children: the name of the function
   217  and the list of arguments to the function:
   218  
   219    define FuncCall {
   220      Name Expr
   221      Args ExprList
   222    }
   223  
   224  There are several kinds of list matchers, each of which uses a variant of the
   225  list matching bracket syntax. The ellipses signify that 0 or more items can
   226  match at either the beginning or end of the list. The item pattern can be any
   227  legal match pattern, and can be bound to a variable.
   228  
   229    [ ... <item pattern> ... ]
   230  
   231  - ANY: Matches if any item in the list matches the item pattern. If multiple
   232  items match, then the list matcher binds to the first match.
   233  
   234    [ ... $item:* ... ]
   235  
   236  - FIRST: Matches if the first item in the list matches the item pattern (and
   237  there is at least one item in the list).
   238  
   239    [ $item:* ... ]
   240  
   241  - LAST: Matches if the last item in the list matches the item pattern (and
   242  there is at least one item).
   243  
   244    [ ... $item:* ]
   245  
   246  - SINGLE: Matches if there is exactly one item in the list, and it matches the
   247  item pattern.
   248  
   249    [ $item:* ]
   250  
   251  - EMPTY: Matches if there are zero items in the list.
   252  
   253    []
   254  
   255  Following is a more complete example. The ANY list matcher in the example
   256  searches through the Filter child's list, looking for a Subquery node. If a
   257  matching node is found, then the list matcher succeeds and binds the node to
   258  the $item variable.
   259  
   260    (Select
   261      $input:*
   262      (Filter [ ... $item:(Subquery) ... ])
   263    )
   264  
   265  Custom Matching
   266  
   267  When the supported matching syntax is insufficient, Optgen provides an escape
   268  mechanism. Custom matching functions can invoke Go functions, passing
   269  previously bound variables as arguments, and checking the boolean result for a
   270  match. For example:
   271  
   272    [EliminateFilters]
   273    (Filters $items:* & (IsEmptyList $items)) => (True)
   274  
   275  This pattern passes the $items child node to the IsEmptyList function. If that
   276  returns true, then the pattern matches.
   277  
   278  Custom matching functions can appear anywhere that other matchers can, and can
   279  be combined with other matchers using boolean operators (see the Boolean
   280  Expressions section for more details). While variable references are the most
   281  common argument, it is also legal to nest function invocations:
   282  
   283    (Project
   284      $input:*
   285      $projections:* & ^(IsEmpty (FindUnusedColumns $projections))
   286    )
   287  
   288  Boolean Expressions
   289  
   290  Multiple match expressions of any type can be combined using the boolean &
   291  (AND) operator. All must match in order for the overall match to succeed:
   292  
   293    (Not
   294      $input:(Comparison) & (Inequality) & (CanInvert $input)
   295    )
   296  
   297  The boolean ^ (NOT) operator negates the result of a boolean match expression.
   298  It can be used with any kind of matcher, including custom match functions:
   299  
   300    (JoinApply
   301      $left:^(Select)
   302      $right:* & ^(IsCorrelated $right $left)
   303      $on:*
   304    )
   305  
   306  This pattern matches only if the left child is not a Select node, and if the
   307  IsCorrelated custom function returns false.
   308  
   309  Replace Patterns
   310  
   311  Once a matching node is found, the replace pattern produces a single
   312  substitution node. The most common replace pattern involves constructing one or
   313  more new nodes, often with child nodes that were bound in the match pattern.
   314  A construction expression specifies the name of the node as its first operand
   315  and its children as subsequent arguments. Construction expressions can be
   316  nested within one another to any depth. For example:
   317  
   318    [HoistSelectExists]
   319    (Select
   320      $input:*
   321      $filter:(Exists $subquery:*)
   322    )
   323    =>
   324    (SemiJoinApply
   325      $input
   326      $subquery
   327      (True)
   328    )
   329  
   330  The replace pattern in this rule constructs a new SemiJoinApply node, with its
   331  first two children bound in the match pattern. The third child is a newly
   332  constructed True node.
   333  
   334  The replace pattern can also consist of a single variable reference, in the
   335  case where the substitution node was already present in the match pattern:
   336  
   337    [EliminateAnd]
   338    (And $left:* (True)) => $left
   339  
   340  Custom Construction
   341  
   342  When Optgen syntax cannot easily produce a result, custom construction
   343  functions allow results to be derived in Go code. If a construction
   344  expression's name is not recognized as a node name, then it is assumed to be
   345  the name of a custom function. For example:
   346  
   347    [MergeSelectJoin]
   348    (Select
   349      (InnerJoin $r:* $s:* $on:*)
   350      $filter:*
   351    )
   352    =>
   353    (InnerJoin
   354      $r
   355      $s
   356      (ConcatFilters $on $filter)
   357    )
   358  
   359  Here, the ConcatFilters custom function is invoked in order to concatenate two
   360  filter lists together. Function parameters can include nodes, lists (see the
   361  Constructing Lists section), operator names (see the Name parameters section),
   362  and the results of nested custom function calls. While custom functions
   363  typically return a node, they can return other types if they are parameters to
   364  other custom functions.
   365  
   366  Constructing Lists
   367  
   368  Lists can be constructed and passed as parameters to node construction
   369  expressions or custom replace functions. A list consists of multiple items that
   370  can be of any parameter type, including nodes, strings, custom function
   371  invocations, or lists. Here is an example:
   372  
   373    [MergeSelectJoin]
   374    (Select
   375      (InnerJoin $left:* $right:* $on:*)
   376      $filters:*
   377    )
   378    =>
   379    (InnerJoin
   380      $left
   381      $right
   382      (And [$on $filters])
   383    )
   384  
   385  Dynamic Construction
   386  
   387  Sometimes the name of a constructed node can be one of several choices. The
   388  built-in "OpName" function can be used to dynamically construct the right kind
   389  of node. For example:
   390  
   391    [NormalizeVar]
   392    (Eq | Ne
   393      $left:^(Variable)
   394      $right:(Variable)
   395    )
   396    =>
   397    ((OpName) $right $left)
   398  
   399  In this pattern, the name of the constructed result is either Eq or Ne,
   400  depending on which is matched. When the OpName function has no arguments, then
   401  it is bound to the name of the node matched at the top-level. The OpName
   402  function can also take a single variable reference argument. In that case, it
   403  refers to the name of the node bound to that variable:
   404  
   405    [PushDownSelect]
   406    (Select
   407      $input:(Join $left:* $right:* $on:*)
   408      $filter:* & ^(IsCorrelated $filter $right)
   409    )
   410    =>
   411    ((OpName $input)
   412      (Select $left $filter)
   413      $right
   414      $on
   415    )
   416  
   417  In this pattern, Join is a tag that refers to a group of nodes. The replace
   418  expression will construct a node having the same name as the matched join node.
   419  
   420  Name Parameters
   421  
   422  The OpName built-in function can also be a parameter to a custom match or
   423  replace function which needs to know which name matched. For example:
   424  
   425    [FoldBinaryNull]
   426    (Binary $left:* (Null) & ^(HasNullableArgs (OpName)))
   427    =>
   428    (Null)
   429  
   430  The name of the matched Binary node (e.g. Plus, In, Contains) is passed to the
   431  HasNullableArgs function as a symbolic identifier. Here is an example that uses
   432  a custom replace function and the OpName function with an argument:
   433  
   434    [NegateComparison]
   435    (Not $input:(Comparison $left:* $right:*))
   436    =>
   437    (InvertComparison (OpName $input) $left $right)
   438  
   439  As described in the previous section, adding the argument enables OpName to
   440  return a name that was matched deeper in the pattern.
   441  
   442  In addition to a name returned by the OpName function, custom match and replace
   443  functions can accept literal operator names as parameters. The Minus operator
   444  name is passed as a parameter to two functions in this example:
   445  
   446    [FoldMinus]
   447    (UnaryMinus
   448      (Minus $left $right) & (OverloadExists Minus $right $left)
   449    )
   450    =>
   451    (ConstructBinary Minus $right $left)
   452  
   453  Type Inference
   454  
   455  Expressions in both the match and replace patterns are assigned a data type
   456  that describes the kind of data that will be returned by the expression. These
   457  types are inferred using a combination of top-down and bottom-up type inference
   458  rules. For example:
   459  
   460    define Select {
   461      Input  Expr
   462      Filter Expr
   463    }
   464  
   465    (Select $input:(LeftJoin | RightJoin) $filter:*) => $input
   466  
   467  The type of $input is inferred as "LeftJoin | RightJoin" by bubbling up the type
   468  of the bound expression. That type is propagated to the $input reference in the
   469  replace pattern. By contrast, the type of the * expression is inferred to be
   470  "Expr" using a top-down type inference rule, since the second argument to the
   471  Select operator is known to have type "Expr".
   472  
   473  When multiple types are inferred for an expression using different type
   474  inference rules, the more restrictive type is assigned to the expression. For
   475  example:
   476  
   477    (Select $input:* & (LeftJoin)) => $input
   478  
   479  Here, the left input to the And expression was inferred to have type "Expr" and
   480  the right input to have type "LeftJoin". Since "LeftJoin" is the more
   481  restrictive type, the And expression and the $input binding are typed as
   482  "LeftJoin".
   483  
   484  Type inference detects and reports type contradictions, which occur when
   485  multiple incompatible types are inferred for an expression. For example:
   486  
   487    (Select $input:(InnerJoin) & (LeftJoin)) => $input
   488  
   489  Because the input cannot be both an InnerJoin and a LeftJoin, Optgen reports a
   490  type contradiction error.
   491  
   492  Syntax
   493  
   494  This section describes the Optgen language syntax in a variant of extended
   495  Backus-Naur form. The non-terminals correspond to methods in the parser. The
   496  terminals correspond to tokens returned by the scanner. Whitespace and
   497  comment tokens can be freely interleaved between other tokens in the
   498  grammar.
   499  
   500    root         = tags (define | rule)
   501    tags         = '[' IDENT (',' IDENT)* ']'
   502  
   503    define       = 'define' define-name '{' define-field* '}'
   504    define-name  = IDENT
   505    define-field = field-name field-type
   506    field-name   = IDENT
   507    field-type   = IDENT
   508  
   509    rule         = func '=>' replace
   510    match        = func
   511    replace      = func | ref
   512    func         = '(' func-name arg* ')'
   513    func-name    = names | func
   514    names        = name ('|' name)*
   515    arg          = bind and | ref | and
   516    and          = expr ('&' and)
   517    expr         = func | not | list | any | name | STRING | NUMBER
   518    not          = '^' expr
   519    list         = '[' list-child* ']'
   520    list-child   = list-any | arg
   521    list-any     = '...'
   522    bind         = '$' label ':' and
   523    ref          = '$' label
   524    any          = '*'
   525    name         = IDENT
   526    label        = IDENT
   527  
   528  Here are the pseudo-regex definitions for the lexical tokens that aren't
   529  represented as single-quoted strings above:
   530  
   531    STRING     = " [^"\n]* "
   532    NUMBER     = UnicodeDigit+
   533    IDENT      = (UnicodeLetter | '_') (UnicodeLetter | '_' | UnicodeNumber)*
   534    COMMENT    = '#' .* \n
   535    WHITESPACE = UnicodeSpace+
   536  
   537  The support directory contains syntax coloring files for several editors,
   538  including Vim, TextMate, and Visual Studio Code. JetBrains editor (i.e. GoLand)
   539  can also import TextMate bundles to provide syntax coloring.
   540  
   541  Components
   542  
   543  The lang package contains a scanner that breaks input files into lexical
   544  tokens, a parser that assembles an abstract syntax tree (AST) from the tokens,
   545  and a compiler that performs semantic checks and creates a rudimentary symbol
   546  table.
   547  
   548  The compiled rules and definitions become the input to a separate code
   549  generation package which generates parts of the Cockroach DB SQL optimizer.
   550  However, the Optgen language itself is not Cockroach or SQL specific, and can
   551  be used in other contexts. For example, the Optgen language parser generates
   552  its own AST expressions using itself (compiler bootstrapping).
   553  */
   554  package lang