github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/pkg/sql/opt/optgen/lang/doc.go (about) 1 // Copyright 2018 The Cockroach Authors. 2 // 3 // Use of this software is governed by the Business Source License 4 // included in the file licenses/BSL.txt. 5 // 6 // As of the Change Date specified in that file, in accordance with 7 // the Business Source License, use of this software will be governed 8 // by the Apache License, Version 2.0, included in the file 9 // licenses/APL.txt. 10 11 /* 12 Package lang implements a language called Optgen, short for "optimizer 13 generator". Optgen is a domain-specific language (DSL) that provides an 14 intuitive syntax for defining, matching, and replacing nodes in a target 15 expression tree. Here is an example: 16 17 [NormalizeEq] 18 (Eq 19 $left:^(Variable) 20 $right:(Variable) 21 ) 22 => 23 (Eq $right $left) 24 25 The expression above the arrow is called the "match pattern" and the expression 26 below the arrow is called the "replace pattern". If a node in the target 27 expression tree matches the match pattern, then it will be replaced by a node 28 that is constructed according to the replace pattern. Together, the match 29 pattern and replace pattern are called a "rule". 30 31 In addition to rules, the Optgen language includes "definitions". Each 32 definition names and describes one of the nodes that the target expression tree 33 may contain. Match and replace patterns can recognize and construct these 34 nodes. Here is an example: 35 36 define Eq { 37 Left Expr 38 Right Expr 39 } 40 41 The following sections provide more detail on the Optgen language syntax and 42 semantics, as well as some implementation notes. 43 44 Definitions 45 46 Optgen language input files may contain any number of definitions, in any 47 order. Each definition describes a node in the target expression tree. A 48 definition has a name and a set of "fields" which describe the node's children. 49 A definition may have zero fields, in which case it describes a node with zero 50 children, which is always a "leaf" in the expression tree. 51 52 A field definition consists of two parts - the field's name and its type. The 53 Optgen parser treats the field's type as an opaque identifier; it's up to other 54 components to interpret it. However, typically the field type refers to either 55 some primitive type (like string or int), or else refers to the name of some 56 other operator or group of operators. 57 58 Here is the syntax for an operator definition: 59 60 define <name> { 61 <field-1-name> <field-1-type> 62 <field-2-name> <field-2-type> 63 ... 64 } 65 66 And here is an example: 67 68 define Join { 69 Left Expr 70 Right Expr 71 On Expr 72 } 73 74 Definition Tags 75 76 A "definition tag" is an opaque identifier that describes some property of the 77 defined node. Definitions can have multiple tags or no tags at all, and the 78 same tag can be attached to multiple definitions. Tags can be used to group 79 definitions according to some shared property or logical grouping. For example, 80 arithmetic or boolean operators might be grouped together. Match patterns can 81 then reference those tags in order to match groups of nodes (see "Matching 82 Names" section). 83 84 Here is the definition tagging syntax: 85 86 [<tag-1-name>, <tag-2-name>, ...] 87 define <name> { 88 } 89 90 And here is an example: 91 92 [Comparison, Inequality] 93 define Lt { 94 Left Expr 95 Right Expr 96 } 97 98 Rules 99 100 Optgen language input files may contain any number of rules, in any order. Each 101 rule has a unique name and consists of a match pattern and a corresponding 102 replace pattern. A rule's match pattern is tested against every node in the 103 target expression tree, bottom-up. Each matching node is replaced by a node 104 constructed according to the replace pattern. The replacement node is itself 105 tested against every rule, and so on, until no further rules match. 106 107 Note that this is just a conceptual description. Optgen does not actually do 108 any of this matching or replacing itself. Other components use the Optgen 109 library to generate code. These components are free to match however they want, 110 and to replace nodes or keep the new and old nodes side-by-side (as with a 111 typical optimizer MEMO structure). 112 113 Similar to define statements, a rule may have a set of tags associated with it. 114 Rule tags logically group rules, and can also serve as directives to the code 115 generator. 116 117 Here is the partial rule syntax (see Syntax section for full syntax): 118 119 [<rule-name>, <tag-1-name>, <tag-2-name>, ...] 120 (<match-opname> 121 <match-expr> 122 <match-expr> 123 ... 124 ) 125 => 126 (<replace-opname> 127 <replace-expr> 128 <replace-expr> 129 ... 130 ) 131 132 Match Patterns 133 134 The top-level match pattern matches the name and children of one or more nodes 135 in the target expression tree. For example: 136 137 (Eq * *) 138 139 The "*" character is the "wildcard matcher", which matches a child of any kind. 140 Therefore, this pattern matches any node named "Eq" that has at least two 141 children. Matchers can be nested within one another in order to match children, 142 grandchildren, etc. For example: 143 144 (Eq (Variable) (Const)) 145 146 This pattern matches an "Eq" node with a "Variable" node as its left child 147 and a "Const" node as its right child. 148 149 Binding 150 151 Child patterns within match and replace patterns can be "bound" to a named 152 variable. These variables can then be referenced later in the match pattern or 153 in the replace pattern. This is a critical part of the Optgen language, since 154 virtually every pattern constructs its replacement pattern based on parts of 155 the match pattern. For example: 156 157 [EliminateSelect] 158 (Select $input:* (True)) => $input 159 160 The $input variable is bound to the first child of the "Select" node. If the 161 second child is a "True" node, then the "Select" node will be replaced by its 162 input. Variables can also be passed as arguments to custom matchers, which are 163 described below. 164 165 Matching Names 166 167 In addition to simple name matching, a node matcher can match tag names. Any 168 node type which has the named tag is matched. For example: 169 170 [Inequality] 171 define Lt { 172 Left Expr 173 Right Expr 174 } 175 176 [Inequality] 177 define Gt 178 { 179 Left Expr 180 Right Expr 181 } 182 183 (Inequality (Variable) (Const)) 184 185 This pattern matches either "Lt" or "Gt" nodes. This is useful for writing 186 patterns that apply to multiple kinds of nodes, without need for duplicate 187 patterns that only differ in matched node name. 188 189 The node matcher also supports multiple names in the match list, separated by 190 '|' characters. The node's name is allowed to match any of the names in the 191 list. For example: 192 193 (Eq | Ne | Inequality) 194 195 This pattern matches "Eq", "Ne", "Lt", or "Gt" nodes. 196 197 Matching Primitive Types 198 199 String and numeric constant nodes in the tree can be matched against literals. 200 A literal string or number in a match pattern is interpreted as a matcher of 201 that type, and will be tested for equality with the child node. For example: 202 203 [EliminateConcat] 204 (Concat $left:* (Const "")) => $left 205 206 If Concat's right operand is a constant expression with the empty string as its 207 value, then the pattern matches. Similarly, a constant numeric expression can be 208 matched like this: 209 210 [LimitScan] 211 (Limit (Scan $def:*) (Const 1)) => (ScanOneRow $def) 212 213 Matching Lists 214 215 Nodes can have a child that is a list of nodes rather than a single node. As an 216 example, a function call node might have two children: the name of the function 217 and the list of arguments to the function: 218 219 define FuncCall { 220 Name Expr 221 Args ExprList 222 } 223 224 There are several kinds of list matchers, each of which uses a variant of the 225 list matching bracket syntax. The ellipses signify that 0 or more items can 226 match at either the beginning or end of the list. The item pattern can be any 227 legal match pattern, and can be bound to a variable. 228 229 [ ... <item pattern> ... ] 230 231 - ANY: Matches if any item in the list matches the item pattern. If multiple 232 items match, then the list matcher binds to the first match. 233 234 [ ... $item:* ... ] 235 236 - FIRST: Matches if the first item in the list matches the item pattern (and 237 there is at least one item in the list). 238 239 [ $item:* ... ] 240 241 - LAST: Matches if the last item in the list matches the item pattern (and 242 there is at least one item). 243 244 [ ... $item:* ] 245 246 - SINGLE: Matches if there is exactly one item in the list, and it matches the 247 item pattern. 248 249 [ $item:* ] 250 251 - EMPTY: Matches if there are zero items in the list. 252 253 [] 254 255 Following is a more complete example. The ANY list matcher in the example 256 searches through the Filter child's list, looking for a Subquery node. If a 257 matching node is found, then the list matcher succeeds and binds the node to 258 the $item variable. 259 260 (Select 261 $input:* 262 (Filter [ ... $item:(Subquery) ... ]) 263 ) 264 265 Custom Matching 266 267 When the supported matching syntax is insufficient, Optgen provides an escape 268 mechanism. Custom matching functions can invoke Go functions, passing 269 previously bound variables as arguments, and checking the boolean result for a 270 match. For example: 271 272 [EliminateFilters] 273 (Filters $items:* & (IsEmptyList $items)) => (True) 274 275 This pattern passes the $items child node to the IsEmptyList function. If that 276 returns true, then the pattern matches. 277 278 Custom matching functions can appear anywhere that other matchers can, and can 279 be combined with other matchers using boolean operators (see the Boolean 280 Expressions section for more details). While variable references are the most 281 common argument, it is also legal to nest function invocations: 282 283 (Project 284 $input:* 285 $projections:* & ^(IsEmpty (FindUnusedColumns $projections)) 286 ) 287 288 Boolean Expressions 289 290 Multiple match expressions of any type can be combined using the boolean & 291 (AND) operator. All must match in order for the overall match to succeed: 292 293 (Not 294 $input:(Comparison) & (Inequality) & (CanInvert $input) 295 ) 296 297 The boolean ^ (NOT) operator negates the result of a boolean match expression. 298 It can be used with any kind of matcher, including custom match functions: 299 300 (JoinApply 301 $left:^(Select) 302 $right:* & ^(IsCorrelated $right $left) 303 $on:* 304 ) 305 306 This pattern matches only if the left child is not a Select node, and if the 307 IsCorrelated custom function returns false. 308 309 Replace Patterns 310 311 Once a matching node is found, the replace pattern produces a single 312 substitution node. The most common replace pattern involves constructing one or 313 more new nodes, often with child nodes that were bound in the match pattern. 314 A construction expression specifies the name of the node as its first operand 315 and its children as subsequent arguments. Construction expressions can be 316 nested within one another to any depth. For example: 317 318 [HoistSelectExists] 319 (Select 320 $input:* 321 $filter:(Exists $subquery:*) 322 ) 323 => 324 (SemiJoinApply 325 $input 326 $subquery 327 (True) 328 ) 329 330 The replace pattern in this rule constructs a new SemiJoinApply node, with its 331 first two children bound in the match pattern. The third child is a newly 332 constructed True node. 333 334 The replace pattern can also consist of a single variable reference, in the 335 case where the substitution node was already present in the match pattern: 336 337 [EliminateAnd] 338 (And $left:* (True)) => $left 339 340 Custom Construction 341 342 When Optgen syntax cannot easily produce a result, custom construction 343 functions allow results to be derived in Go code. If a construction 344 expression's name is not recognized as a node name, then it is assumed to be 345 the name of a custom function. For example: 346 347 [MergeSelectJoin] 348 (Select 349 (InnerJoin $r:* $s:* $on:*) 350 $filter:* 351 ) 352 => 353 (InnerJoin 354 $r 355 $s 356 (ConcatFilters $on $filter) 357 ) 358 359 Here, the ConcatFilters custom function is invoked in order to concatenate two 360 filter lists together. Function parameters can include nodes, lists (see the 361 Constructing Lists section), operator names (see the Name parameters section), 362 and the results of nested custom function calls. While custom functions 363 typically return a node, they can return other types if they are parameters to 364 other custom functions. 365 366 Constructing Lists 367 368 Lists can be constructed and passed as parameters to node construction 369 expressions or custom replace functions. A list consists of multiple items that 370 can be of any parameter type, including nodes, strings, custom function 371 invocations, or lists. Here is an example: 372 373 [MergeSelectJoin] 374 (Select 375 (InnerJoin $left:* $right:* $on:*) 376 $filters:* 377 ) 378 => 379 (InnerJoin 380 $left 381 $right 382 (And [$on $filters]) 383 ) 384 385 Dynamic Construction 386 387 Sometimes the name of a constructed node can be one of several choices. The 388 built-in "OpName" function can be used to dynamically construct the right kind 389 of node. For example: 390 391 [NormalizeVar] 392 (Eq | Ne 393 $left:^(Variable) 394 $right:(Variable) 395 ) 396 => 397 ((OpName) $right $left) 398 399 In this pattern, the name of the constructed result is either Eq or Ne, 400 depending on which is matched. When the OpName function has no arguments, then 401 it is bound to the name of the node matched at the top-level. The OpName 402 function can also take a single variable reference argument. In that case, it 403 refers to the name of the node bound to that variable: 404 405 [PushDownSelect] 406 (Select 407 $input:(Join $left:* $right:* $on:*) 408 $filter:* & ^(IsCorrelated $filter $right) 409 ) 410 => 411 ((OpName $input) 412 (Select $left $filter) 413 $right 414 $on 415 ) 416 417 In this pattern, Join is a tag that refers to a group of nodes. The replace 418 expression will construct a node having the same name as the matched join node. 419 420 Name Parameters 421 422 The OpName built-in function can also be a parameter to a custom match or 423 replace function which needs to know which name matched. For example: 424 425 [FoldBinaryNull] 426 (Binary $left:* (Null) & ^(HasNullableArgs (OpName))) 427 => 428 (Null) 429 430 The name of the matched Binary node (e.g. Plus, In, Contains) is passed to the 431 HasNullableArgs function as a symbolic identifier. Here is an example that uses 432 a custom replace function and the OpName function with an argument: 433 434 [NegateComparison] 435 (Not $input:(Comparison $left:* $right:*)) 436 => 437 (InvertComparison (OpName $input) $left $right) 438 439 As described in the previous section, adding the argument enables OpName to 440 return a name that was matched deeper in the pattern. 441 442 In addition to a name returned by the OpName function, custom match and replace 443 functions can accept literal operator names as parameters. The Minus operator 444 name is passed as a parameter to two functions in this example: 445 446 [FoldMinus] 447 (UnaryMinus 448 (Minus $left $right) & (OverloadExists Minus $right $left) 449 ) 450 => 451 (ConstructBinary Minus $right $left) 452 453 Type Inference 454 455 Expressions in both the match and replace patterns are assigned a data type 456 that describes the kind of data that will be returned by the expression. These 457 types are inferred using a combination of top-down and bottom-up type inference 458 rules. For example: 459 460 define Select { 461 Input Expr 462 Filter Expr 463 } 464 465 (Select $input:(LeftJoin | RightJoin) $filter:*) => $input 466 467 The type of $input is inferred as "LeftJoin | RightJoin" by bubbling up the type 468 of the bound expression. That type is propagated to the $input reference in the 469 replace pattern. By contrast, the type of the * expression is inferred to be 470 "Expr" using a top-down type inference rule, since the second argument to the 471 Select operator is known to have type "Expr". 472 473 When multiple types are inferred for an expression using different type 474 inference rules, the more restrictive type is assigned to the expression. For 475 example: 476 477 (Select $input:* & (LeftJoin)) => $input 478 479 Here, the left input to the And expression was inferred to have type "Expr" and 480 the right input to have type "LeftJoin". Since "LeftJoin" is the more 481 restrictive type, the And expression and the $input binding are typed as 482 "LeftJoin". 483 484 Type inference detects and reports type contradictions, which occur when 485 multiple incompatible types are inferred for an expression. For example: 486 487 (Select $input:(InnerJoin) & (LeftJoin)) => $input 488 489 Because the input cannot be both an InnerJoin and a LeftJoin, Optgen reports a 490 type contradiction error. 491 492 Syntax 493 494 This section describes the Optgen language syntax in a variant of extended 495 Backus-Naur form. The non-terminals correspond to methods in the parser. The 496 terminals correspond to tokens returned by the scanner. Whitespace and 497 comment tokens can be freely interleaved between other tokens in the 498 grammar. 499 500 root = tags (define | rule) 501 tags = '[' IDENT (',' IDENT)* ']' 502 503 define = 'define' define-name '{' define-field* '}' 504 define-name = IDENT 505 define-field = field-name field-type 506 field-name = IDENT 507 field-type = IDENT 508 509 rule = func '=>' replace 510 match = func 511 replace = func | ref 512 func = '(' func-name arg* ')' 513 func-name = names | func 514 names = name ('|' name)* 515 arg = bind and | ref | and 516 and = expr ('&' and) 517 expr = func | not | list | any | name | STRING | NUMBER 518 not = '^' expr 519 list = '[' list-child* ']' 520 list-child = list-any | arg 521 list-any = '...' 522 bind = '$' label ':' and 523 ref = '$' label 524 any = '*' 525 name = IDENT 526 label = IDENT 527 528 Here are the pseudo-regex definitions for the lexical tokens that aren't 529 represented as single-quoted strings above: 530 531 STRING = " [^"\n]* " 532 NUMBER = UnicodeDigit+ 533 IDENT = (UnicodeLetter | '_') (UnicodeLetter | '_' | UnicodeNumber)* 534 COMMENT = '#' .* \n 535 WHITESPACE = UnicodeSpace+ 536 537 The support directory contains syntax coloring files for several editors, 538 including Vim, TextMate, and Visual Studio Code. JetBrains editor (i.e. GoLand) 539 can also import TextMate bundles to provide syntax coloring. 540 541 Components 542 543 The lang package contains a scanner that breaks input files into lexical 544 tokens, a parser that assembles an abstract syntax tree (AST) from the tokens, 545 and a compiler that performs semantic checks and creates a rudimentary symbol 546 table. 547 548 The compiled rules and definitions become the input to a separate code 549 generation package which generates parts of the Cockroach DB SQL optimizer. 550 However, the Optgen language itself is not Cockroach or SQL specific, and can 551 be used in other contexts. For example, the Optgen language parser generates 552 its own AST expressions using itself (compiler bootstrapping). 553 */ 554 package lang