github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/pkg/sql/opt/doc.go (about) 1 // Copyright 2018 The Cockroach Authors. 2 // 3 // Use of this software is governed by the Business Source License 4 // included in the file licenses/BSL.txt. 5 // 6 // As of the Change Date specified in that file, in accordance with 7 // the Business Source License, use of this software will be governed 8 // by the Apache License, Version 2.0, included in the file 9 // licenses/APL.txt. 10 11 /* 12 Package opt contains the Cockroach SQL optimizer. The optimizer transforms the 13 AST of a SQL query into a physical query plan for execution. Naive execution of 14 a SQL query can be prohibitively expensive, because SQL specifies the desired 15 results and not how to achieve them. A given SQL query can have thousands of 16 equivalent query plans with vastly different execution times. The Cockroach 17 optimizer is cost-based, meaning that it enumerates some or all of these 18 alternate plans and chooses the one with the lowest estimated cost. 19 20 Overview 21 22 SQL query planning is often described in terms of 8 modules: 23 24 1. Properties 25 26 2. Stats 27 28 3. Cost Model 29 30 4. Memo 31 32 5. Transforms 33 34 6. Prep 35 36 7. Rewrite 37 38 8. Search 39 40 Note Prep, Rewrite and Search could be considered phases, though this document 41 will refer to all 8 uniformly as modules. Memo is a technique for compactly 42 representing the forest of trees generated during Search. Stats, Properties, 43 Cost Model and Transformations are modules that power the Prep, Rewrite and 44 Search phases. 45 46 SQL query text 47 | 48 +-----v-----+ - parse SQL text according to grammar 49 | Parse | - report syntax errors 50 +-----+-----+ 51 | 52 (ast) 53 | 54 +-----v-----+ - fold constants, check types, resolve 55 | Analyze | names 56 +-----+-----+ - annotate tree with semantic info 57 | - report semantic errors 58 (ast+) 59 +-------+ | 60 | Stats +----->-----v-----+ - normalize tree with cost-agnostic 61 +-------+ | Prep | transforms (placeholders present) 62 +-->-----+-----+ - compute initial properties 63 | | - retrieve and attach stats 64 | (expr) - done once per PREPARE 65 | | 66 +------------+ | +-----v-----+ - capture placeholder values / timestamps 67 | Transforms |--+--> Rewrite | - normalize tree with cost-agnostic 68 +------------+ | +-----+-----+ transforms (placeholders not present) 69 | | - done once per EXECUTE 70 | (expr) 71 | | 72 +-->-----v-----+ - generate equivalent expression trees 73 +------------+ | Search | - find lowest cost physical plan 74 | Cost Model +----->-----+-----+ - includes DistSQL physical planning 75 +------------+ | 76 (physical plan) 77 | 78 +-----v-----+ 79 | Execution | 80 +-----------+ 81 82 The opt-related packages implement portions of these modules, while other parts 83 are implemented elsewhere. For example, other sql packages are used to perform 84 name resolution and type checking which are part of the Analyze phase. 85 86 Parse 87 88 The parse phase is not discussed in this document. It transforms the SQL query 89 text into an abstract syntax tree (AST). 90 91 Analyze 92 93 The analyze phase ensures that the AST obeys all SQL semantic rules, and 94 annotates the AST with information that will be used by later phases. In 95 addition, some simple transforms are applied to the AST in order to simplify 96 handling in later phases. Semantic rules are many and varied; this document 97 describes a few major categories of semantic checks and rewrites. 98 99 "Name resolution" binds table, column, and other references. Each name must be 100 matched to the appropriate schema object, and an error reported if no matching 101 object can be found. Name binding can result in AST annotations that make it 102 easy for other components to find the target object, or rewrites that replace 103 unbound name nodes with new nodes that are easier to handle (e.g. IndexedVar). 104 105 "Constant folding" rewrites expressions that have constant inputs. For example, 106 1+1 would be folded to 2. Cockroach's typing rules assume that constants have 107 been folded, as there are some expressions that would otherwise produce a 108 semantic error if they are not first folded. 109 110 "Type inference" automatically determines the return data type of various SQL 111 expressions, based on the types of inputs, as well as the context in which the 112 expression is used. The AST is annotated with the resolved types for later use. 113 114 "Type checking" ensures that all inputs to SQL expressions and statements have 115 legal static types. For example, the CONCAT function only accepts zero or more 116 arguments that are statically typed as strings. Violation of the typing rules 117 produces a semantic error. 118 119 Properties 120 121 Properties are meta-information that are computed (and sometimes stored) for 122 each node in an expression. Properties power transformations and optimization. 123 124 "Logical properties" describe the structure and content of data returned by an 125 expression, such as whether relational output columns can contain nulls, or the 126 data type of a scalar expression. Two expressions which are logically 127 equivalent according to the rules of the relational algebra will return the 128 same set of rows and columns, and will have the same set of logical properties. 129 However, the order of the rows, naming of the columns, and other presentational 130 aspects of the result are not governed by the logical properties. 131 132 "Physical properties" are interesting characteristics of an expression that 133 impact its layout, presentation, or location, but not its logical content. 134 Examples include row order, column naming, and data distribution (physical 135 location of data ranges). Physical properties exist outside of the relational 136 algebra, and arise from both the SQL query itself (e.g. the non-relational 137 ORDER BY operator) and by the selection of specific implementations during 138 optimization (e.g. a merge join requires the inputs to be sorted in a 139 particular order). 140 141 Properties can be "required" or "derived". A required property is one specified 142 by the SQL query text. For example, a DISTINCT clause is a required property on 143 the set of columns of the corresponding projection -- that the tuple of columns 144 forms a key (unique values) in the results. A derived property is one derived 145 by the optimizer for an expression based on the properties of the child 146 expressions. For example: 147 148 SELECT k+1 FROM kv 149 150 Once the ordering of "k" is known from kv's descriptor, the same ordering 151 property can be derived for k+1. During optimization, for each expression with 152 required properties, the optimizer will look at child expressions to check 153 whether their actual properties (which can be derived) match the requirement. 154 If they don't, the optimizer must introduce an "enforcer" operator in the plan 155 that provides the required property. For example, an ORDER BY clause creates a 156 required ordering property that can cause the optimizer to add a Sort operator 157 as an enforcer of that property. 158 159 Stats 160 161 Table statistics power both the cost model and the search of alternate query 162 plans. A simple example of where statistics guide the search of alternate query 163 plans is in join ordering: 164 165 SELECT * FROM a JOIN b 166 167 In the absence of other opportunities, this might be implemented as a hash 168 join. With a hash join, we want to load the smaller set of rows (either from a 169 or b) into the hash table and then query that table while looping through the 170 larger set of rows. How do we know whether a or b is larger? We keep statistics 171 about the cardinality of a and b, i.e. the (approximate) number of different 172 values. 173 174 Simple table cardinality is sufficient for the above query but fails in other 175 queries. Consider: 176 177 SELECT * FROM a JOIN b ON a.x = b.x WHERE a.y > 10 178 179 Table statistics might indicate that a contains 10x more data than b, but the 180 predicate a.y > 10 is filtering a chunk of the table. What we care about is 181 whether the result of the scan *after* filtering returns more rows than the 182 scan of b. This can be accomplished by making a determination of the 183 selectivity of the predicate a.y > 10 (the % of rows it will filter) and then 184 multiplying that selectivity by the cardinality of a. The common technique for 185 estimating selectivity is to collect a histogram on a.y. 186 187 The collection of table statistics occurs prior to receiving the query. As 188 such, the statistics are necessarily out of date and may be inaccurate. The 189 system may bound the inaccuracy by recomputing the stats based on how fast a 190 table is being modified. Or the system may notice when stat estimations are 191 inaccurate during query execution. 192 193 Cost Model 194 195 The cost model takes an expression as input and computes an estimated "cost" 196 to execute that expression. The unit of "cost" can be arbitrary, though it is 197 desirable if it has some real world meaning such as expected execution time. 198 What is required is for the costs of different query plans to be comparable. 199 The optimizer seeks to find the shortest expected execution time for a query 200 and uses cost as a proxy for execution time. 201 202 Cost is roughly calculated by estimating how much time each node in the 203 expression tree will use to process all results and modeling how data flows 204 through the expression tree. Table statistics are used to power cardinality 205 estimates of base relations which in term power cardinality estimates of 206 intermediate relations. This is accomplished by propagating histograms of 207 column values from base relations up through intermediate nodes (e.g. combining 208 histograms from the two join inputs into a single histogram). Operator-specific 209 computations model the network, disk and CPU costs. The cost model should 210 include data layout and the specific operating environment. For example, 211 network RTT in one cluster might be vastly different than another. 212 213 Because the cost for a query plan is an estimate, there is an associated error. 214 This error might be implicit in the cost, or could be explicitly tracked. One 215 advantage to explicitly tracking the expected error is that it can allow 216 selecting a higher cost but lower expected error plan over a lower cost but 217 higher expected error plan. Where does the error come from? One source is the 218 innate inaccuracy of stats: selectivity estimation might be wildly off due to 219 an outlier value. Another source is the accumulated build up of estimation 220 errors the higher up in the query tree. Lastly, the cost model is making an 221 estimation for the execution time of an operation such as a network RTT. This 222 estimate can also be wildly inaccurate due to bursts of activity. 223 224 Search finds the lowest cost plan using dynamic programming. That imposes a 225 restriction on the cost model: it must exhibit optimal substructure. An optimal 226 solution can be constructed from optimal solutions of its sub-problems. 227 228 Memo 229 230 Memo is a data structure for efficiently storing a forest of query plans. 231 Conceptually, the memo is composed of a numbered set of equivalency classes 232 called groups where each group contains a set of logically equivalent 233 expressions. The different expressions in a single group are called memo 234 expressions (memo-ized expressions). A memo expression has a list of child 235 groups as its children rather than a list of individual expressions. The 236 forest is composed of every possible combination of parent expression with 237 its children, recursively applied. 238 239 Memo expressions can be relational (e.g. join) or scalar (e.g. <). Operators 240 are always both logical (specify results) and physical (specify results and a 241 particular implementation). This means that even a "raw" unoptimized expression 242 tree can be executed (naively). Both relational and scalar operators are 243 uniformly represented as nodes in memo expression trees, which facilitates tree 244 pattern matching and replacement. 245 246 Because memo groups contain logically equivalent expressions, all the memo 247 expressions in a group share the same logical properties. However, it's 248 possible for two logically equivalent expressions to be placed in different 249 memo groups. This occurs because determining logical equivalency of two 250 relational expressions is too complex to perform 100% correctly. A correctness 251 failure (i.e. considering two expressions logically equivalent when they are 252 not) results in invalid transformations and invalid plans. But placing two 253 logically equivalent expressions in different groups has a much gentler failure 254 mode: the memo and transformations are less efficient. Expressions within the 255 memo may have different physical properties. For example, a memo group might 256 contain both hash join and merge join expressions which produce the same set of 257 output rows, but produce them in different orders. 258 259 Expressions are inserted into the memo by the factory, which ensure that 260 expressions have been fully normalized before insertion (see the Prep section 261 for more details). A new group is created only when unique normalized 262 expressions are created by the factory during construction or rewrite of the 263 tree. Uniqueness is determined by computing the fingerprint for a memo 264 expression, which is simply the expression operator and its list of child 265 groups. For example, consider this query: 266 267 SELECT * FROM a, b WHERE a.x = b.x 268 269 After insertion into the memo, the memo would contain these six groups: 270 271 6: [inner-join [1 2 5]] 272 5: [eq [3 4]] 273 4: [variable b.x] 274 3: [variable a.x] 275 2: [scan b] 276 1: [scan a] 277 278 The fingerprint for the inner-join expression is [inner-join [1 2 5]]. The 279 memo maintains a map from expression fingerprint to memo group which allows 280 quick determination of whether the normalized form of an expression already 281 exists in the memo. 282 283 The normalizing factory will never add more than one expression to a memo 284 group. But the explorer (see Search section for more details) does add 285 denormalized expressions to existing memo groups, since oftentimes one of these 286 equivalent, but denormalized expressions will have a lower cost than the 287 initial normalized expression added by the factory. For example, the join 288 commutativity transformation expands the memo like this: 289 290 6: [inner-join [1 2 5]] [inner-join [2 1 5]] 291 5: [eq [3 4]] 292 4: [variable b.x] 293 3: [variable a.x] 294 2: [scan b] 295 1: [scan a] 296 297 Notice that there are now two expressions in memo group 6. The coster (see Cost 298 Model section for more details) will estimate the execution cost of each 299 expression, and the optimizer will select the lowest cost alternative. 300 301 Transforms 302 303 Transforms convert an input expression tree into zero or more logically 304 equivalent trees. Transforms consist of two parts: a "match pattern" and a 305 "replace pattern". Together, the match pattern and replace pattern are called a 306 "rule". Transform rules are categorized as "normalization" or "exploration" 307 rules. 308 309 If an expression in the tree matches the match pattern, then a new expression 310 will be constructed according to the replace pattern. Note that "replace" means 311 the new expression is a logical replacement for the existing expression, not 312 that the existing expression needs to physically be replaced. Depending on the 313 context, the existing expression may be discarded, or it may be retained side- 314 by-side with the new expression in the memo group. 315 316 Normalization rules are cost-agnostic, as they are always considered to be 317 beneficial. All normalization rules are implemented by the normalizing factory, 318 which does its best to map all logically equivalent expression trees to a 319 single canonical form from which searches can branch out. See the Prep section 320 for more details. 321 322 Exploration rules generate equivalent expression trees that must be costed in 323 order to determine the lowest cost alternative. All exploration rules are 324 implemented by the explorer, which is optimized to efficiently enumerate all 325 possible expression tree combinations in the memo in order to look for rule 326 matches. When it finds a match, the explorer applies the rule and adds an 327 equivalent expression to the existing memo group. See the Search section for 328 more details. 329 330 Some examples of transforms: 331 332 Join commutativity 333 Swaps the order of the inputs to an inner join. 334 SELECT * FROM a, b => SELECT * FROM b, a 335 336 Join associativity 337 Reorders the children of a parent and child join 338 SELECT * FROM (SELECT * FROM a, b), c 339 => 340 SELECT * FROM (SELECT * FROM a, c), b 341 342 Predicate pushdown 343 Moves predicates below joins 344 SELECT * FROM a, b USING (x) WHERE a.x < 10 345 => 346 SELECT * FROM (SELECT * FROM a WHERE a.x < 10), b USING (x) 347 348 Join elimination 349 Removes unnecessary joins based on projected columns and foreign keys. 350 SELECT a.x FROM a, b USING (x) 351 => 352 SELECT a.x FROM a 353 354 Distinct/group-by elimination 355 Removes unnecessary distinct/group-by operations based on keys. 356 SELECT DISTINCT a.x FROM a 357 => 358 SELECT a.x FROM a 359 360 Predicate inference 361 Adds predicates based on filter conditions. 362 SELECT * FROM a, b USING (x) 363 => 364 SELECT * FROM a, b USING (x) WHERE a.x IS NOT NULL AND b.x IS NOT NULL 365 366 Decorrelation 367 Replaces correlated subqueries with semi-join, anti-join and apply ops. 368 369 Scan to index scan 370 Transforms scan operator into one or more index scans on covering indexes. 371 372 Inner join to merge join 373 Generates alternate merge-join operator from default inner-join operator. 374 375 Much of the optimizer's rule matching and application code is generated by a 376 tool called Optgen, short for "optimizer generator". Optgen is a domain- 377 specific language (DSL) that provides an intuitive syntax for defining 378 transform rules. Here is an example: 379 380 [NormalizeEq] 381 (Eq 382 $left:^(Variable) 383 $right:(Variable) 384 ) 385 => 386 (Eq $right $left) 387 388 The expression above the arrow is the match pattern and the expression below 389 the arrow is the replace pattern. This example rule will match Eq expressions 390 which have a left input which is not a Variable operator and a right input 391 which is a Variable operator. The replace pattern will trigger a replacement 392 that reverses the two inputs. In addition, custom match and replace functions 393 can be defined in order to run arbitrary Go code. 394 395 Prep 396 397 Prep (short for "prepare") is the first phase of query optimization, in which 398 the annotated AST is transformed into a single normalized "expression tree". 399 The optimizer directly creates the expression tree in the memo data structure 400 rather than first constructing an intermediate data structure. A forest of 401 equivalent trees will be generated in later phases, but at the end of the prep 402 phase, the memo contains just one normalized tree that is logically equivalent 403 to the SQL query. 404 405 During the prep phase, placeholder values are not yet known, so normalization 406 cannot go as far as it can during later phases. However, this also means that 407 the resulting expression tree can be cached in response to a PREPARE statement, 408 and then be reused as a starting point each time an EXECUTE statement provides 409 new placeholder values. 410 411 The memo expression tree is constructed by the normalizing factory, which does 412 its best to map all logically equivalent expression trees to a single canonical 413 form from which searches can branch out. The factory has an interface similar 414 to this: 415 416 ConstructConst(value PrivateID) GroupID 417 ConstructAnd(conditions ListID) GroupID 418 ConstructInnerJoin(left GroupID, right GroupID, on GroupID) GroupID 419 420 The factory methods construct a memo expression tree bottom-up, with each memo 421 group becoming an input to operators higher in the tree. 422 423 As each expression is constructed by the factory, it transitively applies 424 normalization rules defined for that expression type. This may result in the 425 construction of a different type of expression than what was requested. If, 426 after normalization, the expression is already part of the memo, then 427 construction is a no-op. Otherwise, a new memo group is created, with the 428 normalized expression as its first and only expression. 429 430 By applying normalization rules as the expression tree is constructed, the 431 factory can avoid creating intermediate expressions; often, "replacement" of 432 an existing expression means it's never created to begin with. 433 434 During Prep, all columns used by the SQL query are given a numeric index that 435 is unique across the query. Column numbering involves assigning every base 436 column and non-trivial projection in a query a unique, query-specific index. 437 Giving each column a unique index allows the expression nodes mentioned above 438 to track input and output columns, or really any set of columns during Prep and 439 later phases, using a bitmap (FastIntSet). The bitmap representation allows 440 fast determination of compatibility between expression nodes and is utilized by 441 transforms to determine the legality of such operations. 442 443 The Prep phase also computes logical properties, such as the input and output 444 columns of each (sub-)expression, equivalent columns, not-null columns and 445 functional dependencies. These properties are computed bottom-up as part of 446 constructing the expression tree. 447 448 Rewrite 449 450 Rewrite is the second phase of query optimization. Placeholder values are 451 available starting at this phase, so new normalization rules will typically 452 match once constant values are substituted for placeholders. As mentioned in 453 the previous section, the expression tree produced by the Prep phase can be 454 cached and serve as the starting point for the Rewrite phase. In addition, the 455 Rewrite phase takes a set of physical properties that are required from the 456 result, such as row ordering and column naming. 457 458 The Rewrite and Search phases have significant overlap. Both phases perform 459 transformations on the expression tree. However, Search preserves the matched 460 expression side-by-side with the new expression, while Rewrite simply discards 461 the matched expression, since the new expression is assumed to always be 462 better. In addition, the application of exploration rules may trigger 463 additional normalization rules, which may in turn trigger additional 464 exploration rules. 465 466 Together, the Rewrite and Search phases are responsible for finding the 467 expression that can provide the required set of physical properties at the 468 lowest possible execution cost. That mandate is recursively applied; in other 469 words, each subtree is also optimized with respect to a set of physical 470 properties required by its parent, and the goal is to find the lowest cost 471 equivalent expression. An example of an "interior" optimization goal is a merge 472 join that requires its inner child to return its rows in a specific order. The 473 same group can be (and sometimes is) optimized multiple times, but with 474 different required properties each time. 475 476 Search 477 478 Search is the final phase of optimization. Search begins with a single 479 normalized tree that was created by the earlier phases. For each group, the 480 "explorer" component generates alternative expressions that are logically 481 equivalent to the normalized expression, but which may have very different 482 execution plans. The "coster" component computes the estimated cost for each 483 alternate expression. The optimizer remembers the "best expression" for each 484 group, for each set of physical properties required of that group. 485 486 Optimization of a group proceeds in two phases: 487 488 1. Compute the cost of any previously generated expressions. That set initially 489 contains only the group's normalized expression, but exploration may yield 490 additional expressions. Costing a parent expression requires that the children 491 first be costed, so costing triggers a recursive traversal of the memo groups. 492 493 2. Invoke the explorer to generate new equivalent expressions for the group. 494 Those new expressions are costed once the optimizer loops back to the first 495 phase. 496 497 In order to avoid a combinatorial explosion in the number of expression trees, 498 the optimizer utilizes the memo structure. Due to the large number of possible 499 plans for some queries, the optimizer cannot always explore all of them. 500 Therefore, it proceeds in multiple iterative "passes", until either it hits 501 some configured time or resource limit, or until an exhaustive search is 502 complete. As long as the search is allowed to complete, the best plan will be 503 found, just as in Volcano and Cascades. 504 505 The optimizer uses several techniques to maximize the chance that it finds the 506 best plan early on: 507 508 - As with Cascades, the search is highly directed, interleaving exploration 509 with costing in order to prune parts of the tree that cannot yield a better 510 plan. This contrasts with Volcano, which first generates all possible plans in 511 one global phase (exploration), and then determines the lowest cost plan in 512 another global phase (costing). 513 514 - The optimizer uses a simple hill climbing heuristic to make greedy progress 515 towards the best plan. During a given pass, the optimizer visits each group and 516 performs costing and exploration for that group. As long as doing that yields a 517 lower cost expression for the group, the optimizer will repeat those steps. 518 This finds a local maxima for each group during the current pass. 519 520 In order to avoid costing or exploring parts of the search space that cannot 521 yield a better plan, the optimizer performs aggressive "branch and bound 522 pruning". Each group expression is optimized with respect to a "budget" 523 parameter. As soon as this budget is exceeded, optimization of that expression 524 terminates. It's not uncommon for large sections of the search space to never 525 be costed or explored due to this pruning. Example: 526 527 innerJoin 528 left: cost = 50 529 right: cost = 75 530 on: cost = 25 531 532 If the current best expression for the group has a cost of 100, then the 533 optimizer does not need to cost or explore the "on" child of the join, and 534 does not need to cost the join itself. This is because the combined cost of 535 the left and right children already exceeds 100. 536 */ 537 package opt