vitess.io/vitess@v0.16.2/doc/design-docs/V3HighLevelDesign.md (about) 1 # V3 high level design 2 3 # Objectives 4 5 The goal of this document is to describe the guiding principles that will be used to implement the full SQL feature set of Vitess’s V3 API. We start off with the high level concepts on SQL and its features, and iterate our way into some concrete design proposals. 6 7 ### Prerequisites 8 9 Before reading this doc you must be familiar with [vindexes](https://github.com/vitessio/vitess/blob/main/doc/V3VindexDesign.md), which is used as foundation for the arguments presented here. 10 11 # Background 12 13 VTGate was initially conceived as a router that encapsulated the logic of deciding where to send the application’s queries. The primary objective was to isolate the application from maintenance events like reparenting, server restarts, and resharding. As of the V2 API, this problem is pretty much solved. 14 15 One drawback of the V2 API is that it requires the application to be aware of the sharding key. This key had to be sent with every query. So, unless an application was written from the ground up with this awareness, the task of migrating it to use vitess was rather daunting. 16 17 The question was asked: could VTGate be changed to not require the sharding key in its API? Could VTGate make the sharded database look like a single database? 18 19 This was how V3 was born. A working proof of concept was developed. This showed that we can automatically route queries that targeted single shards. We could also route queries that targeted multiple shards as long as the results could be combined without any additional processing. 20 21 The next question was: Can we support more SQL constructs? How far can we go? Can we go all the way and support everything? 22 23 These are the questions we’ll try to answer in the subsequent sections. 24 25 # TL;DR 26 27 This section gives a preview of what this document tries to achieve. 28 29 ## Current V3 30 31 The current V3 functionality is capable of correctly serving simple SQL statements that reference a single table. For example, if id is the sharding column for a: 32 33 `select a.col from a where a.id=:id` will get routed to the correct shard. 34 35 `select a.col from a where a.id in ::list` will get broken out into shard-specific in statements and routed to those specific shards 36 37 `select a.col from a where a.col=:col` will get sent to all shards and results will be combined and returned 38 39 The current V3 also supports simple DMLs. More complex SQLs like joins, aggregations and subqueries are not supported. 40 41 ## Next phase: support more queries 42 43 If the first phase of this document is implemented, VTGate will be able to support the following constructs: 44 45 ### Joins 46 47 `select a.col, b.col from a join b on b.id=a.id where a.id=:id` will get routed to the correct shard. Correspondingly, IN clauses will also will get correctly broken out and routed. 48 49 `select a.col, b.col from a join b on b.id=a.id `will get sent to all shards. 50 51 If the data is on multiple shard, VTGate will perform the join: 52 53 `select a.col, b.col from a join b on b.id2=a.id where a.id=:id` 54 55 will get rewritten as two queries, one dependent on the other: 56 57 ``` 58 select a.col, a.id from a where a.id=:id -> yield "_a_id" from col 1 59 select b.col from b where b.id2=:_a_id 60 ``` 61 62 Joins can group tables, as long as the order is correctly specified: 63 64 `select a.col, b.col, c.col from a join b on b.id=a.id join c on c.id2=a.id where a.id=:id` 65 66 will get rewritten as: 67 68 ``` 69 select a.col, b.col, a.id from a join b on b.id=a.id where a.id=:id 70 select c.col from where c.id2=:_a_id 71 ``` 72 73 Joins can also group other where clauses and wire-up cross-table dependencies: 74 75 `select a.col, b.col from a join b on b.id2=a.id where a.id=:id and b.col > a.col` 76 77 will get rewritten as: 78 79 ``` 80 select a.col, a.id from a where a.id=:id 81 select b.col from b where b.id2=:_a_id and b.col > :_a_col 82 ``` 83 84 Joins can also be cascased: 85 86 `select a.col, b.col, c.col from a join b on b.i2=a.id join c on c.id3=b.id2 where a.id=:id` 87 88 will get rewritten as: 89 90 ``` 91 select a.col, a.id from a where a.id=:id 92 select b.col, b.id2 from b where b.id2=_a_id 93 select c.col from c where c.id3=_b_id2 94 ``` 95 96 LEFT JOIN will also be supported for all the above constructs. However, WHERE clauses that add additional constraints to the nullable parts of a left join will not be supported; They cannot be pushed down. 97 98 ### Subqueries 99 100 Correlated subqueries that can be executed as part of the main query will be supported: 101 102 `select a.id from a where exists (select b.id from b where b.id=a.id) and a.id=:id` will get routed to the correct shard. 103 104 Correlated subqueries can also be sent as scatter queries, as long as the join constraints are met. 105 106 Subqueries in the FROM clause can be joined with other tables: 107 108 `select a.id, a.c, b.col from (select id, count(*) c from a) as a join b on b.id=a.id` 109 110 will get rewritten as: 111 112 ``` 113 select a.id, a.c from (select id, count(*) c from a) as a 114 select b.col from b where b.id=:_a_id 115 ``` 116 117 Subqueries in the WHERE clause and SELECT list will be supported only if they can be grouped. They will not be broken out. 118 119 ### Aggregation 120 121 Aggregation will be supported for single-shard queries, but they can contain joins and subqueries. 122 123 You can also use aggregation for scatter queries if a unique vindex is in the list of result columns, like this one: 124 125 `select id, count(*) from a` 126 127 ### Sorting 128 129 Sorting will be supported as long as it can be meaningfully pushed down into the grouped parts. For example, 130 131 `select a.col, b.col from a join be where b.id=a.id order by a.id, b.id` 132 133 will be rewritten as: 134 135 ``` 136 select a.col from a order by a.id order by a.id 137 select b.col from b where b.id=:_a_id order by b.id 138 ``` 139 140 ### All combinations 141 142 The above features are designed to be orthogonal to each other. This means that VTGate can support simultaneous combinations of the above features, and with arbitrary levels of nesting and cascading. 143 144 ## Going beyond 145 146 Once the above phase is complete, we can iterate on performance features as well as add support for more constructs. It will depend on customer demand. 147 148 # High level concepts 149 150 The V3 proof of concept drew parallels between database indexes and vitess’s indexing scheme. A database deals with tables, but VTGate deals with databases. It turns out that a table and a database have a surprising number of similarities. The basic guarantee of an index is that it lets you limit the number for rows you have to inspect in order to compute the result of a query. In the same way, vindexes give you the guarantee that it will be sufficient to send your queries to a subset of all the shards. 151 152 As a quick recap, a vindex is a cross-shard index maintained by vitess. Given a column constraint, the vindex is one that’s capable of telling which shards contain the rows needed to satisfy a query. In this document, we’ll use the term ‘unique vindex’ quite often. A unique vindex is one that can produce only one shard as target for a column equality constraint. However, there may still be many rows that match that constraint. But they’ll always be within that one shard. 153 154 The following table further illustrates the similarities of the concepts: 155 156 <table> 157 <tr> 158 <td>RDBMS</td> 159 <td>Vitess</td> 160 </tr> 161 <tr> 162 <td>Table</td> 163 <td>Database</td> 164 </tr> 165 <tr> 166 <td>Primary Key</td> 167 <td>Sharding Key</td> 168 </tr> 169 <tr> 170 <td>Index</td> 171 <td>Vindex</td> 172 </tr> 173 <tr> 174 <td>Foreign Key</td> 175 <td>Shared Vindex</td> 176 </tr> 177 </table> 178 179 180 This coincidence is not accidental. For example, if you used a vindex to route a query to a shard, it’s likely that the database also has an index on that same column, with properties similar to the vindex. If you have a shared vindex between two tables, it’s very likely that there exists a similar foreign key relationship between the tables. 181 182 The foundation for these similarities comes from the fact that a table that was once in a single database is now partitioned out. So, the indexes for those tables have to also grow out of the original database. 183 184 There is another strong parallel between a table and a database: A SELECT statement and a table are interchangeable. On one hand, all the data in table t can be represented as `‘select * from t’. `On the other hand, you can use a SELECT statement as stand-in for a table. So, many of the approaches used to optimize SQL for scanning tables can be re-used to orchestrate queries that span multiple shards or databases. The main difference is that you can treat bigger SQL chunks as tables, and can outsource that work to the database that hosts them. 185 186 But then, there’s only so much you can outsource. As soon as a query exceeds a certain level of complexity, where the required data needs to be pulled from more than one database, then any subsequent work needs to be done by VTGate. If we want to support the full SQL syntax, VTGate will have to eventually become a full-fledged SQL engine. 187 188 VTGate’s SQL engine will likely not be as good as what the underlying database can do, at least initially. So, until such time, the goal will be to push as much of the work down to the databases, and do as little as possible with what comes out. 189 190 ## Limitations 191 192 The above arguments can more or less be used as proof that we can theoretically support the full SQL syntax. In the worst case scenario, VTGate will have to individually scan each table involved in the query and perform the rest of the work by itself. 193 194 But there are some things it cannot do: 195 196 ### Cross-shard consistency 197 198 There is currently no way to get a system-wide consistent snapshot of all the keyspaces and shards. If reads are being performed during in-flight cross-shard transactions, then VTGate may see such data as committed in one database, and not committed in the other. 199 200 We could rely on MySQL’s MVCC feature that will give you a view of each database as of when a transaction was started, but this is still not perfect because there is no guarantee about the exact timing of each BEGIN, and there are also replication lag related inconsistencies. We currently don’t support read transactions for replicas, but it can be added. 201 202 The final option is to lock those rows before reading them. It’s a very high price to pay, but it will work. 203 204 ### Environmental differences 205 206 VTGate will be running on a different machine than the databases. So, functions that rely on the current environment will not produce consistent results. For example, time functions will likely have variations depending on clock skew. 207 208 # SQL for dummies 209 210 An upfront clarification: we’ll be only focusing on SELECT statements in this document, because that’s where all the complexities are. Also, Vitess currently allows only single-shard DMLs. So, most of this analysis will not be applicable for them. 211 212 ## Dissecting SELECT 213 214 A single table SELECT is a relatively simple thing. Let’s dissect the following statement: 215 216 ``` 217 select count(*), a+1, b 218 from t 219 where c=10 220 group by a+1, b 221 having count(*) > 10 222 order by count(*) desc 223 limit 10 224 ``` 225 226 The steps performed by the SQL engine are as follows: 227 228 1. Scan t (FROM) 229 2. Filter rows where c=10 (WHERE) 230 3. Remove all columns other than a & b, and compute a+1 (SELECT) 231 4. Count and group unique a+1, b values (GROUP BY) 232 5. Filter counts that are > 10 (HAVING) 233 6. order the rows by descending count (ORDER BY) 234 7. Return the top 10 results (LIMIT) 235 236 The one thing that’s not obvious in a SELECT is that the select list is not the first clause that’s executed. It’s the FROM. 237 238 ## SELECT as a subquery 239 240 If a SELECT statement was used just to fetch results, things would be very simple. However, it can be used in other parts of a query, and can produce 5 kinds of values: 241 242 1. A simple value: `select * from t where id = (select 1 from dual)` 243 2. A value tuple: multiple values of the same type: `select * from t1 where id in (select val from t2)` 244 3. A row tuple: multiple values of different types. A row tuple can also be used as a value: `select * from t1 where (a,b) = (select b,c from t2 where pk=1)` 245 4. Rows: results of a regular query. Note that rows can be treated as a value tuple of row tuples. For example, this statement is valid: `select * from t1 where (a,b) in (select b,c from t2)` 246 5. A virtual table: If you don’t strip out the field info from the original query, then a select can act as a table, where the field names act as column names. For example, this statement is valid: `select * from (select a, b from t2) as t where t.a=1` 247 248 This opens the door to some powerful expressibility, the other edge of the sword being indefinite complexity. 249 250 Subqueries can be found in the following clauses: 251 252 1. FROM 253 2. SELECT list 254 3. WHERE 255 256 In a way, subqueries are brain-dead, including the correlated ones. You just execute your main query as specified by the sequence in the ‘Dissecting SELECT’ section. Any time you encounter a subquery, you just evaluate it as an expression on-the-fly and use the resulting value. The only complication of a correlated subquery is that it references a value from the outer Result to produce its own Result. 257 258 An immediate optimization: If a subquery is not correlated, we know that its value is the same for every row of the outer query. So, such a subquery can be "pulled out" and evaluated only once, or we could use lazy evaluation and caching. But we’re venturing into optimizer territory, which we should refrain from for now. 259 260 Based on the above properties, we can make some observations: 261 262 * A subquery in a FROM clause cannot reference columns of the SELECT list. This is because of the execution order: the SELECT list depends on what comes out of the FROM clause, not the other way around. However, such a subquery can still reference columns from an outer SELECT. But MySQL is more restrictive, and prevents all correlation for subqueries in the FROM clause. 263 * A subquery in a SELECT list cannot affect the original number of rows returned. All it can do is supply the value for the column it was meant to compute the result for. If there was no result, the value would be NULL. A subquery in a SELECT list must produce a single scalar value. 264 * A subquery in a WHERE clause cannot change any values of the original result, or produce new rows. It can only cause rows to be eliminated. 265 266 It’s easy to get lost in the world of subqueries, and it may get confusing when you start to think about how they interact with joins. But you can retain sanity by following these principles while analyzing a complex query: 267 268 * Tables involved in a join are within the same scope, and can reference each other's columns. 269 * Subqueries are in an inner scope. An inner query can use values from the outer query. Other than this, all it can do is produce a value. 270 * Other parts of the outer query cannot reference anything about the subquery. It’s a black box in that respect. 271 272 Correlated subqueries exhibit the properties of a join. So, there are situations where they could be converted to actual joins by the optimizer. 273 274 It’s now time to drill down into the details of the operators. The notations below are not very formal. They’re there mainly to illustrate the concepts. 275 276 ## What kind of machine is SQL? 277 278 Primer on automata theory: [https://en.wikipedia.org/wiki/Automata_theory](https://en.wikipedia.org/wiki/Automata_theory). 279 280 Is SQL a turing machine? The answer is no, because it can only make forward progress, and always terminates. 281 282 Strangely, there are no references on the internet about what kind of machine SQL is. Looking at whether it can be a finite state machine, the answer seems to be no, because of subqueries. 283 284 So, it appears to be a Pushdown Automaton (PDA). This determination is important because it will help us decide the data structure needed to represent an execution plan. 285 286 Specifically, a tree structure is sufficient to represent a PDA. 287 288 ## The operands 289 290 This is the list of ‘values’ that a query engine deals with in order to satisfy a query: 291 292 1. Result: This is the core data type that every operator deals with. It’s a complex data structure. Sometimes, the primitives may access only a part of the Result. So, the interchangeability rules of primitives will be different based on which parts of the Result each of them accesses. 293 2. Value expression list: This is used in various places in a statement. Most notably, for converting a table scan into the list of values to be returned. 294 3. Conditional expression: This is used to filter out results. 295 4. Aggregate expression list: This has some similarities with a value expression list, but it’s worth differentiating. 296 5. Scalar values: Used all over. 297 298 The above operands are a simplified way of looking at things. For example, a conditional expression is a separate world that has its own set of operators and values, and they follow a different set of algebraic rules. But we need to divide and conquer. 299 300 You’ll see how these operands are used in the description of the operators. 301 302 ## The operators 303 304 If one were to ask what are the minimal set of primitives needed to serve SQL queries, we can use [relational algebra](https://www.tutorialspoint.com/dbms/relational_algebra.htm) as starting point. But databases have extended beyond those basic operations. They allow duplicate tuples and ordering. So, here’s a more practical set: 305 306 1. Scan (FROM). 307 2. Join: Join two results as cross-product (JOIN). 308 3. LeftJoin: The algebraic rules for an outer join are so different from a Join that this needs to be a separate operator (LEFT JOIN). 309 4. Filter: Removing rows based on a conditional expression (WHERE, HAVING). 310 5. Select: Evaluating new values based on a value expression list (SELECT list). 311 6. Aggregate: Grouping rows based on an aggregate expression list (GROUP BY, UNION). 312 7. Sort (ORDER BY). 313 8. Limit (LIMIT). 314 9. Merge: Merge two results into one (UNION ALL). 315 316 Windowing functions are out of scope for this doc. 317 318 Even the above primitives are fairly theoretical; One would rarely perform a Join without also simultaneously applying a join condition. 319 320 For a given SELECT statement, the order of these operators are fixed. They are as follows: 321 322 `Scan <-> (Join & LeftJoin) -> Filter(where) -> Select -> Aggregate -> Filter(having) -> Sort -> Limit -> Merge` 323 324 However, following the above order is almost never optimal. This is where we have to study how these operators interact, and try to find ways to resequence them for higher efficiency. 325 326 It’s important to note that most of these operators take a Result as input and produce one as output. This means that they can be indefinitely combined. This is achieved by re-SELECTing from a Result. It’s usually rare that people perform such nested selects, but there are legitimate use cases, typically for computing aggregates of aggregates. 327 328 Additionally, Filter and Select can perform their own scans (subqueries) as part of their expressions. This adds another dimension to the overall complexity. 329 330 ### Scan 331 332 This is the most basic operation: given a table name, the operation produces a result, which is a list of all the rows in the table, with field names: 333 334 `Scan(tableName) -> Result` 335 336 Rethinking this for VTGate, the corresponding operation would be to route a query that targets a single database, and the output would be the result. 337 338 `Route(query, keyspace, shard) -> Result` 339 340 It’s implied that the query is the sql+bindvars. 341 342 *If every operator takes a Result as input, who produces the first Result?* 343 344 Scan is in this category. Without this operator, there is nothing that can map a query to a Result. Consequently, this is also the lowest level building block that all other operations depend on. 345 346 It turns out that this is the only part of VTGate that differs from a traditional SQL engine. But it’s a powerful difference. For well tuned apps, the entire query will most of the time be just passed through to one of the underlying databases. 347 348 The optimizer is likely to be very different, because its main task will be to figure out how to break up a query into parts that can be outsourced to different databases. For example, if there’s a three-way join and if two of those pieces can be sent to a single shard, then it has to split that query correctly into two. Because of this, we need the ability to translate a relational expression back into a query, or somehow find a way to reuse the parts of the original query that pertain to the relational expression. 349 350 ### Filter 351 352 The filter operator removes rows from the Result based on a condition: 353 354 `Filter(Result, condition) -> Result` 355 356 A condition is a boolean expression. The operands can be: 357 358 * A column of the Result 359 * A column of an outer Result (to support correlated subqueries) 360 * Boolean values or expressions 361 * Scalar values or expressions 362 * The result of a subquery 363 364 The operators are: 365 366 * Pure boolean operators like AND, OR, etc. 367 * Comparison operators like =, <, etc. 368 * Other complex operators and functions like IF, etc. 369 370 Inferred properties of Filter: 371 372 * The filter operation is applied to each row individually. There is no interaction between rows. 373 * A row is never modified. 374 * No new rows are produced. 375 * It does not change the order of the rows in the Result. This is not a requirement, but it’s a property worth maintaining. 376 * The AND clauses of a filter can be broken into parts and applied separately. This is useful for pushing down WHERE clauses into scans. 377 378 It would have been nice if the Result was the only variable input to Filter. However, this is not true if one of the operands is a subquery. 379 380 *There are two schools of thought when it comes to boolean algebra: Those that allow interconversion between scalar and boolean values, and those that don’t. The former considers any non-zero scalar value to be TRUE, and zero as FALSE. The latter requires the scalar value to be tested using a comparison operator before it becomes a boolean value.* 381 382 *The first school of thought allows statements like this:* 383 384 `select * from t where a` 385 386 *The second school of thought requires you to rewrite it as:* 387 388 `select * from t where a != 0` 389 390 *Vitess’s grammar currently doesn’t allow this interconversion, but MySQL does. From a practical perspective, this difference is not significant because it results in a minor syntactic inconvenience. One could argue that ‘a’ could be a boolean column. But boolean values are not common in databases, and there are other alternatives. So, Vitess doesn’t support boolean column types. It assumes that everything from a database (or Result) is a scalar.* 391 392 *Not allowing interconversion greatly simplifies the grammar. Also, the boolean engine and expressions engine can be kept separate, which is another simplification.* 393 394 ### Select 395 396 The Select operation produces a new Result by performing arithmetic operations to the input Result. 397 398 `Select(Result, expression list) -> Result` 399 400 An expression is a subset of a boolean expression. The operands can be: 401 402 * A column of the Result 403 * A column of an outer Result 404 * Scalar values or expressions 405 * The result of a subquery 406 407 The operators can be: 408 409 * Arithmetic operators like +, - 410 * Functions and other complex expressions 411 412 Properties of a Select: 413 414 * A Select produces the same number of rows as the input 415 * A Select may not produce the same columns as its input 416 * The expressions are applied to each row individually. There is no interaction between rows. 417 * The order of the result is not changed. This is not a requirement, but it’s a property worth maintaining. 418 * The different items (columns) of a Select can be broken into parts and independently computed. 419 420 ### Aggregate 421 422 Aggregate splits the Result column into two groups: the aggregate columns and the group by columns. It merges the rows whose group by columns are equal, and simultaneously applies aggregation functions to the aggregate columns. 423 424 `Aggregate(Result, aggrExpressionList(columns)) -> Result` 425 426 Any columns that are not in the aggregate expression list are implicitly in the group by list. 427 428 There are two special cases: 429 430 * There are no aggregate columns. In this case, the operation degenerates into a dedup (UNION). 431 * There are no group by columns. In this case, one and only one row is produced. 432 433 The SQL syntax allows expressions to be present in aggregate functions, but that can be broken down into Select->Aggregate. 434 435 Properties of Aggregate: 436 437 * It produces no more rows than the input. 438 * It produces the same number of columns as the input. 439 * Aggregate can be efficient if the rows are sorted by the group by columns. If so, it also retains the sort order. 440 * Aggregate can also be efficient if there are no group by columns. Then only one row is produced in the Result. 441 * The different items of an Aggregate cannot be independently computed. 442 443 ### Sort 444 445 Sort reorders the rows of a Result based on a list of columns with ascending or descending properties. 446 447 `Sort(Result, (columns asc/desc)) -> Result` 448 449 Properties: 450 451 * The number of rows produced is the same as the input 452 * No rows are changed 453 * An outer Sort supersedes an inner Sort 454 * Sorting can be broken into parts as long as the resulting row is built in the same order as the values referenced by the sort. For example, if we join a with b, and sorting is by a.col and b.col, then we can sort a by a.col and join with b by sorting b by b.col. 455 456 ### Limit 457 458 Limit returns at the most N rows from the specified starting point. 459 460 `Limit(Result, start, count) -> Result` 461 462 Limit has algebraic properties that are similar to Filter. 463 464 Properties: 465 466 * A limit doesn’t change the rows 467 * An outer limit supersedes an inner Limit 468 * A Limit is not interchangeable with anything other than Select. 469 * A Limit without a Sort has non-deterministic results. 470 471 ### Join 472 473 Join produces the cross-product of two Results. 474 475 `Join(Result, Result) -> Result` 476 477 This primitive is too inefficient in most situations. Here are some practical composites: 478 479 `InnerJoin(Result, Result, condition) ⇔ Filter(Join(Result, Result), condition)` 480 481 The use case for InnerJoin is easy to explain, because pushing down a condition usually leads to valuable efficiency improvements. 482 483 `QueryJoin(query, query, condition) ⇔ Filter(Join(Scan(query), Scan(query)), condition)` 484 485 The QueryJoin is relevant mainly because VTGate may not get good at performing joins for a long time. So, it may have to rely on sending nested loop queries to the underlying databases instead. However, there will be situations where a join will have to be performed on two Results. Such queries will be out of VTGate’s ability to execute until it implements the `Join(Result, Result)` primitive. 486 487 Properties: 488 489 * Rows are not modified 490 * Lots of rows can be produced 491 * Interchangeable with Filter as long as the condition can be split properly. This will be a significant driving factor for the optimizer. 492 493 ### LeftJoin 494 495 LeftJoin requires a join condition. If the condition is not satisfied, the left side of the Result is still produced, with NULL RHS values. 496 497 `LeftJoin(LeftResult, RightResult, condition) -> Result` 498 499 It’s important to differentiate LeftJoin from Join because LeftJoin is not freely interchangeable with Filter or other Joins. 500 501 ### Merge 502 503 Merge merges two results into one. There is no particular order in the rows. 504 505 `Merge(Result, Result) -> Result` 506 507 Properties: 508 509 * Rows are not modified 510 * Number of rows is the sum of the two Results’ number of rows 511 512 ## Example from hell 513 514 Now that all operators are defined, we have to see if they’re complete. Let’s try this query: 515 516 ``` 517 01: (select count(*), ta1.col+1, tb1.col 518 02: from ta1 join tb1 on ta1.id=tb1.id 519 03: where ta1.col2=10 and ta1.col2 > any 520 04: (select c from tc1 where tc1.id=ta1.id) 521 05: group by ta1.col+1, tb1.col 522 06: having count(*) > 10 523 07: order by count(*) desc 524 08: limit 10) 525 09:union 526 10: (select count(*), ta2.col+1, tb2.col 527 11: from ta2 join tb2 on ta2.id=tb2.id 528 12: where ta2.col2=10 529 13: group by ta2.col+1, tb2.col) 530 14:order by count(*) desc 531 15:limit 5 532 ``` 533 534 The above query can be represented as a single expression. But let’s start with something that has intermediate results first: 535 536 ``` 537 // query 1 538 02: JoinResult1 = InnerJoin(Scan(ta1), Scan(tb1), ta1.id=tb1.id) 539 // Note that the next operation is within the scope of JoinResult1, so its columns can be used. Nevertheless, it’s an external input. 540 04: SubqueryCScanResult = Select( 541 Filter( 542 Scan(tc1), 543 tc1.id=JoinResult1.ta1.id, 544 ), 545 tc1.c) 546 03: FilterResult1 = Filter( 547 JoinResult1, 548 ta1.col2=10 and ta1.col2 > any(SubqueryCScanResult), 549 ) 550 01: SelectResult1 = Select(FilterResult1, (1, ta1.col+1, tb1.col)) 551 05: GroupResult1 = Aggregate(SelectResult1, count(‘1’)) 552 06: HavingResult1 = Filter(GroupResult1, ‘1’ > 10) 553 07: SortResult1 = Sort(HavingResult1, (‘1’ desc)) 554 08: Result1 = Limit(SortResult1, 0, 10) 555 // query 2 556 11: JoinResult2 = InnerJoin(Scan(ta2), Scan(tb2), ta2.id=tb2.id) 557 12: FilterResult2 = Filter(JoinResult2, ta2.col2=10) 558 10: SelectResult2 = Select(FilterResult2, (1, ta2.col+1, tb2.col)) 559 13: Result2 = Aggregate(SelectResult2, count(‘1’)) 560 // union 561 09: UnionResult = Aggregate(Merge(Result1, Result2), NULL) 562 14: OrderResult = Sort(UnionResult, (‘1’ desc)) 563 15: Result = Limit(OrderResult, 0, 5) 564 ``` 565 566 Now, combine everything into one expression, and voila: 567 568 ``` 569 Result = Limit( 570 Sort( 571 Aggregate( 572 Merge( 573 Limit( 574 Sort( 575 Filter( 576 Aggregate( 577 Select( 578 Filter( 579 JoinResult1 = InnerJoin( 580 Scan(ta1), 581 Scan(tb1), 582 ta1.id=tb1.id, 583 ), 584 ta1.col=10 and ta1.col > any( 585 Select( 586 Filter( 587 Scan(tc1), 588 tc1.id=JoinResult1.ta1.id, 589 ), 590 tc1.c, 591 ), 592 ), 593 ), 594 (1, ta1.col+1, tb1.col), 595 ), 596 count(‘1’), // ‘1’ is column 1 597 ), 598 ‘1’ > 10, 599 ), 600 (‘1’ desc), 601 ), 602 0, 10, 603 ), 604 Aggregate( 605 Filter( 606 InnerJoin( 607 Scan(ta2), 608 Scan(tb2), 609 ta2.id=tb2.id, 610 ), 611 ta2.col=10, 612 ), 613 count(‘1’), 614 ), 615 ), 616 NULL, 617 ), 618 (‘1’ desc), 619 ), 620 0, 5, 621 ) 622 ``` 623 624 There’s a lot of nesting, but the expressions are mostly straightforward by the fact that each result feeds into the other. The only exception is the correlated subquery that uses a value from JoinResult1 from the outer query to perform its operation. You can also see that the lowest level operations are all Scans. 625 626 If the primitives defined at the beginning were all implemented, if we had an infinitely fast CPU, or an infinite amount of time, and we also had an infinite amount of memory, the above ‘plan’ will produce the correct result. 627 628 However, SQL engines are all about having limited CPU, memory and disk. So, they have to figure out ways to rearrange those operations differently so that the minimum amount of time and resources is used to compute the result. The next section will cover those techniques. 629 630 # Approaches and tradeoffs 631 632 ## Standard approaches 633 634 If you analyze the various operators, the expensive ones are: 635 636 * Scan: you want to avoid full table scans, unless it’s a necessity. This is ironically the lowest level operation. 637 * Sort: this is a potentially expensive operation if the result set is large. 638 * Aggregation: is also potentially expensive 639 640 All other operations are considered ‘cheap’ because they can be applied as the rows come. To optimize for the above three operations, databases combine the primitives. The most popular one is the index scan, which allows you to combine Scan, Filter and Sort, all as one operation. Aggregation can also benefit from an index scan if the GROUP BY columns are part of an index. If an index cannot be used, the database resorts to the more traditional approach of full table scans, file sorts, etc. 641 642 ## Preserving the original representation 643 644 In the case of VTGate, the ‘Route’ operation is capable of performing all 9 functions, as long as all the rows needed to fulfill the query reside in a single database. However, those functions have to be expressed as an SQL query. Most often, the original query could just be passed-through to a single shard. So, the one important question is: If we converted a query to its relational representation, can we then convert it back to the original query? The answer is no, or at least not trivially; The relational operators don’t always map one-to-one with the constructs of an SQL statement. Here’s a specific example: 645 646 `select count(*), a+1, b from t group by a+1, b` 647 648 This gets expressed as: 649 650 ``` 651 Aggregate( 652 Select( 653 Scan(t), 654 (1, a+1, b), 655 ), 656 Count(‘1’), // ‘1’ is column 1 657 ) 658 ``` 659 660 So, it may not be a good idea to immediately convert the parsed query into the relational representation. When we analyze the query, we should try to retain the original text and see if they could be recombined without loss of readability. 661 662 The AST produced by the parser can be reversed back into a representation that’s accurate and close to the original query. This is what VTTablet already does. So, we can use the AST as the intermediate representation that can be retained until we can figure out what can be delegated to the underlying databases. 663 664 Once we have identified the parts that can be pushed down, what is left can be converted to use the operations that VTGate can perform. 665 666 ## Cleverness 667 668 Relational databases initially optimized their queries based on a standard set of rules that used the schema info as the only additional input. These were known as rule-based optimizers. However, some queries would fail for certain values due to pathological corner cases. 669 670 In order to address such corner cases, databases decided to venture into cost-based optimizers. This was somewhat of a slippery slope. On the one hand, the feature was very convenient, because one could throw any kind of ad-hoc query to a database, and it would figure out the best way to get the results. However, every once in awhile, the database would make the wrong optimization decision. This led to iterations to accommodate the new corner cases. Unfortunately, there is no end to this, and databases are still iterating. Beyond a certain point, this becomes a losing battle because the cost of computing the best plan starts to exceed the cost of executing the query using a less optimal one. 671 672 The cost-based optimizers have become so complex that a novice programmer can’t reliably predict the execution plan for a query. It’s also hard to know why an engine chose a certain plan. 673 674 An escape hatch was provided: query hints. If the database failed to find the most optimal plan, the application can specify hints in the query. Once hints are provided, the optimizer is bypassed, and the application dictates the plan for the query. The fundamental question is: Who should know the data? This question opens a lot of arguments. The main argument in favor of why the app should be responsible for the query plans is as follows: even if the database can decide the most optimal plan, the app has to know the cost of a query. It needs to know this because it makes a decision on whether it’s going to send 1QPS or 1MQPS of that query. If the plan needs to change, it means that the cost is also changing, which means that the app needs to be revisited. This is why an optimizer that is too clever has diminishing returns, and is sometimes counter-productive. 675 676 The rise of NoSQL databases is further evidence that engineers are generally not very sold on how clever the database optimizers are. 677 678 One concern with the application dictating the plan is that we may constantly worry that it might not be the best one. To alleviate this, we can develop analyzers, possibly backed by a machine-learning backend, that can suggest rewrites. But it will be up to the application to accept them. 679 680 So, we’re going to attempt this brain-dead approach for optimizations in VTGate: By looking at a query, and knowing the vschema, one should be able to predict what VTGate will do with it. If, for some reason, the plan doesn’t work out, then the query must be rewritten to represent the alternate plan. There is a possibility that this approach may fail; Some database users have grown accustomed to the optimizer doing the work for them. If it happens, this decision can be revisited. 681 682 ### What does this all mean 683 684 This just means that the VTGate optimizer will be very simple and transparent. For example, joins will always be performed left-to-right. When you look at a query, you’ll know the join order by just looking at it. VTGate will not try to rewrite subqueries into joins. If a join plan is more efficient than a subquery, then the app must write it that way. 685 686 Of course, if there are optimizations that make common sense, those will be implemented. For example, if a subquery is not correlated, then it can be pulled out and executed only once. 687 688 We should also be sensitive to situations where the natural expression of a query does not match the best plan. 689 690 No matter what, once a query is outsourced to a database, it’s subject to the optimizer decisions of the database. So, the app may still have to give hints for such situations. 691 692 ## Opcodes vs tree 693 694 The execution plan can be represented either using opcodes or as a tree. The opcode approach represents the plan as a list of actions to be performed, whereas a tree represents the same plan as one action that depends on other actions, which recursively ends up with leaf nodes that perform the routing. 695 696 Opcodes can be very flexible (turing complete). However, they require the introduction of intermediate variables. Let’s look at this example: 697 698 `a - (b+c)` 699 700 The opcode approach would express the expression as: 701 702 ``` 703 temp = b+c 704 result = a - temp 705 ``` 706 707 The tree approach would represent the action as: 708 709 `Minus(a, Plus(b, c))` 710 711 In the theory section, we classified SQL as a pushdown automaton, which means that a tree structure is sufficient to represent a statement. 712 713 Using opcodes may still help us come up with better optimizations. However, given that we don’t plan on going overboard on optimizations, the more readable tree representation is preferable. 714 715 ## Prioritization 716 717 Since we have powerful databases underneath us, our first priority is to harness anything they can do first. The next priority will be to do things that VTGate can perform on-the-fly, without accumulating results. And finally, we’ll look at features that require complex memory management or intermediate file storage. The features will be implemented as follows: 718 719 1. Single shard or single keyspace joins. 720 2. Scatter joins that can be merged in any order. 721 3. Cross-shard joins that don’t require post-processing. 722 4. Sort, Aggregate and Limit, provided the results can be pre-sorted by the underlying database(s). 723 5. Filter and Select expression engines. 724 6. All other constructs. 725 726 One hurdle to overcome with #4 is collation. Substantial work may have to be done to make the VTGate collation behavior match MySQL. 727 728 ## Streaming vs non-streaming 729 730 VTGate has two query APIs: 731 732 * Non-streaming API: This API has deadlines and row count limits. It collects all the results first and then returns them. 733 * Streaming API: This API has no deadlines or row count limits. It streams the results as they come. 734 735 This difference will eventually cause a divergence in what VTGate can do for streaming vs non-streaming queries. Since the entire result is available in-memory for non-streaming queries, we’ll be able to perform sorts and aggregations after obtaining the results. However, this will not be possible for streaming queries. 736 737 For the short term, we’ll maintain parity by only implementing operations that both APIs can perform. This actually means priorities 1 through 5. So, it may be a long time before the two diverge. 738 739 ## Column names 740 741 The vschema currently doesn’t know all the column names of the tables. VTGate cannot resolve joins if it doesn’t have this info. We have two options: 742 743 1. Extend vschema to also contain the column names of tables. 744 2. Require join queries to always qualify column names by the table. 745 746 The first option is going to be a maintenance problem. It introduces a dependency that is hard to maintain. For now, we’ll go with option #2. It is a syntactic inconvenience, but let’s hope that users find this acceptable. We could allow exceptions for vindex columns, but that’s likely to be confusing. So, it’s better to require this for all columns. For subqueries (that are themselves not joins), we can make an exception and assume that unqualified column names are addressing the inner table. You’ll need to use qualified names only for addressing columns of an outer query. 747 748 ## Redundant constructs 749 750 SQL has many redundant constructs. There’s diminishing return in supporting all of them. So, the following constructs will not be supported: 751 752 * `‘,’` join: This operator was the old way of joining tables. The newer SQL recommends using actual JOIN keywords. Allowing both forms of JOIN will be a big source of confusion because these operators also have different preferences. 753 * RIGHT JOIN: This is same as a reversed LEFT JOIN, except it’s less readable. 754 * NATURAL joins: These are rarely used in practice because the join condition applies to all columns. 755 * JOIN USING: This is just syntactic sugar for JOIN ON a.id=b.id 756 * WHERE clause join condition: A JOIN without an ON clause is supposed to be a cross product of the two tables. However, old-style ‘,’ users have continued to specify join conditions in the WHERE clause. The initial version of VTGate will only use the ON clause to decide how to route parts of a query. This decision can be revisited if things are too inflexible. 757 758 ## Limiting results 759 760 VTGate currently doesn’t have any limits on how many rows it can hold at any given time. This is dangerous because a scatter query that hits a large number of shards could easily blow up the memory. 761 762 Also, as we add support for joins, we’ll have a new way of accumulating a large number of rows. 763 764 So, the VTTablet level protection to limit the maximum number of rows fetched may not be sufficient any more. 765 766 We’ll add a new maxrows variable to VTGate. If the number of rows accumulated exceeds this count, we’ll return an error. 767 768 Note that this limit only applies to non-streaming queries. Streaming queries will continue to work as before. 769 770 # Design 771 772 Recapitulating what we’ve covered so far: 773 774 * The primary function of the optimizer is to push down as much of the query components as possible. 775 * We’ll not support redundant constructs for joins. 776 * If a query has to be split across multiple shards or keyspaces, we’ll expect the application to dictate the order of operations. This will be left to right, unless parentheses define a different order. 777 * For joins, all column names must be qualified by their table. 778 * We’ll preserve the original representation of the query to the extent possible. 779 * We’ll not attempt to flatten subqueries. However, they can still get pushed down with the outer query if they’re constrained by the same keyspace id. 780 781 ## Symbol table and scoping rules 782 783 Once we start allowing joins and subqueries, we have a whole bunch of table aliases and relationships to deal with. We have to contend with name clashes, self-joins, as well as scoping rules. In a way, the vschema has acted as a static symbol table so far. But that’s not going to be enough any more. 784 785 The core of the symbol table will contain a map whose key will be a table alias, and the elements will be [similar to the table in vschema](https://github.com/vitessio/vitess/blob/0b3de7c4a2de8daec545f040639b55a835361685/go/vt/vtgate/vindexes/vschema.go#L82). However, it will also contain a column list that will be built as the query is parsed. 786 787 ### A simple example 788 789 Let’s start with the following vschema that has two tables t1 & t2. Here’s a simplified view: 790 791 <table> 792 <tr> 793 <td>tables</td> 794 <td>columns</td> 795 <td>vindexes</td> 796 </tr> 797 <tr> 798 <td>t1</td> 799 <td>id</td> 800 <td>id_idx: hash</td> 801 </tr> 802 <tr> 803 <td></td> 804 <td>a</td> 805 <td>a_idx: lookup_unique</td> 806 </tr> 807 <tr> 808 <td>t2</td> 809 <td>t2id</td> 810 <td>t2id_idx: hash</td> 811 </tr> 812 <tr> 813 <td></td> 814 <td>b</td> 815 <td>b_idx: lookup_unique</td> 816 </tr> 817 </table> 818 819 820 For the following query: 821 822 `select a, b from t1 where id = :id and c = 5` 823 824 The starting symbol table will be identical to the vschema. However, as the query is analyzed, b and c will get added to t1 as additional columns: 825 826 <table> 827 <tr> 828 <td>tables</td> 829 <td>columns</td> 830 <td>vindexes</td> 831 </tr> 832 <tr> 833 <td>t1</td> 834 <td>id</td> 835 <td>id_idx: hash</td> 836 </tr> 837 <tr> 838 <td></td> 839 <td>a</td> 840 <td>a_idx: lookup_unique</td> 841 </tr> 842 <tr> 843 <td></td> 844 <td>b</td> 845 <td></td> 846 </tr> 847 <tr> 848 <td></td> 849 <td>c</td> 850 <td></td> 851 </tr> 852 </table> 853 854 855 If a symbol table contains only one database table, then all unqualified column references are assumed to implicitly reference columns of that table. 856 857 *Ideally, we should be validating column references against a known column list for each table. However, as mentioned before, the maintenance of that list comes with its own problems. Instead, we trust that a non-ambiguous column reference implies that the table (or result) does have such a column. In the worst case where such a column doesn’t exist, the query will fail. There are no circumstances where this assumption can lead to incorrect results.* 858 859 The bind vars section of the symbol table will contain an entry for id. 860 861 ### Self-join 862 863 For the following query (using the previous vschema): 864 865 `select t1.a, t2.c from t1 join t1 as t2 where t2.a = t1.id` 866 867 The symbol table will look like this: 868 869 <table> 870 <tr> 871 <td>tables</td> 872 <td>columns</td> 873 <td>vindexes</td> 874 </tr> 875 <tr> 876 <td>t1</td> 877 <td>id</td> 878 <td>id_idx: hash</td> 879 </tr> 880 <tr> 881 <td></td> 882 <td>a</td> 883 <td>a_idx: lookup_unique</td> 884 </tr> 885 <tr> 886 <td>t2</td> 887 <td>id</td> 888 <td>id_idx: hash</td> 889 </tr> 890 <tr> 891 <td></td> 892 <td>a</td> 893 <td>a_idx: lookup_unique</td> 894 </tr> 895 <tr> 896 <td></td> 897 <td>c</td> 898 <td></td> 899 </tr> 900 </table> 901 902 903 In the symbol table, t2 starts off as identical to t1, and is different from t2 that’s in the original vschema. But t2 eventually adds c to its list of columns. There is currently no value in remembering that t1 and t2 refer to the same underlying table. Maybe a future use case will arise that will introduce this need. 904 905 ### Subqueries 906 907 MySQL allows table names (or aliases) in subqueries to hide an outer name. Due to this, we have to extend the symbol table to support scoping. This is done by changing it into a linked list, where each node represents a scope. A symbol is searched in the list order until it’s found. 908 909 Let’s look at this intentionally confusing example: 910 911 ``` 912 select t1.a, t2.b from t1 join t2 where t2.t2id in ( 913 select b from t1 as t2 where id = t1.c 914 ) 915 ``` 916 917 The outerscope symbol table will look like this: 918 919 <table> 920 <tr> 921 <td>t1</td> 922 <td>id</td> 923 <td>id_idx: hash</td> 924 </tr> 925 <tr> 926 <td></td> 927 <td>a</td> 928 <td>a_idx: lookup_unique</td> 929 </tr> 930 <tr> 931 <td></td> 932 <td>c</td> 933 <td></td> 934 </tr> 935 <tr> 936 <td>t2</td> 937 <td>t2id</td> 938 <td>t2id_idx: hash</td> 939 </tr> 940 <tr> 941 <td></td> 942 <td>b</td> 943 <td>b_idx: lookup_unique</td> 944 </tr> 945 </table> 946 947 948 The innerscope will look like this: 949 950 <table> 951 <tr> 952 <td>t2</td> 953 <td>id</td> 954 <td>id_idx: hash</td> 955 </tr> 956 <tr> 957 <td></td> 958 <td>a</td> 959 <td>a_idx: lookup_unique</td> 960 </tr> 961 <tr> 962 <td></td> 963 <td>b</td> 964 <td></td> 965 </tr> 966 </table> 967 968 969 Column c for t1 was added because the inner query made a qualified reference to it. The inner scope table gets a column b added because the innerscope query made an implicit reference, and there was only one item in that scope. 970 971 Also, the inner query cannot access any columns of the outer t2 because it has its own t2 symbol that hides the outer one. 972 973 A corollary: since subquery nesting can have many levels, an inner query can still reference results from any of the outer scopes. For example, this is a valid query: 974 975 ``` 976 select t1.a, t1.b 977 from t1 978 where t1.id in ( 979 select t2.id 980 from t2 981 where t2.id in ( 982 select t3.id 983 from t3 984 where t3.id = t1.a 985 ) 986 ) 987 ``` 988 989 ### The FROM clause 990 991 The FROM clause is an exception. This clause creates a new symbol table (scope), with no initial outerscope, even if it was part of a subquery. At the end of the analysis, the newly created symbol table is added to the outerscope, if there was one. 992 993 This means that subqueries in the FROM clause are not allowed to refer to elements of any outer query. 994 995 The result of such a subquery will be a new virtual table with its own vindex columns, etc. This is facilitated by the fact that the SQL syntax requires you to name the subquery with an alias. Figuring out possible vindexes for the new virtual table is useful because routing rules could be applied for other tables that may join with the subquery. Here’s an example: 996 997 ``` 998 select sq.a, t2.b 999 from 1000 (select a from t1 where id = :id) as sq 1001 join t2 on t2.b = sq.a 1002 ``` 1003 1004 The main symbol table will look like this: 1005 1006 <table> 1007 <tr> 1008 <td>tables</td> 1009 <td>columns</td> 1010 <td>vindexes</td> 1011 </tr> 1012 <tr> 1013 <td>sq</td> 1014 <td>a</td> 1015 <td>a_idx: lookup_unique</td> 1016 </tr> 1017 <tr> 1018 <td>t2</td> 1019 <td>t2id</td> 1020 <td>t2id_idx: hash</td> 1021 </tr> 1022 <tr> 1023 <td></td> 1024 <td>b</td> 1025 <td>b_idx: lookup_unique</td> 1026 </tr> 1027 </table> 1028 1029 1030 The symbol table for the `‘from t1’` subquery will be in its own scope. Its analysis will yield a new virtual table sq, which will be the one added to the main symbol table. The analyzer should hopefully be smart enough to deduce that column ‘a’ of the subquery can be used for routing because it’s a faithful representation of the column in the underlying table t1. 1031 1032 *Conceptually speaking, there’s no strong reason to prevent the FROM clause from referencing columns of an outer query. MySQL probably disallows this because it leads to confusing scoping rules. Let’s take a look at this example:* 1033 1034 ``` 1035 select id from a where id in ( 1036 select b.id, a.id 1037 from 1038 (select id from b where b.id=a.id) b 1039 join d as a on a.id=b.id) 1040 ``` 1041 1042 *If we allowed an outer scope, then b.id=a.id becomes ambiguous. We don’t know if it should refer to the outer ‘a’, or the one that is yet to be created as part of the current scope. Needless to say, one shouldn’t be writing such unreadable queries.* 1043 1044 ### ON clauses 1045 1046 By the fact that the ON clause is part of a FROM clause, you cannot refer to an element of an outer query. This is an invalid query: 1047 1048 `select id from a where id in (select b.id from b join c on b.id=a.id)` 1049 1050 But this is valid: 1051 1052 `select id from a where id in (select b.id from b join c where b.id=a.id)` 1053 1054 Furthermore, you can only refer to elements that participate in the current JOIN. But the following query is still valid: 1055 1056 `select * from a join b on b.id = a.id join c on c.id = a.id` 1057 1058 Because of associativity rules, c is actually joined with ‘a join b’. 1059 1060 But this query is not valid: 1061 1062 `select * from a join (b join c on c.id = a.id)` 1063 1064 The ON clause participants are only b and c because of the parenthesis. So, you’re not allowed to reference a.id in that join. In other words, analysis of the ON clause requires a list of participants. The analyzer will need to check with this list before looking inside the symbol table, or we could exploit Go interfaces here by sending a subset symbol table that has the same interface as the original one. 1065 1066 ON clauses are allowed to have correlated subqueries, but only with tables that are the ON clause participants. This statement is valid: 1067 1068 `select * from a join b on b.id in (select c.id from c where c.id=a.id)` 1069 1070 ### Naming result columns 1071 1072 Result columns get referenced in the following places: 1073 1074 * For addressing the results of a subquery in the FROM clause 1075 * GROUP BY, HAVING and ORDER BY clauses 1076 1077 The WHERE clause is not allowed to reference them because it happens before the SELECT expressions are evaluated. 1078 1079 Column names are usually built from the select expression itself. Therefore, ‘a+1’ is a valid column name. If such a column is in a FROM subquery, it can be addressed using back-quotes as ``a+1``. However, you can reference such expressions directly in GROUP BY, HAVING and ORDER BY clauses. You can also reference a result by using its column number (starting with 1). 1080 1081 Conceptually, this is similar to the subquery column naming. GROUP BY, etc. are post-processing operations that are applied to the result of a query. 1082 1083 *MySQL allows duplicate column names. Using such column names for joins and subqueries leads to unpredictable results. Because of this, we could go `either way`:* 1084 1085 * *Allow duplicates like MySQL and remain equally ambiguous* 1086 * *Disallow duplicates* 1087 1088 Here’s an example of how MySQL names columns: 1089 1090 ``` 1091 > select id, (select val from a), 3, 3, 3 as `4` from a; 1092 +------+------------------------+---+---+---+ 1093 | id | (select val from a) | 3 | 3 | 4 | 1094 +------+------------------------+---+---+---+ 1095 | 1 | a | 3 | 3 | 3 | 1096 +------+------------------------+---+---+---+ 1097 MariaDB [a]> select `(select val from a)` from (select id, (select val from a), 3, 3, 3 as `4` from a) as t; 1098 +------------------------+ 1099 | (select val from a) | 1100 +------------------------+ 1101 | a | 1102 +------------------------+ 1103 1 row in set (0.00 sec) 1104 MariaDB [a]> select `(select val from a)` from (select id, (select val from a), 3, 3, 3 as `4` from a) as t; 1105 ERROR 1054 (42S22): Unknown column '(select val from a)' in 'field list' 1106 ``` 1107 1108 The select expression is produced verbatim as column name, and duplicates are allowed. 1109 1110 These quirks don’t matter much if VTGate just has to return these results back to the client without generating its own values. However, the result column names need to be known for analyzing dependencies when we split queries. So, VTGate will do a fake mimic of what MySQL does: Take a guess at the column name by rebuilding a string representation of the AST for that select expression. If there’s a lucky match, well and good. If not, the query analysis will fail, and the user will have to provide a non-ambiguous alias for such columns. In practice, this is not a real problem because users also generally dislike using verbatim expressions as column names when performing complex joins. 1111 1112 So, once the SELECT expression is analyzed, the symbol table (for that scope) will acquire an entry for the result column names. Additionally, if a result column is a faithful reference to an underlying table column, it will contain this additional annotation, which can be used for optimizing other operations. 1113 1114 ### Bind variables 1115 1116 When VTGate has to perform a join, it needs to use the results of one query to feed into the other. For example, let’s look at this query: 1117 1118 `select t1.a, t2.b from t1 join t2 on t2.id = t1.id` 1119 1120 If t1 and t2 are on different keyspaces, VTGate has to perform the following operations: 1121 1122 `select t1.a from t1` 1123 1124 For each row of this result, it then has to perform: 1125 1126 `select t2.b from t2 where t2.id = :_t1_a` 1127 1128 The mechanism for this happens by generating a bind variable named ‘_t1_a1’ from the first result. The bind variable name can be generated using a functional algorithm: 1129 1130 `‘_’ + tableAlias + ‘_’ + columnName` 1131 1132 The scoping rule works in our favor here because this type of name generation correctly handles naming conflicts. 1133 1134 *MySQL allows multiple columns in a result to have the same name, and allows using of such ambiguous names in joins, etc. Fortunately, this also works in our favor because we’ll be no worse if we also allowed it.* 1135 1136 For the sake of simplicity, we’ll not allow dependencies to use non-standard column names. For example, we’ll fail a query like this: 1137 1138 `select * from (select a, count(*) from t1) t where t.count(*) > 0` 1139 1140 This should instead be rewritten as: 1141 1142 `select * from (select a, count(*) c from t1) t where t.c > 0` 1143 1144 This way, we can build a bind variable like `‘:_t_c’` for the outer query if we have split the above query into two. 1145 1146 There is a subtle difference between column names produced by a result vs. the column names in the symbol table. The WHERE clause constructs reference columns in the symbol table. It’s only the post-processing constructs like GROUP BY, etc. that reference column names produced by the result. This could pose a problem when a SELECT gets pushed down that defines a new alias that has the same column name as an underlying table. Will such a construct hide, and therefore, corrupt the meaning of a join? Let’s look at an example: 1147 1148 `select a.col1 as col2, b.col1 from a join b on b.col2=a.col2` 1149 1150 Rewritten, this would become 1151 1152 ``` 1153 select a.col1 as col2, a.col2 from a 1154 Join (by producing _a_col2 from column 2) 1155 select b.col1 from b where b.col2 = :_a_col2 1156 ``` 1157 1158 Because we reference column numbers from the first query, we bypass naming ambiguities. However, we still need to know the source column for each expression so we don’t have to unnecessarily fetch identical results. In other words, the following query: 1159 1160 `select a.col2 as col2, b.col1 from a join b on b.col2=a.col2` 1161 1162 should become: 1163 1164 ``` 1165 select a.col2 as col2 from a 1166 Join (by producing _a_col2 from column 1) 1167 select b.col1 from b where b.col2 = :_a_col2 1168 ``` 1169 1170 This means that select expressions need to know if they’re a faithful reference of an underlying table column, and this information will be used for wiring up dependencies. 1171 1172 The other situation where result names interact with the symbol table is during a subquery in the FROM clause. Fortunately, we create a separate virtual table entry for it in the symbol table for this situation. This allows us to keep the two worlds from colliding. 1173 1174 ## Pushdown 1175 1176 Pushdown is achieved by analyzing the query in the execution order and seeing if the next stage can be pushed down into the database. For VTGate, the execution order is slightly different from the standard one: Route replaces Scan, and also, the Route primitive can perform a scatter. 1177 1178 In terms of relational primitives, a scatter is a bunch of `Route` operations followed by a `Merge`. We’ll call it `RouteMerge`. So, what you can push down into a simple `Route` may not be allowed for a `RouteMerge`. Here’s the VTGate-specific order of operations: 1179 1180 `Route/RouteMerge <-> (Join & LeftJoin) -> Filter(where) -> Select -> Aggregate -> Filter(having) -> Sort -> Limit -> Merge` 1181 1182 The rest of the analysis tries to push the subsequent operations into Route or RouteMerge. Whatever cannot be pushed down is the work that VTGate has to do by itself. The fact that we won’t try to flatten subqueries simplifies our analysis a little bit. 1183 1184 *MySQL allows you to Sort and Limit the result of a UNION. This is something that the Vitess grammar doesn’t allow right now. We’ll eventually need to support this construct, which will add `Aggregate -> Sort -> Limit` as things that can further happen after the last `Merge`.* 1185 1186 The overall strategy is as follows: 1187 1188 1. Identify groups: Form groups of tables (or subqueries) in the FROM clause that can stay together for Route operations. 1189 2. Pushdown higher-level operations into the various groups: This step uses the AST as input and produces an execution tree where the leaf nodes are all Route operations. 1190 3. Wire up dependencies: This step computes the dependencies between different routes, and potentially modifies the queries in the routes so that additional values are returned that may be needed by other routes. 1191 4. Produce the execution plan. 1192 1193 ### Starting primitives 1194 1195 In order to align ourselves with our priorities, we’ll start off with a limited set of primitives, and then we can expand from there. 1196 1197 VTGate already has `Route` and `RouteMerge` as primitives. To this list, let’s add `Join` and `LeftJoin`. Using these primitives, we should be able to cover priorities 1-3 (mentioned in the [Prioritization](https://github.com/vitessio/vitess/blob/main/doc/V3HighLevelDesign.md#prioritization) section). So, any constructs that will require VTGate to do additional work will not be supported. Here’s a recap of what each primitive must do: 1198 1199 * `Route`: Sends a query to a single shard or unsharded keyspace. 1200 * `RouteMerge`: Sends a (mostly) identical query to multiple shards and returns the combined results in no particular order. 1201 * `Join`: Executes the LHS operation, which could be any primitive. For each row returned, it builds additional bind vars using the result, and uses them to execute the RHS. For each row of the RHS, it builds a new row that combines both the results. 1202 * `LeftJoin`: Same as Join, except that a row is returned even if the RHS fails to return a row. 1203 1204 ### FROM clause 1205 1206 The first step of the pushdown algorithm is to identify tables that can stay together. Requiring the ON clause to specify the join condition helps simplify this analysis. If this rule is followed in the query, we can determine this grouping by just analyzing the FROM clause. 1207 1208 The premise is as follows: A join can be pushed down if all the rows needed to satisfy a query are within the same database. If all the tables are in an unsharded keyspace, then this constraint is automatically true. If the tables are in a sharded keyspace, then the join condition must guarantee this. This is best explained with an example. Let’s start with a sharded keyspace with user and extra as tables, and analyze this query: 1209 1210 `select user.name, extra.info from user join extra on extra.id=user.id` 1211 1212 The join algorithm will first scan user: 1213 1214 `select user.name, user.id from user` 1215 1216 Then for each user.id (:user_id), it will execute: 1217 1218 `select extra.info from extra where extra.id = :user_id` 1219 1220 If user and extra had the same vindex on id, and if it was unique, then we know that all the necessary rows to satisfy the second query belong to the same kesypace_id as the row that came from user. So, they will be in the same shard. This means that it’s safe to push down the join. 1221 1222 Note that this push-down will work for RouteMerge queries also. 1223 1224 Given that the unsharded keyspace is an easy special case, we’ll focus mainly on the analysis of sharded keyspaces. Any place where we mention ‘same shard’ or ‘same keyspace id’, such condition will equally apply to an unsharded keyspace. 1225 1226 *Rule: For a join to be groupable, the join condition must be an equality condition where the participant columns use the same vindex, and the vindex must be unique.* 1227 1228 Some examples of what kind of AST will be produced by joins: 1229 1230 ``` 1231 a join b join c 1232 J 1233 / \ 1234 J c 1235 / \ 1236 a b 1237 A join (b join c) 1238 J 1239 / \ 1240 a J 1241 / \ 1242 b c 1243 a left join b on c1 left join c on c2 1244 L:c2 1245 / \ 1246 L:c1 c 1247 / \ 1248 a b 1249 ``` 1250 1251 For left joins, the ON clause is mandatory and enforced by the grammar. Because of this, the placement of ON clauses can also dictate grouping, which is not the case for normal joins: 1252 1253 ``` 1254 a left join b left join c on c1 on c2 1255 L:c2 1256 / \ 1257 a L:c1 1258 / \ 1259 b c 1260 ``` 1261 1262 By looking at the above tree, we can deduce that c1 can reference b and c, while c2 can reference a, b and c. 1263 1264 As mentioned before, when the required data is in multiple shards or keyspaces, VTGate will perform the join, and the order will be dictated by the SQL statement, as represented by the above execution trees. The join conditions themselves may try to dictate a different order. However, that will not be used to reorder the joins. 1265 1266 For example: 1267 1268 `from a join (b join c on c.id=b.id) on c.id=a.id` 1269 1270 is likely better executed this way: 1271 1272 `from a join c on c.id=a.id join b on b.id=c.id` 1273 1274 or this way: 1275 1276 `from b join c on c.id=b.id join a on a.id=c.id` 1277 1278 But VTGate will not perform such reordering tricks. 1279 1280 The work of the analyzer is to recursively walk the JOIN tree. The decisions are made from bottom to top. 1281 1282 The lowest level node of a parse tree is a ‘table’. This could be a real table or the result of a subquery that can be treated as a table. Knowing the table name, we can deduce the keyspace to which it belongs. So, the output of this analysis is the creation of the table entry in the symbol table, and the node being marked as a RouteMerge, because the rows for that table are scattered across all shards until additional constraints are applied. 1283 1284 When analyzing a join node, we first extract the ON clause conditions as a list, and we see if these requirements are satisfied: 1285 1286 * The node types have to be Route or RouteMerge 1287 * They should be for the same keyspace (sharded) 1288 * There should be at least one join condition between the two nodes: left.col = right.col 1289 * The columns referenced in the join must use the same vindex 1290 * The vindex must be unique 1291 * Other conditions in the ON clause must also be executable under the new group. This basically means that if there are subqueries, they should be correlated in such a way that the required rows can all be found within the new group. The algorithm for this analysis is explained in the WHERE clause section. 1292 1293 Two nodes can also be grouped (without a join condition) if they are both constrained by the same keyspace id (Route nodes). 1294 1295 The above rules work for both JOIN and LEFT JOIN nodes. 1296 1297 ###### Why does grouping work for LEFT JOIN? 1298 1299 *It’s easy to explain why grouping works for the simple case:* 1300 1301 `a left join b on b.id=a.id` 1302 1303 *In the above case, the only rows that are inspected in b are those where b.id=a.id. So, the join can be grouped. But what about this join:* 1304 1305 `(a left join b on a.id=b.id) join (c left join d on c.id=d.id) on b.id=d.id` 1306 1307 *In the above case, rows from b or c could be NULL. Fortunately, in SQL, NULL != NULL. So, the outer-scope join will succeed only if rows from b and c are not NULL. If they’re not NULL, they’re guaranteed to be from the same shard. So, this makes them groupable. In fact, in the above case, the left joins in the inner scope are unnecessary. They could have just been normal joins. However, things would be very different if one had used the null-safe equal operator (<=>) for joins. But we’ll not treat null-safe equal as a valid join operator.* 1308 1309 When two nodes are grouped, the current join condition becomes the root of the new group, and it gets a routing property: 1310 1311 * If it’s a JOIN, the new property is the more restrictive of the two nodes. So, if one of them is a Route, then the new node is also a Route. 1312 * For a LEFT JOIN, the new property is the same as the LHS node. 1313 1314 If the grouping conditions are not met, then the node remains a join node. In this case, we have to see if the ON clause conditions can be pushed down into the left and/or right nodes. By the fact that the current join is split into two, the ON clause cannot be pushed as is. Instead, we use associativity rules to our benefit and merge the ON clause conditions into the WHERE clauses of the underlying nodes. The rules are the same as the ones described for a normal WHERE clause. 1315 1316 But left joins are slightly different, because the join condition is applied *to the RHS only*. Also, the condition cannot be further pushed into other nested left joins, because they will change the meaning of the statement. For example: 1317 1318 `from a left join (b left join c on c.id=b.id) on c.id=a.id` 1319 1320 is same as: 1321 1322 `a LeftJoin (b LeftJoin c where c.id=b.id) where c.id=a.id)` 1323 1324 But it’s not the same as: 1325 1326 `a LeftJoin (b left join c where c.id=b.id and c.id=a.id)` 1327 1328 If the above joins were normal joins, then all three would be equivalent. 1329 1330 *The RHS-only rule for left joins has some non-intuitive repercussions. For example, this query will return all rows of table ‘a’:* 1331 1332 `select a.id, b.id from a left join b on (a.id=5 and b.id=a.id)` 1333 1334 *But b.id will be mostly null, except when a.id and b.id were 5.* 1335 1336 An ON clause can contain other constraints that are not joins. If so, they’re treated just like a WHERE clause constraint. So, a keyspace id constraint in an ON clause can change a RouteMerge to a Route. This means that we have to look at how to combine Route nodes with others. 1337 1338 A subquery in the FROM clause can result in any of the four primitive node. If it’s a join node, then the ON clause cannot be pushed down because it depends on the result of the join. If it was a Route or RouteMerge, the ON clause is pushable. Here’s how: 1339 1340 `from a join (select id, count(*) from b where b.id=1) t on t.id=a.id` 1341 1342 becomes 1343 1344 ``` 1345 J 1346 / \ 1347 a (select id, count(*) from b where b,id=1) t where t.id=a.id 1348 ``` 1349 1350 Here’s a tabular version of the above algorithm: 1351 1352 <table> 1353 <tr> 1354 <td>LHS</td> 1355 <td>Join type</td> 1356 <td>RHS</td> 1357 <td>Result</td> 1358 </tr> 1359 <tr> 1360 <td>Route</td> 1361 <td>Join</td> 1362 <td>Route</td> 1363 <td>Route</td> 1364 </tr> 1365 <tr> 1366 <td>Route</td> 1367 <td>Join</td> 1368 <td>RouteMerge</td> 1369 <td>Route</td> 1370 </tr> 1371 <tr> 1372 <td>RouteMerge</td> 1373 <td>Join</td> 1374 <td>Route</td> 1375 <td>Route (RHS)</td> 1376 </tr> 1377 <tr> 1378 <td>RouteMerge</td> 1379 <td>Join</td> 1380 <td>RouteMerge</td> 1381 <td>RouteMerge</td> 1382 </tr> 1383 <tr> 1384 <td>Route</td> 1385 <td>LeftJoin</td> 1386 <td>Route</td> 1387 <td>Route (LHS)</td> 1388 </tr> 1389 <tr> 1390 <td>Route</td> 1391 <td>LeftJoin</td> 1392 <td>RouteMerge</td> 1393 <td>Route (LHS)</td> 1394 </tr> 1395 <tr> 1396 <td>RouteMerge</td> 1397 <td>LeftJoin</td> 1398 <td>Route</td> 1399 <td>RouteMerge</td> 1400 </tr> 1401 <tr> 1402 <td>RouteMerge</td> 1403 <td>LeftJoin</td> 1404 <td>RouteMerge</td> 1405 <td>RouteMerge</td> 1406 </tr> 1407 <tr> 1408 <td>Join/LeftJoin</td> 1409 <td>All joins</td> 1410 <td>All primitives</td> 1411 <td>New join</td> 1412 </tr> 1413 <tr> 1414 <td>All primitives</td> 1415 <td>All joins</td> 1416 <td>Join/LeftJoin</td> 1417 <td>New join</td> 1418 </tr> 1419 </table> 1420 1421 1422 At the end of the analysis, the output is a symbol table and a reduced tree representing the grouped joins. After this phase, the groups are finalized. They will not be further split or merged. 1423 1424 Here’s an example: 1425 1426 `a join b on b.id=a.id join c on c.id=b.id2 join d on d.id=a.id` 1427 1428 will be represented as: 1429 1430 ``` 1431 J2 1432 / \ 1433 J1 d where d.id=a.id 1434 / \ 1435 a join b on b.id=a.id c where c.id=b.id2 1436 ``` 1437 1438 ### WHERE clause 1439 1440 The ON clause analysis is similar to the WHERE clause, which is why it wasn’t covered in detail in the previous section. The main difference between WHERE and ON is that the ON clause can cause groups to merge. If two groups cannot be merged, the ON clauses get converted to WHERE clauses and get pushed down using the same rules. The only exception is the forced RHS push in the case of left joins. 1441 1442 *The symbol table used by the ON clause is a subset of the one for the WHERE clause. So, it’s safe to move on ON clause condition into a WHERE clause. Attempting to move a WHERE clause condition to an ON clause may require more care.* 1443 1444 The first step here is to convert the parse tree into a list of conditions that can be put back together using AND clauses. Then we analyze each condition separately and decide if it can be pushed into a Route or RouteMerge. 1445 1446 We look at the columns that each condition references, and we identify the rightmost group that the expression references and we push the where clause there. Any references outside of that group will later be changed to bind vars. By the fact that the where clause was pushed to the right most group gives us the guarantee that values will be available for the rest of the column references. 1447 1448 Each time a WHERE clause gets successfully pushed into a Route or RouteMerge, we re-evaluate the routing plan to see if it can be improved. The rules for this are described in the vindex document. 1449 1450 There are situations where we cannot push down: 1451 1452 ###### LEFT JOIN 1453 1454 A WHERE clause cannot be pushed inside a LEFT JOIN, because it has to be applied after 1455 1456 the ON clause is processed, which is not until `LeftJoin` returns the results. 1457 1458 *There are some practical use cases where you’d want to add conditions in the WHERE clause, typically for NULL checks. To support this in the future, we can use this technique that converts a LEFT JOIN into a normal JOIN:* 1459 1460 `from a left join b on b.id=a.id where b.id is NULL` 1461 1462 *After split, the query on b should have become:* 1463 1464 `from b where b.id=a.id` 1465 1466 *We cannot push down the null check in the above case. But we can, if we rewrote it as:* 1467 1468 `from (select 1) as t left join b on b.id=a.id where b.id is NULL` 1469 1470 *The (select 1) acts as a fake driver row for the left join, which allows us to tack on the where clause and also change the `LeftJoin` to a `Join`.* 1471 1472 ###### Subqueries 1473 1474 To be able to push a subquery, we have to determine that the subquery does not escape out of the shard of the parent query. For this, we identify the target group using the same method as above. Then we check to see if the subquery can get a Route plan where the input is the same keyspace id as the group’s query. If this is true, then the where clause can be pushed down just like a normal condition. 1475 1476 If this condition is not met, the subquery becomes a separate group. This complexity is beyond what can be expressed using the four primitives we have. So, we’ll address such constructs when we add more primitives to the mix. 1477 1478 When recombining WHERE clauses, additional parenthesis will be needed to prevent lower precedence operators from changing the intended order. 1479 1480 ###### An example 1481 1482 `from a join b on b.id=a.id join c on c.id=b.id where cond1(a.col, b.col) and cond2(a.col, c.col)` 1483 1484 If a, b and c where in different groups, the output would be: 1485 1486 ``` 1487 J 1488 / \ 1489 J c where (c.id=b.id) and (cond2(a.col, c.col)) 1490 / \ 1491 a b where (b.id=a.id) and (cond1(a.col, b.col)) 1492 ``` 1493 1494 The cond2 expression gets pushed into the where clause for table ‘c’ because it’s the right-most group that’s referenced by the condition. External references will be changed to appropriate bind variables by the rewiring phase. 1495 1496 *Once VTGate acquires the ability to perform its own filters, should we stop pushing these conditions into the dependent queries and do it ourselves instead? The answer will usually be no. You almost always want to push down filters. This is because it will let the underlying database scan fewer rows, or choose better indexes. The more restrictive the query is, the better.* 1497 1498 At the end of the analysis, if no WHERE clauses remain, we can go to the next step. If not, the query is too complex, and we error out. 1499 1500 ### SELECT clause, non-aggregate case 1501 1502 The SELECT clause can represent two relational primitives: a Select or Aggregate (or both). In the case of an Aggregate, it works in conjunction with the GROUP BY clause. So, it will be simpler if the analysis branched off depending on whether the SELECT clause has aggregates or not. 1503 1504 Before starting to process select, we have to perform the pre-step of naming the results. During this process, we generate a name for each item in the SELECT clause. Additionally, if a column is a faithful reference to an underlying table column, we add this qualification. This information will be used for future analysis. 1505 1506 The important property of a non-aggregate SELECT clause is that the column expressions don’t depend on each other. Therefore, they can be split and pushed down past joins into different groups. 1507 1508 If an expression references columns from a single group, then it can be pushed down into that group. 1509 1510 If an expression references columns from more than one group, then the expression cannot be pushed down, we fail the query. 1511 1512 *It is possible to push-down combo expressions using the same technique we use for WHERE clauses. For example:* 1513 1514 `select a.col+b.col from a join b where b.foo=a.bar` 1515 1516 *can be resolved with the following join* 1517 1518 ``` 1519 select a.col, a.bar from a // produce _a_col & _a_bar 1520 Join 1521 select :_a_col+b.col from b where b.foo=:_a_bar 1522 ``` 1523 1524 *However, such queries are not very efficient for VTTablet. It can’t cache field info. Such push-downs don’t benefit MySQL either. So, it’s better to wait till VTGate implements the ability to evaluate expressions.* 1525 1526 SELECT can also contain subqueries. The same restriction as WHERE clauses can be applied here: If the keyspace id of the subquery is not the same as the outer query’s, we fail the query. Otherwise, we push it down. 1527 1528 ###### ‘*’ expressions 1529 1530 Since VTGate does not know all the columns of a table, a ‘SELECT *’ may not be resolvable. We can consider disallowing it altogether. However, here are situations where we could allow it: 1531 1532 * If the underlying primitive is a Route or RouteMerge, we could just push it down without knowing what it means. 1533 * If the underlying ‘table’ is a subquery, we know the full column list. In such cases, we know how to expand the ‘*’ into an actual list. 1534 1535 ###### Additional expressions for joins 1536 1537 Note: This phase could be postponed till the time we do the wire-up. 1538 1539 After the SELECT expressions are pushed down, we have to see if there are dependencies across joins that are still not addressed. If there are WHERE clause dependencies from one group to another that are not listed in the SELECT expression, we need to add those to the list of expressions. This is where we use the info for each select column to see if it already references a column we need for the RHS query. Here’s an example of the metadata in a Join primitive: 1540 1541 `select a.col1, b.col1, a.col3 from a join b on b.col2=a.col2` 1542 1543 will be represented as 1544 1545 ``` 1546 Join: 1547 LHS: select a.col1, a.col3, a.col2 from a 1548 Bindvars: "_a_col2": 2 // Col 2 produces bind var _a_col2 1549 RHS: select b.col1 from b where b.col2=:_a_col2 1550 Select: [ LHS(0), RHS(0), LHS(1) ] 1551 ``` 1552 1553 Consequently, this means that the lower level routes fetch columns that should not be returned in the result. This is the reason why a Join primitive has to also specify the columns that have to be returned in the final result. 1554 1555 *It’s safe to assume that no further changes will be made to the SELECT list, and consider it finalized. Subsequent clauses will only be able to operate on the outcome of the result. This fact is not material right now. But it may become significant when VTGate starts to perform other operations. Specifically, VTGate may split the SELECT clause into a Select and Aggregate and push the Select part down. If the select list changes due to other external dependencies, then the Aggregate cannot be pushed down.* 1556 1557 *MySQL allows you to address columns that are not in the SELECT list in GROUP BY and HAVING clauses, which can further affect the Select list. This is not standard SQL.* 1558 1559 ###### Handling empty lists 1560 1561 The converse of the above rule is the case where the result references only columns from one side of a join. In such cases, the other side of the join doesn’t have a SELECT expression yet. For such situations, we create a dummy ‘select 1’ to make the query work. Example: 1562 1563 `select a.col from a join b on b.col=a.col` 1564 1565 will be represented as: 1566 1567 ``` 1568 Join: 1569 LHS: select a.col from a 1570 Bindvars: "_a_col": 0 1571 RHS: select 1 from b where b.col=:_a_col 1572 Select: [LHS(0)] 1573 ``` 1574 1575 ###### Pushing past Filter 1576 1577 If/when VTGate implements filters, we may have to ask if a SELECT clause can be pushed past such filters into a Route primitive. The answer is yes, because Select and Filter are orthogonal operations. However, we may prefer performing the Select at the VTGate level instead of pushing it down. 1578 1579 ### SELECT with aggregates 1580 1581 If a SELECT clause uses aggregate functions, then it can be pushed down if the underlying node is a Route. As usual, subquery escape analysis will still need to be applied. 1582 1583 If the underlying node is a RouteMerge, then it can still be pushed down if one of the non-aggregate result columns has a unique vindex. Such a constraint guarantees that all the rows of each group are within one shard. 1584 1585 You generally can’t push an Aggregate past a join. Theoretically, there are situations where you can do this, but such opportunities are rare. So, it’s not worth considering these optimizations. 1586 1587 ### GROUP BY 1588 1589 The act of pushing an Aggregate down also necessitates that the associated GROUP BY clause be pushed. It’s just the other side of the same coin. 1590 1591 The GROUP BY clause is slightly unusual because its contents can actually be automatically inferred in most cases. More often than not, the SQL engine has to validate that the GROUP BY columns match the inferred values. 1592 1593 It is possible that a GROUP BY to be present while a SELECT doesn’t have aggregates. It’s a common technique that people use to dedupe rows in the result. In such cases, the same rules as an Aggregate apply to the GROUP BY clause. 1594 1595 *As mentioned previously, MySQL analyzes the GROUP BY clause using the query’s scope. It also does not enforce that the GROUP BY columns are consistent with the SELECT. For example, this is valid in MySQL:* 1596 1597 `select a.col1, a.col2 from a group by a.col1` 1598 1599 *This also valid:* 1600 1601 `select a.col1 from a group by a.col1, a.col2` 1602 1603 *Both of the above constructs are invalid by SQL standards. We could piggy-back on this lax attitude and just pass-through the GROUP BY as is. After all, these are not very meaningful constructs. If MySQL allows them, why should we work against them?* 1604 1605 ### HAVING 1606 1607 We reach this phase only if SELECT and GROUP BY were successfully pushed. 1608 1609 The difference between HAVING and WHERE is that the HAVING clause applies filters after the result is computed. So, it can only reference columns from the result. 1610 1611 For each condition in the HAVING clause, we find out which result columns it references. If all those columns originate from the same group, then the clause is pushable into that group. 1612 1613 For conditions that span multiple groups, we can apply the bind variable dependency technique just like we did for the WHERE clause. If we decide to do this, we’ll need a separate bind variable naming scheme, because these refer to result columns and not table columns. 1614 1615 As usual, subquery processing will need to be done as needed. 1616 1617 Note: HAVING clauses are usually very simple. So, there may be diminishing return in allowing complexities here. However, if we’re just reusing code, we can allow these. Otherwise, we shouldn’t go overboard. 1618 1619 ### ORDER BY 1620 1621 Even though this clause near the end of the execution chain, it’s a popular one. It’s also an expensive one. So, we have to make every effort to push the work down into the underlying routes. 1622 1623 We’ll start with the usual restriction that this clause will only reference the result columns. 1624 1625 If the underlying node is a Route, then the clause can just be pushed down. 1626 1627 If the underlying node is a RouteMerge, then it cannot be pushed down. 1628 1629 If the underlying node is a join, then we perform the following analysis: Take each expression and identify the group it originates from. If it references more than one group, then we cannot push that expression down. If the group order is no less than the current group order, the expression can be pushed. We continue this until we reach an expression that is not pushable, or we successfully push everything. Here are two examples: 1630 1631 `select a.col, b.col, c.col from a join b join c order by a.col, b.col, c.col` 1632 1633 This will produce: 1634 1635 ``` 1636 select a.col from a order by a.col // group 1 1637 select b.col from b order by b.col // group 2 1638 select c.col from c order by c.col // group 3 1639 ``` 1640 1641 Example 2: 1642 1643 `select a.col, b.col, c.col from a join b join c order by a.col, c.col, b.col` 1644 1645 For the above case, a.col is from group 1, c.col is from group 3, but b.col is from group 2, which is less than group 3. So, the last expression is not pushable. 1646 1647 ### LIMIT 1648 1649 The LIMIT clause has two forms: 1650 1651 LIMIT m, n: Fetch m+n rows, but return rows m through n. 1652 1653 LIMIT n: same as LIMIT 0, n 1654 1655 If the underlying node is a Route, then the LIMIT can be pushed down. 1656 1657 If the underlying node is a RouteMerge, then we fail the query. This is because ORDER BY cannot be pushed into a RouteMerge, and a LIMIT without an ORDER BY is not very meaningful. Also, the order of rows is unpredictable for RouteMerge. So, it’s even less valuable to apply a limit. 1658 1659 In the case of a join, the query could be passed down without a limit and we can do the filtering on the entire result. Maybe this is not worth supporting. 1660 1661 ## Wire-up 1662 1663 If we’ve reached this phase, it means that we have successfully broken up the original query and pushed the different parts into route primitives that are now wired together by joins. 1664 1665 However, these parts would still be referring to each other. This section will describe how those dependencies will get resolved. 1666 1667 In order to resolve dependencies correctly, we need the symbol tables that were used for analyzing each context. With the symbol table at hand, we walk the AST for each group, and look up the referenced symbols in the symbol table. If the symbol belongs to the same group, then nothing needs to be done. If the reference is outside, we perform the following steps: 1668 1669 1. Change the external reference to the corresponding bind var name. 1670 2. If that column is not already in the select list of the other group, we add it. 1671 3. Mark the column number to be exported as the bind var name that we chose. 1672 1673 This step has a caveat: if the SELECT list contained an Aggregate, then we cannot add such a column. However, we’re currently protected by the constraint that Aggregate cannot be pushed into a join, and external dependencies can only happen for joins. But this is something to keep in mind as we add more primitives and flexibilities. 1674 1675 ### Accessing the symbol table 1676 1677 Currently, the AST contains only information parsed from the query. When we analyze the query, we build various symbol tables. However, they would go away as we finish each analysis. 1678 1679 Since we need them again for resolving dependencies, we have to store them somewhere. We also have to retrieve the right one for each subquery. This gives rise to a dependency issue: Subqueries can be anywhere in the parse tree. So, it’s unnatural to have an outside data structure that points to various parts of the AST. 1680 1681 So, we make a compromise: we extend the Select structure and add a Symbols (any) member to it. Every time we finish analyzing a subquery, we set the Symbols member to point to the symbol table. With this compromise, the rest of the code base will flow naturally while resolving dependencies. 1682 1683 ## Plan generation 1684 1685 Now, all we have to do is generate the execution plan. During this step, we just generate the code for the AST of each route node, and perform additional substitutions needed to perform the correct routing, as dictated by the vindexes chosen for each route. 1686 1687 A Route node will become one of: SelectUnsharded, SelectEqual, SelectKeyrange 1688 1689 A RouteMerge will become one of: SelectScatter, SelectIN 1690 1691 # Future primitives 1692 1693 Looking at how far we can go with just four primitives is pretty amazing. But we can’t stop there. There are a few more low-hanging fruits we can get to, and then we can also start looking at the high-hanging ones. 1694 1695 ## MergeSort and MergeAggregate 1696 1697 If the ORDER BY clause can be pushed down for a Route, then we’re not far away from supporting it for a RouteMerge also. The results from each shard are already coming sorted. Merge-sorting such results becomes trivial. 1698 1699 Along the same lines, if we ordered the rows of an Aggregate by the columns it’s grouping, we can push down such a query through a RouteMerge, and perform the final aggregation as the results come. 1700 1701 The one challenge is collation. We’ll need to make sure that our collation algorithm can match what MySQL does. 1702 1703 ## Subquery 1704 1705 Uncorrelated subqueries can be trivially ‘pulled out’ to be executed separately, and their results can be pushed down into the query that needs the result. 1706 1707 Here’s an example: 1708 1709 `select * from a where col in (select col from lookaside)` 1710 1711 This can be expressed as: 1712 1713 ``` 1714 Subquery: 1715 _lookaside: select col from lookaside 1716 query: select * from a where col in ::_lookaside 1717 ``` 1718 1719 ## Select & Filter 1720 1721 If we built an expression engine, then we can do our own evaluation and filtering when something is not pushable. Adding support for this will automatically include correlated subqueries, because the result of the underlying query can be used to execute the subquery during the Select or Filter stage. Example: 1722 1723 `select a.id from a where a.id = (select id from b where b.col=a.col)` 1724 1725 Plan: 1726 1727 ``` 1728 Filter: 1729 query: select a.id, a.col from a 1730 bindvars: "_a_col": 1 1731 constraint: a.id = ConvertVal: Route: 1732 select id from b where b.col=:_a_col 1733 ``` 1734 1735 ## Other expensive primitives 1736 1737 The rest of the primitives have potentially unbounded memory consequences. For these, we’ll have to explore map-reduce based approaches.