vitess.io/vitess@v0.16.2/doc/design-docs/V3HighLevelDesign.md (about)

     1  # V3 high level design
     2  
     3  # Objectives
     4  
     5  The goal of this document is to describe the guiding principles that will be used to implement the full SQL feature set of Vitess’s V3 API. We start off with the high level concepts on SQL and its features, and iterate our way into some concrete design proposals.
     6  
     7  ### Prerequisites
     8  
     9  Before reading this doc you must be familiar with [vindexes](https://github.com/vitessio/vitess/blob/main/doc/V3VindexDesign.md), which is used as foundation for the arguments presented here.
    10  
    11  # Background
    12  
    13  VTGate was initially conceived as a router that encapsulated the logic of deciding where to send the application’s queries. The primary objective was to isolate the application from maintenance events like reparenting, server restarts, and resharding. As of the V2 API, this problem is pretty much solved.
    14  
    15  One drawback of the V2 API is that it requires the application to be aware of the sharding key. This key had to be sent with every query. So, unless an application was written from the ground up with this awareness, the task of migrating it to use vitess was rather daunting.
    16  
    17  The question was asked: could VTGate be changed to not require the sharding key in its API? Could VTGate make the sharded database look like a single database?
    18  
    19  This was how V3 was born. A working proof of concept was developed. This showed that we can automatically route queries that targeted single shards. We could also route queries that targeted multiple shards as long as the results could be combined without any additional processing.
    20  
    21  The next question was: Can we support more SQL constructs? How far can we go? Can we go all the way and support everything?
    22  
    23  These are the questions we’ll try to answer in the subsequent sections.
    24  
    25  # TL;DR
    26  
    27  This section gives a preview of what this document tries to achieve.
    28  
    29  ## Current V3
    30  
    31  The current V3 functionality is capable of correctly serving simple SQL statements that reference a single table. For example, if id is the sharding column for a:
    32  
    33  `select a.col from a where a.id=:id` will get routed to the correct shard.
    34  
    35  `select a.col from a where a.id in ::list` will get broken out into shard-specific in statements and routed to those specific shards
    36  
    37  `select a.col from a where a.col=:col` will get sent to all shards and results will be combined and returned
    38  
    39  The current V3 also supports simple DMLs. More complex SQLs like joins, aggregations and subqueries are not supported.
    40  
    41  ## Next phase: support more queries
    42  
    43  If the first phase of this document is implemented, VTGate will be able to support the following constructs:
    44  
    45  ### Joins
    46  
    47  `select a.col, b.col from a join b on b.id=a.id where a.id=:id` will get routed to the correct shard. Correspondingly, IN clauses will also will get correctly broken out and routed.
    48  
    49  `select a.col, b.col from a join b on b.id=a.id `will get sent to all shards.
    50  
    51  If the data is on multiple shard, VTGate will perform the join:
    52  
    53  `select a.col, b.col from a join b on b.id2=a.id where a.id=:id`
    54  
    55  will get rewritten as two queries, one dependent on the other:
    56  
    57  ```
    58  select a.col, a.id from a where a.id=:id -> yield "_a_id" from col 1
    59  select b.col from b where b.id2=:_a_id
    60  ```
    61  
    62  Joins can group tables, as long as the order is correctly specified:
    63  
    64  `select a.col, b.col, c.col from a join b on b.id=a.id join c on c.id2=a.id where a.id=:id`
    65  
    66  will get rewritten as:
    67  
    68  ```
    69  select a.col, b.col, a.id from a join b on b.id=a.id where a.id=:id
    70  select c.col from where c.id2=:_a_id
    71  ```
    72  
    73  Joins can also group other where clauses and wire-up cross-table dependencies:
    74  
    75  `select a.col, b.col from a join b on b.id2=a.id where a.id=:id and b.col > a.col`
    76  
    77  will get rewritten as:
    78  
    79  ```
    80  select a.col, a.id from a where a.id=:id
    81  select b.col from b where b.id2=:_a_id and b.col > :_a_col
    82  ```
    83  
    84  Joins can also be cascased:
    85  
    86  `select a.col, b.col, c.col from a join b on b.i2=a.id join c on c.id3=b.id2 where a.id=:id`
    87  
    88  will get rewritten as:
    89  
    90  ```
    91  select a.col, a.id from a where a.id=:id
    92  select b.col, b.id2 from b where b.id2=_a_id
    93  select c.col from c where c.id3=_b_id2
    94  ```
    95  
    96  LEFT JOIN will also be supported for all the above constructs. However, WHERE clauses that add additional constraints to the nullable parts of a left join will not be supported; They cannot be pushed down.
    97  
    98  ### Subqueries
    99  
   100  Correlated subqueries that can be executed as part of the main query will be supported:
   101  
   102  `select a.id from a where exists (select b.id from b where b.id=a.id) and a.id=:id` will get routed to the correct shard.
   103  
   104  Correlated subqueries can also be sent as scatter queries, as long as the join constraints are met.
   105  
   106  Subqueries in the FROM clause can be joined with other tables:
   107  
   108  `select a.id, a.c, b.col from (select id, count(*) c from a) as a join b on b.id=a.id`
   109  
   110  will get rewritten as:
   111  
   112  ```
   113  select a.id, a.c from (select id, count(*) c from a) as a
   114  select b.col from b where b.id=:_a_id
   115  ```
   116  
   117  Subqueries in the WHERE clause and SELECT list will be supported only if they can be grouped. They will not be broken out.
   118  
   119  ### Aggregation
   120  
   121  Aggregation will be supported for single-shard queries, but they can contain joins and subqueries.
   122  
   123  You can also use aggregation for scatter queries if a unique vindex is in the list of result columns, like this one:
   124  
   125  `select id, count(*) from a`
   126  
   127  ### Sorting
   128  
   129  Sorting will be supported as long as it can be meaningfully pushed down into the grouped parts. For example,
   130  
   131  `select a.col, b.col from a join be where b.id=a.id order by a.id, b.id`
   132  
   133  will be rewritten as:
   134  
   135  ```
   136  select a.col from a order by a.id order by a.id
   137  select b.col from b where b.id=:_a_id order by b.id
   138  ```
   139  
   140  ### All combinations
   141  
   142  The above features are designed to be orthogonal to each other. This means that VTGate can support simultaneous combinations of the above features, and with arbitrary levels of nesting and cascading.
   143  
   144  ## Going beyond
   145  
   146  Once the above phase is complete, we can iterate on performance features as well as add support for more constructs. It will depend on customer demand.
   147  
   148  # High level concepts
   149  
   150  The V3 proof of concept drew parallels between database indexes and vitess’s indexing scheme. A database deals with tables, but VTGate deals with databases. It turns out that a table and a database have a surprising number of similarities. The basic guarantee of an index is that it lets you limit the number for rows you have to inspect in order to compute the result of a query. In the same way, vindexes give you the guarantee that it will be sufficient to send your queries to a subset of all the shards.
   151  
   152  As a quick recap, a vindex is a cross-shard index maintained by vitess. Given a column constraint, the vindex is one that’s capable of telling which shards contain the rows needed to satisfy a query. In this document, we’ll use the term ‘unique vindex’ quite often. A unique vindex is one that can produce only one shard as target for a column equality constraint. However, there may still be many rows that match that constraint. But they’ll always be within that one shard.
   153  
   154  The following table further illustrates the similarities of the concepts:
   155  
   156  <table>
   157    <tr>
   158      <td>RDBMS</td>
   159      <td>Vitess</td>
   160    </tr>
   161    <tr>
   162      <td>Table</td>
   163      <td>Database</td>
   164    </tr>
   165    <tr>
   166      <td>Primary Key</td>
   167      <td>Sharding Key</td>
   168    </tr>
   169    <tr>
   170      <td>Index</td>
   171      <td>Vindex</td>
   172    </tr>
   173    <tr>
   174      <td>Foreign Key</td>
   175      <td>Shared Vindex</td>
   176    </tr>
   177  </table>
   178  
   179  
   180  This coincidence is not accidental. For example, if you used a vindex to route a query to a shard, it’s likely that the database also has an index on that same column, with properties similar to the vindex. If you have a shared vindex between two tables, it’s very likely that there exists a similar foreign key relationship between the tables.
   181  
   182  The foundation for these similarities comes from the fact that a table that was once in a single database is now partitioned out. So, the indexes for those tables have to also grow out of the original database.
   183  
   184  There is another strong parallel between a table and a database: A SELECT statement and a table are interchangeable. On one hand, all the data in table t can be represented as `‘select * from t’. `On the other hand, you can use a SELECT statement as stand-in for a table. So, many of the approaches used to optimize SQL for scanning tables can be re-used to orchestrate queries that span multiple shards or databases. The main difference is that you can treat bigger SQL chunks as tables, and can outsource that work to the database that hosts them.
   185  
   186  But then, there’s only so much you can outsource. As soon as a query exceeds a certain level of complexity, where the required data needs to be pulled from more than one database, then any subsequent work needs to be done by VTGate. If we want to support the full SQL syntax, VTGate will have to eventually become a full-fledged SQL engine.
   187  
   188  VTGate’s SQL engine will likely not be as good as what the underlying database can do, at least initially. So, until such time, the goal will be to push as much of the work down to the databases, and do as little as possible with what comes out.
   189  
   190  ## Limitations
   191  
   192  The above arguments can more or less be used as proof that we can theoretically support the full SQL syntax. In the worst case scenario, VTGate will have to individually scan each table involved in the query and perform the rest of the work by itself.
   193  
   194  But there are some things it cannot do:
   195  
   196  ### Cross-shard consistency
   197  
   198  There is currently no way to get a system-wide consistent snapshot of all the keyspaces and shards. If reads are being performed during in-flight cross-shard transactions, then VTGate may see such data as committed in one database, and not committed in the other.
   199  
   200  We could rely on MySQL’s MVCC feature that will give you a view of each database as of when a transaction was started, but this is still not perfect because there is no guarantee about the exact timing of each BEGIN, and there are also replication lag related inconsistencies. We currently don’t support read transactions for replicas, but it can be added.
   201  
   202  The final option is to lock those rows before reading them. It’s a very high price to pay, but it will work.
   203  
   204  ### Environmental differences
   205  
   206  VTGate will be running on a different machine than the databases. So, functions that rely on the current environment will not produce consistent results. For example, time functions will likely have variations depending on clock skew.
   207  
   208  # SQL for dummies
   209  
   210  An upfront clarification: we’ll be only focusing on SELECT statements in this document, because that’s where all the complexities are. Also, Vitess currently allows only single-shard DMLs. So, most of this analysis will not be applicable for them.
   211  
   212  ## Dissecting SELECT
   213  
   214  A single table SELECT is a relatively simple thing. Let’s dissect the following statement:
   215  
   216  ```
   217  select count(*), a+1, b
   218  from t
   219  where c=10
   220  group by a+1, b
   221  having count(*) > 10
   222  order by count(*) desc
   223  limit 10
   224  ```
   225  
   226  The steps performed by the SQL engine are as follows:
   227  
   228  1. Scan t (FROM)
   229  2. Filter rows where c=10 (WHERE)
   230  3. Remove all columns other than a & b, and compute a+1 (SELECT)
   231  4. Count and group unique a+1, b values (GROUP BY)
   232  5. Filter counts that are > 10 (HAVING)
   233  6. order the rows by descending count (ORDER BY)
   234  7. Return the top 10 results (LIMIT)
   235  
   236  The one thing that’s not obvious in a SELECT is that the select list is not the first clause that’s executed. It’s the FROM.
   237  
   238  ## SELECT as a subquery
   239  
   240  If a SELECT statement was used just to fetch results, things would be very simple. However, it can be used in other parts of a query, and can produce 5 kinds of values:
   241  
   242  1. A simple value: `select * from t where id = (select 1 from dual)`
   243  2. A value tuple: multiple values of the same type: `select * from t1 where id in (select val from t2)`
   244  3. A row tuple: multiple values of different types. A row tuple can also be used as a value: `select * from t1 where (a,b) = (select b,c from t2 where pk=1)`
   245  4. Rows: results of a regular query. Note that rows can be treated as a value tuple of row tuples. For example, this statement is valid: `select * from t1 where (a,b) in (select b,c from t2)`
   246  5. A virtual table: If you don’t strip out the field info from the original query, then a select can act as a table, where the field names act as column names. For example, this statement is valid: `select * from (select a, b from t2) as t where t.a=1`
   247  
   248  This opens the door to some powerful expressibility, the other edge of the sword being indefinite complexity.
   249  
   250  Subqueries can be found in the following clauses:
   251  
   252  1. FROM
   253  2. SELECT list
   254  3. WHERE
   255  
   256  In a way, subqueries are brain-dead, including the correlated ones. You just execute your main query as specified by the sequence in the ‘Dissecting SELECT’ section. Any time you encounter a subquery, you just evaluate it as an expression on-the-fly and use the resulting value. The only complication of a correlated subquery is that it references a value from the outer Result to produce its own Result.
   257  
   258  An immediate optimization: If a subquery is not correlated, we know that its value is the same for every row of the outer query. So, such a subquery can be "pulled out" and evaluated only once, or we could use lazy evaluation and caching. But we’re venturing into optimizer territory, which we should refrain from for now.
   259  
   260  Based on the above properties, we can make some observations:
   261  
   262  * A subquery in a FROM clause cannot reference columns of the SELECT list. This is because of the execution order: the SELECT list depends on what comes out of the FROM clause, not the other way around. However, such a subquery can still reference columns from an outer SELECT. But MySQL is more restrictive, and prevents all correlation for subqueries in the FROM clause.
   263  * A subquery in a SELECT list cannot affect the original number of rows returned. All it can do is supply the value for the column it was meant to compute the result for. If there was no result, the value would be NULL. A subquery in a SELECT list must produce a single scalar value.
   264  * A subquery in a WHERE clause cannot change any values of the original result, or produce new rows. It can only cause rows to be eliminated.
   265  
   266  It’s easy to get lost in the world of subqueries, and it may get confusing when you start to think about how they interact with joins. But you can retain sanity by following these principles while analyzing a complex query:
   267  
   268  * Tables involved in a join are within the same scope, and can reference each other's columns.
   269  * Subqueries are in an inner scope. An inner query can use values from the outer query. Other than this, all it can do is produce a value.
   270  * Other parts of the outer query cannot reference anything about the subquery. It’s a black box in that respect.
   271  
   272  Correlated subqueries exhibit the properties of a join. So, there are situations where they could be converted to actual joins by the optimizer.
   273  
   274  It’s now time to drill down into the details of the operators. The notations below are not very formal. They’re there mainly to illustrate the concepts.
   275  
   276  ## What kind of machine is SQL?
   277  
   278  Primer on automata theory: [https://en.wikipedia.org/wiki/Automata_theory](https://en.wikipedia.org/wiki/Automata_theory).
   279  
   280  Is SQL a turing machine? The answer is no, because it can only make forward progress, and always terminates.
   281  
   282  Strangely, there are no references on the internet about what kind of machine SQL is. Looking at whether it can be a finite state machine, the answer seems to be no, because of subqueries.
   283  
   284  So, it appears to be a Pushdown Automaton (PDA). This determination is important because it will help us decide the data structure needed to represent an execution plan.
   285  
   286  Specifically, a tree structure is sufficient to represent a PDA.
   287  
   288  ## The operands
   289  
   290  This is the list of ‘values’ that a query engine deals with in order to satisfy a query:
   291  
   292  1. Result: This is the core data type that every operator deals with. It’s a complex data structure. Sometimes, the primitives may access only a part of the Result. So, the interchangeability rules of primitives will be different based on which parts of the Result each of them accesses.
   293  2. Value expression list: This is used in various places in a statement. Most notably, for converting a table scan into the list of values to be returned.
   294  3. Conditional expression: This is used to filter out results.
   295  4. Aggregate expression list: This has some similarities with a value expression list, but it’s worth differentiating.
   296  5. Scalar values: Used all over.
   297  
   298  The above operands are a simplified way of looking at things. For example, a conditional expression is a separate world that has its own set of operators and values, and they follow a different set of algebraic rules. But we need to divide and conquer.
   299  
   300  You’ll see how these operands are used in the description of the operators.
   301  
   302  ## The operators
   303  
   304  If one were to ask what are the minimal set of primitives needed to serve SQL queries, we can use [relational algebra](https://www.tutorialspoint.com/dbms/relational_algebra.htm) as starting point. But databases have extended beyond those basic operations. They allow duplicate tuples and ordering. So, here’s a more practical set:
   305  
   306  1. Scan (FROM).
   307  2. Join: Join two results as cross-product (JOIN).
   308  3. LeftJoin: The algebraic rules for an outer join are so different from a Join that this needs to be a separate operator (LEFT JOIN).
   309  4. Filter: Removing rows based on a conditional expression (WHERE, HAVING).
   310  5. Select: Evaluating new values based on a value expression list (SELECT list).
   311  6. Aggregate: Grouping rows based on an aggregate expression list (GROUP BY, UNION).
   312  7. Sort (ORDER BY).
   313  8. Limit (LIMIT).
   314  9. Merge: Merge two results into one (UNION ALL).
   315  
   316  Windowing functions are out of scope for this doc.
   317  
   318  Even the above primitives are fairly theoretical; One would rarely perform a Join without also simultaneously applying a join condition. 
   319  
   320  For a given SELECT statement, the order of these operators are fixed. They are as follows:
   321  
   322  `Scan <-> (Join & LeftJoin) -> Filter(where) -> Select -> Aggregate -> Filter(having) -> Sort -> Limit -> Merge`
   323  
   324  However, following the above order is almost never optimal. This is where we have to study how these operators interact, and try to find ways to resequence them for higher efficiency.
   325  
   326  It’s important to note that most of these operators take a Result as input and produce one as output. This means that they can be indefinitely combined. This is achieved by re-SELECTing from a Result. It’s usually rare that people perform such nested selects, but there are legitimate use cases, typically for computing aggregates of aggregates.
   327  
   328  Additionally, Filter and Select can perform their own scans (subqueries) as part of their expressions. This adds another dimension to the overall complexity.
   329  
   330  ### Scan
   331  
   332  This is the most basic operation: given a table name, the operation produces a result, which is a list of all the rows in the table, with field names:
   333  
   334  `Scan(tableName) -> Result`
   335  
   336  Rethinking this for VTGate, the corresponding operation would be to route a query that targets a single database, and the output would be the result.
   337  
   338  `Route(query, keyspace, shard) -> Result`
   339  
   340  It’s implied that the query is the sql+bindvars.
   341  
   342  *If every operator takes a Result as input, who produces the first Result?*
   343  
   344  Scan is in this category. Without this operator, there is nothing that can map a query to a Result. Consequently, this is also the lowest level building block that all other operations depend on.
   345  
   346  It turns out that this is the only part of VTGate that differs from a traditional SQL engine. But it’s a powerful difference. For well tuned apps, the entire query will most of the time be just passed through to one of the underlying databases.
   347  
   348  The optimizer is likely to be very different, because its main task will be to figure out how to break up a query into parts that can be outsourced to different databases. For example, if there’s a three-way join and if two of those pieces can be sent to a single shard, then it has to split that query correctly into two. Because of this, we need the ability to translate a relational expression back into a query, or somehow find a way to reuse the parts of the original query that pertain to the relational expression.
   349  
   350  ### Filter
   351  
   352  The filter operator removes rows from the Result based on a condition:
   353  
   354  `Filter(Result, condition) -> Result`
   355  
   356  A condition is a boolean expression. The operands can be:
   357  
   358  * A column of the Result
   359  * A column of an outer Result (to support correlated subqueries)
   360  * Boolean values or expressions
   361  * Scalar values or expressions
   362  * The result of a subquery
   363  
   364  The operators are:
   365  
   366  * Pure boolean operators like AND, OR, etc.
   367  * Comparison operators like =, <, etc.
   368  * Other complex operators and functions like IF, etc.
   369  
   370  Inferred properties of Filter:
   371  
   372  * The filter operation is applied to each row individually. There is no interaction between rows.
   373  * A row is never modified.
   374  * No new rows are produced.
   375  * It does not change the order of the rows in the Result. This is not a requirement, but it’s a property worth maintaining.
   376  * The AND clauses of a filter can be broken into parts and applied separately. This is useful for pushing down WHERE clauses into scans.
   377  
   378  It would have been nice if the Result was the only variable input to Filter. However, this is not true if one of the operands is a subquery.
   379  
   380  *There are two schools of thought when it comes to boolean algebra: Those that allow interconversion between scalar and boolean values, and those that don’t. The former considers any non-zero scalar value to be TRUE, and zero as FALSE. The latter requires the scalar value to be tested using a comparison operator before it becomes a boolean value.*
   381  
   382  *The first school of thought allows statements like this:*
   383  
   384  `select * from t where a`
   385  
   386  *The second school of thought requires you to rewrite it as:*
   387  
   388  `select * from t where a != 0`
   389  
   390  *Vitess’s grammar currently doesn’t allow this interconversion, but MySQL does. From a practical perspective, this difference is not significant because it results in a minor syntactic inconvenience. One could argue that ‘a’ could be a boolean column. But boolean values are not common in databases, and there are other alternatives. So, Vitess doesn’t support boolean column types. It assumes that everything from a database (or Result) is a scalar.*
   391  
   392  *Not allowing interconversion greatly simplifies the grammar. Also, the boolean engine and expressions engine can be kept separate, which is another simplification.*
   393  
   394  ### Select
   395  
   396  The Select operation produces a new Result by performing arithmetic operations to the input Result.
   397  
   398  `Select(Result, expression list) -> Result`
   399  
   400  An expression is a subset of a boolean expression. The operands can be:
   401  
   402  * A column of the Result
   403  * A column of an outer Result
   404  * Scalar values or expressions
   405  * The result of a subquery
   406  
   407  The operators can be:
   408  
   409  * Arithmetic operators like +, -
   410  * Functions and other complex expressions
   411  
   412  Properties of a Select:
   413  
   414  * A Select produces the same number of rows as the input
   415  * A Select may not produce the same columns as its input
   416  * The expressions are applied to each row individually. There is no interaction between rows.
   417  * The order of the result is not changed. This is not a requirement, but it’s a property worth maintaining.
   418  * The different items (columns) of a Select can be broken into parts and independently computed.
   419  
   420  ### Aggregate
   421  
   422  Aggregate splits the Result column into two groups: the aggregate columns and the group by columns. It merges the rows whose group by columns are equal, and simultaneously applies aggregation functions to the aggregate columns.
   423  
   424  `Aggregate(Result, aggrExpressionList(columns)) -> Result`
   425  
   426  Any columns that are not in the aggregate expression list are implicitly in the group by list.
   427  
   428  There are two special cases:
   429  
   430  * There are no aggregate columns. In this case, the operation degenerates into a dedup (UNION).
   431  * There are no group by columns. In this case, one and only one row is produced.
   432  
   433  The SQL syntax allows expressions to be present in aggregate functions, but that can be broken down into Select->Aggregate.
   434  
   435  Properties of Aggregate:
   436  
   437  * It produces no more rows than the input.
   438  * It produces the same number of columns as the input.
   439  * Aggregate can be efficient if the rows are sorted by the group by columns. If so, it also retains the sort order.
   440  * Aggregate can also be efficient if there are no group by columns. Then only one row is produced in the Result.
   441  * The different items of an Aggregate cannot be independently computed.
   442  
   443  ### Sort
   444  
   445  Sort reorders the rows of a Result based on a list of columns with ascending or descending properties.
   446  
   447  `Sort(Result, (columns asc/desc)) -> Result`
   448  
   449  Properties:
   450  
   451  * The number of rows produced is the same as the input
   452  * No rows are changed
   453  * An outer Sort supersedes an inner Sort
   454  * Sorting can be broken into parts as long as the resulting row is built in the same order as the values referenced by the sort. For example, if we join a with b, and sorting is by a.col and b.col, then we can sort a by a.col and join with b by sorting b by b.col.
   455  
   456  ### Limit
   457  
   458  Limit returns at the most N rows from the specified starting point.
   459  
   460  `Limit(Result, start, count) -> Result`
   461  
   462  Limit has algebraic properties that are similar to Filter.
   463  
   464  Properties:
   465  
   466  * A limit doesn’t change the rows
   467  * An outer limit supersedes an inner Limit
   468  * A Limit is not interchangeable with anything other than Select.
   469  * A Limit without a Sort has non-deterministic results.
   470  
   471  ### Join
   472  
   473  Join produces the cross-product of two Results.
   474  
   475  `Join(Result, Result) -> Result`
   476  
   477  This primitive is too inefficient in most situations. Here are some practical composites:
   478  
   479  `InnerJoin(Result, Result, condition) ⇔ Filter(Join(Result, Result), condition)`
   480  
   481  The use case for InnerJoin is easy to explain, because pushing down a condition usually leads to valuable efficiency improvements.
   482  
   483  `QueryJoin(query, query, condition) ⇔ Filter(Join(Scan(query), Scan(query)), condition)`
   484  
   485  The QueryJoin is relevant mainly because VTGate may not get good at performing joins for a long time. So, it may have to rely on sending nested loop queries to the underlying databases instead. However, there will be situations where a join will have to be performed on two Results. Such queries will be out of VTGate’s ability to execute until it implements the `Join(Result, Result)` primitive.
   486  
   487  Properties:
   488  
   489  * Rows are not modified
   490  * Lots of rows can be produced
   491  * Interchangeable with Filter as long as the condition can be split properly. This will be a significant driving factor for the optimizer.
   492  
   493  ### LeftJoin
   494  
   495  LeftJoin requires a join condition. If the condition is not satisfied, the left side of the Result is still produced, with NULL RHS values.
   496  
   497  `LeftJoin(LeftResult, RightResult, condition) -> Result`
   498  
   499  It’s important to differentiate LeftJoin from Join because LeftJoin is not freely interchangeable with Filter or other Joins.
   500  
   501  ### Merge
   502  
   503  Merge merges two results into one. There is no particular order in the rows.
   504  
   505  `Merge(Result, Result) -> Result`
   506  
   507  Properties:
   508  
   509  * Rows are not modified
   510  * Number of rows is the sum of the two Results’ number of rows
   511  
   512  ## Example from hell
   513  
   514  Now that all operators are defined, we have to see if they’re complete. Let’s try this query:
   515  
   516  ```
   517  01: (select count(*), ta1.col+1, tb1.col 
   518  02: from ta1 join tb1 on ta1.id=tb1.id
   519  03: where ta1.col2=10 and ta1.col2 > any
   520  04:   (select c from tc1 where tc1.id=ta1.id)
   521  05: group by ta1.col+1, tb1.col
   522  06: having count(*) > 10
   523  07: order by count(*) desc
   524  08: limit 10)
   525  09:union
   526  10: (select count(*), ta2.col+1, tb2.col 
   527  11: from ta2 join tb2 on ta2.id=tb2.id
   528  12: where ta2.col2=10
   529  13: group by ta2.col+1, tb2.col)
   530  14:order by count(*) desc
   531  15:limit 5
   532  ```
   533  
   534  The above query can be represented as a single expression. But let’s start with something that has intermediate results first:
   535  
   536  ```
   537  // query 1
   538  02: JoinResult1 = InnerJoin(Scan(ta1), Scan(tb1), ta1.id=tb1.id)
   539  // Note that the next operation is within the scope of JoinResult1, so its columns can be used. Nevertheless, it’s an external input.
   540  04: SubqueryCScanResult = Select(
   541        Filter(
   542          Scan(tc1),
   543          tc1.id=JoinResult1.ta1.id,
   544        ),
   545        tc1.c)
   546  03: FilterResult1 = Filter(
   547        JoinResult1,
   548        ta1.col2=10 and ta1.col2 > any(SubqueryCScanResult),
   549      )
   550  01: SelectResult1 = Select(FilterResult1, (1, ta1.col+1, tb1.col))
   551  05: GroupResult1 = Aggregate(SelectResult1, count(‘1’))
   552  06: HavingResult1 = Filter(GroupResult1, ‘1’ > 10)
   553  07: SortResult1 = Sort(HavingResult1, (‘1’ desc))
   554  08: Result1 = Limit(SortResult1, 0, 10)
   555  // query 2
   556  11: JoinResult2 = InnerJoin(Scan(ta2), Scan(tb2), ta2.id=tb2.id)
   557  12: FilterResult2 = Filter(JoinResult2, ta2.col2=10)
   558  10: SelectResult2 = Select(FilterResult2, (1, ta2.col+1, tb2.col))
   559  13: Result2 = Aggregate(SelectResult2, count(‘1’))
   560  // union
   561  09: UnionResult = Aggregate(Merge(Result1, Result2), NULL)
   562  14: OrderResult = Sort(UnionResult, (‘1’ desc))
   563  15: Result = Limit(OrderResult, 0, 5)
   564  ```
   565  
   566  Now, combine everything into one expression, and voila:
   567  
   568  ```
   569  Result = Limit(
   570    Sort(
   571      Aggregate(
   572        Merge(
   573          Limit(
   574            Sort(
   575              Filter(
   576                Aggregate(
   577                  Select(
   578                    Filter(
   579                      JoinResult1 = InnerJoin(
   580                        Scan(ta1),
   581                        Scan(tb1),
   582                        ta1.id=tb1.id,
   583                      ),
   584                      ta1.col=10 and ta1.col > any(
   585                        Select(
   586                          Filter(
   587                            Scan(tc1),
   588                            tc1.id=JoinResult1.ta1.id,
   589                          ),
   590                          tc1.c,
   591                        ),
   592                      ),
   593                    ),
   594                    (1, ta1.col+1, tb1.col),
   595                  ),
   596                  count(‘1’), // ‘1’ is column 1
   597                ),
   598                ‘1’ > 10,
   599              ),
   600              (‘1’ desc),
   601            ),
   602            0, 10,
   603          ),
   604          Aggregate(
   605            Filter(
   606              InnerJoin(
   607                Scan(ta2),
   608                Scan(tb2),
   609                ta2.id=tb2.id,
   610              ),
   611              ta2.col=10,
   612            ),
   613            count(‘1’),
   614          ),
   615        ),
   616        NULL,
   617      ),
   618      (‘1’ desc),
   619    ),
   620    0, 5,
   621  )
   622  ```
   623  
   624  There’s a lot of nesting, but the expressions are mostly straightforward by the fact that each result feeds into the other. The only exception is the correlated subquery that uses a value from JoinResult1 from the outer query to perform its operation. You can also see that the lowest level operations are all Scans.
   625  
   626  If the primitives defined at the beginning were all implemented, if we had an infinitely fast CPU, or an infinite amount of time, and we also had an infinite amount of memory, the above ‘plan’ will produce the correct result.
   627  
   628  However, SQL engines are all about having limited CPU, memory and disk. So, they have to figure out ways to rearrange those operations differently so that the minimum amount of time and resources is used to compute the result. The next section will cover those techniques.
   629  
   630  # Approaches and tradeoffs
   631  
   632  ## Standard approaches
   633  
   634  If you analyze the various operators, the expensive ones are:
   635  
   636  * Scan: you want to avoid full table scans, unless it’s a necessity. This is ironically the lowest level operation.
   637  * Sort: this is a potentially expensive operation if the result set is large.
   638  * Aggregation: is also potentially expensive
   639  
   640  All other operations are considered ‘cheap’ because they can be applied as the rows come. To optimize for the above three operations, databases combine the primitives. The most popular one is the index scan, which allows you to combine Scan, Filter and Sort, all as one operation. Aggregation can also benefit from an index scan if the GROUP BY columns are part of an index. If an index cannot be used, the database resorts to the more traditional approach of full table scans, file sorts, etc.
   641  
   642  ## Preserving the original representation
   643  
   644  In the case of VTGate, the ‘Route’ operation is capable of performing all 9 functions, as long as all the rows needed to fulfill the query reside in a single database. However, those functions have to be expressed as an SQL query. Most often, the original query could just be passed-through to a single shard. So, the one important question is: If we converted a query to its relational representation, can we then convert it back to the original query? The answer is no, or at least not trivially; The relational operators don’t always map one-to-one with the constructs of an SQL statement. Here’s a specific example:
   645  
   646  `select count(*), a+1, b from t group by a+1, b`
   647  
   648  This gets expressed as:
   649  
   650  ```
   651  Aggregate(
   652    Select(
   653      Scan(t),
   654      (1, a+1, b),
   655    ),
   656    Count(‘1’),  // ‘1’ is column 1
   657  )
   658  ```
   659  
   660  So, it may not be a good idea to immediately convert the parsed query into the relational representation. When we analyze the query, we should try to retain the original text and see if they could be recombined without loss of readability.
   661  
   662  The AST produced by the parser can be reversed back into a representation that’s accurate and close to the original query. This is what VTTablet already does. So, we can use the AST as the intermediate representation that can be retained until we can figure out what can be delegated to the underlying databases.
   663  
   664  Once we have identified the parts that can be pushed down, what is left can be converted to use the operations that VTGate can perform.
   665  
   666  ## Cleverness
   667  
   668  Relational databases initially optimized their queries based on a standard set of rules that used the schema info as the only additional input. These were known as rule-based optimizers. However, some queries would fail for certain values due to pathological corner cases.
   669  
   670  In order to address such corner cases, databases decided to venture into cost-based optimizers. This was somewhat of a slippery slope. On the one hand, the feature was very convenient, because one could throw any kind of ad-hoc query to a database, and it would figure out the best way to get the results. However, every once in awhile, the database would make the wrong optimization decision. This led to iterations to accommodate the new corner cases. Unfortunately, there is no end to this, and databases are still iterating. Beyond a certain point, this becomes a losing battle because the cost of computing the best plan starts to exceed the cost of executing the query using a less optimal one.
   671  
   672  The cost-based optimizers have become so complex that a novice programmer can’t reliably predict the execution plan for a query. It’s also hard to know why an engine chose a certain plan.
   673  
   674  An escape hatch was provided: query hints. If the database failed to find the most optimal plan, the application can specify hints in the query. Once hints are provided, the optimizer is bypassed, and the application dictates the plan for the query. The fundamental question is: Who should know the data? This question opens a lot of arguments. The main argument in favor of why the app should be responsible for the query plans is as follows: even if the database can decide the most optimal plan, the app has to know the cost of a query. It needs to know this because it makes a decision on whether it’s going to send 1QPS or 1MQPS of that query. If the plan needs to change, it means that the cost is also changing, which means that the app needs to be revisited. This is why an optimizer that is too clever has diminishing returns, and is sometimes counter-productive.
   675  
   676  The rise of NoSQL databases is further evidence that engineers are generally not very sold on how clever the database optimizers are.
   677  
   678  One concern with the application dictating the plan is that we may constantly worry that it might not be the best one. To alleviate this, we can develop analyzers, possibly backed by a machine-learning backend, that can suggest rewrites. But it will be up to the application to accept them.
   679  
   680  So, we’re going to attempt this brain-dead approach for optimizations in VTGate: By looking at a query, and knowing the vschema, one should be able to predict what VTGate will do with it. If, for some reason, the plan doesn’t work out, then the query must be rewritten to represent the alternate plan. There is a possibility that this approach may fail; Some database users have grown accustomed to the optimizer doing the work for them. If it happens, this decision can be revisited.
   681  
   682  ### What does this all mean
   683  
   684  This just means that the VTGate optimizer will be very simple and transparent. For example, joins will always be performed left-to-right. When you look at a query, you’ll know the join order by just looking at it. VTGate will not try to rewrite subqueries into joins. If a join plan is more efficient than a subquery, then the app must write it that way.
   685  
   686  Of course, if there are optimizations that make common sense, those will be implemented. For example, if a subquery is not correlated, then it can be pulled out and executed only once.
   687  
   688  We should also be sensitive to situations where the natural expression of a query does not match the best plan.
   689  
   690  No matter what, once a query is outsourced to a database, it’s subject to the optimizer decisions of the database. So, the app may still have to give hints for such situations.
   691  
   692  ## Opcodes vs tree
   693  
   694  The execution plan can be represented either using opcodes or as a tree. The opcode approach represents the plan as a list of actions to be performed, whereas a tree represents the same plan as one action that depends on other actions, which recursively ends up with leaf nodes that perform the routing.
   695  
   696  Opcodes can be very flexible (turing complete). However, they require the introduction of intermediate variables. Let’s look at this example:
   697  
   698  `a - (b+c)`
   699  
   700  The opcode approach would express the expression as:
   701  
   702  ```
   703  temp = b+c
   704  result = a - temp
   705  ```
   706  
   707  The tree approach would represent the action as:
   708  
   709  `Minus(a, Plus(b, c))`
   710  
   711  In the theory section, we classified SQL as a pushdown automaton, which means that a tree structure is sufficient to represent a statement.
   712  
   713  Using opcodes may still help us come up with better optimizations. However, given that we don’t plan on going overboard on optimizations, the more readable tree representation is preferable.
   714  
   715  ## Prioritization
   716  
   717  Since we have powerful databases underneath us, our first priority is to harness anything they can do first. The next priority will be to do things that VTGate can perform on-the-fly, without accumulating results. And finally, we’ll look at features that require complex memory management or intermediate file storage. The features will be implemented as follows:
   718  
   719  1. Single shard or single keyspace joins.
   720  2. Scatter joins that can be merged in any order.
   721  3. Cross-shard joins that don’t require post-processing.
   722  4. Sort, Aggregate and Limit, provided the results can be pre-sorted by the underlying database(s).
   723  5. Filter and Select expression engines.
   724  6. All other constructs.
   725  
   726  One hurdle to overcome with #4 is collation. Substantial work may have to be done to make the VTGate collation behavior match MySQL.
   727  
   728  ## Streaming vs non-streaming
   729  
   730  VTGate has two query APIs:
   731  
   732  * Non-streaming API: This API has deadlines and row count limits. It collects all the results first and then returns them.
   733  * Streaming API: This API has no deadlines or row count limits. It streams the results as they come.
   734  
   735  This difference will eventually cause a divergence in what VTGate can do for streaming vs non-streaming queries. Since the entire result is available in-memory for non-streaming queries, we’ll be able to perform sorts and aggregations after obtaining the results. However, this will not be possible for streaming queries.
   736  
   737  For the short term, we’ll maintain parity by only implementing operations that both APIs can perform. This actually means priorities 1 through 5. So, it may be a long time before the two diverge.
   738  
   739  ## Column names
   740  
   741  The vschema currently doesn’t know all the column names of the tables. VTGate cannot resolve joins if it doesn’t have this info. We have two options:
   742  
   743  1. Extend vschema to also contain the column names of tables.
   744  2. Require join queries to always qualify column names by the table.
   745  
   746  The first option is going to be a maintenance problem. It introduces a dependency that is hard to maintain. For now, we’ll go with option #2. It is a syntactic inconvenience, but let’s hope that users find this acceptable. We could allow exceptions for vindex columns, but that’s likely to be confusing. So, it’s better to require this for all columns. For subqueries (that are themselves not joins), we can make an exception and assume that unqualified column names are addressing the inner table. You’ll need to use qualified names only for addressing columns of an outer query.
   747  
   748  ## Redundant constructs
   749  
   750  SQL has many redundant constructs. There’s diminishing return in supporting all of them. So, the following constructs will not be supported:
   751  
   752  * `‘,’` join: This operator was the old way of joining tables. The newer SQL recommends using actual JOIN keywords. Allowing both forms of JOIN will be a big source of confusion because these operators also have different preferences.
   753  * RIGHT JOIN: This is same as a reversed LEFT JOIN, except it’s less readable.
   754  * NATURAL joins: These are rarely used in practice because the join condition applies to all columns.
   755  * JOIN USING: This is just syntactic sugar for JOIN ON a.id=b.id
   756  * WHERE clause join condition: A JOIN without an ON clause is supposed to be a cross product of the two tables. However, old-style ‘,’ users have continued to specify join conditions in the WHERE clause. The initial version of VTGate will only use the ON clause to decide how to route parts of a query. This decision can be revisited if things are too inflexible.
   757  
   758  ## Limiting results
   759  
   760  VTGate currently doesn’t have any limits on how many rows it can hold at any given time. This is dangerous because a scatter query that hits a large number of shards could easily blow up the memory.
   761  
   762  Also, as we add support for joins, we’ll have a new way of accumulating a large number of rows.
   763  
   764  So, the VTTablet level protection to limit the maximum number of rows fetched may not be sufficient any more.
   765  
   766  We’ll add a new maxrows variable to VTGate. If the number of rows accumulated exceeds this count, we’ll return an error.
   767  
   768  Note that this limit only applies to non-streaming queries. Streaming queries will continue to work as before.
   769  
   770  # Design
   771  
   772  Recapitulating what we’ve covered so far:
   773  
   774  * The primary function of the optimizer is to push down as much of the query components as possible.
   775  * We’ll not support redundant constructs for joins.
   776  * If a query has to be split across multiple shards or keyspaces, we’ll expect the application to dictate the order of operations. This will be left to right, unless parentheses define a different order. 
   777  * For joins, all column names must be qualified by their table.
   778  * We’ll preserve the original representation of the query to the extent possible.
   779  * We’ll not attempt to flatten subqueries. However, they can still get pushed down with the outer query if they’re constrained by the same keyspace id.
   780  
   781  ## Symbol table and scoping rules
   782  
   783  Once we start allowing joins and subqueries, we have a whole bunch of table aliases and relationships to deal with. We have to contend with name clashes, self-joins, as well as scoping rules. In a way, the vschema has acted as a static symbol table so far. But that’s not going to be enough any more.
   784  
   785  The core of the symbol table will contain a map whose key will be a table alias, and the elements will be [similar to the table in vschema](https://github.com/vitessio/vitess/blob/0b3de7c4a2de8daec545f040639b55a835361685/go/vt/vtgate/vindexes/vschema.go#L82). However, it will also contain a column list that will be built as the query is parsed.
   786  
   787  ### A simple example
   788  
   789  Let’s start with the following vschema that has two tables t1 & t2. Here’s a simplified view:
   790  
   791  <table>
   792    <tr>
   793      <td>tables</td>
   794      <td>columns</td>
   795      <td>vindexes</td>
   796    </tr>
   797    <tr>
   798      <td>t1</td>
   799      <td>id</td>
   800      <td>id_idx: hash</td>
   801    </tr>
   802    <tr>
   803      <td></td>
   804      <td>a</td>
   805      <td>a_idx: lookup_unique</td>
   806    </tr>
   807    <tr>
   808      <td>t2</td>
   809      <td>t2id</td>
   810      <td>t2id_idx: hash</td>
   811    </tr>
   812    <tr>
   813      <td></td>
   814      <td>b</td>
   815      <td>b_idx: lookup_unique</td>
   816    </tr>
   817  </table>
   818  
   819  
   820  For the following query:
   821  
   822  `select a, b from t1 where id = :id and c = 5`
   823  
   824  The starting symbol table will be identical to the vschema. However, as the query is analyzed, b and c will get added to t1 as additional columns:
   825  
   826  <table>
   827    <tr>
   828      <td>tables</td>
   829      <td>columns</td>
   830      <td>vindexes</td>
   831    </tr>
   832    <tr>
   833      <td>t1</td>
   834      <td>id</td>
   835      <td>id_idx: hash</td>
   836    </tr>
   837    <tr>
   838      <td></td>
   839      <td>a</td>
   840      <td>a_idx: lookup_unique</td>
   841    </tr>
   842    <tr>
   843      <td></td>
   844      <td>b</td>
   845      <td></td>
   846    </tr>
   847    <tr>
   848      <td></td>
   849      <td>c</td>
   850      <td></td>
   851    </tr>
   852  </table>
   853  
   854  
   855  If a symbol table contains only one database table, then all unqualified column references are assumed to implicitly reference columns of that table.
   856  
   857  *Ideally, we should be validating column references against a known column list for each table. However, as mentioned before, the maintenance of that list comes with its own problems. Instead, we trust that a non-ambiguous column reference implies that the table (or result) does have such a column. In the worst case where such a column doesn’t exist, the query will fail. There are no circumstances where this assumption can lead to incorrect results.*
   858  
   859  The bind vars section of the symbol table will contain an entry for id.
   860  
   861  ### Self-join
   862  
   863  For the following query (using the previous vschema):
   864  
   865  `select t1.a, t2.c from t1 join t1 as t2 where t2.a = t1.id`
   866  
   867  The symbol table will look like this:
   868  
   869  <table>
   870    <tr>
   871      <td>tables</td>
   872      <td>columns</td>
   873      <td>vindexes</td>
   874    </tr>
   875    <tr>
   876      <td>t1</td>
   877      <td>id</td>
   878      <td>id_idx: hash</td>
   879    </tr>
   880    <tr>
   881      <td></td>
   882      <td>a</td>
   883      <td>a_idx: lookup_unique</td>
   884    </tr>
   885    <tr>
   886      <td>t2</td>
   887      <td>id</td>
   888      <td>id_idx: hash</td>
   889    </tr>
   890    <tr>
   891      <td></td>
   892      <td>a</td>
   893      <td>a_idx: lookup_unique</td>
   894    </tr>
   895    <tr>
   896      <td></td>
   897      <td>c</td>
   898      <td></td>
   899    </tr>
   900  </table>
   901  
   902  
   903  In the symbol table, t2 starts off as identical to t1, and is different from t2 that’s in the original vschema. But t2 eventually adds c to its list of columns. There is currently no value in remembering that t1 and t2 refer to the same underlying table. Maybe a future use case will arise that will introduce this need.
   904  
   905  ### Subqueries
   906  
   907  MySQL allows table names (or aliases) in subqueries to hide an outer name. Due to this, we have to extend the symbol table to support scoping. This is done by changing it into a linked list, where each node represents a scope. A symbol is searched in the list order until it’s found.
   908  
   909  Let’s look at this intentionally confusing example:
   910  
   911  ```
   912  select t1.a, t2.b from t1 join t2 where t2.t2id in (
   913    select b from t1 as t2 where id = t1.c
   914  )
   915  ```
   916  
   917  The outerscope symbol table will look like this:
   918  
   919  <table>
   920    <tr>
   921      <td>t1</td>
   922      <td>id</td>
   923      <td>id_idx: hash</td>
   924    </tr>
   925    <tr>
   926      <td></td>
   927      <td>a</td>
   928      <td>a_idx: lookup_unique</td>
   929    </tr>
   930    <tr>
   931      <td></td>
   932      <td>c</td>
   933      <td></td>
   934    </tr>
   935    <tr>
   936      <td>t2</td>
   937      <td>t2id</td>
   938      <td>t2id_idx: hash</td>
   939    </tr>
   940    <tr>
   941      <td></td>
   942      <td>b</td>
   943      <td>b_idx: lookup_unique</td>
   944    </tr>
   945  </table>
   946  
   947  
   948  The innerscope will look like this:
   949  
   950  <table>
   951    <tr>
   952      <td>t2</td>
   953      <td>id</td>
   954      <td>id_idx: hash</td>
   955    </tr>
   956    <tr>
   957      <td></td>
   958      <td>a</td>
   959      <td>a_idx: lookup_unique</td>
   960    </tr>
   961    <tr>
   962      <td></td>
   963      <td>b</td>
   964      <td></td>
   965    </tr>
   966  </table>
   967  
   968  
   969  Column c for t1 was added because the inner query made a qualified reference to it. The inner scope table gets a column b added because the innerscope query made an implicit reference, and there was only one item in that scope.
   970  
   971  Also, the inner query cannot access any columns of the outer t2 because it has its own t2 symbol that hides the outer one.
   972  
   973  A corollary: since subquery nesting can have many levels, an inner query can still reference results from any of the outer scopes. For example, this is a valid query:
   974  
   975  ```
   976  select t1.a, t1.b
   977  from t1
   978  where t1.id in (
   979    select t2.id
   980    from t2
   981    where t2.id in (
   982      select t3.id
   983      from t3
   984      where t3.id = t1.a
   985      )
   986    )
   987  ```
   988  
   989  ### The FROM clause
   990  
   991  The FROM clause is an exception. This clause creates a new symbol table (scope), with no initial outerscope, even if it was part of a subquery. At the end of the analysis, the newly created symbol table is added to the outerscope, if there was one.
   992  
   993  This means that subqueries in the FROM clause are not allowed to refer to elements of any outer query.
   994  
   995  The result of such a subquery will be a new virtual table with its own vindex columns, etc. This is facilitated by the fact that the SQL syntax requires you to name the subquery with an alias. Figuring out possible vindexes for the new virtual table is useful because routing rules could be applied for other tables that may join with the subquery. Here’s an example:
   996  
   997  ```
   998  select sq.a, t2.b
   999  from
  1000    (select a from t1 where id = :id) as sq
  1001    join t2 on t2.b = sq.a
  1002  ```
  1003  
  1004  The main symbol table will look like this:
  1005  
  1006  <table>
  1007    <tr>
  1008      <td>tables</td>
  1009      <td>columns</td>
  1010      <td>vindexes</td>
  1011    </tr>
  1012    <tr>
  1013      <td>sq</td>
  1014      <td>a</td>
  1015      <td>a_idx: lookup_unique</td>
  1016    </tr>
  1017    <tr>
  1018      <td>t2</td>
  1019      <td>t2id</td>
  1020      <td>t2id_idx: hash</td>
  1021    </tr>
  1022    <tr>
  1023      <td></td>
  1024      <td>b</td>
  1025      <td>b_idx: lookup_unique</td>
  1026    </tr>
  1027  </table>
  1028  
  1029  
  1030  The symbol table for the `‘from t1’` subquery will be in its own scope. Its analysis will yield a new virtual table sq, which will be the one added to the main symbol table. The analyzer should hopefully be smart enough to deduce that column ‘a’ of the subquery can be used for routing because it’s a faithful representation of the column in the underlying table t1.
  1031  
  1032  *Conceptually speaking, there’s no strong reason to prevent the FROM clause from referencing columns of an outer query. MySQL probably disallows this because it leads to confusing scoping rules. Let’s take a look at this example:*
  1033  
  1034  ```
  1035  select id from a where id in (
  1036    select b.id, a.id
  1037    from
  1038      (select id from b where b.id=a.id) b
  1039      join d as a on a.id=b.id)
  1040  ```
  1041  
  1042  *If we allowed an outer scope, then b.id=a.id becomes ambiguous. We don’t know if it should refer to the outer ‘a’, or the one that is yet to be created as part of the current scope. Needless to say, one shouldn’t be writing such unreadable queries.*
  1043  
  1044  ### ON clauses
  1045  
  1046  By the fact that the ON clause is part of a FROM clause, you cannot refer to an element of an outer query. This is an invalid query:
  1047  
  1048  `select id from a where id in (select b.id from b join c on b.id=a.id)`
  1049  
  1050  But this is valid:
  1051  
  1052  `select id from a where id in (select b.id from b join c where b.id=a.id)`
  1053  
  1054  Furthermore, you can only refer to elements that participate in the current JOIN. But the following query is still valid:
  1055  
  1056  `select * from a join b on b.id = a.id join c on c.id = a.id`
  1057  
  1058  Because of associativity rules, c is actually joined with ‘a join b’.
  1059  
  1060  But this query is not valid:
  1061  
  1062  `select * from a join (b join c on c.id = a.id)`
  1063  
  1064  The ON clause participants are only b and c because of the parenthesis. So, you’re not allowed to reference a.id in that join. In other words, analysis of the ON clause requires a list of participants. The analyzer will need to check with this list before looking inside the symbol table, or we could exploit Go interfaces here by sending a subset symbol table that has the same interface as the original one.
  1065  
  1066  ON clauses are allowed to have correlated subqueries, but only with tables that are the ON clause participants. This statement is valid:
  1067  
  1068  `select * from a join b on b.id in (select c.id from c where c.id=a.id)`
  1069  
  1070  ### Naming result columns
  1071  
  1072  Result columns get referenced in the following places:
  1073  
  1074  * For addressing the results of a subquery in the FROM clause
  1075  * GROUP BY, HAVING and ORDER BY clauses
  1076  
  1077  The WHERE clause is not allowed to reference them because it happens before the SELECT expressions are evaluated.
  1078  
  1079  Column names are usually built from the select expression itself. Therefore, ‘a+1’ is a valid column name. If such a column is in a FROM subquery, it can be addressed using back-quotes as ``a+1``. However, you can reference such expressions directly in GROUP BY, HAVING and ORDER BY clauses. You can also reference a result by using its column number (starting with 1).
  1080  
  1081  Conceptually, this is similar to the subquery column naming. GROUP BY, etc. are post-processing operations that are applied to the result of a query.
  1082  
  1083  *MySQL allows duplicate column names. Using such column names for joins and subqueries leads to unpredictable results. Because of this, we could go `either way`:*
  1084  
  1085  * *Allow duplicates like MySQL and remain equally ambiguous*
  1086  * *Disallow duplicates*
  1087  
  1088  Here’s an example of how MySQL names columns:
  1089  
  1090  ```
  1091  > select id, (select val    from a), 3, 3, 3 as `4` from a;
  1092  +------+------------------------+---+---+---+
  1093  | id   | (select val    from a) | 3 | 3 | 4 |
  1094  +------+------------------------+---+---+---+
  1095  |    1 | a                      | 3 | 3 | 3 |
  1096  +------+------------------------+---+---+---+
  1097  MariaDB [a]> select `(select val    from a)` from (select id, (select val    from a), 3, 3, 3 as `4` from a) as t;
  1098  +------------------------+
  1099  | (select val    from a) |
  1100  +------------------------+
  1101  | a                      |
  1102  +------------------------+
  1103  1 row in set (0.00 sec)
  1104  MariaDB [a]> select `(select val from a)` from (select id, (select val    from a), 3, 3, 3 as `4` from a) as t;
  1105  ERROR 1054 (42S22): Unknown column '(select val from a)' in 'field list'
  1106  ```
  1107  
  1108  The select expression is produced verbatim as column name, and duplicates are allowed.
  1109  
  1110  These quirks don’t matter much if VTGate just has to return these results back to the client without generating its own values. However, the result column names need to be known for analyzing dependencies when we split queries. So, VTGate will do a fake mimic of what MySQL does: Take a guess at the column name by rebuilding a string representation of the AST for that select expression. If there’s a lucky match, well and good. If not, the query analysis will fail, and the user will have to provide a non-ambiguous alias for such columns. In practice, this is not a real problem because users also generally dislike using verbatim expressions as column names when performing complex joins.
  1111  
  1112  So, once the SELECT expression is analyzed, the symbol table (for that scope) will acquire an entry for the result column names. Additionally, if a result column is a faithful reference to an underlying table column, it will contain this additional annotation, which can be used for optimizing other operations.
  1113  
  1114  ### Bind variables
  1115  
  1116  When VTGate has to perform a join, it needs to use the results of one query to feed into the other. For example, let’s look at this query:
  1117  
  1118  `select t1.a, t2.b from t1 join t2 on t2.id = t1.id`
  1119  
  1120  If t1 and t2 are on different keyspaces, VTGate has to perform the following operations:
  1121  
  1122  `select t1.a from t1`
  1123  
  1124  For each row of this result, it then has to perform:
  1125  
  1126  `select t2.b from t2 where t2.id = :_t1_a`
  1127  
  1128  The mechanism for this happens by generating a bind variable named ‘_t1_a1’ from the first result. The bind variable name can be generated using a functional algorithm:
  1129  
  1130  `‘_’ + tableAlias + ‘_’ + columnName`
  1131  
  1132  The scoping rule works in our favor here because this type of name generation correctly handles naming conflicts.
  1133  
  1134  *MySQL allows multiple columns in a result to have the same name, and allows using of such ambiguous names in joins, etc. Fortunately, this also works in our favor because we’ll be no worse if we also allowed it.*
  1135  
  1136  For the sake of simplicity, we’ll not allow dependencies to use non-standard column names. For example, we’ll fail a query like this:
  1137  
  1138  `select * from (select a, count(*) from t1) t where t.count(*) > 0`
  1139  
  1140  This should instead be rewritten as:
  1141  
  1142  `select * from (select a, count(*) c from t1) t where t.c > 0`
  1143  
  1144  This way, we can build a bind variable like `‘:_t_c’` for the outer query if we have split the above query into two.
  1145  
  1146  There is a subtle difference between column names produced by a result vs. the column names in the symbol table. The WHERE clause constructs reference columns in the symbol table. It’s only the post-processing constructs like GROUP BY, etc. that reference column names produced by the result. This could pose a problem when a SELECT gets pushed down that defines a new alias that has the same column name as an underlying table. Will such a construct hide, and therefore, corrupt the meaning of a join? Let’s look at an example:
  1147  
  1148  `select a.col1 as col2, b.col1 from a join b on b.col2=a.col2`
  1149  
  1150  Rewritten, this would become
  1151  
  1152  ```
  1153  select a.col1 as col2, a.col2 from a
  1154  Join (by producing _a_col2 from column 2)
  1155  select b.col1 from b where b.col2 = :_a_col2
  1156  ```
  1157  
  1158  Because we reference column numbers from the first query, we bypass naming ambiguities. However, we still need to know the source column for each expression so we don’t have to unnecessarily fetch identical results. In other words, the following query:
  1159  
  1160  `select a.col2 as col2, b.col1 from a join b on b.col2=a.col2`
  1161  
  1162  should become:
  1163  
  1164  ```
  1165  select a.col2 as col2 from a
  1166  Join (by producing _a_col2 from column 1)
  1167  select b.col1 from b where b.col2 = :_a_col2
  1168  ```
  1169  
  1170  This means that select expressions need to know if they’re a faithful reference of an underlying table column, and this information will be used for wiring up dependencies.
  1171  
  1172  The other situation where result names interact with the symbol table is during a subquery in the FROM clause. Fortunately, we create a separate virtual table entry for it in the symbol table for this situation. This allows us to keep the two worlds from colliding.
  1173  
  1174  ## Pushdown
  1175  
  1176  Pushdown is achieved by analyzing the query in the execution order and seeing if the next stage can be pushed down into the database. For VTGate, the execution order is slightly different from the standard one: Route replaces Scan, and also, the Route primitive can perform a scatter.
  1177  
  1178  In terms of relational primitives, a scatter is a bunch of `Route` operations followed by a `Merge`. We’ll call it  `RouteMerge`. So, what you can push down into a simple `Route` may not be allowed for a `RouteMerge`. Here’s the VTGate-specific order of operations:
  1179  
  1180  `Route/RouteMerge <-> (Join & LeftJoin) -> Filter(where) -> Select -> Aggregate -> Filter(having) -> Sort -> Limit -> Merge`
  1181  
  1182  The rest of the analysis tries to push the subsequent operations into Route or RouteMerge. Whatever cannot be pushed down is the work that VTGate has to do by itself. The fact that we won’t try to flatten subqueries simplifies our analysis a little bit.
  1183  
  1184  *MySQL allows you to Sort and Limit the result of a UNION. This is something that the Vitess grammar doesn’t allow right now. We’ll eventually need to support this construct, which will add `Aggregate -> Sort -> Limit` as things that can further happen after the last `Merge`.*
  1185  
  1186  The overall strategy is as follows:
  1187  
  1188  1. Identify groups: Form groups of tables (or subqueries) in the FROM clause that can stay together for Route operations.
  1189  2. Pushdown higher-level operations into the various groups: This step uses the AST as input and produces an execution tree where the leaf nodes are all Route operations.
  1190  3. Wire up dependencies: This step computes the dependencies between different routes, and potentially modifies the queries in the routes so that additional values are returned that may be needed by other routes.
  1191  4. Produce the execution plan.
  1192  
  1193  ### Starting primitives
  1194  
  1195  In order to align ourselves with our priorities, we’ll start off with a limited set of primitives, and then we can expand from there.
  1196  
  1197  VTGate already has `Route` and `RouteMerge` as primitives. To this list, let’s add `Join` and `LeftJoin`. Using these primitives, we should be able to cover priorities 1-3 (mentioned in the [Prioritization](https://github.com/vitessio/vitess/blob/main/doc/V3HighLevelDesign.md#prioritization) section). So, any constructs that will require VTGate to do additional work will not be supported. Here’s a recap of what each primitive must do:
  1198  
  1199  * `Route`: Sends a query to a single shard or unsharded keyspace.
  1200  * `RouteMerge`: Sends a (mostly) identical query to multiple shards and returns the combined results in no particular order.
  1201  * `Join`: Executes the LHS operation, which could be any primitive. For each row returned, it builds additional bind vars using the result, and uses them to execute the RHS. For each row of the RHS, it builds a new row that combines both the results.
  1202  * `LeftJoin`: Same as Join, except that a row is returned even if the RHS fails to return a row.
  1203  
  1204  ### FROM clause
  1205  
  1206  The first step of the pushdown algorithm is to identify tables that can stay together. Requiring the ON clause to specify the join condition helps simplify this analysis. If this rule is followed in the query, we can determine this grouping by just analyzing the FROM clause.
  1207  
  1208  The premise is as follows: A join can be pushed down if all the rows needed to satisfy a query are within the same database. If all the tables are in an unsharded keyspace, then this constraint is automatically true. If the tables are in a sharded keyspace, then the join condition must guarantee this. This is best explained with an example. Let’s start with a sharded keyspace with user and extra as tables, and analyze this query:
  1209  
  1210  `select user.name, extra.info from user join extra on extra.id=user.id`
  1211  
  1212  The join algorithm will first scan user:
  1213  
  1214  `select user.name, user.id from user`
  1215  
  1216  Then for each user.id (:user_id), it will execute:
  1217  
  1218  `select extra.info from extra where extra.id = :user_id`
  1219  
  1220  If user and extra had the same vindex on id, and if it was unique, then we know that all the necessary rows to satisfy the second query belong to the same kesypace_id as the row that came from user. So, they will be in the same shard. This means that it’s safe to push down the join.
  1221  
  1222  Note that this push-down will work for RouteMerge queries also.
  1223  
  1224  Given that the unsharded keyspace is an easy special case, we’ll focus mainly on the analysis of sharded keyspaces. Any place where we mention ‘same shard’ or ‘same keyspace id’, such condition will equally apply to an unsharded keyspace.
  1225  
  1226  *Rule: For a join to be groupable, the join condition must be an equality condition where the participant columns use the same vindex, and the vindex must be unique.*
  1227  
  1228  Some examples of what kind of AST will be produced by joins:
  1229  
  1230  ```
  1231  a join b join c
  1232      J
  1233     / \
  1234    J   c
  1235   / \
  1236  a   b
  1237  A join (b join c)
  1238    J
  1239   / \
  1240  a   J
  1241     / \
  1242    b   c
  1243  a left join b on c1 left join c on c2
  1244      L:c2
  1245     / \
  1246   L:c1 c
  1247   / \
  1248  a   b
  1249  ```
  1250  
  1251  For left joins, the ON clause is mandatory and enforced by the grammar. Because of this, the placement of ON clauses can also dictate grouping, which is not the case for normal joins:
  1252  
  1253  ```
  1254  a left join b left join c on c1 on c2
  1255    L:c2
  1256   / \
  1257  a   L:c1
  1258     / \
  1259    b   c
  1260  ```
  1261  
  1262  By looking at the above tree, we can deduce that c1 can reference b and c, while c2 can reference a, b and c.
  1263  
  1264  As mentioned before, when the required data is in multiple shards or keyspaces, VTGate will perform the join, and the order will be dictated by the SQL statement, as represented by the above execution trees. The join conditions themselves may try to dictate a different order. However, that will not be used to reorder the joins.
  1265  
  1266  For example:
  1267  
  1268  `from a join (b join c on c.id=b.id) on c.id=a.id`
  1269  
  1270  is likely better executed this way:
  1271  
  1272  `from a join c on c.id=a.id join b on b.id=c.id`
  1273  
  1274  or this way:
  1275  
  1276  `from b join c on c.id=b.id join a on a.id=c.id`
  1277  
  1278  But VTGate will not perform such reordering tricks.
  1279  
  1280  The work of the analyzer is to recursively walk the JOIN tree. The decisions are made from bottom to top.
  1281  
  1282  The lowest level node of a parse tree is a ‘table’. This could be a real table or the result of a subquery that can be treated as a table. Knowing the table name, we can deduce the keyspace to which it belongs. So, the output of this analysis is the creation of the table entry in the symbol table, and the node being marked as a RouteMerge, because the rows for that table are scattered across all shards until additional constraints are applied.
  1283  
  1284  When analyzing a join node, we first extract the ON clause conditions as a list, and we see if these requirements are satisfied:
  1285  
  1286  * The node types have to be Route or RouteMerge
  1287  * They should be for the same keyspace (sharded)
  1288  * There should be at least one join condition between the two nodes: left.col = right.col
  1289  * The columns referenced in the join must use the same vindex
  1290  * The vindex must be unique
  1291  * Other conditions in the ON clause must also be executable under the new group. This basically means that if there are subqueries, they should be correlated in such a way that the required rows can all be found within the new group. The algorithm for this analysis is explained in the WHERE clause section.
  1292  
  1293  Two nodes can also be grouped (without a join condition) if they are both constrained by the same keyspace id (Route nodes).
  1294  
  1295  The above rules work for both JOIN and LEFT JOIN nodes.
  1296  
  1297  ###### Why does grouping work for LEFT JOIN?
  1298  
  1299  *It’s easy to explain why grouping works for the simple case:*
  1300  
  1301  `a left join b on b.id=a.id`
  1302  
  1303  *In the above case, the only rows that are inspected in b are those where b.id=a.id. So, the join can be grouped. But what about this join:*
  1304  
  1305  `(a left join b on a.id=b.id) join (c left join d on c.id=d.id) on b.id=d.id`
  1306  
  1307  *In the above case, rows from b or c could be NULL. Fortunately, in SQL, NULL != NULL. So, the outer-scope join will succeed only if rows from b and c are not NULL. If they’re not NULL, they’re guaranteed to be from the same shard. So, this makes them groupable. In fact, in the above case, the left joins in the inner scope are unnecessary. They could have just been normal joins. However, things would be very different if one had used the null-safe equal operator (<=>) for joins. But we’ll not treat null-safe equal as a valid join operator.*
  1308  
  1309  When two nodes are grouped, the current join condition becomes the root of the new group, and it gets a routing property:
  1310  
  1311  * If it’s a JOIN, the new property is the more restrictive of the two nodes. So, if one of them is a Route, then the new node is also a Route.
  1312  * For a LEFT JOIN, the new property is the same as the LHS node.
  1313  
  1314  If the grouping conditions are not met, then the node remains a join node. In this case, we have to see if the ON clause conditions can be pushed down into the left and/or right nodes. By the fact that the current join is split into two, the ON clause cannot be pushed as is. Instead, we use associativity rules to our benefit and merge the ON clause conditions into the WHERE clauses of the underlying nodes. The rules are the same as the ones described for a normal WHERE clause.
  1315  
  1316  But left joins are slightly different, because the join condition is applied *to the RHS only*. Also, the condition cannot be further pushed into other nested left joins, because they will change the meaning of the statement. For example:
  1317  
  1318  `from a left join (b left join c on c.id=b.id) on c.id=a.id`
  1319  
  1320  is same as:
  1321  
  1322  `a LeftJoin (b LeftJoin c where c.id=b.id) where c.id=a.id)`
  1323  
  1324  But it’s not the same as:
  1325  
  1326  `a LeftJoin (b left join c where c.id=b.id and c.id=a.id)`
  1327  
  1328  If the above joins were normal joins, then all three would be equivalent.
  1329  
  1330  *The RHS-only rule for left joins has some non-intuitive repercussions. For example, this query will return all rows of table ‘a’:*
  1331  
  1332  `select a.id, b.id from a left join b on (a.id=5 and b.id=a.id)`
  1333  
  1334  *But b.id will be mostly null, except when a.id and b.id were 5.*
  1335  
  1336  An ON clause can contain other constraints that are not joins. If so, they’re treated just like a WHERE clause constraint. So, a keyspace id constraint in an ON clause can change a RouteMerge to a Route. This means that we have to look at how to combine Route nodes with others.
  1337  
  1338  A subquery in the FROM clause can result in any of the four primitive node. If it’s a join node, then the ON clause cannot be pushed down because it depends on the result of the join. If it was a Route or RouteMerge, the ON clause is pushable. Here’s how:
  1339  
  1340  `from a join (select id, count(*) from b where b.id=1) t on t.id=a.id`
  1341  
  1342  becomes
  1343  
  1344  ```
  1345    J
  1346   / \
  1347  a   (select id, count(*) from b where b,id=1) t where t.id=a.id
  1348  ```
  1349  
  1350  Here’s a tabular version of the above algorithm:
  1351  
  1352  <table>
  1353    <tr>
  1354      <td>LHS</td>
  1355      <td>Join type</td>
  1356      <td>RHS</td>
  1357      <td>Result</td>
  1358    </tr>
  1359    <tr>
  1360      <td>Route</td>
  1361      <td>Join</td>
  1362      <td>Route</td>
  1363      <td>Route</td>
  1364    </tr>
  1365    <tr>
  1366      <td>Route</td>
  1367      <td>Join</td>
  1368      <td>RouteMerge</td>
  1369      <td>Route</td>
  1370    </tr>
  1371    <tr>
  1372      <td>RouteMerge</td>
  1373      <td>Join</td>
  1374      <td>Route</td>
  1375      <td>Route (RHS)</td>
  1376    </tr>
  1377    <tr>
  1378      <td>RouteMerge</td>
  1379      <td>Join</td>
  1380      <td>RouteMerge</td>
  1381      <td>RouteMerge</td>
  1382    </tr>
  1383    <tr>
  1384      <td>Route</td>
  1385      <td>LeftJoin</td>
  1386      <td>Route</td>
  1387      <td>Route (LHS)</td>
  1388    </tr>
  1389    <tr>
  1390      <td>Route</td>
  1391      <td>LeftJoin</td>
  1392      <td>RouteMerge</td>
  1393      <td>Route (LHS)</td>
  1394    </tr>
  1395    <tr>
  1396      <td>RouteMerge</td>
  1397      <td>LeftJoin</td>
  1398      <td>Route</td>
  1399      <td>RouteMerge</td>
  1400    </tr>
  1401    <tr>
  1402      <td>RouteMerge</td>
  1403      <td>LeftJoin</td>
  1404      <td>RouteMerge</td>
  1405      <td>RouteMerge</td>
  1406    </tr>
  1407    <tr>
  1408      <td>Join/LeftJoin</td>
  1409      <td>All joins</td>
  1410      <td>All primitives</td>
  1411      <td>New join</td>
  1412    </tr>
  1413    <tr>
  1414      <td>All primitives</td>
  1415      <td>All joins</td>
  1416      <td>Join/LeftJoin</td>
  1417      <td>New join</td>
  1418    </tr>
  1419  </table>
  1420  
  1421  
  1422  At the end of the analysis, the output is a symbol table and a reduced tree representing the grouped joins. After this phase, the groups are finalized. They will not be further split or merged.
  1423  
  1424  Here’s an example:
  1425  
  1426  `a join b on b.id=a.id join c on c.id=b.id2 join d on d.id=a.id`
  1427  
  1428  will be represented as:
  1429  
  1430  ```
  1431                          J2
  1432                         /  \
  1433                        J1   d where d.id=a.id
  1434                       /  \    
  1435  a join b on b.id=a.id    c where c.id=b.id2
  1436  ```
  1437  
  1438  ### WHERE clause
  1439  
  1440  The ON clause analysis is similar to the WHERE clause, which is why it wasn’t covered in detail in the previous section. The main difference between WHERE and ON is that the ON clause can cause groups to merge. If two groups cannot be merged, the ON clauses get converted to WHERE clauses and get pushed down using the same rules. The only exception is the forced RHS push in the case of left joins.
  1441  
  1442  *The symbol table used by the ON clause is a subset of the one for the WHERE clause. So, it’s safe to move on ON clause condition into a WHERE clause. Attempting to move a WHERE clause condition to an ON clause may require more care.*
  1443  
  1444  The first step here is to convert the parse tree into a list of conditions that can be put back together using AND clauses. Then we analyze each condition separately and decide if it can be pushed into a Route or RouteMerge.
  1445  
  1446  We look at the columns that each condition references, and we identify the rightmost group that the expression references and we push the where clause there. Any references outside of that group will later be changed to bind vars. By the fact that the where clause was pushed to the right most group gives us the guarantee that values will be available for the rest of the column references.
  1447  
  1448  Each time a WHERE clause gets successfully pushed into a Route or RouteMerge, we re-evaluate the routing plan to see if it can be improved. The rules for this are described in the vindex document.
  1449  
  1450  There are situations where we cannot push down:
  1451  
  1452  ###### LEFT JOIN
  1453  
  1454  A WHERE clause cannot be pushed inside a LEFT JOIN, because it has to be applied after 
  1455  
  1456  the ON clause is processed, which is not until `LeftJoin` returns the results.
  1457  
  1458  *There are some practical use cases where you’d want to add conditions in the WHERE clause, typically for NULL checks. To support this in the future, we can use this technique that converts a LEFT JOIN into a normal JOIN:*
  1459  
  1460  `from a left join b on b.id=a.id where b.id is NULL`
  1461  
  1462  *After split, the query on b should have become:*
  1463  
  1464  `from b where b.id=a.id`
  1465  
  1466  *We cannot push down the null check in the above case. But we can, if we rewrote it as:*
  1467  
  1468  `from (select 1) as t left join b on b.id=a.id where b.id is NULL`
  1469  
  1470  *The (select 1) acts as a fake driver row for the left join, which allows us to tack on the where clause and also change the `LeftJoin` to a `Join`.*
  1471  
  1472  ###### Subqueries
  1473  
  1474  To be able to push a subquery, we have to determine that the subquery does not escape out of the shard of the parent query. For this, we identify the target group using the same method as above. Then we check to see if the subquery can get a Route plan where the input is the same keyspace id as the group’s query. If this is true, then the where clause can be pushed down just like a normal condition.
  1475  
  1476  If this condition is not met, the subquery becomes a separate group. This complexity is beyond what can be expressed using the four primitives we have. So, we’ll address such constructs when we add more primitives to the mix.
  1477  
  1478  When recombining WHERE clauses, additional parenthesis will be needed to prevent lower precedence operators from changing the intended order.
  1479  
  1480  ###### An example
  1481  
  1482  `from a join b on b.id=a.id join c on c.id=b.id where cond1(a.col, b.col) and cond2(a.col, c.col)`
  1483  
  1484  If a, b and c where in different groups, the output would be:
  1485  
  1486  ```
  1487      J
  1488     / \
  1489    J   c where (c.id=b.id) and (cond2(a.col, c.col))
  1490   / \
  1491  a   b where (b.id=a.id) and (cond1(a.col, b.col))
  1492  ```
  1493  
  1494  The cond2 expression gets pushed into the where clause for table ‘c’ because it’s the right-most group that’s referenced by the condition. External references will be changed to appropriate bind variables by the rewiring phase.
  1495  
  1496  *Once VTGate acquires the ability to perform its own filters, should we stop pushing these conditions into the dependent queries and do it ourselves instead? The answer will usually be no. You almost always want to push down filters. This is because it will let the underlying database scan fewer rows, or choose better indexes. The more restrictive the query is, the better.*
  1497  
  1498  At the end of the analysis, if no WHERE clauses remain, we can go to the next step. If not, the query is too complex, and we error out.
  1499  
  1500  ### SELECT clause, non-aggregate case
  1501  
  1502  The SELECT clause can represent two relational primitives: a Select or Aggregate (or both). In the case of an Aggregate, it works in conjunction with the GROUP BY clause. So, it will be simpler if the analysis branched off depending on whether the SELECT clause has aggregates or not.
  1503  
  1504  Before starting to process select, we have to perform the pre-step of naming the results. During this process, we generate a name for each item in the SELECT clause. Additionally, if a column is a faithful reference to an underlying table column, we add this qualification. This information will be used for future analysis.
  1505  
  1506  The important property of a non-aggregate SELECT clause is that the column expressions don’t depend on each other. Therefore, they can be split and pushed down past joins into different groups.
  1507  
  1508  If an expression references columns from a single group, then it can be pushed down into that group.
  1509  
  1510  If an expression references columns from more than one group, then the expression cannot be pushed down, we fail the query.
  1511  
  1512  *It is possible to push-down combo expressions using the same technique we use for WHERE clauses. For example:*
  1513  
  1514  `select a.col+b.col from a join b where b.foo=a.bar`
  1515  
  1516  *can be resolved with the following join*
  1517  
  1518  ```
  1519  select a.col, a.bar from a // produce _a_col & _a_bar
  1520  Join
  1521  select :_a_col+b.col from b where b.foo=:_a_bar
  1522  ```
  1523  
  1524  *However, such queries are not very efficient for VTTablet. It can’t cache field info. Such push-downs don’t benefit MySQL either. So, it’s better to wait till VTGate implements the ability to evaluate expressions.*
  1525  
  1526  SELECT can also contain subqueries. The same restriction as WHERE clauses can be applied here: If the keyspace id of the subquery is not the same as the outer query’s, we fail the query. Otherwise, we push it down.
  1527  
  1528  ###### ‘*’ expressions
  1529  
  1530  Since VTGate does not know all the columns of a table, a ‘SELECT *’ may not be resolvable. We can consider disallowing it altogether. However, here are situations where we could allow it:
  1531  
  1532  * If the underlying primitive is a Route or RouteMerge, we could just push it down without knowing what it means.
  1533  * If the underlying ‘table’ is a subquery, we know the full column list. In such cases, we know how to expand the ‘*’ into an actual list.
  1534  
  1535  ###### Additional expressions for joins
  1536  
  1537  Note: This phase could be postponed till the time we do the wire-up.
  1538  
  1539  After the SELECT expressions are pushed down, we have to see if there are dependencies across joins that are still not addressed. If there are WHERE clause dependencies from one group to another that are not listed in the SELECT expression, we need to add those to the list of expressions. This is where we use the info for each select column to see if it already references a column we need for the RHS query. Here’s an example of the metadata in a Join primitive:
  1540  
  1541  `select a.col1, b.col1, a.col3 from a join b on b.col2=a.col2`
  1542  
  1543  will be represented as
  1544  
  1545  ```
  1546  Join:
  1547    LHS: select a.col1, a.col3, a.col2 from a
  1548    Bindvars: "_a_col2": 2 // Col 2 produces bind var _a_col2
  1549    RHS: select b.col1 from b where b.col2=:_a_col2
  1550    Select: [ LHS(0), RHS(0), LHS(1) ]
  1551  ```
  1552  
  1553  Consequently, this means that the lower level routes fetch columns that should not be returned in the result. This is the reason why a Join primitive has to also specify the columns that have to be returned in the final result.
  1554  
  1555  *It’s safe to assume that no further changes will be made to the SELECT list, and consider it finalized. Subsequent clauses will only be able to operate on the outcome of the result. This fact is not material right now. But it may become significant when VTGate starts to perform other operations. Specifically, VTGate may split the SELECT clause into a Select and Aggregate and push the Select part down. If the select list changes due to other external dependencies, then the Aggregate cannot be pushed down.*
  1556  
  1557  *MySQL allows you to address columns that are not in the SELECT list in GROUP BY and HAVING clauses, which can further affect the Select list. This is not standard SQL.*
  1558  
  1559  ###### Handling empty lists
  1560  
  1561  The converse of the above rule is the case where the result references only columns from one side of a join. In such cases, the other side of the join doesn’t have a SELECT expression yet. For such situations, we create a dummy ‘select 1’ to make the query work. Example:
  1562  
  1563  `select a.col from a join b on b.col=a.col`
  1564  
  1565  will be represented as:
  1566  
  1567  ```
  1568  Join:
  1569    LHS: select a.col from a
  1570    Bindvars: "_a_col": 0
  1571    RHS: select 1 from b where b.col=:_a_col
  1572    Select: [LHS(0)]
  1573  ```
  1574  
  1575  ###### Pushing past Filter
  1576  
  1577  If/when VTGate implements filters, we may have to ask if a SELECT clause can be pushed past such filters into a Route primitive. The answer is yes, because Select and Filter are orthogonal operations. However, we may prefer performing the Select at the VTGate level instead of pushing it down.
  1578  
  1579  ### SELECT with aggregates
  1580  
  1581  If a SELECT clause uses aggregate functions, then it can be pushed down if the underlying node is a Route. As usual, subquery escape analysis will still need to be applied.
  1582  
  1583  If the underlying node is a RouteMerge, then it can still be pushed down if one of the non-aggregate result columns has a unique vindex. Such a constraint guarantees that all the rows of each group are within one shard.
  1584  
  1585  You generally can’t push an Aggregate past a join. Theoretically, there are situations where you can do this, but such opportunities are rare. So, it’s not worth considering these optimizations.
  1586  
  1587  ### GROUP BY
  1588  
  1589  The act of pushing an Aggregate down also necessitates that the associated GROUP BY clause be pushed. It’s just the other side of the same coin.
  1590  
  1591  The GROUP BY clause is slightly unusual because its contents can actually be automatically inferred in most cases. More often than not, the SQL engine has to validate that the GROUP BY columns match the inferred values.
  1592  
  1593  It is possible that a GROUP BY to be present while a SELECT doesn’t have aggregates. It’s a common technique that people use to dedupe rows in the result. In such cases, the same rules as an Aggregate apply to the GROUP BY clause.
  1594  
  1595  *As mentioned previously, MySQL analyzes the GROUP BY clause using the query’s scope. It also does not enforce that the GROUP BY columns are consistent with the SELECT. For example, this is valid in MySQL:*
  1596  
  1597  `select a.col1, a.col2 from a group by a.col1`
  1598  
  1599  *This also valid:*
  1600  
  1601  `select a.col1 from a group by a.col1, a.col2`
  1602  
  1603  *Both of the above constructs are invalid by SQL standards. We could piggy-back on this lax attitude and just pass-through the GROUP BY as is. After all, these are not very meaningful constructs. If MySQL allows them, why should we work against them?*
  1604  
  1605  ### HAVING
  1606  
  1607  We reach this phase only if SELECT and GROUP BY were successfully pushed.
  1608  
  1609  The difference between HAVING and WHERE is that the HAVING clause applies filters after the result is computed. So, it can only reference columns from the result.
  1610  
  1611  For each condition in the HAVING clause, we find out which result columns it references. If all those columns originate from the same group, then the clause is pushable into that group.
  1612  
  1613  For conditions that span multiple groups, we can apply the bind variable dependency technique just like we did for the WHERE clause. If we decide to do this, we’ll need a separate bind variable naming scheme, because these refer to result columns and not table columns.
  1614  
  1615  As usual, subquery processing will need to be done as needed.
  1616  
  1617  Note: HAVING clauses are usually very simple. So, there may be diminishing return in allowing complexities here. However, if we’re just reusing code, we can allow these. Otherwise, we shouldn’t go overboard.
  1618  
  1619  ### ORDER BY
  1620  
  1621  Even though this clause near the end of the execution chain, it’s a popular one. It’s also an expensive one. So, we have to make every effort to push the work down into the underlying routes.
  1622  
  1623  We’ll start with the usual restriction that this clause will only reference the result columns.
  1624  
  1625  If the underlying node is a Route, then the clause can just be pushed down.
  1626  
  1627  If the underlying node is a RouteMerge, then it cannot be pushed down.
  1628  
  1629  If the underlying node is a join, then we perform the following analysis: Take each expression and identify the group it originates from. If it references more than one group, then we cannot push that expression down. If the group order is no less than the current group order, the expression can be pushed. We continue this until we reach an expression that is not pushable, or we successfully push everything. Here are two examples:
  1630  
  1631  `select a.col, b.col, c.col from a join b join c order by a.col, b.col, c.col`
  1632  
  1633  This will produce:
  1634  
  1635  ```
  1636  select a.col from a order by a.col // group 1
  1637  select b.col from b order by b.col // group 2
  1638  select c.col from c order by c.col // group 3
  1639  ```
  1640  
  1641  Example 2:
  1642  
  1643  `select a.col, b.col, c.col from a join b join c order by a.col, c.col, b.col`
  1644  
  1645  For the above case, a.col is from group 1, c.col is from group 3, but b.col is from group 2, which is less than group 3. So, the last expression is not pushable.
  1646  
  1647  ### LIMIT
  1648  
  1649  The LIMIT clause has two forms:
  1650  
  1651  LIMIT m, n: Fetch m+n rows, but return rows m through n.
  1652  
  1653  LIMIT n: same as LIMIT 0, n
  1654  
  1655  If the underlying node is a Route, then the LIMIT can be pushed down.
  1656  
  1657  If the underlying node is a RouteMerge, then we fail the query. This is because ORDER BY cannot be pushed into a RouteMerge, and a LIMIT without an ORDER BY is not very meaningful. Also, the order of rows is unpredictable for RouteMerge. So, it’s even less valuable to apply a limit.
  1658  
  1659  In the case of a join, the query could be passed down without a limit and we can do the filtering on the entire result. Maybe this is not worth supporting.
  1660  
  1661  ## Wire-up
  1662  
  1663  If we’ve reached this phase, it means that we have successfully broken up the original query and pushed the different parts into route primitives that are now wired together by joins.
  1664  
  1665  However, these parts would still be referring to each other. This section will describe how those dependencies will get resolved.
  1666  
  1667  In order to resolve dependencies correctly, we need the symbol tables that were used for analyzing each context. With the symbol table at hand, we walk the AST for each group, and look up the referenced symbols in the symbol table. If the symbol belongs to the same group, then nothing needs to be done. If the reference is outside, we perform the following steps:
  1668  
  1669  1. Change the external reference to the corresponding bind var name.
  1670  2. If that column is not already in the select list of the other group, we add it.
  1671  3. Mark the column number to be exported as the bind var name that we chose.
  1672  
  1673  This step has a caveat: if the SELECT list contained an Aggregate, then we cannot add such a column. However, we’re currently protected by the constraint that Aggregate cannot be pushed into a join, and external dependencies can only happen for joins. But this is something to keep in mind as we add more primitives and flexibilities.
  1674  
  1675  ### Accessing the symbol table
  1676  
  1677  Currently, the AST contains only information parsed from the query. When we analyze the query, we build various symbol tables. However, they would go away as we finish each analysis.
  1678  
  1679  Since we need them again for resolving dependencies, we have to store them somewhere. We also have to retrieve the right one for each subquery. This gives rise to a dependency issue: Subqueries can be anywhere in the parse tree. So, it’s unnatural to have an outside data structure that points to various parts of the AST.
  1680  
  1681  So, we make a compromise: we extend the Select structure and add a Symbols (any) member to it. Every time we finish analyzing a subquery, we set the Symbols member to point to the symbol table. With this compromise, the rest of the code base will flow naturally while resolving dependencies.
  1682  
  1683  ## Plan generation
  1684  
  1685  Now, all we have to do is generate the execution plan. During this step, we just generate the code for the AST of each route node, and perform additional substitutions needed to perform the correct routing, as dictated by the vindexes chosen for each route.
  1686  
  1687  A Route node will become one of: SelectUnsharded, SelectEqual, SelectKeyrange
  1688  
  1689  A RouteMerge will become one of: SelectScatter, SelectIN
  1690  
  1691  # Future primitives
  1692  
  1693  Looking at how far we can go with just four primitives is pretty amazing. But we can’t stop there. There are a few more low-hanging fruits we can get to, and then we can also start looking at the high-hanging ones.
  1694  
  1695  ## MergeSort and MergeAggregate
  1696  
  1697  If the ORDER BY clause can be pushed down for a Route, then we’re not far away from supporting it for a RouteMerge also. The results from each shard are already coming sorted. Merge-sorting such results becomes trivial.
  1698  
  1699  Along the same lines, if we ordered the rows of an Aggregate by the columns it’s grouping, we can push down such a query through a RouteMerge, and perform the final aggregation as the results come.
  1700  
  1701  The one challenge is collation. We’ll need to make sure that our collation algorithm can match what MySQL does.
  1702  
  1703  ## Subquery
  1704  
  1705  Uncorrelated subqueries can be trivially ‘pulled out’ to be executed separately, and their results can be pushed down into the query that needs the result.
  1706  
  1707  Here’s an example:
  1708  
  1709  `select * from a where col in (select col from lookaside)`
  1710  
  1711  This can be expressed as:
  1712  
  1713  ```
  1714  Subquery:
  1715    _lookaside: select col from lookaside
  1716    query: select * from a where col in ::_lookaside
  1717  ```
  1718  
  1719  ## Select & Filter
  1720  
  1721  If we built an expression engine, then we can do our own evaluation and filtering when something is not pushable. Adding support for this will automatically include correlated subqueries, because the result of the underlying query can be used to execute the subquery during the Select or Filter stage. Example:
  1722  
  1723  `select a.id from a where a.id = (select id from b where b.col=a.col)`
  1724  
  1725  Plan:
  1726  
  1727  ```
  1728  Filter:
  1729    query: select a.id, a.col from a
  1730    bindvars: "_a_col": 1
  1731    constraint: a.id = ConvertVal: Route:
  1732      select id from b where b.col=:_a_col
  1733  ```
  1734  
  1735  ## Other expensive primitives
  1736  
  1737  The rest of the primitives have potentially unbounded memory consequences. For these, we’ll have to explore map-reduce based approaches.