vitess.io/vitess@v0.16.2/doc/design-docs/V3HighLevelDesign.md

vitess.io/vitess@v0.16.2/doc/design-docs/V3HighLevelDesign.md (about)

1 # V3 high level design
2
3 # Objectives
4
5 The goal of this document is to describe the guiding principles that will be used to implement the full SQL feature set of Vitess’s V3 API. We start off with the high level concepts on SQL and its features, and iterate our way into some concrete design proposals.
6
7 ### Prerequisites
8
9 Before reading this doc you must be familiar with [vindexes](https://github.com/vitessio/vitess/blob/main/doc/V3VindexDesign.md), which is used as foundation for the arguments presented here.
10
11 # Background
12
13 VTGate was initially conceived as a router that encapsulated the logic of deciding where to send the application’s queries. The primary objective was to isolate the application from maintenance events like reparenting, server restarts, and resharding. As of the V2 API, this problem is pretty much solved.
14
15 One drawback of the V2 API is that it requires the application to be aware of the sharding key. This key had to be sent with every query. So, unless an application was written from the ground up with this awareness, the task of migrating it to use vitess was rather daunting.
16
17 The question was asked: could VTGate be changed to not require the sharding key in its API? Could VTGate make the sharded database look like a single database?
18
19 This was how V3 was born. A working proof of concept was developed. This showed that we can automatically route queries that targeted single shards. We could also route queries that targeted multiple shards as long as the results could be combined without any additional processing.
20
21 The next question was: Can we support more SQL constructs? How far can we go? Can we go all the way and support everything?
22
23 These are the questions we’ll try to answer in the subsequent sections.
24
25 # TL;DR
26
27 This section gives a preview of what this document tries to achieve.
28
29 ## Current V3
30
31 The current V3 functionality is capable of correctly serving simple SQL statements that reference a single table. For example, if id is the sharding column for a:
32
33 `select a.col from a where a.id=:id` will get routed to the correct shard.
34
35 `select a.col from a where a.id in ::list` will get broken out into shard-specific in statements and routed to those specific shards
36
37 `select a.col from a where a.col=:col` will get sent to all shards and results will be combined and returned
38
39 The current V3 also supports simple DMLs. More complex SQLs like joins, aggregations and subqueries are not supported.
40
41 ## Next phase: support more queries
42
43 If the first phase of this document is implemented, VTGate will be able to support the following constructs:
44
45 ### Joins
46
47 `select a.col, b.col from a join b on b.id=a.id where a.id=:id` will get routed to the correct shard. Correspondingly, IN clauses will also will get correctly broken out and routed.
48
49 `select a.col, b.col from a join b on b.id=a.id `will get sent to all shards.
50
51 If the data is on multiple shard, VTGate will perform the join:
52
53 `select a.col, b.col from a join b on b.id2=a.id where a.id=:id`
54
55 will get rewritten as two queries, one dependent on the other:
56
57 ```
58 select a.col, a.id from a where a.id=:id -> yield "_a_id" from col 1
59 select b.col from b where b.id2=:_a_id
60 ```
61
62 Joins can group tables, as long as the order is correctly specified:
63
64 `select a.col, b.col, c.col from a join b on b.id=a.id join c on c.id2=a.id where a.id=:id`
65
66 will get rewritten as:
67
68 ```
69 select a.col, b.col, a.id from a join b on b.id=a.id where a.id=:id
70 select c.col from where c.id2=:_a_id
71 ```
72
73 Joins can also group other where clauses and wire-up cross-table dependencies:
74
75 `select a.col, b.col from a join b on b.id2=a.id where a.id=:id and b.col > a.col`
76
77 will get rewritten as:
78
79 ```
80 select a.col, a.id from a where a.id=:id
81 select b.col from b where b.id2=:_a_id and b.col > :_a_col
82 ```
83
84 Joins can also be cascased:
85
86 `select a.col, b.col, c.col from a join b on b.i2=a.id join c on c.id3=b.id2 where a.id=:id`
87
88 will get rewritten as:
89
90 ```
91 select a.col, a.id from a where a.id=:id
92 select b.col, b.id2 from b where b.id2=_a_id
93 select c.col from c where c.id3=_b_id2
94 ```
95
96 LEFT JOIN will also be supported for all the above constructs. However, WHERE clauses that add additional constraints to the nullable parts of a left join will not be supported; They cannot be pushed down.
97
98 ### Subqueries
99
100 Correlated subqueries that can be executed as part of the main query will be supported:
101
102 `select a.id from a where exists (select b.id from b where b.id=a.id) and a.id=:id` will get routed to the correct shard.
103
104 Correlated subqueries can also be sent as scatter queries, as long as the join constraints are met.
105
106 Subqueries in the FROM clause can be joined with other tables:
107
108 `select a.id, a.c, b.col from (select id, count(*) c from a) as a join b on b.id=a.id`
109
110 will get rewritten as:
111
112 ```
113 select a.id, a.c from (select id, count(*) c from a) as a
114 select b.col from b where b.id=:_a_id
115 ```
116
117 Subqueries in the WHERE clause and SELECT list will be supported only if they can be grouped. They will not be broken out.
118
119 ### Aggregation
120
121 Aggregation will be supported for single-shard queries, but they can contain joins and subqueries.
122
123 You can also use aggregation for scatter queries if a unique vindex is in the list of result columns, like this one:
124
125 `select id, count(*) from a`
126
127 ### Sorting
128
129 Sorting will be supported as long as it can be meaningfully pushed down into the grouped parts. For example,
130
131 `select a.col, b.col from a join be where b.id=a.id order by a.id, b.id`
132
133 will be rewritten as:
134
135 ```
136 select a.col from a order by a.id order by a.id
137 select b.col from b where b.id=:_a_id order by b.id
138 ```
139
140 ### All combinations
141
142 The above features are designed to be orthogonal to each other. This means that VTGate can support simultaneous combinations of the above features, and with arbitrary levels of nesting and cascading.
143
144 ## Going beyond
145
146 Once the above phase is complete, we can iterate on performance features as well as add support for more constructs. It will depend on customer demand.
147
148 # High level concepts
149
150 The V3 proof of concept drew parallels between database indexes and vitess’s indexing scheme. A database deals with tables, but VTGate deals with databases. It turns out that a table and a database have a surprising number of similarities. The basic guarantee of an index is that it lets you limit the number for rows you have to inspect in order to compute the result of a query. In the same way, vindexes give you the guarantee that it will be sufficient to send your queries to a subset of all the shards.
151
152 As a quick recap, a vindex is a cross-shard index maintained by vitess. Given a column constraint, the vindex is one that’s capable of telling which shards contain the rows needed to satisfy a query. In this document, we’ll use the term ‘unique vindex’ quite often. A unique vindex is one that can produce only one shard as target for a column equality constraint. However, there may still be many rows that match that constraint. But they’ll always be within that one shard.
153
154 The following table further illustrates the similarities of the concepts:
155
156 <table>
157 <tr>
158 <td>RDBMS</td>
159 <td>Vitess</td>
160 </tr>
161 <tr>
162 <td>Table</td>
163 <td>Database</td>
164 </tr>
165 <tr>
166 <td>Primary Key</td>
167 <td>Sharding Key</td>
168 </tr>
169 <tr>
170 <td>Index</td>
171 <td>Vindex</td>
172 </tr>
173 <tr>
174 <td>Foreign Key</td>
175 <td>Shared Vindex</td>
176 </tr>
177 </table>
178
179
180 This coincidence is not accidental. For example, if you used a vindex to route a query to a shard, it’s likely that the database also has an index on that same column, with properties similar to the vindex. If you have a shared vindex between two tables, it’s very likely that there exists a similar foreign key relationship between the tables.
181
182 The foundation for these similarities comes from the fact that a table that was once in a single database is now partitioned out. So, the indexes for those tables have to also grow out of the original database.
183
184 There is another strong parallel between a table and a database: A SELECT statement and a table are interchangeable. On one hand, all the data in table t can be represented as `‘select * from t’. `On the other hand, you can use a SELECT statement as stand-in for a table. So, many of the approaches used to optimize SQL for scanning tables can be re-used to orchestrate queries that span multiple shards or databases. The main difference is that you can treat bigger SQL chunks as tables, and can outsource that work to the database that hosts them.
185
186 But then, there’s only so much you can outsource. As soon as a query exceeds a certain level of complexity, where the required data needs to be pulled from more than one database, then any subsequent work needs to be done by VTGate. If we want to support the full SQL syntax, VTGate will have to eventually become a full-fledged SQL engine.
187
188 VTGate’s SQL engine will likely not be as good as what the underlying database can do, at least initially. So, until such time, the goal will be to push as much of the work down to the databases, and do as little as possible with what comes out.
189
190 ## Limitations
191
192 The above arguments can more or less be used as proof that we can theoretically support the full SQL syntax. In the worst case scenario, VTGate will have to individually scan each table involved in the query and perform the rest of the work by itself.
193
194 But there are some things it cannot do:
195
196 ### Cross-shard consistency
197
198 There is currently no way to get a system-wide consistent snapshot of all the keyspaces and shards. If reads are being performed during in-flight cross-shard transactions, then VTGate may see such data as committed in one database, and not committed in the other.
199
200 We could rely on MySQL’s MVCC feature that will give you a view of each database as of when a transaction was started, but this is still not perfect because there is no guarantee about the exact timing of each BEGIN, and there are also replication lag related inconsistencies. We currently don’t support read transactions for replicas, but it can be added.
201
202 The final option is to lock those rows before reading them. It’s a very high price to pay, but it will work.
203
204 ### Environmental differences
205
206 VTGate will be running on a different machine than the databases. So, functions that rely on the current environment will not produce consistent results. For example, time functions will likely have variations depending on clock skew.
207
208 # SQL for dummies
209
210 An upfront clarification: we’ll be only focusing on SELECT statements in this document, because that’s where all the complexities are. Also, Vitess currently allows only single-shard DMLs. So, most of this analysis will not be applicable for them.
211
212 ## Dissecting SELECT
213
214 A single table SELECT is a relatively simple thing. Let’s dissect the following statement:
215
216 ```
217 select count(*), a+1, b
218 from t
219 where c=10
220 group by a+1, b
221 having count(*) > 10
222 order by count(*) desc
223 limit 10
224 ```
225
226 The steps performed by the SQL engine are as follows:
227
228 1. Scan t (FROM)
229 2. Filter rows where c=10 (WHERE)
230 3. Remove all columns other than a & b, and compute a+1 (SELECT)
231 4. Count and group unique a+1, b values (GROUP BY)
232 5. Filter counts that are > 10 (HAVING)
233 6. order the rows by descending count (ORDER BY)
234 7. Return the top 10 results (LIMIT)
235
236 The one thing that’s not obvious in a SELECT is that the select list is not the first clause that’s executed. It’s the FROM.
237
238 ## SELECT as a subquery
239
240 If a SELECT statement was used just to fetch results, things would be very simple. However, it can be used in other parts of a query, and can produce 5 kinds of values:
241
242 1. A simple value: `select * from t where id = (select 1 from dual)`
243 2. A value tuple: multiple values of the same type: `select * from t1 where id in (select val from t2)`
244 3. A row tuple: multiple values of different types. A row tuple can also be used as a value: `select * from t1 where (a,b) = (select b,c from t2 where pk=1)`
245 4. Rows: results of a regular query. Note that rows can be treated as a value tuple of row tuples. For example, this statement is valid: `select * from t1 where (a,b) in (select b,c from t2)`
246 5. A virtual table: If you don’t strip out the field info from the original query, then a select can act as a table, where the field names act as column names. For example, this statement is valid: `select * from (select a, b from t2) as t where t.a=1`
247
248 This opens the door to some powerful expressibility, the other edge of the sword being indefinite complexity.
249
250 Subqueries can be found in the following clauses:
251
252 1. FROM
253 2. SELECT list
254 3. WHERE
255
256 In a way, subqueries are brain-dead, including the correlated ones. You just execute your main query as specified by the sequence in the ‘Dissecting SELECT’ section. Any time you encounter a subquery, you just evaluate it as an expression on-the-fly and use the resulting value. The only complication of a correlated subquery is that it references a value from the outer Result to produce its own Result.
257
258 An immediate optimization: If a subquery is not correlated, we know that its value is the same for every row of the outer query. So, such a subquery can be "pulled out" and evaluated only once, or we could use lazy evaluation and caching. But we’re venturing into optimizer territory, which we should refrain from for now.
259
260 Based on the above properties, we can make some observations:
261
262 * A subquery in a FROM clause cannot reference columns of the SELECT list. This is because of the execution order: the SELECT list depends on what comes out of the FROM clause, not the other way around. However, such a subquery can still reference columns from an outer SELECT. But MySQL is more restrictive, and prevents all correlation for subqueries in the FROM clause.
263 * A subquery in a SELECT list cannot affect the original number of rows returned. All it can do is supply the value for the column it was meant to compute the result for. If there was no result, the value would be NULL. A subquery in a SELECT list must produce a single scalar value.
264 * A subquery in a WHERE clause cannot change any values of the original result, or produce new rows. It can only cause rows to be eliminated.
265
266 It’s easy to get lost in the world of subqueries, and it may get confusing when you start to think about how they interact with joins. But you can retain sanity by following these principles while analyzing a complex query:
267
268 * Tables involved in a join are within the same scope, and can reference each other's columns.
269 * Subqueries are in an inner scope. An inner query can use values from the outer query. Other than this, all it can do is produce a value.
270 * Other parts of the outer query cannot reference anything about the subquery. It’s a black box in that respect.
271
272 Correlated subqueries exhibit the properties of a join. So, there are situations where they could be converted to actual joins by the optimizer.
273
274 It’s now time to drill down into the details of the operators. The notations below are not very formal. They’re there mainly to illustrate the concepts.
275
276 ## What kind of machine is SQL?
277
278 Primer on automata theory: [https://en.wikipedia.org/wiki/Automata_theory](https://en.wikipedia.org/wiki/Automata_theory).
279
280 Is SQL a turing machine? The answer is no, because it can only make forward progress, and always terminates.
281
282 Strangely, there are no references on the internet about what kind of machine SQL is. Looking at whether it can be a finite state machine, the answer seems to be no, because of subqueries.
283
284 So, it appears to be a Pushdown Automaton (PDA). This determination is important because it will help us decide the data structure needed to represent an execution plan.
285
286 Specifically, a tree structure is sufficient to represent a PDA.
287
288 ## The operands
289
290 This is the list of ‘values’ that a query engine deals with in order to satisfy a query:
291
292 1. Result: This is the core data type that every operator deals with. It’s a complex data structure. Sometimes, the primitives may access only a part of the Result. So, the interchangeability rules of primitives will be different based on which parts of the Result each of them accesses.
293 2. Value expression list: This is used in various places in a statement. Most notably, for converting a table scan into the list of values to be returned.
294 3. Conditional expression: This is used to filter out results.
295 4. Aggregate expression list: This has some similarities with a value expression list, but it’s worth differentiating.
296 5. Scalar values: Used all over.
297
298 The above operands are a simplified way of looking at things. For example, a conditional expression is a separate world that has its own set of operators and values, and they follow a different set of algebraic rules. But we need to divide and conquer.
299
300 You’ll see how these operands are used in the description of the operators.
301
302 ## The operators
303
304 If one were to ask what are the minimal set of primitives needed to serve SQL queries, we can use [relational algebra](https://www.tutorialspoint.com/dbms/relational_algebra.htm) as starting point. But databases have extended beyond those basic operations. They allow duplicate tuples and ordering. So, here’s a more practical set:
305
306 1. Scan (FROM).
307 2. Join: Join two results as cross-product (JOIN).
308 3. LeftJoin: The algebraic rules for an outer join are so different from a Join that this needs to be a separate operator (LEFT JOIN).
309 4. Filter: Removing rows based on a conditional expression (WHERE, HAVING).
310 5. Select: Evaluating new values based on a value expression list (SELECT list).
311 6. Aggregate: Grouping rows based on an aggregate expression list (GROUP BY, UNION).
312 7. Sort (ORDER BY).
313 8. Limit (LIMIT).
314 9. Merge: Merge two results into one (UNION ALL).
315
316 Windowing functions are out of scope for this doc.
317
318 Even the above primitives are fairly theoretical; One would rarely perform a Join without also simultaneously applying a join condition.
319
320 For a given SELECT statement, the order of these operators are fixed. They are as follows:
321
322 `Scan <-> (Join & LeftJoin) -> Filter(where) -> Select -> Aggregate -> Filter(having) -> Sort -> Limit -> Merge`
323
324 However, following the above order is almost never optimal. This is where we have to study how these operators interact, and try to find ways to resequence them for higher efficiency.
325
326 It’s important to note that most of these operators take a Result as input and produce one as output. This means that they can be indefinitely combined. This is achieved by re-SELECTing from a Result. It’s usually rare that people perform such nested selects, but there are legitimate use cases, typically for computing aggregates of aggregates.
327
328 Additionally, Filter and Select can perform their own scans (subqueries) as part of their expressions. This adds another dimension to the overall complexity.
329
330 ### Scan
331
332 This is the most basic operation: given a table name, the operation produces a result, which is a list of all the rows in the table, with field names:
333
334 `Scan(tableName) -> Result`
335
336 Rethinking this for VTGate, the corresponding operation would be to route a query that targets a single database, and the output would be the result.
337
338 `Route(query, keyspace, shard) -> Result`
339
340 It’s implied that the query is the sql+bindvars.
341
342 *If every operator takes a Result as input, who produces the first Result?*
343
344 Scan is in this category. Without this operator, there is nothing that can map a query to a Result. Consequently, this is also the lowest level building block that all other operations depend on.
345
346 It turns out that this is the only part of VTGate that differs from a traditional SQL engine. But it’s a powerful difference. For well tuned apps, the entire query will most of the time be just passed through to one of the underlying databases.
347
348 The optimizer is likely to be very different, because its main task will be to figure out how to break up a query into parts that can be outsourced to different databases. For example, if there’s a three-way join and if two of those pieces can be sent to a single shard, then it has to split that query correctly into two. Because of this, we need the ability to translate a relational expression back into a query, or somehow find a way to reuse the parts of the original query that pertain to the relational expression.
349
350 ### Filter
351
352 The filter operator removes rows from the Result based on a condition:
353
354 `Filter(Result, condition) -> Result`
355
356 A condition is a boolean expression. The operands can be:
357
358 * A column of the Result
359 * A column of an outer Result (to support correlated subqueries)
360 * Boolean values or expressions
361 * Scalar values or expressions
362 * The result of a subquery
363
364 The operators are:
365
366 * Pure boolean operators like AND, OR, etc.
367 * Comparison operators like =, <, etc.
368 * Other complex operators and functions like IF, etc.
369
370 Inferred properties of Filter:
371
372 * The filter operation is applied to each row individually. There is no interaction between rows.
373 * A row is never modified.
374 * No new rows are produced.
375 * It does not change the order of the rows in the Result. This is not a requirement, but it’s a property worth maintaining.
376 * The AND clauses of a filter can be broken into parts and applied separately. This is useful for pushing down WHERE clauses into scans.
377
378 It would have been nice if the Result was the only variable input to Filter. However, this is not true if one of the operands is a subquery.
379
380 *There are two schools of thought when it comes to boolean algebra: Those that allow interconversion between scalar and boolean values, and those that don’t. The former considers any non-zero scalar value to be TRUE, and zero as FALSE. The latter requires the scalar value to be tested using a comparison operator before it becomes a boolean value.*
381
382 *The first school of thought allows statements like this:*
383
384 `select * from t where a`
385
386 *The second school of thought requires you to rewrite it as:*
387
388 `select * from t where a != 0`
389
390 *Vitess’s grammar currently doesn’t allow this interconversion, but MySQL does. From a practical perspective, this difference is not significant because it results in a minor syntactic inconvenience. One could argue that ‘a’ could be a boolean column. But boolean values are not common in databases, and there are other alternatives. So, Vitess doesn’t support boolean column types. It assumes that everything from a database (or Result) is a scalar.*
391
392 *Not allowing interconversion greatly simplifies the grammar. Also, the boolean engine and expressions engine can be kept separate, which is another simplification.*
393
394 ### Select
395
396 The Select operation produces a new Result by performing arithmetic operations to the input Result.
397
398 `Select(Result, expression list) -> Result`
399
400 An expression is a subset of a boolean expression. The operands can be:
401
402 * A column of the Result
403 * A column of an outer Result
404 * Scalar values or expressions
405 * The result of a subquery
406
407 The operators can be:
408
409 * Arithmetic operators like +, -
410 * Functions and other complex expressions
411
412 Properties of a Select:
413
414 * A Select produces the same number of rows as the input
415 * A Select may not produce the same columns as its input
416 * The expressions are applied to each row individually. There is no interaction between rows.
417 * The order of the result is not changed. This is not a requirement, but it’s a property worth maintaining.
418 * The different items (columns) of a Select can be broken into parts and independently computed.
419
420 ### Aggregate
421
422 Aggregate splits the Result column into two groups: the aggregate columns and the group by columns. It merges the rows whose group by columns are equal, and simultaneously applies aggregation functions to the aggregate columns.
423
424 `Aggregate(Result, aggrExpressionList(columns)) -> Result`
425
426 Any columns that are not in the aggregate expression list are implicitly in the group by list.
427
428 There are two special cases:
429
430 * There are no aggregate columns. In this case, the operation degenerates into a dedup (UNION).
431 * There are no group by columns. In this case, one and only one row is produced.
432
433 The SQL syntax allows expressions to be present in aggregate functions, but that can be broken down into Select->Aggregate.
434
435 Properties of Aggregate:
436
437 * It produces no more rows than the input.
438 * It produces the same number of columns as the input.
439 * Aggregate can be efficient if the rows are sorted by the group by columns. If so, it also retains the sort order.
440 * Aggregate can also be efficient if there are no group by columns. Then only one row is produced in the Result.
441 * The different items of an Aggregate cannot be independently computed.
442
443 ### Sort
444
445 Sort reorders the rows of a Result based on a list of columns with ascending or descending properties.
446
447 `Sort(Result, (columns asc/desc)) -> Result`
448
449 Properties:
450
451 * The number of rows produced is the same as the input
452 * No rows are changed
453 * An outer Sort supersedes an inner Sort
454 * Sorting can be broken into parts as long as the resulting row is built in the same order as the values referenced by the sort. For example, if we join a with b, and sorting is by a.col and b.col, then we can sort a by a.col and join with b by sorting b by b.col.
455
456 ### Limit
457
458 Limit returns at the most N rows from the specified starting point.
459
460 `Limit(Result, start, count) -> Result`
461
462 Limit has algebraic properties that are similar to Filter.
463
464 Properties:
465
466 * A limit doesn’t change the rows
467 * An outer limit supersedes an inner Limit
468 * A Limit is not interchangeable with anything other than Select.
469 * A Limit without a Sort has non-deterministic results.
470
471 ### Join
472
473 Join produces the cross-product of two Results.
474
475 `Join(Result, Result) -> Result`
476
477 This primitive is too inefficient in most situations. Here are some practical composites:
478
479 `InnerJoin(Result, Result, condition) ⇔ Filter(Join(Result, Result), condition)`
480
481 The use case for InnerJoin is easy to explain, because pushing down a condition usually leads to valuable efficiency improvements.
482
483 `QueryJoin(query, query, condition) ⇔ Filter(Join(Scan(query), Scan(query)), condition)`
484
485 The QueryJoin is relevant mainly because VTGate may not get good at performing joins for a long time. So, it may have to rely on sending nested loop queries to the underlying databases instead. However, there will be situations where a join will have to be performed on two Results. Such queries will be out of VTGate’s ability to execute until it implements the `Join(Result, Result)` primitive.
486
487 Properties:
488
489 * Rows are not modified
490 * Lots of rows can be produced
491 * Interchangeable with Filter as long as the condition can be split properly. This will be a significant driving factor for the optimizer.
492
493 ### LeftJoin
494
495 LeftJoin requires a join condition. If the condition is not satisfied, the left side of the Result is still produced, with NULL RHS values.
496
497 `LeftJoin(LeftResult, RightResult, condition) -> Result`
498
499 It’s important to differentiate LeftJoin from Join because LeftJoin is not freely interchangeable with Filter or other Joins.
500
501 ### Merge
502
503 Merge merges two results into one. There is no particular order in the rows.
504
505 `Merge(Result, Result) -> Result`
506
507 Properties:
508
509 * Rows are not modified
510 * Number of rows is the sum of the two Results’ number of rows
511
512 ## Example from hell
513
514 Now that all operators are defined, we have to see if they’re complete. Let’s try this query:
515
516 ```
517 01: (select count(*), ta1.col+1, tb1.col
518 02: from ta1 join tb1 on ta1.id=tb1.id
519 03: where ta1.col2=10 and ta1.col2 > any
520 04: (select c from tc1 where tc1.id=ta1.id)
521 05: group by ta1.col+1, tb1.col
522 06: having count(*) > 10
523 07: order by count(*) desc
524 08: limit 10)
525 09:union
526 10: (select count(*), ta2.col+1, tb2.col
527 11: from ta2 join tb2 on ta2.id=tb2.id
528 12: where ta2.col2=10
529 13: group by ta2.col+1, tb2.col)
530 14:order by count(*) desc
531 15:limit 5
532 ```
533
534 The above query can be represented as a single expression. But let’s start with something that has intermediate results first:
535
536 ```
537 // query 1
538 02: JoinResult1 = InnerJoin(Scan(ta1), Scan(tb1), ta1.id=tb1.id)
539 // Note that the next operation is within the scope of JoinResult1, so its columns can be used. Nevertheless, it’s an external input.
540 04: SubqueryCScanResult = Select(
541 Filter(
542 Scan(tc1),
543 tc1.id=JoinResult1.ta1.id,
544 ),
545 tc1.c)
546 03: FilterResult1 = Filter(
547 JoinResult1,
548 ta1.col2=10 and ta1.col2 > any(SubqueryCScanResult),
549 )
550 01: SelectResult1 = Select(FilterResult1, (1, ta1.col+1, tb1.col))
551 05: GroupResult1 = Aggregate(SelectResult1, count(‘1’))
552 06: HavingResult1 = Filter(GroupResult1, ‘1’ > 10)
553 07: SortResult1 = Sort(HavingResult1, (‘1’ desc))
554 08: Result1 = Limit(SortResult1, 0, 10)
555 // query 2
556 11: JoinResult2 = InnerJoin(Scan(ta2), Scan(tb2), ta2.id=tb2.id)
557 12: FilterResult2 = Filter(JoinResult2, ta2.col2=10)
558 10: SelectResult2 = Select(FilterResult2, (1, ta2.col+1, tb2.col))
559 13: Result2 = Aggregate(SelectResult2, count(‘1’))
560 // union
561 09: UnionResult = Aggregate(Merge(Result1, Result2), NULL)
562 14: OrderResult = Sort(UnionResult, (‘1’ desc))
563 15: Result = Limit(OrderResult, 0, 5)
564 ```
565
566 Now, combine everything into one expression, and voila:
567
568 ```
569 Result = Limit(
570 Sort(
571 Aggregate(
572 Merge(
573 Limit(
574 Sort(
575 Filter(
576 Aggregate(
577 Select(
578 Filter(
579 JoinResult1 = InnerJoin(
580 Scan(ta1),
581 Scan(tb1),
582 ta1.id=tb1.id,
583 ),
584 ta1.col=10 and ta1.col > any(
585 Select(
586 Filter(
587 Scan(tc1),
588 tc1.id=JoinResult1.ta1.id,
589 ),
590 tc1.c,
591 ),
592 ),
593 ),
594 (1, ta1.col+1, tb1.col),
595 ),
596 count(‘1’), // ‘1’ is column 1
597 ),
598 ‘1’ > 10,
599 ),
600 (‘1’ desc),
601 ),
602 0, 10,
603 ),
604 Aggregate(
605 Filter(
606 InnerJoin(
607 Scan(ta2),
608 Scan(tb2),
609 ta2.id=tb2.id,
610 ),
611 ta2.col=10,
612 ),
613 count(‘1’),
614 ),
615 ),
616 NULL,
617 ),
618 (‘1’ desc),
619 ),
620 0, 5,
621 )
622 ```
623
624 There’s a lot of nesting, but the expressions are mostly straightforward by the fact that each result feeds into the other. The only exception is the correlated subquery that uses a value from JoinResult1 from the outer query to perform its operation. You can also see that the lowest level operations are all Scans.
625
626 If the primitives defined at the beginning were all implemented, if we had an infinitely fast CPU, or an infinite amount of time, and we also had an infinite amount of memory, the above ‘plan’ will produce the correct result.
627
628 However, SQL engines are all about having limited CPU, memory and disk. So, they have to figure out ways to rearrange those operations differently so that the minimum amount of time and resources is used to compute the result. The next section will cover those techniques.
629
630 # Approaches and tradeoffs
631
632 ## Standard approaches
633
634 If you analyze the various operators, the expensive ones are:
635
636 * Scan: you want to avoid full table scans, unless it’s a necessity. This is ironically the lowest level operation.
637 * Sort: this is a potentially expensive operation if the result set is large.
638 * Aggregation: is also potentially expensive
639
640 All other operations are considered ‘cheap’ because they can be applied as the rows come. To optimize for the above three operations, databases combine the primitives. The most popular one is the index scan, which allows you to combine Scan, Filter and Sort, all as one operation. Aggregation can also benefit from an index scan if the GROUP BY columns are part of an index. If an index cannot be used, the database resorts to the more traditional approach of full table scans, file sorts, etc.
641
642 ## Preserving the original representation
643
644 In the case of VTGate, the ‘Route’ operation is capable of performing all 9 functions, as long as all the rows needed to fulfill the query reside in a single database. However, those functions have to be expressed as an SQL query. Most often, the original query could just be passed-through to a single shard. So, the one important question is: If we converted a query to its relational representation, can we then convert it back to the original query? The answer is no, or at least not trivially; The relational operators don’t always map one-to-one with the constructs of an SQL statement. Here’s a specific example:
645
646 `select count(*), a+1, b from t group by a+1, b`
647
648 This gets expressed as:
649
650 ```
651 Aggregate(
652 Select(
653 Scan(t),
654 (1, a+1, b),
655 ),
656 Count(‘1’), // ‘1’ is column 1
657 )
658 ```
659
660 So, it may not be a good idea to immediately convert the parsed query into the relational representation. When we analyze the query, we should try to retain the original text and see if they could be recombined without loss of readability.
661
662 The AST produced by the parser can be reversed back into a representation that’s accurate and close to the original query. This is what VTTablet already does. So, we can use the AST as the intermediate representation that can be retained until we can figure out what can be delegated to the underlying databases.
663
664 Once we have identified the parts that can be pushed down, what is left can be converted to use the operations that VTGate can perform.
665
666 ## Cleverness
667
668 Relational databases initially optimized their queries based on a standard set of rules that used the schema info as the only additional input. These were known as rule-based optimizers. However, some queries would fail for certain values due to pathological corner cases.
669
670 In order to address such corner cases, databases decided to venture into cost-based optimizers. This was somewhat of a slippery slope. On the one hand, the feature was very convenient, because one could throw any kind of ad-hoc query to a database, and it would figure out the best way to get the results. However, every once in awhile, the database would make the wrong optimization decision. This led to iterations to accommodate the new corner cases. Unfortunately, there is no end to this, and databases are still iterating. Beyond a certain point, this becomes a losing battle because the cost of computing the best plan starts to exceed the cost of executing the query using a less optimal one.
671
672 The cost-based optimizers have become so complex that a novice programmer can’t reliably predict the execution plan for a query. It’s also hard to know why an engine chose a certain plan.
673
674 An escape hatch was provided: query hints. If the database failed to find the most optimal plan, the application can specify hints in the query. Once hints are provided, the optimizer is bypassed, and the application dictates the plan for the query. The fundamental question is: Who should know the data? This question opens a lot of arguments. The main argument in favor of why the app should be responsible for the query plans is as follows: even if the database can decide the most optimal plan, the app has to know the cost of a query. It needs to know this because it makes a decision on whether it’s going to send 1QPS or 1MQPS of that query. If the plan needs to change, it means that the cost is also changing, which means that the app needs to be revisited. This is why an optimizer that is too clever has diminishing returns, and is sometimes counter-productive.
675
676 The rise of NoSQL databases is further evidence that engineers are generally not very sold on how clever the database optimizers are.
677
678 One concern with the application dictating the plan is that we may constantly worry that it might not be the best one. To alleviate this, we can develop analyzers, possibly backed by a machine-learning backend, that can suggest rewrites. But it will be up to the application to accept them.
679
680 So, we’re going to attempt this brain-dead approach for optimizations in VTGate: By looking at a query, and knowing the vschema, one should be able to predict what VTGate will do with it. If, for some reason, the plan doesn’t work out, then the query must be rewritten to represent the alternate plan. There is a possibility that this approach may fail; Some database users have grown accustomed to the optimizer doing the work for them. If it happens, this decision can be revisited.
681
682 ### What does this all mean
683
684 This just means that the VTGate optimizer will be very simple and transparent. For example, joins will always be performed left-to-right. When you look at a query, you’ll know the join order by just looking at it. VTGate will not try to rewrite subqueries into joins. If a join plan is more efficient than a subquery, then the app must write it that way.
685
686 Of course, if there are optimizations that make common sense, those will be implemented. For example, if a subquery is not correlated, then it can be pulled out and executed only once.
687
688 We should also be sensitive to situations where the natural expression of a query does not match the best plan.
689
690 No matter what, once a query is outsourced to a database, it’s subject to the optimizer decisions of the database. So, the app may still have to give hints for such situations.
691
692 ## Opcodes vs tree
693
694 The execution plan can be represented either using opcodes or as a tree. The opcode approach represents the plan as a list of actions to be performed, whereas a tree represents the same plan as one action that depends on other actions, which recursively ends up with leaf nodes that perform the routing.
695
696 Opcodes can be very flexible (turing complete). However, they require the introduction of intermediate variables. Let’s look at this example:
697
698 `a - (b+c)`
699
700 The opcode approach would express the expression as:
701
702 ```
703 temp = b+c
704 result = a - temp
705 ```
706
707 The tree approach would represent the action as:
708
709 `Minus(a, Plus(b, c))`
710
711 In the theory section, we classified SQL as a pushdown automaton, which means that a tree structure is sufficient to represent a statement.
712
713 Using opcodes may still help us come up with better optimizations. However, given that we don’t plan on going overboard on optimizations, the more readable tree representation is preferable.
714
715 ## Prioritization
716
717 Since we have powerful databases underneath us, our first priority is to harness anything they can do first. The next priority will be to do things that VTGate can perform on-the-fly, without accumulating results. And finally, we’ll look at features that require complex memory management or intermediate file storage. The features will be implemented as follows:
718
719 1. Single shard or single keyspace joins.
720 2. Scatter joins that can be merged in any order.
721 3. Cross-shard joins that don’t require post-processing.
722 4. Sort, Aggregate and Limit, provided the results can be pre-sorted by the underlying database(s).
723 5. Filter and Select expression engines.
724 6. All other constructs.
725
726 One hurdle to overcome with #4 is collation. Substantial work may have to be done to make the VTGate collation behavior match MySQL.
727
728 ## Streaming vs non-streaming
729
730 VTGate has two query APIs:
731
732 * Non-streaming API: This API has deadlines and row count limits. It collects all the results first and then returns them.
733 * Streaming API: This API has no deadlines or row count limits. It streams the results as they come.
734
735 This difference will eventually cause a divergence in what VTGate can do for streaming vs non-streaming queries. Since the entire result is available in-memory for non-streaming queries, we’ll be able to perform sorts and aggregations after obtaining the results. However, this will not be possible for streaming queries.
736
737 For the short term, we’ll maintain parity by only implementing operations that both APIs can perform. This actually means priorities 1 through 5. So, it may be a long time before the two diverge.
738
739 ## Column names
740
741 The vschema currently doesn’t know all the column names of the tables. VTGate cannot resolve joins if it doesn’t have this info. We have two options:
742
743 1. Extend vschema to also contain the column names of tables.
744 2. Require join queries to always qualify column names by the table.
745
746 The first option is going to be a maintenance problem. It introduces a dependency that is hard to maintain. For now, we’ll go with option #2. It is a syntactic inconvenience, but let’s hope that users find this acceptable. We could allow exceptions for vindex columns, but that’s likely to be confusing. So, it’s better to require this for all columns. For subqueries (that are themselves not joins), we can make an exception and assume that unqualified column names are addressing the inner table. You’ll need to use qualified names only for addressing columns of an outer query.
747
748 ## Redundant constructs
749
750 SQL has many redundant constructs. There’s diminishing return in supporting all of them. So, the following constructs will not be supported:
751
752 * `‘,’` join: This operator was the old way of joining tables. The newer SQL recommends using actual JOIN keywords. Allowing both forms of JOIN will be a big source of confusion because these operators also have different preferences.
753 * RIGHT JOIN: This is same as a reversed LEFT JOIN, except it’s less readable.
754 * NATURAL joins: These are rarely used in practice because the join condition applies to all columns.
755 * JOIN USING: This is just syntactic sugar for JOIN ON a.id=b.id
756 * WHERE clause join condition: A JOIN without an ON clause is supposed to be a cross product of the two tables. However, old-style ‘,’ users have continued to specify join conditions in the WHERE clause. The initial version of VTGate will only use the ON clause to decide how to route parts of a query. This decision can be revisited if things are too inflexible.
757
758 ## Limiting results
759
760 VTGate currently doesn’t have any limits on how many rows it can hold at any given time. This is dangerous because a scatter query that hits a large number of shards could easily blow up the memory.
761
762 Also, as we add support for joins, we’ll have a new way of accumulating a large number of rows.
763
764 So, the VTTablet level protection to limit the maximum number of rows fetched may not be sufficient any more.
765
766 We’ll add a new maxrows variable to VTGate. If the number of rows accumulated exceeds this count, we’ll return an error.
767
768 Note that this limit only applies to non-streaming queries. Streaming queries will continue to work as before.
769
770 # Design
771
772 Recapitulating what we’ve covered so far:
773
774 * The primary function of the optimizer is to push down as much of the query components as possible.
775 * We’ll not support redundant constructs for joins.
776 * If a query has to be split across multiple shards or keyspaces, we’ll expect the application to dictate the order of operations. This will be left to right, unless parentheses define a different order.
777 * For joins, all column names must be qualified by their table.
778 * We’ll preserve the original representation of the query to the extent possible.
779 * We’ll not attempt to flatten subqueries. However, they can still get pushed down with the outer query if they’re constrained by the same keyspace id.
780
781 ## Symbol table and scoping rules
782
783 Once we start allowing joins and subqueries, we have a whole bunch of table aliases and relationships to deal with. We have to contend with name clashes, self-joins, as well as scoping rules. In a way, the vschema has acted as a static symbol table so far. But that’s not going to be enough any more.
784
785 The core of the symbol table will contain a map whose key will be a table alias, and the elements will be [similar to the table in vschema](https://github.com/vitessio/vitess/blob/0b3de7c4a2de8daec545f040639b55a835361685/go/vt/vtgate/vindexes/vschema.go#L82). However, it will also contain a column list that will be built as the query is parsed.
786
787 ### A simple example
788
789 Let’s start with the following vschema that has two tables t1 & t2. Here’s a simplified view:
790
791 <table>
792 <tr>
793 <td>tables</td>
794 <td>columns</td>
795 <td>vindexes</td>
796 </tr>
797 <tr>
798 <td>t1</td>
799 <td>id</td>
800 <td>id_idx: hash</td>
801 </tr>
802 <tr>
803 <td></td>
804 <td>a</td>
805 <td>a_idx: lookup_unique</td>
806 </tr>
807 <tr>
808 <td>t2</td>
809 <td>t2id</td>
810 <td>t2id_idx: hash</td>
811 </tr>
812 <tr>
813 <td></td>
814 <td>b</td>
815 <td>b_idx: lookup_unique</td>
816 </tr>
817 </table>
818
819
820 For the following query:
821
822 `select a, b from t1 where id = :id and c = 5`
823
824 The starting symbol table will be identical to the vschema. However, as the query is analyzed, b and c will get added to t1 as additional columns:
825
826 <table>
827 <tr>
828 <td>tables</td>
829 <td>columns</td>
830 <td>vindexes</td>
831 </tr>
832 <tr>
833 <td>t1</td>
834 <td>id</td>
835 <td>id_idx: hash</td>
836 </tr>
837 <tr>
838 <td></td>
839 <td>a</td>
840 <td>a_idx: lookup_unique</td>
841 </tr>
842 <tr>
843 <td></td>
844 <td>b</td>
845 <td></td>
846 </tr>
847 <tr>
848 <td></td>
849 <td>c</td>
850 <td></td>
851 </tr>
852 </table>
853
854
855 If a symbol table contains only one database table, then all unqualified column references are assumed to implicitly reference columns of that table.
856
857 *Ideally, we should be validating column references against a known column list for each table. However, as mentioned before, the maintenance of that list comes with its own problems. Instead, we trust that a non-ambiguous column reference implies that the table (or result) does have such a column. In the worst case where such a column doesn’t exist, the query will fail. There are no circumstances where this assumption can lead to incorrect results.*
858
859 The bind vars section of the symbol table will contain an entry for id.
860
861 ### Self-join
862
863 For the following query (using the previous vschema):
864
865 `select t1.a, t2.c from t1 join t1 as t2 where t2.a = t1.id`
866
867 The symbol table will look like this:
868
869 <table>
870 <tr>
871 <td>tables</td>
872 <td>columns</td>
873 <td>vindexes</td>
874 </tr>
875 <tr>
876 <td>t1</td>
877 <td>id</td>
878 <td>id_idx: hash</td>
879 </tr>
880 <tr>
881 <td></td>
882 <td>a</td>
883 <td>a_idx: lookup_unique</td>
884 </tr>
885 <tr>
886 <td>t2</td>
887 <td>id</td>
888 <td>id_idx: hash</td>
889 </tr>
890 <tr>
891 <td></td>
892 <td>a</td>
893 <td>a_idx: lookup_unique</td>
894 </tr>
895 <tr>
896 <td></td>
897 <td>c</td>
898 <td></td>
899 </tr>
900 </table>
901
902
903 In the symbol table, t2 starts off as identical to t1, and is different from t2 that’s in the original vschema. But t2 eventually adds c to its list of columns. There is currently no value in remembering that t1 and t2 refer to the same underlying table. Maybe a future use case will arise that will introduce this need.
904
905 ### Subqueries
906
907 MySQL allows table names (or aliases) in subqueries to hide an outer name. Due to this, we have to extend the symbol table to support scoping. This is done by changing it into a linked list, where each node represents a scope. A symbol is searched in the list order until it’s found.
908
909 Let’s look at this intentionally confusing example:
910
911 ```
912 select t1.a, t2.b from t1 join t2 where t2.t2id in (
913 select b from t1 as t2 where id = t1.c
914 )
915 ```
916
917 The outerscope symbol table will look like this:
918
919 <table>
920 <tr>
921 <td>t1</td>
922 <td>id</td>
923 <td>id_idx: hash</td>
924 </tr>
925 <tr>
926 <td></td>
927 <td>a</td>
928 <td>a_idx: lookup_unique</td>
929 </tr>
930 <tr>
931 <td></td>
932 <td>c</td>
933 <td></td>
934 </tr>
935 <tr>
936 <td>t2</td>
937 <td>t2id</td>
938 <td>t2id_idx: hash</td>
939 </tr>
940 <tr>
941 <td></td>
942 <td>b</td>
943 <td>b_idx: lookup_unique</td>
944 </tr>
945 </table>
946
947
948 The innerscope will look like this:
949
950 <table>
951 <tr>
952 <td>t2</td>
953 <td>id</td>
954 <td>id_idx: hash</td>
955 </tr>
956 <tr>
957 <td></td>
958 <td>a</td>
959 <td>a_idx: lookup_unique</td>
960 </tr>
961 <tr>
962 <td></td>
963 <td>b</td>
964 <td></td>
965 </tr>
966 </table>
967
968
969 Column c for t1 was added because the inner query made a qualified reference to it. The inner scope table gets a column b added because the innerscope query made an implicit reference, and there was only one item in that scope.
970
971 Also, the inner query cannot access any columns of the outer t2 because it has its own t2 symbol that hides the outer one.
972
973 A corollary: since subquery nesting can have many levels, an inner query can still reference results from any of the outer scopes. For example, this is a valid query:
974
975 ```
976 select t1.a, t1.b
977 from t1
978 where t1.id in (
979 select t2.id
980 from t2
981 where t2.id in (
982 select t3.id
983 from t3
984 where t3.id = t1.a
985 )
986 )
987 ```
988
989 ### The FROM clause
990
991 The FROM clause is an exception. This clause creates a new symbol table (scope), with no initial outerscope, even if it was part of a subquery. At the end of the analysis, the newly created symbol table is added to the outerscope, if there was one.
992
993 This means that subqueries in the FROM clause are not allowed to refer to elements of any outer query.
994
995 The result of such a subquery will be a new virtual table with its own vindex columns, etc. This is facilitated by the fact that the SQL syntax requires you to name the subquery with an alias. Figuring out possible vindexes for the new virtual table is useful because routing rules could be applied for other tables that may join with the subquery. Here’s an example:
996
997 ```
998 select sq.a, t2.b
999 from
1000 (select a from t1 where id = :id) as sq
1001 join t2 on t2.b = sq.a
1002 ```
1003
1004 The main symbol table will look like this:
1005
1006 <table>
1007 <tr>
1008 <td>tables</td>
1009 <td>columns</td>
1010 <td>vindexes</td>
1011 </tr>
1012 <tr>
1013 <td>sq</td>
1014 <td>a</td>
1015 <td>a_idx: lookup_unique</td>
1016 </tr>
1017 <tr>
1018 <td>t2</td>
1019 <td>t2id</td>
1020 <td>t2id_idx: hash</td>
1021 </tr>
1022 <tr>
1023 <td></td>
1024 <td>b</td>
1025 <td>b_idx: lookup_unique</td>
1026 </tr>
1027 </table>
1028
1029
1030 The symbol table for the `‘from t1’` subquery will be in its own scope. Its analysis will yield a new virtual table sq, which will be the one added to the main symbol table. The analyzer should hopefully be smart enough to deduce that column ‘a’ of the subquery can be used for routing because it’s a faithful representation of the column in the underlying table t1.
1031
1032 *Conceptually speaking, there’s no strong reason to prevent the FROM clause from referencing columns of an outer query. MySQL probably disallows this because it leads to confusing scoping rules. Let’s take a look at this example:*
1033
1034 ```
1035 select id from a where id in (
1036 select b.id, a.id
1037 from
1038 (select id from b where b.id=a.id) b
1039 join d as a on a.id=b.id)
1040 ```
1041
1042 *If we allowed an outer scope, then b.id=a.id becomes ambiguous. We don’t know if it should refer to the outer ‘a’, or the one that is yet to be created as part of the current scope. Needless to say, one shouldn’t be writing such unreadable queries.*
1043
1044 ### ON clauses
1045
1046 By the fact that the ON clause is part of a FROM clause, you cannot refer to an element of an outer query. This is an invalid query:
1047
1048 `select id from a where id in (select b.id from b join c on b.id=a.id)`
1049
1050 But this is valid:
1051
1052 `select id from a where id in (select b.id from b join c where b.id=a.id)`
1053
1054 Furthermore, you can only refer to elements that participate in the current JOIN. But the following query is still valid:
1055
1056 `select * from a join b on b.id = a.id join c on c.id = a.id`
1057
1058 Because of associativity rules, c is actually joined with ‘a join b’.
1059
1060 But this query is not valid:
1061
1062 `select * from a join (b join c on c.id = a.id)`
1063
1064 The ON clause participants are only b and c because of the parenthesis. So, you’re not allowed to reference a.id in that join. In other words, analysis of the ON clause requires a list of participants. The analyzer will need to check with this list before looking inside the symbol table, or we could exploit Go interfaces here by sending a subset symbol table that has the same interface as the original one.
1065
1066 ON clauses are allowed to have correlated subqueries, but only with tables that are the ON clause participants. This statement is valid:
1067
1068 `select * from a join b on b.id in (select c.id from c where c.id=a.id)`
1069
1070 ### Naming result columns
1071
1072 Result columns get referenced in the following places:
1073
1074 * For addressing the results of a subquery in the FROM clause
1075 * GROUP BY, HAVING and ORDER BY clauses
1076
1077 The WHERE clause is not allowed to reference them because it happens before the SELECT expressions are evaluated.
1078
1079 Column names are usually built from the select expression itself. Therefore, ‘a+1’ is a valid column name. If such a column is in a FROM subquery, it can be addressed using back-quotes as ``a+1``. However, you can reference such expressions directly in GROUP BY, HAVING and ORDER BY clauses. You can also reference a result by using its column number (starting with 1).
1080
1081 Conceptually, this is similar to the subquery column naming. GROUP BY, etc. are post-processing operations that are applied to the result of a query.
1082
1083 *MySQL allows duplicate column names. Using such column names for joins and subqueries leads to unpredictable results. Because of this, we could go `either way`:*
1084
1085 * *Allow duplicates like MySQL and remain equally ambiguous*
1086 * *Disallow duplicates*
1087
1088 Here’s an example of how MySQL names columns:
1089
1090 ```
1091 > select id, (select val from a), 3, 3, 3 as `4` from a;
1092 +------+------------------------+---+---+---+
1093 | id | (select val from a) | 3 | 3 | 4 |
1094 +------+------------------------+---+---+---+
1095 | 1 | a | 3 | 3 | 3 |
1096 +------+------------------------+---+---+---+
1097 MariaDB [a]> select `(select val from a)` from (select id, (select val from a), 3, 3, 3 as `4` from a) as t;
1098 +------------------------+
1099 | (select val from a) |
1100 +------------------------+
1101 | a |
1102 +------------------------+
1103 1 row in set (0.00 sec)
1104 MariaDB [a]> select `(select val from a)` from (select id, (select val from a), 3, 3, 3 as `4` from a) as t;
1105 ERROR 1054 (42S22): Unknown column '(select val from a)' in 'field list'
1106 ```
1107
1108 The select expression is produced verbatim as column name, and duplicates are allowed.
1109
1110 These quirks don’t matter much if VTGate just has to return these results back to the client without generating its own values. However, the result column names need to be known for analyzing dependencies when we split queries. So, VTGate will do a fake mimic of what MySQL does: Take a guess at the column name by rebuilding a string representation of the AST for that select expression. If there’s a lucky match, well and good. If not, the query analysis will fail, and the user will have to provide a non-ambiguous alias for such columns. In practice, this is not a real problem because users also generally dislike using verbatim expressions as column names when performing complex joins.
1111
1112 So, once the SELECT expression is analyzed, the symbol table (for that scope) will acquire an entry for the result column names. Additionally, if a result column is a faithful reference to an underlying table column, it will contain this additional annotation, which can be used for optimizing other operations.
1113
1114 ### Bind variables
1115
1116 When VTGate has to perform a join, it needs to use the results of one query to feed into the other. For example, let’s look at this query:
1117
1118 `select t1.a, t2.b from t1 join t2 on t2.id = t1.id`
1119
1120 If t1 and t2 are on different keyspaces, VTGate has to perform the following operations:
1121
1122 `select t1.a from t1`
1123
1124 For each row of this result, it then has to perform:
1125
1126 `select t2.b from t2 where t2.id = :_t1_a`
1127
1128 The mechanism for this happens by generating a bind variable named ‘_t1_a1’ from the first result. The bind variable name can be generated using a functional algorithm:
1129
1130 `‘_’ + tableAlias + ‘_’ + columnName`
1131
1132 The scoping rule works in our favor here because this type of name generation correctly handles naming conflicts.
1133
1134 *MySQL allows multiple columns in a result to have the same name, and allows using of such ambiguous names in joins, etc. Fortunately, this also works in our favor because we’ll be no worse if we also allowed it.*
1135
1136 For the sake of simplicity, we’ll not allow dependencies to use non-standard column names. For example, we’ll fail a query like this:
1137
1138 `select * from (select a, count(*) from t1) t where t.count(*) > 0`
1139
1140 This should instead be rewritten as:
1141
1142 `select * from (select a, count(*) c from t1) t where t.c > 0`
1143
1144 This way, we can build a bind variable like `‘:_t_c’` for the outer query if we have split the above query into two.
1145
1146 There is a subtle difference between column names produced by a result vs. the column names in the symbol table. The WHERE clause constructs reference columns in the symbol table. It’s only the post-processing constructs like GROUP BY, etc. that reference column names produced by the result. This could pose a problem when a SELECT gets pushed down that defines a new alias that has the same column name as an underlying table. Will such a construct hide, and therefore, corrupt the meaning of a join? Let’s look at an example:
1147
1148 `select a.col1 as col2, b.col1 from a join b on b.col2=a.col2`
1149
1150 Rewritten, this would become
1151
1152 ```
1153 select a.col1 as col2, a.col2 from a
1154 Join (by producing _a_col2 from column 2)
1155 select b.col1 from b where b.col2 = :_a_col2
1156 ```
1157
1158 Because we reference column numbers from the first query, we bypass naming ambiguities. However, we still need to know the source column for each expression so we don’t have to unnecessarily fetch identical results. In other words, the following query:
1159
1160 `select a.col2 as col2, b.col1 from a join b on b.col2=a.col2`
1161
1162 should become:
1163
1164 ```
1165 select a.col2 as col2 from a
1166 Join (by producing _a_col2 from column 1)
1167 select b.col1 from b where b.col2 = :_a_col2
1168 ```
1169
1170 This means that select expressions need to know if they’re a faithful reference of an underlying table column, and this information will be used for wiring up dependencies.
1171
1172 The other situation where result names interact with the symbol table is during a subquery in the FROM clause. Fortunately, we create a separate virtual table entry for it in the symbol table for this situation. This allows us to keep the two worlds from colliding.
1173
1174 ## Pushdown
1175
1176 Pushdown is achieved by analyzing the query in the execution order and seeing if the next stage can be pushed down into the database. For VTGate, the execution order is slightly different from the standard one: Route replaces Scan, and also, the Route primitive can perform a scatter.
1177
1178 In terms of relational primitives, a scatter is a bunch of `Route` operations followed by a `Merge`. We’ll call it `RouteMerge`. So, what you can push down into a simple `Route` may not be allowed for a `RouteMerge`. Here’s the VTGate-specific order of operations:
1179
1180 `Route/RouteMerge <-> (Join & LeftJoin) -> Filter(where) -> Select -> Aggregate -> Filter(having) -> Sort -> Limit -> Merge`
1181
1182 The rest of the analysis tries to push the subsequent operations into Route or RouteMerge. Whatever cannot be pushed down is the work that VTGate has to do by itself. The fact that we won’t try to flatten subqueries simplifies our analysis a little bit.
1183
1184 *MySQL allows you to Sort and Limit the result of a UNION. This is something that the Vitess grammar doesn’t allow right now. We’ll eventually need to support this construct, which will add `Aggregate -> Sort -> Limit` as things that can further happen after the last `Merge`.*
1185
1186 The overall strategy is as follows:
1187
1188 1. Identify groups: Form groups of tables (or subqueries) in the FROM clause that can stay together for Route operations.
1189 2. Pushdown higher-level operations into the various groups: This step uses the AST as input and produces an execution tree where the leaf nodes are all Route operations.
1190 3. Wire up dependencies: This step computes the dependencies between different routes, and potentially modifies the queries in the routes so that additional values are returned that may be needed by other routes.
1191 4. Produce the execution plan.
1192
1193 ### Starting primitives
1194
1195 In order to align ourselves with our priorities, we’ll start off with a limited set of primitives, and then we can expand from there.
1196
1197 VTGate already has `Route` and `RouteMerge` as primitives. To this list, let’s add `Join` and `LeftJoin`. Using these primitives, we should be able to cover priorities 1-3 (mentioned in the [Prioritization](https://github.com/vitessio/vitess/blob/main/doc/V3HighLevelDesign.md#prioritization) section). So, any constructs that will require VTGate to do additional work will not be supported. Here’s a recap of what each primitive must do:
1198
1199 * `Route`: Sends a query to a single shard or unsharded keyspace.
1200 * `RouteMerge`: Sends a (mostly) identical query to multiple shards and returns the combined results in no particular order.
1201 * `Join`: Executes the LHS operation, which could be any primitive. For each row returned, it builds additional bind vars using the result, and uses them to execute the RHS. For each row of the RHS, it builds a new row that combines both the results.
1202 * `LeftJoin`: Same as Join, except that a row is returned even if the RHS fails to return a row.
1203
1204 ### FROM clause
1205
1206 The first step of the pushdown algorithm is to identify tables that can stay together. Requiring the ON clause to specify the join condition helps simplify this analysis. If this rule is followed in the query, we can determine this grouping by just analyzing the FROM clause.
1207
1208 The premise is as follows: A join can be pushed down if all the rows needed to satisfy a query are within the same database. If all the tables are in an unsharded keyspace, then this constraint is automatically true. If the tables are in a sharded keyspace, then the join condition must guarantee this. This is best explained with an example. Let’s start with a sharded keyspace with user and extra as tables, and analyze this query:
1209
1210 `select user.name, extra.info from user join extra on extra.id=user.id`
1211
1212 The join algorithm will first scan user:
1213
1214 `select user.name, user.id from user`
1215
1216 Then for each user.id (:user_id), it will execute:
1217
1218 `select extra.info from extra where extra.id = :user_id`
1219
1220 If user and extra had the same vindex on id, and if it was unique, then we know that all the necessary rows to satisfy the second query belong to the same kesypace_id as the row that came from user. So, they will be in the same shard. This means that it’s safe to push down the join.
1221
1222 Note that this push-down will work for RouteMerge queries also.
1223
1224 Given that the unsharded keyspace is an easy special case, we’ll focus mainly on the analysis of sharded keyspaces. Any place where we mention ‘same shard’ or ‘same keyspace id’, such condition will equally apply to an unsharded keyspace.
1225
1226 *Rule: For a join to be groupable, the join condition must be an equality condition where the participant columns use the same vindex, and the vindex must be unique.*
1227
1228 Some examples of what kind of AST will be produced by joins:
1229
1230 ```
1231 a join b join c
1232 J
1233 / \
1234 J c
1235 / \
1236 a b
1237 A join (b join c)
1238 J
1239 / \
1240 a J
1241 / \
1242 b c
1243 a left join b on c1 left join c on c2
1244 L:c2
1245 / \
1246 L:c1 c
1247 / \
1248 a b
1249 ```
1250
1251 For left joins, the ON clause is mandatory and enforced by the grammar. Because of this, the placement of ON clauses can also dictate grouping, which is not the case for normal joins:
1252
1253 ```
1254 a left join b left join c on c1 on c2
1255 L:c2
1256 / \
1257 a L:c1
1258 / \
1259 b c
1260 ```
1261
1262 By looking at the above tree, we can deduce that c1 can reference b and c, while c2 can reference a, b and c.
1263
1264 As mentioned before, when the required data is in multiple shards or keyspaces, VTGate will perform the join, and the order will be dictated by the SQL statement, as represented by the above execution trees. The join conditions themselves may try to dictate a different order. However, that will not be used to reorder the joins.
1265
1266 For example:
1267
1268 `from a join (b join c on c.id=b.id) on c.id=a.id`
1269
1270 is likely better executed this way:
1271
1272 `from a join c on c.id=a.id join b on b.id=c.id`
1273
1274 or this way:
1275
1276 `from b join c on c.id=b.id join a on a.id=c.id`
1277
1278 But VTGate will not perform such reordering tricks.
1279
1280 The work of the analyzer is to recursively walk the JOIN tree. The decisions are made from bottom to top.
1281
1282 The lowest level node of a parse tree is a ‘table’. This could be a real table or the result of a subquery that can be treated as a table. Knowing the table name, we can deduce the keyspace to which it belongs. So, the output of this analysis is the creation of the table entry in the symbol table, and the node being marked as a RouteMerge, because the rows for that table are scattered across all shards until additional constraints are applied.
1283
1284 When analyzing a join node, we first extract the ON clause conditions as a list, and we see if these requirements are satisfied:
1285
1286 * The node types have to be Route or RouteMerge
1287 * They should be for the same keyspace (sharded)
1288 * There should be at least one join condition between the two nodes: left.col = right.col
1289 * The columns referenced in the join must use the same vindex
1290 * The vindex must be unique
1291 * Other conditions in the ON clause must also be executable under the new group. This basically means that if there are subqueries, they should be correlated in such a way that the required rows can all be found within the new group. The algorithm for this analysis is explained in the WHERE clause section.
1292
1293 Two nodes can also be grouped (without a join condition) if they are both constrained by the same keyspace id (Route nodes).
1294
1295 The above rules work for both JOIN and LEFT JOIN nodes.
1296
1297 ###### Why does grouping work for LEFT JOIN?
1298
1299 *It’s easy to explain why grouping works for the simple case:*
1300
1301 `a left join b on b.id=a.id`
1302
1303 *In the above case, the only rows that are inspected in b are those where b.id=a.id. So, the join can be grouped. But what about this join:*
1304
1305 `(a left join b on a.id=b.id) join (c left join d on c.id=d.id) on b.id=d.id`
1306
1307 *In the above case, rows from b or c could be NULL. Fortunately, in SQL, NULL != NULL. So, the outer-scope join will succeed only if rows from b and c are not NULL. If they’re not NULL, they’re guaranteed to be from the same shard. So, this makes them groupable. In fact, in the above case, the left joins in the inner scope are unnecessary. They could have just been normal joins. However, things would be very different if one had used the null-safe equal operator (<=>) for joins. But we’ll not treat null-safe equal as a valid join operator.*
1308
1309 When two nodes are grouped, the current join condition becomes the root of the new group, and it gets a routing property:
1310
1311 * If it’s a JOIN, the new property is the more restrictive of the two nodes. So, if one of them is a Route, then the new node is also a Route.
1312 * For a LEFT JOIN, the new property is the same as the LHS node.
1313
1314 If the grouping conditions are not met, then the node remains a join node. In this case, we have to see if the ON clause conditions can be pushed down into the left and/or right nodes. By the fact that the current join is split into two, the ON clause cannot be pushed as is. Instead, we use associativity rules to our benefit and merge the ON clause conditions into the WHERE clauses of the underlying nodes. The rules are the same as the ones described for a normal WHERE clause.
1315
1316 But left joins are slightly different, because the join condition is applied *to the RHS only*. Also, the condition cannot be further pushed into other nested left joins, because they will change the meaning of the statement. For example:
1317
1318 `from a left join (b left join c on c.id=b.id) on c.id=a.id`
1319
1320 is same as:
1321
1322 `a LeftJoin (b LeftJoin c where c.id=b.id) where c.id=a.id)`
1323
1324 But it’s not the same as:
1325
1326 `a LeftJoin (b left join c where c.id=b.id and c.id=a.id)`
1327
1328 If the above joins were normal joins, then all three would be equivalent.
1329
1330 *The RHS-only rule for left joins has some non-intuitive repercussions. For example, this query will return all rows of table ‘a’:*
1331
1332 `select a.id, b.id from a left join b on (a.id=5 and b.id=a.id)`
1333
1334 *But b.id will be mostly null, except when a.id and b.id were 5.*
1335
1336 An ON clause can contain other constraints that are not joins. If so, they’re treated just like a WHERE clause constraint. So, a keyspace id constraint in an ON clause can change a RouteMerge to a Route. This means that we have to look at how to combine Route nodes with others.
1337
1338 A subquery in the FROM clause can result in any of the four primitive node. If it’s a join node, then the ON clause cannot be pushed down because it depends on the result of the join. If it was a Route or RouteMerge, the ON clause is pushable. Here’s how:
1339
1340 `from a join (select id, count(*) from b where b.id=1) t on t.id=a.id`
1341
1342 becomes
1343
1344 ```
1345 J
1346 / \
1347 a (select id, count(*) from b where b,id=1) t where t.id=a.id
1348 ```
1349
1350 Here’s a tabular version of the above algorithm:
1351
1352 <table>
1353 <tr>
1354 <td>LHS</td>
1355 <td>Join type</td>
1356 <td>RHS</td>
1357 <td>Result</td>
1358 </tr>
1359 <tr>
1360 <td>Route</td>
1361 <td>Join</td>
1362 <td>Route</td>
1363 <td>Route</td>
1364 </tr>
1365 <tr>
1366 <td>Route</td>
1367 <td>Join</td>
1368 <td>RouteMerge</td>
1369 <td>Route</td>
1370 </tr>
1371 <tr>
1372 <td>RouteMerge</td>
1373 <td>Join</td>
1374 <td>Route</td>
1375 <td>Route (RHS)</td>
1376 </tr>
1377 <tr>
1378 <td>RouteMerge</td>
1379 <td>Join</td>
1380 <td>RouteMerge</td>
1381 <td>RouteMerge</td>
1382 </tr>
1383 <tr>
1384 <td>Route</td>
1385 <td>LeftJoin</td>
1386 <td>Route</td>
1387 <td>Route (LHS)</td>
1388 </tr>
1389 <tr>
1390 <td>Route</td>
1391 <td>LeftJoin</td>
1392 <td>RouteMerge</td>
1393 <td>Route (LHS)</td>
1394 </tr>
1395 <tr>
1396 <td>RouteMerge</td>
1397 <td>LeftJoin</td>
1398 <td>Route</td>
1399 <td>RouteMerge</td>
1400 </tr>
1401 <tr>
1402 <td>RouteMerge</td>
1403 <td>LeftJoin</td>
1404 <td>RouteMerge</td>
1405 <td>RouteMerge</td>
1406 </tr>
1407 <tr>
1408 <td>Join/LeftJoin</td>
1409 <td>All joins</td>
1410 <td>All primitives</td>
1411 <td>New join</td>
1412 </tr>
1413 <tr>
1414 <td>All primitives</td>
1415 <td>All joins</td>
1416 <td>Join/LeftJoin</td>
1417 <td>New join</td>
1418 </tr>
1419 </table>
1420
1421
1422 At the end of the analysis, the output is a symbol table and a reduced tree representing the grouped joins. After this phase, the groups are finalized. They will not be further split or merged.
1423
1424 Here’s an example:
1425
1426 `a join b on b.id=a.id join c on c.id=b.id2 join d on d.id=a.id`
1427
1428 will be represented as:
1429
1430 ```
1431 J2
1432 / \
1433 J1 d where d.id=a.id
1434 / \
1435 a join b on b.id=a.id c where c.id=b.id2
1436 ```
1437
1438 ### WHERE clause
1439
1440 The ON clause analysis is similar to the WHERE clause, which is why it wasn’t covered in detail in the previous section. The main difference between WHERE and ON is that the ON clause can cause groups to merge. If two groups cannot be merged, the ON clauses get converted to WHERE clauses and get pushed down using the same rules. The only exception is the forced RHS push in the case of left joins.
1441
1442 *The symbol table used by the ON clause is a subset of the one for the WHERE clause. So, it’s safe to move on ON clause condition into a WHERE clause. Attempting to move a WHERE clause condition to an ON clause may require more care.*
1443
1444 The first step here is to convert the parse tree into a list of conditions that can be put back together using AND clauses. Then we analyze each condition separately and decide if it can be pushed into a Route or RouteMerge.
1445
1446 We look at the columns that each condition references, and we identify the rightmost group that the expression references and we push the where clause there. Any references outside of that group will later be changed to bind vars. By the fact that the where clause was pushed to the right most group gives us the guarantee that values will be available for the rest of the column references.
1447
1448 Each time a WHERE clause gets successfully pushed into a Route or RouteMerge, we re-evaluate the routing plan to see if it can be improved. The rules for this are described in the vindex document.
1449
1450 There are situations where we cannot push down:
1451
1452 ###### LEFT JOIN
1453
1454 A WHERE clause cannot be pushed inside a LEFT JOIN, because it has to be applied after
1455
1456 the ON clause is processed, which is not until `LeftJoin` returns the results.
1457
1458 *There are some practical use cases where you’d want to add conditions in the WHERE clause, typically for NULL checks. To support this in the future, we can use this technique that converts a LEFT JOIN into a normal JOIN:*
1459
1460 `from a left join b on b.id=a.id where b.id is NULL`
1461
1462 *After split, the query on b should have become:*
1463
1464 `from b where b.id=a.id`
1465
1466 *We cannot push down the null check in the above case. But we can, if we rewrote it as:*
1467
1468 `from (select 1) as t left join b on b.id=a.id where b.id is NULL`
1469
1470 *The (select 1) acts as a fake driver row for the left join, which allows us to tack on the where clause and also change the `LeftJoin` to a `Join`.*
1471
1472 ###### Subqueries
1473
1474 To be able to push a subquery, we have to determine that the subquery does not escape out of the shard of the parent query. For this, we identify the target group using the same method as above. Then we check to see if the subquery can get a Route plan where the input is the same keyspace id as the group’s query. If this is true, then the where clause can be pushed down just like a normal condition.
1475
1476 If this condition is not met, the subquery becomes a separate group. This complexity is beyond what can be expressed using the four primitives we have. So, we’ll address such constructs when we add more primitives to the mix.
1477
1478 When recombining WHERE clauses, additional parenthesis will be needed to prevent lower precedence operators from changing the intended order.
1479
1480 ###### An example
1481
1482 `from a join b on b.id=a.id join c on c.id=b.id where cond1(a.col, b.col) and cond2(a.col, c.col)`
1483
1484 If a, b and c where in different groups, the output would be:
1485
1486 ```
1487 J
1488 / \
1489 J c where (c.id=b.id) and (cond2(a.col, c.col))
1490 / \
1491 a b where (b.id=a.id) and (cond1(a.col, b.col))
1492 ```
1493
1494 The cond2 expression gets pushed into the where clause for table ‘c’ because it’s the right-most group that’s referenced by the condition. External references will be changed to appropriate bind variables by the rewiring phase.
1495
1496 *Once VTGate acquires the ability to perform its own filters, should we stop pushing these conditions into the dependent queries and do it ourselves instead? The answer will usually be no. You almost always want to push down filters. This is because it will let the underlying database scan fewer rows, or choose better indexes. The more restrictive the query is, the better.*
1497
1498 At the end of the analysis, if no WHERE clauses remain, we can go to the next step. If not, the query is too complex, and we error out.
1499
1500 ### SELECT clause, non-aggregate case
1501
1502 The SELECT clause can represent two relational primitives: a Select or Aggregate (or both). In the case of an Aggregate, it works in conjunction with the GROUP BY clause. So, it will be simpler if the analysis branched off depending on whether the SELECT clause has aggregates or not.
1503
1504 Before starting to process select, we have to perform the pre-step of naming the results. During this process, we generate a name for each item in the SELECT clause. Additionally, if a column is a faithful reference to an underlying table column, we add this qualification. This information will be used for future analysis.
1505
1506 The important property of a non-aggregate SELECT clause is that the column expressions don’t depend on each other. Therefore, they can be split and pushed down past joins into different groups.
1507
1508 If an expression references columns from a single group, then it can be pushed down into that group.
1509
1510 If an expression references columns from more than one group, then the expression cannot be pushed down, we fail the query.
1511
1512 *It is possible to push-down combo expressions using the same technique we use for WHERE clauses. For example:*
1513
1514 `select a.col+b.col from a join b where b.foo=a.bar`
1515
1516 *can be resolved with the following join*
1517
1518 ```
1519 select a.col, a.bar from a // produce _a_col & _a_bar
1520 Join
1521 select :_a_col+b.col from b where b.foo=:_a_bar
1522 ```
1523
1524 *However, such queries are not very efficient for VTTablet. It can’t cache field info. Such push-downs don’t benefit MySQL either. So, it’s better to wait till VTGate implements the ability to evaluate expressions.*
1525
1526 SELECT can also contain subqueries. The same restriction as WHERE clauses can be applied here: If the keyspace id of the subquery is not the same as the outer query’s, we fail the query. Otherwise, we push it down.
1527
1528 ###### ‘*’ expressions
1529
1530 Since VTGate does not know all the columns of a table, a ‘SELECT *’ may not be resolvable. We can consider disallowing it altogether. However, here are situations where we could allow it:
1531
1532 * If the underlying primitive is a Route or RouteMerge, we could just push it down without knowing what it means.
1533 * If the underlying ‘table’ is a subquery, we know the full column list. In such cases, we know how to expand the ‘*’ into an actual list.
1534
1535 ###### Additional expressions for joins
1536
1537 Note: This phase could be postponed till the time we do the wire-up.
1538
1539 After the SELECT expressions are pushed down, we have to see if there are dependencies across joins that are still not addressed. If there are WHERE clause dependencies from one group to another that are not listed in the SELECT expression, we need to add those to the list of expressions. This is where we use the info for each select column to see if it already references a column we need for the RHS query. Here’s an example of the metadata in a Join primitive:
1540
1541 `select a.col1, b.col1, a.col3 from a join b on b.col2=a.col2`
1542
1543 will be represented as
1544
1545 ```
1546 Join:
1547 LHS: select a.col1, a.col3, a.col2 from a
1548 Bindvars: "_a_col2": 2 // Col 2 produces bind var _a_col2
1549 RHS: select b.col1 from b where b.col2=:_a_col2
1550 Select: [ LHS(0), RHS(0), LHS(1) ]
1551 ```
1552
1553 Consequently, this means that the lower level routes fetch columns that should not be returned in the result. This is the reason why a Join primitive has to also specify the columns that have to be returned in the final result.
1554
1555 *It’s safe to assume that no further changes will be made to the SELECT list, and consider it finalized. Subsequent clauses will only be able to operate on the outcome of the result. This fact is not material right now. But it may become significant when VTGate starts to perform other operations. Specifically, VTGate may split the SELECT clause into a Select and Aggregate and push the Select part down. If the select list changes due to other external dependencies, then the Aggregate cannot be pushed down.*
1556
1557 *MySQL allows you to address columns that are not in the SELECT list in GROUP BY and HAVING clauses, which can further affect the Select list. This is not standard SQL.*
1558
1559 ###### Handling empty lists
1560
1561 The converse of the above rule is the case where the result references only columns from one side of a join. In such cases, the other side of the join doesn’t have a SELECT expression yet. For such situations, we create a dummy ‘select 1’ to make the query work. Example:
1562
1563 `select a.col from a join b on b.col=a.col`
1564
1565 will be represented as:
1566
1567 ```
1568 Join:
1569 LHS: select a.col from a
1570 Bindvars: "_a_col": 0
1571 RHS: select 1 from b where b.col=:_a_col
1572 Select: [LHS(0)]
1573 ```
1574
1575 ###### Pushing past Filter
1576
1577 If/when VTGate implements filters, we may have to ask if a SELECT clause can be pushed past such filters into a Route primitive. The answer is yes, because Select and Filter are orthogonal operations. However, we may prefer performing the Select at the VTGate level instead of pushing it down.
1578
1579 ### SELECT with aggregates
1580
1581 If a SELECT clause uses aggregate functions, then it can be pushed down if the underlying node is a Route. As usual, subquery escape analysis will still need to be applied.
1582
1583 If the underlying node is a RouteMerge, then it can still be pushed down if one of the non-aggregate result columns has a unique vindex. Such a constraint guarantees that all the rows of each group are within one shard.
1584
1585 You generally can’t push an Aggregate past a join. Theoretically, there are situations where you can do this, but such opportunities are rare. So, it’s not worth considering these optimizations.
1586
1587 ### GROUP BY
1588
1589 The act of pushing an Aggregate down also necessitates that the associated GROUP BY clause be pushed. It’s just the other side of the same coin.
1590
1591 The GROUP BY clause is slightly unusual because its contents can actually be automatically inferred in most cases. More often than not, the SQL engine has to validate that the GROUP BY columns match the inferred values.
1592
1593 It is possible that a GROUP BY to be present while a SELECT doesn’t have aggregates. It’s a common technique that people use to dedupe rows in the result. In such cases, the same rules as an Aggregate apply to the GROUP BY clause.
1594
1595 *As mentioned previously, MySQL analyzes the GROUP BY clause using the query’s scope. It also does not enforce that the GROUP BY columns are consistent with the SELECT. For example, this is valid in MySQL:*
1596
1597 `select a.col1, a.col2 from a group by a.col1`
1598
1599 *This also valid:*
1600
1601 `select a.col1 from a group by a.col1, a.col2`
1602
1603 *Both of the above constructs are invalid by SQL standards. We could piggy-back on this lax attitude and just pass-through the GROUP BY as is. After all, these are not very meaningful constructs. If MySQL allows them, why should we work against them?*
1604
1605 ### HAVING
1606
1607 We reach this phase only if SELECT and GROUP BY were successfully pushed.
1608
1609 The difference between HAVING and WHERE is that the HAVING clause applies filters after the result is computed. So, it can only reference columns from the result.
1610
1611 For each condition in the HAVING clause, we find out which result columns it references. If all those columns originate from the same group, then the clause is pushable into that group.
1612
1613 For conditions that span multiple groups, we can apply the bind variable dependency technique just like we did for the WHERE clause. If we decide to do this, we’ll need a separate bind variable naming scheme, because these refer to result columns and not table columns.
1614
1615 As usual, subquery processing will need to be done as needed.
1616
1617 Note: HAVING clauses are usually very simple. So, there may be diminishing return in allowing complexities here. However, if we’re just reusing code, we can allow these. Otherwise, we shouldn’t go overboard.
1618
1619 ### ORDER BY
1620
1621 Even though this clause near the end of the execution chain, it’s a popular one. It’s also an expensive one. So, we have to make every effort to push the work down into the underlying routes.
1622
1623 We’ll start with the usual restriction that this clause will only reference the result columns.
1624
1625 If the underlying node is a Route, then the clause can just be pushed down.
1626
1627 If the underlying node is a RouteMerge, then it cannot be pushed down.
1628
1629 If the underlying node is a join, then we perform the following analysis: Take each expression and identify the group it originates from. If it references more than one group, then we cannot push that expression down. If the group order is no less than the current group order, the expression can be pushed. We continue this until we reach an expression that is not pushable, or we successfully push everything. Here are two examples:
1630
1631 `select a.col, b.col, c.col from a join b join c order by a.col, b.col, c.col`
1632
1633 This will produce:
1634
1635 ```
1636 select a.col from a order by a.col // group 1
1637 select b.col from b order by b.col // group 2
1638 select c.col from c order by c.col // group 3
1639 ```
1640
1641 Example 2:
1642
1643 `select a.col, b.col, c.col from a join b join c order by a.col, c.col, b.col`
1644
1645 For the above case, a.col is from group 1, c.col is from group 3, but b.col is from group 2, which is less than group 3. So, the last expression is not pushable.
1646
1647 ### LIMIT
1648
1649 The LIMIT clause has two forms:
1650
1651 LIMIT m, n: Fetch m+n rows, but return rows m through n.
1652
1653 LIMIT n: same as LIMIT 0, n
1654
1655 If the underlying node is a Route, then the LIMIT can be pushed down.
1656
1657 If the underlying node is a RouteMerge, then we fail the query. This is because ORDER BY cannot be pushed into a RouteMerge, and a LIMIT without an ORDER BY is not very meaningful. Also, the order of rows is unpredictable for RouteMerge. So, it’s even less valuable to apply a limit.
1658
1659 In the case of a join, the query could be passed down without a limit and we can do the filtering on the entire result. Maybe this is not worth supporting.
1660
1661 ## Wire-up
1662
1663 If we’ve reached this phase, it means that we have successfully broken up the original query and pushed the different parts into route primitives that are now wired together by joins.
1664
1665 However, these parts would still be referring to each other. This section will describe how those dependencies will get resolved.
1666
1667 In order to resolve dependencies correctly, we need the symbol tables that were used for analyzing each context. With the symbol table at hand, we walk the AST for each group, and look up the referenced symbols in the symbol table. If the symbol belongs to the same group, then nothing needs to be done. If the reference is outside, we perform the following steps:
1668
1669 1. Change the external reference to the corresponding bind var name.
1670 2. If that column is not already in the select list of the other group, we add it.
1671 3. Mark the column number to be exported as the bind var name that we chose.
1672
1673 This step has a caveat: if the SELECT list contained an Aggregate, then we cannot add such a column. However, we’re currently protected by the constraint that Aggregate cannot be pushed into a join, and external dependencies can only happen for joins. But this is something to keep in mind as we add more primitives and flexibilities.
1674
1675 ### Accessing the symbol table
1676
1677 Currently, the AST contains only information parsed from the query. When we analyze the query, we build various symbol tables. However, they would go away as we finish each analysis.
1678
1679 Since we need them again for resolving dependencies, we have to store them somewhere. We also have to retrieve the right one for each subquery. This gives rise to a dependency issue: Subqueries can be anywhere in the parse tree. So, it’s unnatural to have an outside data structure that points to various parts of the AST.
1680
1681 So, we make a compromise: we extend the Select structure and add a Symbols (any) member to it. Every time we finish analyzing a subquery, we set the Symbols member to point to the symbol table. With this compromise, the rest of the code base will flow naturally while resolving dependencies.
1682
1683 ## Plan generation
1684
1685 Now, all we have to do is generate the execution plan. During this step, we just generate the code for the AST of each route node, and perform additional substitutions needed to perform the correct routing, as dictated by the vindexes chosen for each route.
1686
1687 A Route node will become one of: SelectUnsharded, SelectEqual, SelectKeyrange
1688
1689 A RouteMerge will become one of: SelectScatter, SelectIN
1690
1691 # Future primitives
1692
1693 Looking at how far we can go with just four primitives is pretty amazing. But we can’t stop there. There are a few more low-hanging fruits we can get to, and then we can also start looking at the high-hanging ones.
1694
1695 ## MergeSort and MergeAggregate
1696
1697 If the ORDER BY clause can be pushed down for a Route, then we’re not far away from supporting it for a RouteMerge also. The results from each shard are already coming sorted. Merge-sorting such results becomes trivial.
1698
1699 Along the same lines, if we ordered the rows of an Aggregate by the columns it’s grouping, we can push down such a query through a RouteMerge, and perform the final aggregation as the results come.
1700
1701 The one challenge is collation. We’ll need to make sure that our collation algorithm can match what MySQL does.
1702
1703 ## Subquery
1704
1705 Uncorrelated subqueries can be trivially ‘pulled out’ to be executed separately, and their results can be pushed down into the query that needs the result.
1706
1707 Here’s an example:
1708
1709 `select * from a where col in (select col from lookaside)`
1710
1711 This can be expressed as:
1712
1713 ```
1714 Subquery:
1715 _lookaside: select col from lookaside
1716 query: select * from a where col in ::_lookaside
1717 ```
1718
1719 ## Select & Filter
1720
1721 If we built an expression engine, then we can do our own evaluation and filtering when something is not pushable. Adding support for this will automatically include correlated subqueries, because the result of the underlying query can be used to execute the subquery during the Select or Filter stage. Example:
1722
1723 `select a.id from a where a.id = (select id from b where b.col=a.col)`
1724
1725 Plan:
1726
1727 ```
1728 Filter:
1729 query: select a.id, a.col from a
1730 bindvars: "_a_col": 1
1731 constraint: a.id = ConvertVal: Route:
1732 select id from b where b.col=:_a_col
1733 ```
1734
1735 ## Other expensive primitives
1736
1737 The rest of the primitives have potentially unbounded memory consequences. For these, we’ll have to explore map-reduce based approaches.