github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171206_single_use_common_table_expressions.md (about) 1 - Single-Use Common Table Expressions 2 - Status: completed 3 - Start Date: 2017-11-30 4 - Authors: Jordan Lewis 5 - RFC PR: #20374 6 - Cockroach Issue: #7029 7 8 # Summary 9 10 Implement a simple subset of common table expressions that permits naming 11 result sets and using each of them at most once in a statement. Full support 12 for common table expression (henceforth CTEs) requires temporary table 13 infrastructure, which is currently missing from CockroachDB. This proposal aims 14 to fast-track the implementation of the subset of CTEs that doesn't require 15 temporary tables, providing our users with partial compatibility and query 16 readability boons at little cost. 17 18 The supported subset will be known as single-use CTEs, and consists of: 19 - Full `WITH ... AS` syntax support, with unlimited clauses 20 - Each clause can be referenced as a data source at most once 21 22 Features that are not present in single-use CTEs: 23 - Using a CTE more than once in a data source will not be included because 24 implementing it requires temporary tables 25 - Correlated CTEs will not be supported because we don't yet support correlated 26 subqueries in general. 27 - `WITH RECURSIVE` will not be included because implementing it requires 28 temporary tables and a complex new execution strategy. 29 30 # Motivation 31 32 Common table expressions (CTEs), or `WITH` clauses, are a standard SQL 1999 33 feature that permit naming the result sets of intermediate statements for use 34 in a single larger `SELECT`, `INSERT/UPSERT`, `UPDATE` or `DELETE` statement. 35 In full CTEs, the named statements can be referred to unlimited times 36 throughout the rest of the statement. To preserve the referential integrity of 37 the names of the statements, the execution engine must ensure that each impure 38 clause (either one that modifies data or one that uses an impure builtin 39 function) is not executed more than once. CTEs increase the expressivity of SQL 40 by adding some syntactic sugar and new access patterns: 41 42 1. Statements with many subqueries can be made more readable by extracting the 43 subqueries into named, top-level CTE clauses 44 2. The results of `INSERT/UPSERT/UPDATE/DELETE` statements that use `RETURNING` 45 can be captured in a named result set, which is not possible with ordinary 46 subuqeries 47 3. Recursive statements for tasks like graph traversal can be written with a CTE 48 clause that references itself, using the `WITH RECURSIVE` syntax 49 4. Statements can reference a named CTE clause more than once, enabling patterns 50 that join complex datasets to themselves 51 52 CTEs are a frequently requested feature for CockroachDB, both for compatibility 53 and developer quality-of-life reasons. Implementing CTEs in full is quite 54 involved, because of the requirement that each CTE clause is referentially 55 transparent. Impure statements can have arbitrarily large result sets, so 56 temporarily accumulating these result sets for use by other statements requires 57 infrastructure for temporary tables to ensure that CockroachDB doesn't run out 58 of memory while executing them. 59 60 However, many CTE use cases only need the syntactic sugar provided in points 1 61 and 2 above. None of the queries mentioned in CockroachDB's CTE issue #7029 use 62 features missing from the proposed subset, for example. Also, several ORMs 63 including ActiveRecord use single-use CTEs as part of their schema introspection 64 routines. 65 66 Therefore, this proposal aims to implement just the syntactic sugar in points 1 67 and 2 above, additionally imposing the restriction that each CTE clause may not be 68 referred to as a data source more than once. We believe that providing this 69 subset of CTEs will be so beneficial for our users that we shouldn't delay 70 implementing the feature until after temporary tables are available at some 71 undetermined time in the future. 72 73 An 74 [initial implementation](https://github.com/cockroachdb/cockroach/pull/20359) 75 of this subset was straightforward and weighed in at less than 300 lines of new 76 non-test code. 77 78 # Guide-level explanation 79 80 Implementing this proposal would enable the `WITH` statement syntax for 81 unlimited named result sets per statement, as long as each result set is not 82 accessed more than once. 83 84 The following syntax sketch aims to show the salient features of the CTE syntax: 85 86 ``` 87 WITH name1 AS (<dml_with_results>), 88 name2 AS (<dml_with_results> that can reference name1), 89 name3 AS (<dml_with_results> that can reference name1 and name2), 90 ... 91 nameN AS (<dml_with_results> that can reference all names above) 92 <dml> that can reference all names above 93 ``` 94 95 where `<dml_with_results>` is any `SELECT` or `INSERT`, `UPSERT`, `UPDATE` or 96 `DELETE` that uses the `RETURNING` clause, and `<dml>` is any of the above 97 statements with or without `RETURNING`. 98 99 The following example demonstrates the query-factoring capability of CTEs: 100 101 ``` 102 --- Original query: 103 104 INSERT INTO v SELECT * FROM 105 (SELECT c FROM u JOIN 106 (SELECT a, b FROM t WHERE a < 5) AS x(a, b) 107 ON u.a = x.a WHERE b > 10) 108 AS y 109 110 --- CTE'd equivalent of the above: 111 WITH x(a, b) AS (SELECT a, b FROM t WHERE a < 5), 112 y AS (SELECT c from u JOIN x ON u.a = x.a WHERE b > 10) 113 INSERT INTO v SELECT * from y 114 ``` 115 116 The second version is more readable, since the subquery nesting is replaced 117 with a lateral set of result set declarations. 118 119 Here's an example with `RETURNING` clauses in the CTE clauses: 120 ``` 121 WITH x AS (INSERT INTO t(a) VALUES(1) RETURNING a), 122 y(c) AS (DELETE FROM u WHERE a IN (x) RETURNING b), 123 z AS (SELECT a FROM v WHERE a < 10) 124 SELECT * FROM z JOIN y ON z.a = y.c; 125 ``` 126 127 In this example, the outputs of an `INSERT` and `DELETE` statement are each 128 used as a named result set, something that's not possible with ordinary 129 subqueries. 130 131 Each CTE clause can itself have nested `WITH` clauses or subqueries, in which 132 case the names from the outer CTE are still made available to the inner 133 queries. For example: 134 135 ``` 136 WITH x AS (SELECT c FROM a), 137 y AS (SELECT d FROM b WHERE e IN (SELECT p from c WHERE q IN x) 138 SELECT * FROM y WHERE d > 5; 139 ``` 140 141 In this case, the subquery in clause `y` can still reference the clause `x` 142 from the outer CTE. 143 144 Each clause can only reference named result sets that were defined before in 145 the statement, and clauses can't reference themselves. Additionally, each 146 result set can only be used as a data source once by subsequent CTE clauses and 147 the main statement clause. For example, the following CTEs would not be 148 supported by the proposal: 149 150 ``` 151 --- Sum the integers from 1 to 10 152 --- Not supported: clauses can't reference themselves. 153 WITH RECURSIVE t(n) AS ( 154 SELECT 1 155 UNION ALL 156 SELECT n+1 FROM t 157 ) 158 SELECT SUM(n) FROM t LIMIT 10; 159 160 --- Not supported: can't reference a clause more than once. 161 WITH x(a) AS (SELECT a FROM t), 162 y(b) AS (SELECT a + 1 FROM x) 163 SELECT * FROM x JOIN y ON x.a = y.b; 164 ``` 165 166 As a more realistic example, implementing this proposal would permit 167 CockroachDB to execute the complete sample query suggested by the popular CTE 168 blog post 169 [The Best Postgres Feature You're Not Using – CTEs Aka WITH Clauses](http://www.craigkerstiens.com/2013/11/18/best-postgres-feature-youre-not-using/): 170 171 ```sql 172 --- Initial query to grab project title and tasks per user 173 WITH users_tasks AS ( 174 SELECT 175 users.id as user_id, 176 users.email, 177 array_agg(tasks.name) as task_list, 178 projects.title 179 FROM 180 users, 181 tasks, 182 project 183 WHERE 184 users.id = tasks.user_id 185 projects.title = tasks.project_id 186 GROUP BY 187 users.email, 188 projects.title 189 ), 190 191 --- Calculates the total tasks per each project 192 total_tasks_per_project AS ( 193 SELECT 194 project_id, 195 count(*) as task_count 196 FROM tasks 197 GROUP BY project_id 198 ), 199 200 --- Calculates the projects per each user 201 tasks_per_project_per_user AS ( 202 SELECT 203 user_id, 204 project_id, 205 count(*) as task_count 206 FROM tasks 207 GROUP BY user_id, project_id 208 ), 209 210 --- Gets user ids that have over 50% of tasks assigned 211 overloaded_users AS ( 212 SELECT tasks_per_project_per_user.user_id, 213 214 FROM tasks_per_project_per_user, 215 total_tasks_per_project 216 WHERE tasks_per_project_per_user.task_count > (total_tasks_per_project / 2) 217 ) 218 219 SELECT 220 email, 221 task_list, 222 title 223 FROM 224 users_tasks, 225 overloaded_users 226 WHERE 227 users_tasks.user_id = overloaded_users.user_id 228 ``` 229 230 This query is executable by the single-use CTE implementation since every named 231 result set is used at most once. As you can see, trying to represent this 232 massive query as a single nested set of subselects would be much more difficult 233 to read. This improvement in readability is a major reason why CTEs are 234 popular, and something that we could easily provide today without waiting for 235 temporary tables. 236 237 # Reference-level explanation 238 239 CockroachDB already supports using arbitrary plans as plan data sources, so the 240 bulk of the implementation of the single-use CTEs is adding syntactic sugar and 241 a name resolution mechanism for CTE subclauses. 242 243 Specifically, the planner will be augmented to include a naming environment 244 that is composed of one frame per CTE statement. Each frame will consist of a 245 mapping from CTE clause name to a query plan for executing the statement 246 corresponding to the name, as well as any column aliases the user might have 247 provided. The environment will be treated as a stack. Planning a CTE pushes 248 a new frame onto the stack, and finishing planning that CTE pops the frame. 249 250 Data source resolution will be augmented to search through the naming 251 environment from the top of the stack to the bottom before searching for 252 tables. CTE names can shadow table names and other CTE names in an outer scope, 253 but an individual CTE statement can't contain more than one clause with any 254 given name. 255 256 To enforce this proposal's restrictions, the naming environment will also 257 include a flag on each named clause that is set when it is used as a data 258 source in another clause or the main statement. This flag will allow the 259 planner to detect when a query tries to reference a table more than once and 260 return a suitable error. 261 262 Because of this proposal's restrictions, temporary table infrastructure is not 263 necessary as each CTE clause will stream its output to the plan that references 264 it just like an ordinary CockroachDB plan tree. 265 266 Performance of planning ordinary changes will not be meaningfully affected, 267 since the naming environment doesn't have to get checked if its empty. 268 269 ## Drawbacks 270 271 Despite the fact that completing this proposal would provide strictly more 272 functionality to our users, it might be risky to ship an incomplete version of 273 common table expressions from an optics perspective. We wouldn't want to give 274 users the impression that we don't care about fully implementing features that 275 we claim to support. 276 277 This risk can be mitigated by setting expectations carefully in the docs and 278 feature release notes. As long as we don't claim to have full CTE support, 279 people won't be unduly surprised when they can't use some of the more complex 280 functionality that CTEs offer. 281 282 ## Rationale and Alternatives 283 284 This design is a simple incremental step toward providing common table 285 expressions. Whether or not we choose to ship this partial CTE implementation, 286 it's a good idea to start with this simpler set of functionality to establish a 287 baseline for testing. 288 289 As an alternative, we could punt on CTEs entirely until temporary tables are 290 available for internal use, and then deliver a full implementation of CTEs all 291 at once. 292 293 The impact of waiting to implement this functionality is that we might turn 294 away potential users that expect to be able to use CTEs. 295 296 ## Unresolved questions 297 298 None. 299 300 ## Future work 301 302 Implementing the rest of CTEs has 2 stages: temporary storage to enable 303 multi-use clauses, and execution engine changes to enable `WITH RECURSIVE`. 304 305 Temporary storage may be less involved than fully implementing temporary 306 tables. DistSQL processors can be configured to use temporary storage in a 307 processor if necessary, so it's possible that all this will take is plugging 308 the temporary storage infrastructure into the local execution engine in front 309 of clauses that need to be used more than once, or waiting until the two 310 engines are merged. 311 312 `WITH RECURSIVE` will take some additional thought, and possibly another RFC.