github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171206_single_use_common_table_expressions.md (about)

     1  - Single-Use Common Table Expressions
     2  - Status: completed
     3  - Start Date: 2017-11-30
     4  - Authors: Jordan Lewis
     5  - RFC PR: #20374
     6  - Cockroach Issue: #7029
     7  
     8  # Summary
     9  
    10  Implement a simple subset of common table expressions that permits naming
    11  result sets and using each of them at most once in a statement. Full support
    12  for common table expression (henceforth CTEs) requires temporary table
    13  infrastructure, which is currently missing from CockroachDB. This proposal aims
    14  to fast-track the implementation of the subset of CTEs that doesn't require
    15  temporary tables, providing our users with partial compatibility and query
    16  readability boons at little cost.
    17  
    18  The supported subset will be known as single-use CTEs, and consists of:
    19  - Full `WITH ... AS` syntax support, with unlimited clauses
    20  - Each clause can be referenced as a data source at most once
    21  
    22  Features that are not present in single-use CTEs:
    23  - Using a CTE more than once in a data source will not be included because
    24    implementing it requires temporary tables
    25  - Correlated CTEs will not be supported because we don't yet support correlated
    26    subqueries in general.
    27  - `WITH RECURSIVE` will not be included because implementing it requires
    28    temporary tables and a complex new execution strategy.
    29  
    30  # Motivation
    31  
    32  Common table expressions (CTEs), or `WITH` clauses, are a standard SQL 1999
    33  feature that permit naming the result sets of intermediate statements for use
    34  in a single larger `SELECT`, `INSERT/UPSERT`, `UPDATE` or `DELETE` statement.
    35  In full CTEs, the named statements can be referred to unlimited times
    36  throughout the rest of the statement. To preserve the referential integrity of
    37  the names of the statements, the execution engine must ensure that each impure
    38  clause (either one that modifies data or one that uses an impure builtin
    39  function) is not executed more than once. CTEs increase the expressivity of SQL
    40  by adding some syntactic sugar and new access patterns:
    41  
    42  1. Statements with many subqueries can be made more readable by extracting the
    43     subqueries into named, top-level CTE clauses
    44  2. The results of `INSERT/UPSERT/UPDATE/DELETE` statements that use `RETURNING`
    45     can be captured in a named result set, which is not possible with ordinary
    46     subuqeries
    47  3. Recursive statements for tasks like graph traversal can be written with a CTE
    48     clause that references itself, using the `WITH RECURSIVE` syntax
    49  4. Statements can reference a named CTE clause more than once, enabling patterns
    50     that join complex datasets to themselves
    51  
    52  CTEs are a frequently requested feature for CockroachDB, both for compatibility
    53  and developer quality-of-life reasons. Implementing CTEs in full is quite
    54  involved, because of the requirement that each CTE clause is referentially
    55  transparent. Impure statements can have arbitrarily large result sets, so
    56  temporarily accumulating these result sets for use by other statements requires
    57  infrastructure for temporary tables to ensure that CockroachDB doesn't run out
    58  of memory while executing them.
    59  
    60  However, many CTE use cases only need the syntactic sugar provided in points 1
    61  and 2 above. None of the queries mentioned in CockroachDB's CTE issue #7029 use
    62  features missing from the proposed subset, for example. Also, several ORMs
    63  including ActiveRecord use single-use CTEs as part of their schema introspection
    64  routines.
    65  
    66  Therefore, this proposal aims to implement just the syntactic sugar in points 1
    67  and 2 above, additionally imposing the restriction that each CTE clause may not be
    68  referred to as a data source more than once. We believe that providing this
    69  subset of CTEs will be so beneficial for our users that we shouldn't delay
    70  implementing the feature until after temporary tables are available at some
    71  undetermined time in the future.
    72  
    73  An
    74  [initial implementation](https://github.com/cockroachdb/cockroach/pull/20359)
    75  of this subset was straightforward and weighed in at less than 300 lines of new
    76  non-test code.
    77  
    78  # Guide-level explanation
    79  
    80  Implementing this proposal would enable the `WITH` statement syntax for
    81  unlimited named result sets per statement, as long as each result set is not
    82  accessed more than once.
    83  
    84  The following syntax sketch aims to show the salient features of the CTE syntax:
    85  
    86  ```
    87  WITH name1 AS (<dml_with_results>),
    88       name2 AS (<dml_with_results> that can reference name1),
    89       name3 AS (<dml_with_results> that can reference name1 and name2),
    90       ...
    91       nameN AS (<dml_with_results> that can reference all names above)
    92  <dml> that can reference all names above
    93  ```
    94  
    95  where `<dml_with_results>` is any `SELECT` or `INSERT`, `UPSERT`, `UPDATE` or
    96  `DELETE` that uses the `RETURNING` clause, and `<dml>` is any of the above
    97  statements with or without `RETURNING`.
    98  
    99  The following example demonstrates the query-factoring capability of CTEs:
   100  
   101  ```
   102  --- Original query:
   103  
   104  INSERT INTO v SELECT * FROM
   105    (SELECT c FROM u JOIN
   106      (SELECT a, b FROM t WHERE a < 5) AS x(a, b)
   107      ON u.a = x.a WHERE b > 10)
   108    AS y
   109  
   110  --- CTE'd equivalent of the above:
   111  WITH x(a, b) AS (SELECT a, b FROM t WHERE a < 5),
   112       y       AS (SELECT c from u JOIN x ON u.a = x.a WHERE b > 10)
   113  INSERT INTO v SELECT * from y
   114  ```
   115  
   116  The second version is more readable, since the subquery nesting is replaced
   117  with a lateral set of result set declarations.
   118  
   119  Here's an example with `RETURNING` clauses in the CTE clauses:
   120  ```
   121  WITH x AS (INSERT INTO t(a) VALUES(1) RETURNING a),
   122       y(c) AS (DELETE FROM u WHERE a IN (x) RETURNING b),
   123       z AS (SELECT a FROM v WHERE a < 10)
   124  SELECT * FROM z JOIN y ON z.a = y.c;
   125  ```
   126  
   127  In this example, the outputs of an `INSERT` and `DELETE` statement are each
   128  used as a named result set, something that's not possible with ordinary
   129  subqueries.
   130  
   131  Each CTE clause can itself have nested `WITH` clauses or subqueries, in which
   132  case the names from the outer CTE are still made available to the inner
   133  queries. For example:
   134  
   135  ```
   136  WITH x AS (SELECT c FROM a),
   137       y AS (SELECT d FROM b WHERE e IN (SELECT p from c WHERE q IN x)
   138  SELECT * FROM y WHERE d > 5;
   139  ```
   140  
   141  In this case, the subquery in clause `y` can still reference the clause `x`
   142  from the outer CTE.
   143  
   144  Each clause can only reference named result sets that were defined before in
   145  the statement, and clauses can't reference themselves. Additionally, each
   146  result set can only be used as a data source once by subsequent CTE clauses and
   147  the main statement clause. For example, the following CTEs would not be
   148  supported by the proposal:
   149  
   150  ```
   151  --- Sum the integers from 1 to 10
   152  --- Not supported: clauses can't reference themselves.
   153  WITH RECURSIVE t(n) AS (
   154      SELECT 1
   155    UNION ALL
   156      SELECT n+1 FROM t
   157  )
   158  SELECT SUM(n) FROM t LIMIT 10;
   159  
   160  --- Not supported: can't reference a clause more than once.
   161  WITH x(a) AS (SELECT a FROM t),
   162       y(b) AS (SELECT a + 1 FROM x)
   163  SELECT * FROM x JOIN y ON x.a = y.b;
   164  ```
   165  
   166  As a more realistic example, implementing this proposal would permit
   167  CockroachDB to execute the complete sample query suggested by the popular CTE
   168  blog post
   169  [The Best Postgres Feature You're Not Using – CTEs Aka WITH Clauses](http://www.craigkerstiens.com/2013/11/18/best-postgres-feature-youre-not-using/):
   170  
   171  ```sql
   172  --- Initial query to grab project title and tasks per user
   173  WITH users_tasks AS (
   174    SELECT
   175           users.id as user_id,
   176           users.email,
   177           array_agg(tasks.name) as task_list,
   178           projects.title
   179    FROM
   180         users,
   181         tasks,
   182         project
   183    WHERE
   184          users.id = tasks.user_id
   185          projects.title = tasks.project_id
   186    GROUP BY
   187             users.email,
   188             projects.title
   189  ),
   190  
   191  --- Calculates the total tasks per each project
   192  total_tasks_per_project AS (
   193    SELECT
   194           project_id,
   195           count(*) as task_count
   196    FROM tasks
   197    GROUP BY project_id
   198  ),
   199  
   200  --- Calculates the projects per each user
   201  tasks_per_project_per_user AS (
   202    SELECT
   203           user_id,
   204           project_id,
   205           count(*) as task_count
   206    FROM tasks
   207    GROUP BY user_id, project_id
   208  ),
   209  
   210  --- Gets user ids that have over 50% of tasks assigned
   211  overloaded_users AS (
   212    SELECT tasks_per_project_per_user.user_id,
   213  
   214    FROM tasks_per_project_per_user,
   215         total_tasks_per_project
   216    WHERE tasks_per_project_per_user.task_count > (total_tasks_per_project / 2)
   217  )
   218  
   219  SELECT
   220         email,
   221         task_list,
   222         title
   223  FROM
   224       users_tasks,
   225       overloaded_users
   226  WHERE
   227        users_tasks.user_id = overloaded_users.user_id
   228  ```
   229  
   230  This query is executable by the single-use CTE implementation since every named
   231  result set is used at most once. As you can see, trying to represent this
   232  massive query as a single nested set of subselects would be much more difficult
   233  to read. This improvement in readability is a major reason why CTEs are
   234  popular, and something that we could easily provide today without waiting for
   235  temporary tables.
   236  
   237  # Reference-level explanation
   238  
   239  CockroachDB already supports using arbitrary plans as plan data sources, so the
   240  bulk of the implementation of the single-use CTEs is adding syntactic sugar and
   241  a name resolution mechanism for CTE subclauses.
   242  
   243  Specifically, the planner will be augmented to include a naming environment
   244  that is composed of one frame per CTE statement. Each frame will consist of a
   245  mapping from CTE clause name to a query plan for executing the statement
   246  corresponding to the name, as well as any column aliases the user might have
   247  provided. The environment will be treated as a stack. Planning a CTE pushes
   248  a new frame onto the stack, and finishing planning that CTE pops the frame.
   249  
   250  Data source resolution will be augmented to search through the naming
   251  environment from the top of the stack to the bottom before searching for
   252  tables. CTE names can shadow table names and other CTE names in an outer scope,
   253  but an individual CTE statement can't contain more than one clause with any
   254  given name.
   255  
   256  To enforce this proposal's restrictions, the naming environment will also
   257  include a flag on each named clause that is set when it is used as a data
   258  source in another clause or the main statement. This flag will allow the
   259  planner to detect when a query tries to reference a table more than once and
   260  return a suitable error.
   261  
   262  Because of this proposal's restrictions, temporary table infrastructure is not
   263  necessary as each CTE clause will stream its output to the plan that references
   264  it just like an ordinary CockroachDB plan tree.
   265  
   266  Performance of planning ordinary changes will not be meaningfully affected,
   267  since the naming environment doesn't have to get checked if its empty.
   268  
   269  ## Drawbacks
   270  
   271  Despite the fact that completing this proposal would provide strictly more
   272  functionality to our users, it might be risky to ship an incomplete version of
   273  common table expressions from an optics perspective. We wouldn't want to give
   274  users the impression that we don't care about fully implementing features that
   275  we claim to support.
   276  
   277  This risk can be mitigated by setting expectations carefully in the docs and
   278  feature release notes. As long as we don't claim to have full CTE support,
   279  people won't be unduly surprised when they can't use some of the more complex
   280  functionality that CTEs offer.
   281  
   282  ## Rationale and Alternatives
   283  
   284  This design is a simple incremental step toward providing common table
   285  expressions. Whether or not we choose to ship this partial CTE implementation,
   286  it's a good idea to start with this simpler set of functionality to establish a
   287  baseline for testing.
   288  
   289  As an alternative, we could punt on CTEs entirely until temporary tables are
   290  available for internal use, and then deliver a full implementation of CTEs all
   291  at once.
   292  
   293  The impact of waiting to implement this functionality is that we might turn
   294  away potential users that expect to be able to use CTEs.
   295  
   296  ## Unresolved questions
   297  
   298  None.
   299  
   300  ## Future work
   301  
   302  Implementing the rest of CTEs has 2 stages: temporary storage to enable
   303  multi-use clauses, and execution engine changes to enable `WITH RECURSIVE`.
   304  
   305  Temporary storage may be less involved than fully implementing temporary
   306  tables. DistSQL processors can be configured to use temporary storage in a
   307  processor if necessary, so it's possible that all this will take is plugging
   308  the temporary storage infrastructure into the local execution engine in front
   309  of clauses that need to be used more than once, or waiting until the two
   310  engines are merged.
   311  
   312  `WITH RECURSIVE` will take some additional thought, and possibly another RFC.