github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170926_sql_aux_data.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170926_sql_aux_data.md (about)

     1  - Feature Name: Auxiliary leaf data in SQL abstract trees
     2  - Status: in-progress
     3  - Start Date: 2017-09-04
     4  - Authors: knz, jordan
     5  - RFC PR: [#18204](https://github.com/cockroachdb/cockroach/pull/18204)
     6  - Cockroach Issue: (one or more # from the issue tracker)
     7  
     8  # Summary
     9  
    10  This RFC proposes to extract leaf data from the tree data
    11  structures used to represent SQL syntax and logical plans, and instead:
    12  
    13  - host the values in slices in a context data structure passed as
    14    argument to the functions where the values are needed/used;
    15  - in the tree, instead store an index into that slice.
    16  
    17  Why: this generally increases performance in several areas, and
    18  incidentally+serendipitously removes the cause for a sore point that
    19  prevented progress on the IR RFC (#10055).
    20  
    21  How: this is a mechanical, easy to review code substitution in the
    22  `sql/parser` and `sql` packages.
    23  
    24  Impact: performance + enables further IR work.
    25  
    26  # Motivation
    27  
    28  tl;dr: hosting leaf data outside of the logical tree makes
    29  things generally simpler and cleaner, which is desirable overall.
    30  
    31  Why is this so? Granted, it is hard to recognize this to be true in
    32  general. Concretely, there are four different sorts of leaf
    33  data items in SQL trees, and for each of them the motivation for this
    34  RFC can be phrased in a different way, which I detail below. However,
    35  after reading the four motivations the reader can satisfy themselves
    36  that the "tl'dr" summary above is adequate.
    37  
    38  The four sorts are, in decreasing order of "how well they justify this
    39  RFC and the corresponding changes":
    40  
    41  - placeholders
    42  - subqueries
    43  - column data references ("indexed vars")
    44  - datums
    45  
    46  ## Motivation for placeholders
    47  
    48  Currently in CockroachDB placeholders in the input (`$1`, `$2` etc. in
    49  prepared queries) are replaced by a `Placeholder` instance in the
    50  tree:
    51  
    52  ```go
    53  type Placeholder struct {
    54  	Name string
    55  	typeAnnotation
    56  }
    57  ```
    58  
    59  During type checking, the `typeAnnotation` member gets filled in with
    60  the inferred/specified type for the placeholder. This field is
    61  subsequently used by the `ResolvedType()` method whenever the type is
    62  needed.
    63  
    64  Then, when the prepared query is executed, the entire AST is
    65  *rewritten* so that each `Placeholder` instance is replaced by the
    66  `Datum` for the value provided by the client. Since CockroachDB avoids
    67  in-place modifications to AST nodes, this rewrite requires
    68  re-allocating a fresh object on the heap for every ancestor of a
    69  placeholder up to the root of the query AST.
    70  
    71  What's wrong with this?
    72  
    73  - when a placeholder is used multiple times (e.g. `select $1 where x >
    74    $1`), the type information is stored multiple times in the AST.
    75  - the tree rewrite at each execute of a prepared query is expensive
    76    and puts pressure on Go's heap allocator and GC.
    77  
    78  ## Motivation for subqueries
    79  
    80  Subqueries in SQL can be of two sorts:
    81  
    82  - subqueries as data sources, e.g. `select * from (select * from
    83    kv)` - these are simply nested selects and are treated very
    84    efficiently: their logical plan is simply embedded as the suitable
    85    operand in the enclosing query's own plan.
    86  - subqueries as expressions. For example with `select * from kv limit
    87    (select v from kv where k=2)`. Here the value of the result of
    88    evaluating the subquery must be known before the rest of the plan
    89    can be executed.
    90  
    91  This latter case is currently handled as follows:
    92  
    93  - when a query plan is initially constructed, first all the
    94    expressions are traversed, and any sub-query expression is replaced
    95    by an instance of `sql.subquery` (or `parser.SubqueryPlaceholder`
    96    once #18094 is merged). *This object is embedded as leaf
    97    data in the expression tree!*
    98  - when execution of a plan starts, again all expressions are
    99    traversed, and any sub-query is executed in turn to completion, and
   100    the query's results are used to *replace* the `subquery` object in
   101    the expression tree. Since CockroachDB avoids in-place modifications
   102    to AST nodes, this rewrite requires re-allocating a fresh object on
   103    the heap for every ancestor of a subquery up to the root of the
   104    query AST.
   105  - the EXPLAIN statement must also traverse all expressions in a logical
   106    plan to "fish out" the logical plans of subqueries and embed them in the EXPLAIN output.
   107  
   108  What's wrong with this?
   109  
   110  - the various stages of subquery handling are really algorithms of the
   111    form "for every subquery, *no matter where it appears*, do X". It is
   112    silly to have to recursively traverse all expressions, *including
   113    those that definitely do not contain subqueries*, to apply these
   114    algorithms.
   115  - the rewrite of the subquery results, when applicable, is expensive
   116    towards Go's heap allocator and GC.
   117  
   118  ## Motivation for column data referencs ("indexed vars")
   119  
   120  Currently in CockroachDB an early transform called "name resolution"
   121  will replace any column reference by name (e.g. "`k`" in `select k
   122  from foo`) or by ordinal reference (e.g. `@1` in `select @1 from foo`)
   123  by an instance of `IndexedVar`:
   124  
   125  ```go
   126  type IndexedVar struct {
   127  	Idx       int
   128  	container IndexedVarContainer
   129  }
   130  ```
   131  
   132  This object contains the index of the column inside the "current data
   133  source context" (usually: the current table; this is only more complex
   134  with joins and UPSERT). It *also* contains a pointer to some other object, somewhere, that is able to:
   135  
   136  - serve the type of the column (via the `ResolvedType()` method);
   137  - serve the value of the column for the current row (via the `Eval()` method);
   138  - serve a representation of the column reference (via the `Format()` method).
   139  
   140  This `container` fields requires some care during transforms: if an
   141  expression is migrated from one level of a logical plan to another (a
   142  common occurrence during optimizations), the `container` field must be
   143  suitably rewritten.
   144  
   145  This happens pretty often actually (`BindIfUnbound()` and `Rebind()`). Two aspects of note:
   146  
   147  - during initial name resolution, `BindIfUnbound()` is called, and
   148    this actually overwrites `container` in-place. This incidentally
   149    violates the rule that ASTs should be immutable once constructed!
   150    And has caused bugs (and probably is causing bugs, due to our
   151    inability to assert immutability).
   152  - `Rebind()` is called many times for the same expression in
   153    moderately complex queries (most notoriously during filter
   154    propagation), each time requires a full traversal of the entire
   155    expression tree, and if an expression is found requires a
   156    substitution. And as you can expect if you've read the two sections
   157    above: since CockroachDB avoids in-place modifications to AST nodes,
   158    this rewrite requires re-allocating a fresh object on the heap for
   159    every ancestor of a subquery up to the root of the query AST.
   160  
   161  What's wrong with this?
   162  
   163  - `BindIfUnbound()` violates the immutability contract;
   164  - if the same column reference is used multiple times at the same
   165    level of a logical plan, there are multiple redundant references
   166    (`container` copies) to the data source.
   167  - `Rebind()` exerts pressure on the Go heap allocator and GC.
   168  
   169  ## Motivation for Datums
   170  
   171  (This motivating section is a bit less evident for the casual reader
   172  with less experience with CockroachDB's SQL codebase. It is also not
   173  terribly important since the 3 sections above already motivate the
   174  common underlying pattern. Feel free to skip to the next sub-section.)
   175  
   176  CockroachDB currently defines 17 elementary Go types to hold SQL
   177  values (`DInt`, `DString`, `DInterval`, etc.), increasing to 18
   178  when #18171 merges, and possibly increasing further as we extend pg
   179  compatibility. Moreover, we are eventually planning to let users
   180  define their own value types.
   181  
   182  It is certainly instructive to ask: "Why?"
   183  
   184  For example, TiDB uses a single Go struct for all the values, storing
   185  the actual value in different fields of that struct depending on the
   186  semantic type of the SQL expression. (This is not where this RFC is
   187  going but this observation highlights that the current approach in
   188  CockroachDB is not trivially necessary.)
   189  
   190  There are two motivations for using separate Go types:
   191  
   192  - Go requires that values that are composed of a reference to
   193    something else (e.g. `DString`, or a slice reference as in `DByte`),
   194    and values that are composed of simple "value bits" (e.g. `DInt`,
   195    `DFloat`) are stored in variables that have distinct Go types,
   196    because Go's GC needs to differentiate statically value and
   197    reference variables.
   198  - in many expression transformation algorithms in CockroachDB we must
   199    discriminate in conditionals based on the semantic type of SQL
   200    values. It so happens that Go is rather good at branching on type
   201    tags ("type switches" in Go's jargon) so this vaguely suggests
   202    separate Go types for each SQL value type.
   203  
   204  What's wrong with this?
   205  
   206  - `Datum` types are actually embedded in SQL expression trees using an
   207    interface reference (`Datum` and/or `Expr`). This means that even
   208    "value" types are never embedded as-is in the expression tree and
   209    must be allocated on the heap instead. Worse even, a SQL value that
   210    is itself a reference (e.g. `DBytes`) then requires a double
   211    indirection: once to retrieve the Datum object from the Expr
   212    reference, then another one to access the data behind that datum's
   213    reference. Given the ubiquity of values this merits some attention
   214    if a simplification is possible.
   215  - the value types are also used to report values in *result rows*
   216    during plan execution, behind a `Datum` reference. In Go this means
   217    that each time a row tuple is constructed, Go must assemble a
   218    reference to the value together with the vtable pointer of the
   219    `Datum` interface, to construct a `Datum` reference in each position
   220    of the result tuple. Then whenever this value is consumed somewhere
   221    else (either at a different level in the logical plan or when the
   222    results are converted towards pgwire), Go must check that the type
   223    cast is valid, this constitutes run-time overhead. One should note
   224    here that this type dance is entirely *unnecessary*: from one row to
   225    the next, the type of the SQL value for a given result column is
   226    *always the same* (*). Storing then checking the Go type information in
   227    memory for every cell of every row in a plan's result column is
   228    horrendously redundant.
   229  
   230  (*) Regarding the handling of NULL values: NULL is a value that
   231  inhabits every type. That is, `NULL::int` and `NULL::string` are two
   232  different things in SQL. Even in a column where NULL *values* can
   233  occur, all values in the columns should still be of the same type.
   234  This is not currently true in CockroachDB, some suggestions are made
   235  below to arrive there.
   236  
   237  ## Synthesis: what's the underlying problematic pattern?
   238  
   239  Discriminating different things in a tree data structure in Go using
   240  different Go types implementing a common "node interface" is a
   241  textbook design pattern. It is simple to understand, simple to
   242  implement, simple to recognize, and "just works".
   243  
   244  However it breaks down when any of the following conditions apply:
   245  
   246  - the application domain strongly favors/recommends immutable trees,
   247    but the logical values in the trees require replacement between
   248    uses. This requires an expensive "rewrite via new allocation" traversal.
   249  - some algorithms need to perform an action for every node in a tree of a
   250    specific type, without knowning in advance where these values are,
   251    and this types happens to be rather uncommon in trees in
   252    practice. This causes "work for nothing": the common case pays the
   253    price of full tree traversals even when the application domain tells us
   254    in advance that these full traversals are usually unnecessary.
   255  - when many objects *are known to have the same type* (e.g. values for
   256    the same column in many rows, multiple instances of the same
   257    placeholders in different places of an expression) it is wasteful to
   258    duplicate this type information across all instances.
   259  
   260  In CockroachDB, all three conditions apply. So there's a problem.
   261  
   262  # Guide-level explanation
   263  
   264  When this proposal is implemented, CockroachDB will replace the
   265  "embedding" of special values in IR trees by a simple integer value,
   266  to serve as index in an array of appropriate type outside of the tree.
   267  
   268  For example, before:
   269  
   270  ```go
   271  e := BinExpr{Left: IndexedVar{Idx:1}, Right: IndexedVar{Idx:2}, Op: Add}
   272  
   273  // Link IndexedVars to containers.
   274  ivarHelper := MakeIndexedVarHelper(container)
   275  e = ivarHelper.Rebind(e)
   276  
   277  // Use the Expr.
   278  fmt.Println(e.String()) // shows "x + y" instead of "@1 + @2"
   279  
   280  // This uses:
   281  // type IndexedVar struct {Idx int; container IndexedVarContainer}
   282  // type Expr interface { Format(buf *bytes.Buffer); String() string };
   283  // (X).String() calls (X).Format() for every X implementing Expr;
   284  // (iv *IndexedVar).Format() calls iv.container.FormatIndexedVar(iv.Idx).
   285  ```
   286  
   287  After:
   288  
   289  ```go
   290  e := BinExpr{Left: IndexedVarIdx(1), Right: IndexedVarIdx(2), Op: Add}
   291  
   292  // Use the Expr.
   293  fmt.Println(Format(e, container)) // shows "x + y" instead of "@1 + @2"
   294  
   295  // This uses:
   296  type IndexedVarIdx int
   297  type Expr { Format(buf *bytes.Buffer, container IndexedVarContainer) ... }
   298  func Format(e, container) {
   299    e.Format(buf, container)
   300  }
   301  func (iv IndexedVarIdx) Format(buf, container) {
   302    container.FormatIndexedVar(int(iv))
   303  }
   304  ```
   305  
   306  In general, the tree structure will store only an integer, and the
   307  resolution of that integer to the "thing" that it logically refers to
   308  is only performed at the point of use, not stored in the tree
   309  directly. This enables minimal data storage in the tree and efficient
   310  substitutions of the corresponding values without having to mutate the
   311  tree.
   312  
   313  # Reference-level explanation
   314  
   315  ## Detailed design
   316  
   317  | Current code | New code | Notes |
   318  |--------------|----------|-------|
   319  | `type IndexedVar struct {...}` | `type IndexedVarIdx int` | |
   320  | `type Placeholder struct {name, typ}` | `type PlaceholderIdx string` | (1) |
   321  | `type subquery struct {...}` | `type subquery int` | |
   322  | `type Datum interface { Expr; ... }` | `type Datum int` | (4) |
   323  | `(NodeFormatter).Format(buf, f)` | `(NodeFormatter).Format(ctx, buf)` | (2) (3) |
   324  | `(TypedExpr).ResolvedType()` | `(TypedExpr).ResolvedType(ctx)` | (3) |
   325  | `(Datum).AmbiguousFormat()` | `(Datum).AmbiguousFormat(ctx)` | (3) |
   326  | `(Datum).Prev()` | `(Datum).Prev(ctx)` | (3) |
   327  | `(Datum).Next()` | `(Datum).Prev(ctx)` | (3) |
   328  | `(Datum).IsMin()` | `(Datum).IsMin(ctx)` | (3) |
   329  | `(Datum).IsMax()` | `(Datum).IsMax(ctx)` | (3) |
   330  | `(Datum).min()` | `(Datum).min(ctx)` | (3) |
   331  | `(Datum).max()` | `(Datum).max(ctx)` | (3) |
   332  | `(Datum).Size()` | `(Datum).Size(ctx)` | (3) |
   333  
   334  Notes:
   335  
   336  1. about placeholders: CockroachDB currently identifies placeholders
   337     by name. This is because the Postgres protocol, in principle,
   338     allows placeholders with arbitrary names not just numbers. In
   339     practice however, we never encountered a client that does so, and
   340     the CockroachDB code even contains an assertion on the initial
   341     lexing of placeholder names to force them to be numerical. Perhaps
   342     it is time to drop the idea to name placeholders and instead number
   343     them, which will in turn make the data structure even smaller and
   344     more efficient to use (lookups using an array instead of a map).
   345     We can note here that placeholders are always "dense" in practice
   346     (all the placeholder between $1 and $max are used), so this
   347     optimization will not create memory inefficiencies.
   348  
   349  2. the `FmtFlags` argument is replaced by a `FormattingContext` struct
   350     reference which contains both the formatting flags / overrides and
   351     (a reference to) the semantic context.
   352  
   353  3. the new semantic context reference passed through the recursive
   354     interface API gives the method implementations access to the arrays
   355     where they can look up the values from the numeric
   356     Datum/subquery/indexedvar/placeholder references.
   357  
   358  4. Regarding the handling of NULL. If we wish to keep the current
   359     implementation semantics which treats NULL values in trees as
   360     always untyped (`DNull` never has a type other than itself), we can
   361     use the value -1 to encode it as an integer. If we wish to change
   362     this and make NULL a member of every type, and have datum slots
   363     have a type next to a NULL value, then the index can refer to a
   364     value slot with no data, and we can separately introduce a bitmap
   365     of which datums in the value slots should be interpreted as SQL
   366     NULLs.
   367  
   368  ## Opportunity for logical plans
   369  
   370  Currently in CockroachDB the logical links between stages of logical
   371  plans are implemented using simple Go references. For example, a
   372  `joinNode` is a struct with two members `left` and `right` each of
   373  type `planNode`, an interface, and this can be dereferenced in memory
   374  to get access to another Go struct.
   375  
   376  Meanwhile, most of the "interesting" optimization algorithms for SQL
   377  queries make use of the notion of *equivalence classes* for logical
   378  plans: two (sub-)trees in a logical plans that are semantically
   379  equivalent (same columns, same result rows) can be substituted for one
   380  another "for free", and the different candidate equivalent plans
   381  should be kept in memory side by side while the optimization
   382  measures/decides which one to keep.
   383  
   384  This strongly suggests to implement the link between logical plans not
   385  as a simple Go reference to another plan, but instead as a reference
   386  to a "equivalence class" which in data structure terms would be
   387  something like a "set" (probably an array in practice), itself
   388  containing references to actual trees.
   389  
   390  The issue here is that the particular data structure(s) to be used to
   391  represent equivalence classes may change between optimization
   392  algorithms. We wouldn't want to change the code to traverse logical
   393  plans and otherwise manipulate them (i.e. change the IR language that
   394  defines logical plans) every time we consider a different way to
   395  represent equivalence classes.
   396  
   397  However, behold! The pattern presented in this RFC can be once again
   398  reused here. A link to a logical sub-tree in a logical plan can be
   399  stored as an integer. Then the algorithms that traverse the logical
   400  plan can take a context argument, and use that context to look up
   401  sub-tree from these integer values. This enables decoupling the
   402  implementation of equivalence classes from the implementation of
   403  logical plan traversals that are not particular to optimization
   404  algorithms.
   405  
   406  ## Drawbacks
   407  
   408  - Slight deviation from 100-level data structure textbooks. (Although
   409    the new proposed pattern is not really unknown to 200- or 300-level
   410    compiler courses).
   411  - There's a new additional function argument to some of the expression
   412    methods. The potential corresponding run-time overhead is expected
   413    to be offset by the overall cache utilization gains of storing much
   414    less redundant information side-by-side in memory, together with the
   415    lower GC activity.
   416  
   417  ## Rationale and Alternatives
   418  
   419  The motivation section above provides the majority of the rationale
   420  for this change. Even if the motivation section for Datums isn't as
   421  transparent as the three others, there is impetus to do something
   422  about Datums suggested by the query compilation RFC (#16686).
   423  
   424  Additionally, another motivation emerges from the IR RFC (#10055): we
   425  need leaf data using Go types that cannot be expressed easily using
   426  basic types (e.g. when the types belong to external Go packages, like
   427  we do for `DDecimal`). This causes a difficulty when planning to
   428  implement/deploy a code generator from a [simple type
   429  language](20170517_algebraic_data_types.md): the code generator should
   430  then be extended in several non-trival ways to support allocating,
   431  manipulating and traversing objects at run-time which it knows little
   432  about; the input definition language must be extended to specify these
   433  external types; and the testing and validation story becomes much more
   434  murky. This was a major unresolved question in the IR RFC in #10055
   435  until this point. By subjecting all non-trivial leaf data to the
   436  treatment advertised in this RFC, this complexity is side-stepped
   437  entirely.
   438  
   439  Several alternatives were considered:
   440  
   441  - do nothing: further SQL development and run-time performance slowly
   442    grinds to a halt (hyperbolic perhaps, but it highlights the trend).
   443  - rewrite CockroachDB's SQL layer in a different programming language
   444    which enables programmers to control allocations and avoid paying
   445    run-time overhead for typing when it's known to be safe. This
   446    alternative, if considered, would necessarily be a longer-term
   447    endeavour. Also even if the generated code would be overall much
   448    more heap-efficient with tree rewrites, it would still require
   449    algorithms to traverse trees even when it's not needed.
   450  - extend the ADT RFC to support a much richer type system, and express
   451    all the tree data using that.  This more or less amounts to
   452    re-inventing a programming language, which is a lot more
   453    effort. Also it does not really solve the problem of redundant
   454    data. Also even if the generated code would be overall much more
   455    heap-efficient with tree rewrites, it would still require algorithms
   456    to traverse trees even when it's not needed.
   457  - when expressions are analyzed, store a copy of the reference to the
   458    "special" nodes (placeholders/subqueries/datums/indexedvars) into
   459    the semantic context. This way algorithms that need to do "for every
   460    X, do Y" can find all the X's in the semantic context and avoid
   461    expression traversal. This does not eliminate information redundancy
   462    however, nor the overhead of type casts.
   463  
   464  The reviewers are invited to suggest additional alternatives if they
   465  can see any!
   466  
   467  ## Unresolved questions
   468  
   469  None.