github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170926_sql_aux_data.md (about) 1 - Feature Name: Auxiliary leaf data in SQL abstract trees 2 - Status: in-progress 3 - Start Date: 2017-09-04 4 - Authors: knz, jordan 5 - RFC PR: [#18204](https://github.com/cockroachdb/cockroach/pull/18204) 6 - Cockroach Issue: (one or more # from the issue tracker) 7 8 # Summary 9 10 This RFC proposes to extract leaf data from the tree data 11 structures used to represent SQL syntax and logical plans, and instead: 12 13 - host the values in slices in a context data structure passed as 14 argument to the functions where the values are needed/used; 15 - in the tree, instead store an index into that slice. 16 17 Why: this generally increases performance in several areas, and 18 incidentally+serendipitously removes the cause for a sore point that 19 prevented progress on the IR RFC (#10055). 20 21 How: this is a mechanical, easy to review code substitution in the 22 `sql/parser` and `sql` packages. 23 24 Impact: performance + enables further IR work. 25 26 # Motivation 27 28 tl;dr: hosting leaf data outside of the logical tree makes 29 things generally simpler and cleaner, which is desirable overall. 30 31 Why is this so? Granted, it is hard to recognize this to be true in 32 general. Concretely, there are four different sorts of leaf 33 data items in SQL trees, and for each of them the motivation for this 34 RFC can be phrased in a different way, which I detail below. However, 35 after reading the four motivations the reader can satisfy themselves 36 that the "tl'dr" summary above is adequate. 37 38 The four sorts are, in decreasing order of "how well they justify this 39 RFC and the corresponding changes": 40 41 - placeholders 42 - subqueries 43 - column data references ("indexed vars") 44 - datums 45 46 ## Motivation for placeholders 47 48 Currently in CockroachDB placeholders in the input (`$1`, `$2` etc. in 49 prepared queries) are replaced by a `Placeholder` instance in the 50 tree: 51 52 ```go 53 type Placeholder struct { 54 Name string 55 typeAnnotation 56 } 57 ``` 58 59 During type checking, the `typeAnnotation` member gets filled in with 60 the inferred/specified type for the placeholder. This field is 61 subsequently used by the `ResolvedType()` method whenever the type is 62 needed. 63 64 Then, when the prepared query is executed, the entire AST is 65 *rewritten* so that each `Placeholder` instance is replaced by the 66 `Datum` for the value provided by the client. Since CockroachDB avoids 67 in-place modifications to AST nodes, this rewrite requires 68 re-allocating a fresh object on the heap for every ancestor of a 69 placeholder up to the root of the query AST. 70 71 What's wrong with this? 72 73 - when a placeholder is used multiple times (e.g. `select $1 where x > 74 $1`), the type information is stored multiple times in the AST. 75 - the tree rewrite at each execute of a prepared query is expensive 76 and puts pressure on Go's heap allocator and GC. 77 78 ## Motivation for subqueries 79 80 Subqueries in SQL can be of two sorts: 81 82 - subqueries as data sources, e.g. `select * from (select * from 83 kv)` - these are simply nested selects and are treated very 84 efficiently: their logical plan is simply embedded as the suitable 85 operand in the enclosing query's own plan. 86 - subqueries as expressions. For example with `select * from kv limit 87 (select v from kv where k=2)`. Here the value of the result of 88 evaluating the subquery must be known before the rest of the plan 89 can be executed. 90 91 This latter case is currently handled as follows: 92 93 - when a query plan is initially constructed, first all the 94 expressions are traversed, and any sub-query expression is replaced 95 by an instance of `sql.subquery` (or `parser.SubqueryPlaceholder` 96 once #18094 is merged). *This object is embedded as leaf 97 data in the expression tree!* 98 - when execution of a plan starts, again all expressions are 99 traversed, and any sub-query is executed in turn to completion, and 100 the query's results are used to *replace* the `subquery` object in 101 the expression tree. Since CockroachDB avoids in-place modifications 102 to AST nodes, this rewrite requires re-allocating a fresh object on 103 the heap for every ancestor of a subquery up to the root of the 104 query AST. 105 - the EXPLAIN statement must also traverse all expressions in a logical 106 plan to "fish out" the logical plans of subqueries and embed them in the EXPLAIN output. 107 108 What's wrong with this? 109 110 - the various stages of subquery handling are really algorithms of the 111 form "for every subquery, *no matter where it appears*, do X". It is 112 silly to have to recursively traverse all expressions, *including 113 those that definitely do not contain subqueries*, to apply these 114 algorithms. 115 - the rewrite of the subquery results, when applicable, is expensive 116 towards Go's heap allocator and GC. 117 118 ## Motivation for column data referencs ("indexed vars") 119 120 Currently in CockroachDB an early transform called "name resolution" 121 will replace any column reference by name (e.g. "`k`" in `select k 122 from foo`) or by ordinal reference (e.g. `@1` in `select @1 from foo`) 123 by an instance of `IndexedVar`: 124 125 ```go 126 type IndexedVar struct { 127 Idx int 128 container IndexedVarContainer 129 } 130 ``` 131 132 This object contains the index of the column inside the "current data 133 source context" (usually: the current table; this is only more complex 134 with joins and UPSERT). It *also* contains a pointer to some other object, somewhere, that is able to: 135 136 - serve the type of the column (via the `ResolvedType()` method); 137 - serve the value of the column for the current row (via the `Eval()` method); 138 - serve a representation of the column reference (via the `Format()` method). 139 140 This `container` fields requires some care during transforms: if an 141 expression is migrated from one level of a logical plan to another (a 142 common occurrence during optimizations), the `container` field must be 143 suitably rewritten. 144 145 This happens pretty often actually (`BindIfUnbound()` and `Rebind()`). Two aspects of note: 146 147 - during initial name resolution, `BindIfUnbound()` is called, and 148 this actually overwrites `container` in-place. This incidentally 149 violates the rule that ASTs should be immutable once constructed! 150 And has caused bugs (and probably is causing bugs, due to our 151 inability to assert immutability). 152 - `Rebind()` is called many times for the same expression in 153 moderately complex queries (most notoriously during filter 154 propagation), each time requires a full traversal of the entire 155 expression tree, and if an expression is found requires a 156 substitution. And as you can expect if you've read the two sections 157 above: since CockroachDB avoids in-place modifications to AST nodes, 158 this rewrite requires re-allocating a fresh object on the heap for 159 every ancestor of a subquery up to the root of the query AST. 160 161 What's wrong with this? 162 163 - `BindIfUnbound()` violates the immutability contract; 164 - if the same column reference is used multiple times at the same 165 level of a logical plan, there are multiple redundant references 166 (`container` copies) to the data source. 167 - `Rebind()` exerts pressure on the Go heap allocator and GC. 168 169 ## Motivation for Datums 170 171 (This motivating section is a bit less evident for the casual reader 172 with less experience with CockroachDB's SQL codebase. It is also not 173 terribly important since the 3 sections above already motivate the 174 common underlying pattern. Feel free to skip to the next sub-section.) 175 176 CockroachDB currently defines 17 elementary Go types to hold SQL 177 values (`DInt`, `DString`, `DInterval`, etc.), increasing to 18 178 when #18171 merges, and possibly increasing further as we extend pg 179 compatibility. Moreover, we are eventually planning to let users 180 define their own value types. 181 182 It is certainly instructive to ask: "Why?" 183 184 For example, TiDB uses a single Go struct for all the values, storing 185 the actual value in different fields of that struct depending on the 186 semantic type of the SQL expression. (This is not where this RFC is 187 going but this observation highlights that the current approach in 188 CockroachDB is not trivially necessary.) 189 190 There are two motivations for using separate Go types: 191 192 - Go requires that values that are composed of a reference to 193 something else (e.g. `DString`, or a slice reference as in `DByte`), 194 and values that are composed of simple "value bits" (e.g. `DInt`, 195 `DFloat`) are stored in variables that have distinct Go types, 196 because Go's GC needs to differentiate statically value and 197 reference variables. 198 - in many expression transformation algorithms in CockroachDB we must 199 discriminate in conditionals based on the semantic type of SQL 200 values. It so happens that Go is rather good at branching on type 201 tags ("type switches" in Go's jargon) so this vaguely suggests 202 separate Go types for each SQL value type. 203 204 What's wrong with this? 205 206 - `Datum` types are actually embedded in SQL expression trees using an 207 interface reference (`Datum` and/or `Expr`). This means that even 208 "value" types are never embedded as-is in the expression tree and 209 must be allocated on the heap instead. Worse even, a SQL value that 210 is itself a reference (e.g. `DBytes`) then requires a double 211 indirection: once to retrieve the Datum object from the Expr 212 reference, then another one to access the data behind that datum's 213 reference. Given the ubiquity of values this merits some attention 214 if a simplification is possible. 215 - the value types are also used to report values in *result rows* 216 during plan execution, behind a `Datum` reference. In Go this means 217 that each time a row tuple is constructed, Go must assemble a 218 reference to the value together with the vtable pointer of the 219 `Datum` interface, to construct a `Datum` reference in each position 220 of the result tuple. Then whenever this value is consumed somewhere 221 else (either at a different level in the logical plan or when the 222 results are converted towards pgwire), Go must check that the type 223 cast is valid, this constitutes run-time overhead. One should note 224 here that this type dance is entirely *unnecessary*: from one row to 225 the next, the type of the SQL value for a given result column is 226 *always the same* (*). Storing then checking the Go type information in 227 memory for every cell of every row in a plan's result column is 228 horrendously redundant. 229 230 (*) Regarding the handling of NULL values: NULL is a value that 231 inhabits every type. That is, `NULL::int` and `NULL::string` are two 232 different things in SQL. Even in a column where NULL *values* can 233 occur, all values in the columns should still be of the same type. 234 This is not currently true in CockroachDB, some suggestions are made 235 below to arrive there. 236 237 ## Synthesis: what's the underlying problematic pattern? 238 239 Discriminating different things in a tree data structure in Go using 240 different Go types implementing a common "node interface" is a 241 textbook design pattern. It is simple to understand, simple to 242 implement, simple to recognize, and "just works". 243 244 However it breaks down when any of the following conditions apply: 245 246 - the application domain strongly favors/recommends immutable trees, 247 but the logical values in the trees require replacement between 248 uses. This requires an expensive "rewrite via new allocation" traversal. 249 - some algorithms need to perform an action for every node in a tree of a 250 specific type, without knowning in advance where these values are, 251 and this types happens to be rather uncommon in trees in 252 practice. This causes "work for nothing": the common case pays the 253 price of full tree traversals even when the application domain tells us 254 in advance that these full traversals are usually unnecessary. 255 - when many objects *are known to have the same type* (e.g. values for 256 the same column in many rows, multiple instances of the same 257 placeholders in different places of an expression) it is wasteful to 258 duplicate this type information across all instances. 259 260 In CockroachDB, all three conditions apply. So there's a problem. 261 262 # Guide-level explanation 263 264 When this proposal is implemented, CockroachDB will replace the 265 "embedding" of special values in IR trees by a simple integer value, 266 to serve as index in an array of appropriate type outside of the tree. 267 268 For example, before: 269 270 ```go 271 e := BinExpr{Left: IndexedVar{Idx:1}, Right: IndexedVar{Idx:2}, Op: Add} 272 273 // Link IndexedVars to containers. 274 ivarHelper := MakeIndexedVarHelper(container) 275 e = ivarHelper.Rebind(e) 276 277 // Use the Expr. 278 fmt.Println(e.String()) // shows "x + y" instead of "@1 + @2" 279 280 // This uses: 281 // type IndexedVar struct {Idx int; container IndexedVarContainer} 282 // type Expr interface { Format(buf *bytes.Buffer); String() string }; 283 // (X).String() calls (X).Format() for every X implementing Expr; 284 // (iv *IndexedVar).Format() calls iv.container.FormatIndexedVar(iv.Idx). 285 ``` 286 287 After: 288 289 ```go 290 e := BinExpr{Left: IndexedVarIdx(1), Right: IndexedVarIdx(2), Op: Add} 291 292 // Use the Expr. 293 fmt.Println(Format(e, container)) // shows "x + y" instead of "@1 + @2" 294 295 // This uses: 296 type IndexedVarIdx int 297 type Expr { Format(buf *bytes.Buffer, container IndexedVarContainer) ... } 298 func Format(e, container) { 299 e.Format(buf, container) 300 } 301 func (iv IndexedVarIdx) Format(buf, container) { 302 container.FormatIndexedVar(int(iv)) 303 } 304 ``` 305 306 In general, the tree structure will store only an integer, and the 307 resolution of that integer to the "thing" that it logically refers to 308 is only performed at the point of use, not stored in the tree 309 directly. This enables minimal data storage in the tree and efficient 310 substitutions of the corresponding values without having to mutate the 311 tree. 312 313 # Reference-level explanation 314 315 ## Detailed design 316 317 | Current code | New code | Notes | 318 |--------------|----------|-------| 319 | `type IndexedVar struct {...}` | `type IndexedVarIdx int` | | 320 | `type Placeholder struct {name, typ}` | `type PlaceholderIdx string` | (1) | 321 | `type subquery struct {...}` | `type subquery int` | | 322 | `type Datum interface { Expr; ... }` | `type Datum int` | (4) | 323 | `(NodeFormatter).Format(buf, f)` | `(NodeFormatter).Format(ctx, buf)` | (2) (3) | 324 | `(TypedExpr).ResolvedType()` | `(TypedExpr).ResolvedType(ctx)` | (3) | 325 | `(Datum).AmbiguousFormat()` | `(Datum).AmbiguousFormat(ctx)` | (3) | 326 | `(Datum).Prev()` | `(Datum).Prev(ctx)` | (3) | 327 | `(Datum).Next()` | `(Datum).Prev(ctx)` | (3) | 328 | `(Datum).IsMin()` | `(Datum).IsMin(ctx)` | (3) | 329 | `(Datum).IsMax()` | `(Datum).IsMax(ctx)` | (3) | 330 | `(Datum).min()` | `(Datum).min(ctx)` | (3) | 331 | `(Datum).max()` | `(Datum).max(ctx)` | (3) | 332 | `(Datum).Size()` | `(Datum).Size(ctx)` | (3) | 333 334 Notes: 335 336 1. about placeholders: CockroachDB currently identifies placeholders 337 by name. This is because the Postgres protocol, in principle, 338 allows placeholders with arbitrary names not just numbers. In 339 practice however, we never encountered a client that does so, and 340 the CockroachDB code even contains an assertion on the initial 341 lexing of placeholder names to force them to be numerical. Perhaps 342 it is time to drop the idea to name placeholders and instead number 343 them, which will in turn make the data structure even smaller and 344 more efficient to use (lookups using an array instead of a map). 345 We can note here that placeholders are always "dense" in practice 346 (all the placeholder between $1 and $max are used), so this 347 optimization will not create memory inefficiencies. 348 349 2. the `FmtFlags` argument is replaced by a `FormattingContext` struct 350 reference which contains both the formatting flags / overrides and 351 (a reference to) the semantic context. 352 353 3. the new semantic context reference passed through the recursive 354 interface API gives the method implementations access to the arrays 355 where they can look up the values from the numeric 356 Datum/subquery/indexedvar/placeholder references. 357 358 4. Regarding the handling of NULL. If we wish to keep the current 359 implementation semantics which treats NULL values in trees as 360 always untyped (`DNull` never has a type other than itself), we can 361 use the value -1 to encode it as an integer. If we wish to change 362 this and make NULL a member of every type, and have datum slots 363 have a type next to a NULL value, then the index can refer to a 364 value slot with no data, and we can separately introduce a bitmap 365 of which datums in the value slots should be interpreted as SQL 366 NULLs. 367 368 ## Opportunity for logical plans 369 370 Currently in CockroachDB the logical links between stages of logical 371 plans are implemented using simple Go references. For example, a 372 `joinNode` is a struct with two members `left` and `right` each of 373 type `planNode`, an interface, and this can be dereferenced in memory 374 to get access to another Go struct. 375 376 Meanwhile, most of the "interesting" optimization algorithms for SQL 377 queries make use of the notion of *equivalence classes* for logical 378 plans: two (sub-)trees in a logical plans that are semantically 379 equivalent (same columns, same result rows) can be substituted for one 380 another "for free", and the different candidate equivalent plans 381 should be kept in memory side by side while the optimization 382 measures/decides which one to keep. 383 384 This strongly suggests to implement the link between logical plans not 385 as a simple Go reference to another plan, but instead as a reference 386 to a "equivalence class" which in data structure terms would be 387 something like a "set" (probably an array in practice), itself 388 containing references to actual trees. 389 390 The issue here is that the particular data structure(s) to be used to 391 represent equivalence classes may change between optimization 392 algorithms. We wouldn't want to change the code to traverse logical 393 plans and otherwise manipulate them (i.e. change the IR language that 394 defines logical plans) every time we consider a different way to 395 represent equivalence classes. 396 397 However, behold! The pattern presented in this RFC can be once again 398 reused here. A link to a logical sub-tree in a logical plan can be 399 stored as an integer. Then the algorithms that traverse the logical 400 plan can take a context argument, and use that context to look up 401 sub-tree from these integer values. This enables decoupling the 402 implementation of equivalence classes from the implementation of 403 logical plan traversals that are not particular to optimization 404 algorithms. 405 406 ## Drawbacks 407 408 - Slight deviation from 100-level data structure textbooks. (Although 409 the new proposed pattern is not really unknown to 200- or 300-level 410 compiler courses). 411 - There's a new additional function argument to some of the expression 412 methods. The potential corresponding run-time overhead is expected 413 to be offset by the overall cache utilization gains of storing much 414 less redundant information side-by-side in memory, together with the 415 lower GC activity. 416 417 ## Rationale and Alternatives 418 419 The motivation section above provides the majority of the rationale 420 for this change. Even if the motivation section for Datums isn't as 421 transparent as the three others, there is impetus to do something 422 about Datums suggested by the query compilation RFC (#16686). 423 424 Additionally, another motivation emerges from the IR RFC (#10055): we 425 need leaf data using Go types that cannot be expressed easily using 426 basic types (e.g. when the types belong to external Go packages, like 427 we do for `DDecimal`). This causes a difficulty when planning to 428 implement/deploy a code generator from a [simple type 429 language](20170517_algebraic_data_types.md): the code generator should 430 then be extended in several non-trival ways to support allocating, 431 manipulating and traversing objects at run-time which it knows little 432 about; the input definition language must be extended to specify these 433 external types; and the testing and validation story becomes much more 434 murky. This was a major unresolved question in the IR RFC in #10055 435 until this point. By subjecting all non-trivial leaf data to the 436 treatment advertised in this RFC, this complexity is side-stepped 437 entirely. 438 439 Several alternatives were considered: 440 441 - do nothing: further SQL development and run-time performance slowly 442 grinds to a halt (hyperbolic perhaps, but it highlights the trend). 443 - rewrite CockroachDB's SQL layer in a different programming language 444 which enables programmers to control allocations and avoid paying 445 run-time overhead for typing when it's known to be safe. This 446 alternative, if considered, would necessarily be a longer-term 447 endeavour. Also even if the generated code would be overall much 448 more heap-efficient with tree rewrites, it would still require 449 algorithms to traverse trees even when it's not needed. 450 - extend the ADT RFC to support a much richer type system, and express 451 all the tree data using that. This more or less amounts to 452 re-inventing a programming language, which is a lot more 453 effort. Also it does not really solve the problem of redundant 454 data. Also even if the generated code would be overall much more 455 heap-efficient with tree rewrites, it would still require algorithms 456 to traverse trees even when it's not needed. 457 - when expressions are analyzed, store a copy of the reference to the 458 "special" nodes (placeholders/subqueries/datums/indexedvars) into 459 the semantic context. This way algorithms that need to do "for every 460 X, do Y" can find all the X's in the semantic context and avoid 461 expression traversal. This does not eliminate information redundancy 462 however, nor the overhead of type casts. 463 464 The reviewers are invited to suggest additional alternatives if they 465 can see any! 466 467 ## Unresolved questions 468 469 None.