github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171025_interleaved_table_joins.md (about) 1 - Feature Name: Interleaved Table Joins 2 - Status: completed 3 - Start Date: 2017-10-03 4 - Authors: Richard Wu 5 - RFC PR: [#19028](https://github.com/cockroachdb/cockroach/pull/19028) 6 - Cockroach Issue: [#18948](https://github.com/cockroachdb/cockroach/issues/18948) 7 8 # Table of contents 9 10 * [Table of contents](#table-of-contents) 11 * [Summary](#summary) 12 * [Motivation](#motivation) 13 * [Guide-level explanation](#guide-level-explanation) 14 * [Terminology](#terminology) 15 * [CockroachDB User](#cockroachdb-user) 16 * [CockroachDB Contributor](#cockroachdb-contributor) 17 * [Examples](#examples) 18 * [Reference-level explanation](#reference-level-explanation) 19 * [Detailed design](#detailed-design) 20 * [Processors implementation](#processors-implementation) 21 * [Reading component](#reading-component) 22 * [Joining component](#joining-component) 23 * [Planning implementation](#planning-implementation) 24 * [Logical planning](#logical-planning) 25 * [Local execution engine](#local-execution-engine) 26 * [Distributed execution engine (DistSQL)](#distributed-execution-engine-distsql) 27 * [Drawbacks](#drawbacks) 28 * [Rationale and Alternatives](#rationale-and-alternatives) 29 * [[1] Scanning the interleaved hierarchy](#1-scanning-the-interleaved-hierarchy) 30 * [[2] Logical planning > physical planning](#2-logical-planning--physical-planning) 31 * [[3] Generalizing planning](#3-generalizing-planning) 32 * [[3a] Common ancestor joins](#3a-common-ancestor-joins) 33 * [[3b] Prefix joins](#3b-prefix-joins) 34 * [Canonical prefix join example](#canonical-prefix-join-example) 35 * [Approach 1 (2-pass, one scan)](#approach-1-2-pass-one-scan) 36 * [Approach 2 (2-pass, least memory, O(2k) scans)](#approach-2-2-pass-least-memory-o2k-scans) 37 * [[3c] Subset joins](#3c-subset-joins) 38 * [[4] Multi-table joins](#4-multi-table-joins) 39 * [[5] Theta Joins](#5-theta-joins) 40 * [[6] InterleaveReader and MergeJoiner](#6-interleavereader-and-mergejoiner) 41 * [Planning](#planning) 42 * [Reading component](#reading-component-1) 43 * [Joining component](#joining-component-1) 44 * [Drawbacks of this approach](#drawbacks-of-this-approach) 45 * [[7] Index joins](#7-index-joins) 46 * [[8] Avoiding splits inside interleaves](#8-avoiding-splits-inside-interleaves) 47 * [Unresolved questions](#unresolved-questions) 48 49 # Summary 50 51 We [currently permit 52 users](https://www.cockroachlabs.com/docs/stable/interleave-in-parent.html) to 53 specify if a child table should be interleaved on-disk under its specified 54 parent table. This feature was adopted from [Google Spanner's interleaved 55 tables](https://cloud.google.com/spanner/docs/schema-and-data-model#creating_interleaved_tables) 56 and was meant to be an optimization the user could opt in for tables that are 57 often queried in a parent-child relationship (i.e. one-to-one, one-to-many). 58 Refer to the [RFC on interleaved 59 tables](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160624_sql_interleaved_tables.md) 60 for more details. 61 62 Beyond co-locating the child's key-value (kv) pairs with the parent's kv pairs 63 in the underlying Key-Value (KV) storage (which provides the significant optimization 64 of 1 phase-commit (1PC) commits for free), we do not currently take advantage 65 of this fact to optimize queries. 66 67 One type of query we can optimize for interleaved tables is joins between the 68 parent and children tables. In the context of DistSQL, instead of spawning two 69 `TableReader`s, one scan for the parent and one scan for the child table for a 70 total of two scans, and a `Joiner` processor, we can do one scan over the 71 relevant interleaved rows, then perform a join on the table rows. 72 73 This optimization has significant upside on performance, potentially lowering 74 both the number of scans - interleaved parent and child tables are 75 scanned simultaneously instead of in two separate scans - and the volume of 76 inter-node gRPC traffic for the join - rows routed by [the hash on their 77 join 78 columns](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160421_distributed_sql.md#processors) 79 cross the network to their respective join processor. 80 81 This RFC identifies two implementation phases for interleaved table joins: 82 1. Mechanism that performs the joint scan as well as the join (Processors implementation phase) 83 2. Planning interleaved table joins for a given query (Planning implementation phase) 84 85 # Motivation 86 87 We currently highlight [some of the 88 benefits](https://www.cockroachlabs.com/docs/stable/interleave-in-parent.html#benefits) 89 for interleaved tables such as joins, yet we do not currently execute them 90 differently for interleaved tables compared to non-interleaved tables. In 91 fact, since we do two separate table scans for the parent and child table in 92 an interleaved configuration, there is a performance impact to scan disjoint 93 KV ranges as [noted in the interleave table 94 docs themselves](https://www.cockroachlabs.com/docs/stable/interleave-in-parent.html#tradeoffs). 95 96 To illustrate the current behavior, let us interleave table `child` into 97 table `parent`: 98 ```sql 99 CREATE DATABASE foo; 100 101 CREATE TABLE IF NOT EXISTS parent (id INT PRIMARY KEY); 102 103 CREATE TABLE IF NOT EXISTS child (id INT, parent_id INT, PRIMARY KEY (parent_id, id)) INTERLEAVE IN PARENT parent (parent_id); 104 105 EXPLAIN SELECT * FROM child JOIN parent ON child.parent_id = parent.id; 106 ``` 107 which results in the following logical plan 108 ``` 109 +-------+------+----------------+--------------------+ 110 | Level | Type | Field | Description | 111 +-------+------+----------------+--------------------+ 112 | 0 | join | | | 113 | 0 | | type | inner | 114 | 0 | | equality | (parent_id) = (id) | 115 | 0 | | mergeJoinOrder | +"(parent_id=id)" | 116 | 1 | scan | | | 117 | 1 | | table | child@primary | 118 | 1 | | spans | ALL | 119 | 1 | scan | | | 120 | 1 | | table | parent@primary | 121 | 1 | | spans | ALL | 122 +-------+------+----------------+--------------------+ 123 ``` 124 125 Users expecting a performance boost for parent-child joins (even simple ones 126 like the above) are instead experiencing a performance hit. 127 128 # Guide-level explanation 129 130 It is worth noting how SQL rows are mapped to KV pairs, specifically how data 131 for a given table is represented as key-value pairs on the primary index (refer to [this blog 132 post](https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mapping-table-data-to-key-value-storage/)). 133 134 ## Terminology 135 136 - **logical plan**: what an SQL query is transformed into after parsing, type 137 checking, name resolution, and query optimization. A logical plan is 138 concretely represented as a (logical) plan tree (tree of `planNode`s) in 139 CockroachDB. 140 - **physical plan**: what [the distributed execution 141 engine](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160421_distributed_sql.md) 142 transforms the logical plan into. It is employed to generate and schedule 143 flows which are composed of processors which are the functional units that 144 defer computation to the data nodes. 145 - **parent-child relationship**: when table `C` is interleaved into table `P`, 146 then `C` is the child table of the parent table `P`. `P` and `C` have a 147 parent-child relationship where `C` is a direct "descendant" of `P` (and `P` 148 a direct "ancestor" of `C`). 149 - **interleaved table**: child table that is interleaved into its parent primary index 150 via the `CREATE TABLE ... INTERLEAVE IN PARENT` rule. This is concretely an 151 **interleaved primary index** (see next point). 152 - **interleaved index**: the primary or secondary index of a child table that 153 is interleaved into the parent table's primary index. Interleaved secondary 154 indexes are created via the `CREATE INDEX ... INTERLEAVE IN PARENT` rule. 155 - **interleave prefix**: the primary key (consisting of 1+ columns) of the 156 parent table and its corresponding prefix of the interleaved index of the 157 child table. In the [Motivation](#motivation) example, the interleave prefix 158 is `(child.parent_id)` or `(parent.id)`. Note the prefix may or may not be a 159 proper prefix (it is possible for the child table to have no additional 160 primary key columns). 161 - **(grand-)\*children**: the set of all direct and indirect descendants of the 162 specified table. That is: all tables/indexes that are interleaved together 163 with the specified (ancestor) table except itself. 164 - **(grand-)\*parent**: similar to **(grand-)\*children** except with ancestors 165 instead of descendants. 166 - **root table**: parent table that is __not__ interleaved into another 167 table (it is the root of its interleaved hierarchy). 168 - **interleaved hierarchy**: root table and all its (grand-)\*children. For a 169 given database, there may exist 0 or more interleaved hierarchies. 170 - **interleaf (pl. interleaves)**: a parent row with a primary key and all (grand-)\*children 171 rows with the same interleave prefix. 172 173 ## CockroachDB User 174 175 From a CockroachDB user perspective, there is no apparent feature change. Join 176 performance will be improved for joins between interleaved tables and their 177 parent tables. Since we do not employ any significant query rewriting or 178 optimization, the implementation of how we recognize interleaved table joins in 179 the Planning phase will heavily dictate how we advertise this performance 180 improvement. 181 182 Specifically, the first iteration will permit more efficient joins between 183 tables with a parent-child relationship. Any table that is often joined with 184 a parent table (e.g. via a foreign key) should be interleaved in the the parent table. 185 186 In general, tables that are consistently subject to hierarchical querying patterns 187 (i.e. queried together via multi-way joins) can see improvements with interleaving. 188 The [current docs](https://www.cockroachlabs.com/docs/stable/interleave-in-parent.html) 189 do a good job highlighting the benefits and tradeoffs of interleaved tables 190 in the general case. See [Drawbacks](#drawbacks) for caveats on advising users 191 with respect to interleaved tables. 192 193 ## CockroachDB Contributor 194 195 For CockroachDB contributors and developers, changes to the codebase are mapped 196 to their respective phases. Firstly, it is important to understand how the 197 interleaved indexes are mapped in storage. Refer to the [RFC on interleaved 198 tables](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160624_sql_interleaved_tables.md). 199 200 As for how each phase of this feature affects the codebase: 201 1. The Processors implementation phase will introduce a new processor, the 202 `InterleaveReaderJoiner` processor that functionally combines two `TableReader`s 203 (corresponding to each parent and child table) and a `MergeJoiner` 204 (since interleave prefixes are defined on the parent table's primary key, 205 there is an ordering guarantee we can take advantage of): 206 - The `InterleaveReaderJoiner` will first scan the interleaved hierarchy 207 (scoped to the relevant key span if there is a filter on the primary 208 key). It can be configured with the `TableDescriptor`s and 209 `IndexDescriptor`s of the parent and child table such that it can perform 210 a single-pass scan of the two tables with `RowFetcher`. See [\[1\] 211 Scanning the interleaved 212 hierarchy](#1-scanning-the-interleaved-hierarchy) for discussion on why 213 this is the case. 214 - There is the caveat that each `InterleaveReaderJoiner` must read full 215 interleaves (and cannot partially read an interleaf), otherwise there may 216 potentially be missing joined rows (see the [Reference-level explanation 217 section](#Reference-level-explanation)). 218 - After the rows are extracted, the joining logic is identical to that of 219 `MergeJoiner`. We can abstract the joining logic from `MergeJoiner` and incorporate 220 it into `InterleaveReaderJoiner`. 221 - See [InterleaveReader and MergeJoiner](#6-interleavereader-and-mergejoiner) for an 222 alternative design where the reader and joiner are separated into two 223 processors. 224 2. The Planning implementation phase will introduce changes to the logical plan 225 whereby interleaved joins are acknowledged. Note choosing to introduce 226 interleaved joins at the logical plan level as opposed to the physical plan 227 level is further discussed in [\[2\] Logical planning > physical 228 planning](#2-logical-planning--physical-planning). 229 - `joinNode`s will be annotated in the logical plan if it is the root of 230 a plan sub-tree, have two `scanNode` children/leaves, and: 231 - Each of the `scanNode`s correspond to a scan of the parent and child 232 tables 233 - The `joinNode` is an **equality** join on the parent's primary key (the interleave 234 prefix) (see [\[5\] Theta joins](#5-theta-joins) for inequality joins) 235 In the general case, the table with the join columns could be a common 236 ancestor (see [\[3a\] Common ancestor joins 237 Alternatives](#3a-common-ancestor-joins)) and/or the join columns could 238 be a prefix of the interleave prefix (see [\[3b\] Prefix 239 joins](#3b-prefix-joins)). 240 - Local execution engine will carry out the join as it would without interleaved 241 joins (until we decide to push down interleaved join logic to the local nodes, see 242 [\[2\] Logical planning > physical planning](#2-logical-planning--physical-planning). 243 - Distributed execution engine will plan an `InterleaveReaderJoiner` each 244 node (along with routers and streams) that has data for the relevant span 245 when a `joinNode` with an interleaved join algorithm is encountered. 246 247 Note every type of join (inner, left, right, full) can easily be supported since 248 `InterleaveReaderJoiner` employs the same joining logic as `MergeJoiner`. 249 250 ## Examples 251 252 For the query described above where we do a simple join between table `parent` 253 and table `child` where `child` is `INTERLEAVED INTO PARENT` `parent` 254 ```sql 255 SELECT * FROM child JOIN parent ON child.parent_id = parent.id 256 ``` 257 258 we previously had the logical plan 259 ``` 260 +-------+------+----------------+--------------------+ 261 | Level | Type | Field | Description | 262 +-------+------+----------------+--------------------+ 263 | 0 | join | | | 264 | 0 | | type | inner | 265 | 0 | | equality | (parent_id) = (id) | 266 | 0 | | mergeJoinOrder | +"(parent_id=id)" | 267 | 1 | scan | | | 268 | 1 | | table | child@primary | 269 | 1 | | spans | ALL | 270 | 1 | scan | | | 271 | 1 | | table | parent@primary | 272 | 1 | | spans | ALL | 273 +-------+------+----------------+--------------------+ 274 ``` 275 which is a logical plan with a `joinNode` root and two `scanNode` leaves. This 276 could very well be a `planNode` sub-tree in a more complex query. Regardless, 277 one can annotate the `joinNode` to identify the interleaved join which will 278 produce a query plan resembling 279 ``` 280 +-------+------+------------------+--------------------+ 281 | Level | Type | Field | Description | 282 +-------+------+------------------+--------------------+ 283 | 0 | join | | | 284 | 0 | | type | inner | 285 | 0 | | equality | (parent_id) = (id) | 286 | 0 | | mergeJoinOrder | +"(parent_id=id)" | 287 | 0 | | interleavePrefix | parent_id, id | 288 | 0 | | algorithmHint | interleave 289 | 1 | scan | | | 290 | 1 | | table | child@primary | 291 | 1 | | spans | ALL | 292 | 1 | scan | | | 293 | 1 | | table | parent@primary | 294 | 1 | | spans | ALL | 295 +-------+------+------------------+--------------------+ 296 ``` 297 The local execution engine can recognize the additional `interleavePrefix` 298 field and `algorithmHint = interleave` once it implements interleaved join logic. The distributed 299 execution engine can produce a physical plan with an `InterleaveReaderJoiner` 300 processor for every such tree pattern. 301 302 # Reference-level explanation 303 304 This RFC will be scoped down to: 305 1. joins between **two** tables in the interleaved hierarchy 306 2. planning for interleaved joins between parent-child tables (see future work 307 on [\[3a\] Common ancestor joins](#3a-common-ancestor-joins)) 308 3. the join columns must be exactly the interleave prefix (see future work on [\[3b\] 309 Prefix joins](#3b-prefix-joins) and [\[3c\] Subset 310 joins](#3c-subset-joins)) 311 312 Furthermore, we will discuss the two implementation phases (Processors and 313 Planning) separately since they can be orthogonally implemented. 314 315 ## Detailed design 316 317 ### Processors implementation 318 319 For the first iteration, we only care about equality joins on the entire 320 interleave prefix. If you imagine the following interleaved hierarchy 321 in storage (where each row represents a table row), where the primary 322 key of of `parent` is `pk`, the primary key of `childN` is `(pk, ckN)`, 323 and the interleaved index/interleave prefix is `(pk)`: 324 ``` 325 <pk>/.../parent_r1 326 <pk>/<ck1>/.../child1_r1 327 ... 328 <pk>/<ck1>/.../child1_r2 329 <pk>/<ck2>/.../child2_r1 330 <pk>/<ck3>/.../child3_r1 331 ... 332 <pk>/.../parent_r2 333 <pk>/<ck1>/.../child1_r3 334 <pk>/<ck2>/.../child2_r2 335 <pk>/<ck2>/.../child2_r3 336 <pk>/<ck2>/.../child2_r4 337 ... 338 ``` 339 Let's suppose we want to join `parent` and `child2` 340 ``` 341 SELECT * FROM parent JOIN child2 ON parent.pk = child2.pk 342 ``` 343 The expected output is 344 ``` 345 parent_r1 U child2_r1 346 parent_r2 U child2_r2 347 parent_r2 U child2_r3 348 parent_r2 U child2_r4 349 ... 350 ``` 351 352 We can scan top to bottom of the interleaved hierarchy and separate out 353 `parent_rN` rows and `child2_rN` rows into two streams. 354 355 #### Reading component 356 357 Looking through the implementation of `RowFetcher`, specifically the call chain 358 ``` 359 RowFetcher --> RowFetcher.NextKey --> RowFetcher.ReadIndexKey --> DecodeIndexKey (table.go) 360 ``` 361 it doesn't seem like `RowFetcher` is very permissive when it comes to allow 362 1-pass through on an interleaved hierarchy. Specfically, [`DecodeIndexKey` in 363 `table.go`](https://github.com/cockroachdb/cockroach/blob/de7337dc5ca5b4e5ee17e812c817e4bba5a449ca/pkg/sql/sqlbase/table.go#L778L780) 364 will need to be refactored so that it returns `true` for KV pairs if 365 it matches either the target `parent` index or `child2` index. This is 366 generalized to a set membership problem with a set of `N` `(tableID, indexID)` tuples if we 367 emit multiple interleaved tables in one scan or to support [multi-table 368 joins](#4-multi-table-joins)). 369 370 The refactor necessary is as follows: 371 - A new `RowFetcher` that produces rows from any of the `N` tables 372 (identified as unique `TableDescriptor`-`IndexDescriptor` pairs) 373 - `NextRow` will need to be able to return a `RowResponse` with the 374 `EncDatumRow` as well as the row's `TableDescriptor` and `IndexDescriptor`. 375 - The equivalence signature is defined as the sequence of ancestor table and 376 index IDs encoded into a `[]byte`. For example the equivalence signature 377 for the __primary index__ on `child2` interleaved into the primary index of 378 `parent` as well as __any rows__ that belong to this primary index of `child2` is 379 ``` 380 /parent/1/#/child2/1 381 ``` 382 - We can pre-compute this equivalence signature for each table. When we 383 proceed to decode each index key, we compute the equivalence signature 384 and check if it corresponds to any of the tables. 385 - Instead of trying to refactor `RowFetcher` and introducing additional 386 overhead, it is prudent to separately implement this `RowFetcher`. We 387 can eventually merge `RowFetcher` into `RowFetcher` after it is 388 determined the overhead is marginal for the 1 table case. 389 390 The [outstanding PR for `RowFetcher`](https://github.com/cockroachdb/cockroach/pull/19228) 391 has the full implementation details. 392 393 #### Joining component 394 395 The `InterleaveReaderJoiner` can branch on the 396 `TableDescriptor`-`IndexDescriptor` pair to distinguish between `parent` and 397 `child2` rows. Each join batch is defined as rows that are joined for a given 398 interleave prefix. If the `InterleaveReaderJoiner` observes a `parent` row, it 399 can start a new join batch since each unique interleave prefix is a primary key 400 on `parent` which is also unique. It memoizes this `parent` row. If a `child2` 401 row is retrieved, it joins it with the most recent `parent` with 402 [`joinerBase.render`](https://github.com/cockroachdb/cockroach/blob/c86f16f89c154797ed07012f66d4aa49b1947624/pkg/sql/distsqlrun/joinerbase.go#L175#L193). 403 This implies that `InterleaveReaderJoiner` will need to [nest `joinerBase` 404 similar to 405 `mergeJoiner`](https://github.com/cockroachdb/cockroach/blob/c86f16f89c154797ed07012f66d4aa49b1947624/pkg/sql/distsqlrun/mergejoiner.go#L33). 406 407 The joining logic becomes more complicated on joins not on the full interleave prefix 408 as detailed for [prefix joins](#3b-prefix-joins) and [subset joins](#3c-subset-joins). 409 410 ### Planning implementation 411 412 There are three areas (that each vary in the amount of work required) that are 413 relevant to implementing the planning of interleaved table joins: 414 1. Logical planning & annotating `joinNode`s 415 2. Local execution engine (little work required) 416 3. Distributed execution engine & spawning appropriate flows 417 418 #### Logical planning 419 420 Annotating `joinNode`s can be accomplished after a logical plan has been 421 created and during optimization (since one can consider identifying interleaved 422 table joins an optimization; by precedent, we did this in the `optimizePlan` 423 stage of the planner [for merge 424 joins](https://github.com/cockroachdb/cockroach/pull/17214/files)). We 425 introduce a general `algorithmHint` field on `joinNode`s to annotate. 426 427 Annotation can be accomplished when we `expandPlan` (within `optimizePlan`) and do type 428 switching on the nodes we encounter. Specifically, we'd have a helper function to 429 peek into the left and right sources to see if they satisfy the conditions 430 for an interleaved join. This is already being done for identifying merge 431 joins and merge join order: in fact this can be a cookie cutter derivative 432 of the [planning changes in the DistSQL merge join 433 PR](https://github.com/cockroachdb/cockroach/pull/17214/files#diff-03ffd4efde59eae13665287cc1193d9a). 434 435 The conditions for an interleaved join in this first iteration are: 436 1. Each of the `scanNode`s correspond to a scan of the parent and child tables 437 2. The `joinNode` is an **equality** join on the parent's primary key (the 438 interleave prefix) 439 440 We first check if the `planNode` attach to `left` and `right` are `scanNode`s. 441 We then corroborate their `InterleaveDescriptor`s (nested in `IndexDescriptor` which 442 is nested in `TableDescriptor`). Note that `InterleaveDescriptor` does in 443 fact keep track of all ancestors, thus it's rather simple to do a interleaved hierarchy traversal 444 if we want to do [common ancestor joins](#3a-common-ancestor-joins) in the future. 445 If the first ancestor's (the parent) table ID of `left` or `right` 446 corresponds to the table ID of the other table, then we've satisfied condition #1. 447 448 Note that only a scan of the interleaved index of the child table can be 449 incorporated in an interleaved join. For example, if only the child table's primary index 450 is interleaved into the parent table but the `scanNode` is a scan over one of the child's 451 secondary indexes, then an interleaved join is not appropriate. If the secondary 452 index is interleaved then we can should also [optimize index joins](#7-index-joins). 453 454 We then check if the `joinPredicate` on the join nodes have equality indices 455 on the (exact) columns in the interleave prefix. 456 457 If both conditions are met, we set `algorithmHint` to "parent-child interleaved join". 458 The annotated plan tree is then sent off to the execution engine. 459 460 #### Local execution engine 461 462 Thankfully, there is not much to do here for the first iteration of this RFC! 463 464 The [idea of spawning the `InterleaveReaderJoiner` in the local execution 465 engine](#2-logical-planning--physical-planning) is out of this RFC's scope. 466 467 #### Distributed execution engine (DistSQL) 468 469 The bulk of the logic that will setup the interleave 470 join will be in 471 [`createPlanForJoin`](https://github.com/cockroachdb/cockroach/blob/10e3751071c3b80540486d4a2f11c2d322501d42/pkg/sql/distsql_physical_planner.go#L1851). 472 If the interleave join algorithm has been selected by the logical planner, it 473 will skip [creating the individual 474 plans](https://github.com/cockroachdb/cockroach/blob/10e3751071c3b80540486d4a2f11c2d322501d42/pkg/sql/distsql_physical_planner.go#L1874L1878) 475 for the `left` and `right` `scanNode`s and invoking `MergePlans`. Instead, the 476 descriptors on the `scanNode`s will be used to construct the 477 `InterleaveReaderJoiner` processor. 478 We pass down `TableDescriptor`s, `IndexDescriptor`s and any other arguments 479 our `RowFetcher` requires to perform a single-pass read for the two tables. 480 481 The final physical plan for a three-node cluster looks something like 482 483  484 485 First the union of the spans from the two `scanNode`s (by invoking 486 [`MergeSpans`](https://github.com/cockroachdb/cockroach/blob/master/pkg/roachpb/merge_spans.go#L42)) 487 are passed into 488 [`PartitionSpans`](https://github.com/cockroachdb/cockroach/blob/62b7495302a8e06b4be3780e349d10d31e378bd7/pkg/sql/distsql_physical_planner.go#L488). 489 This will return a slice of `SpanPartition`s of length `n`, which corresponds 490 to the number of data nodes. We will eventually spawn `n` 491 `InterleaveReaderJoiner`s, one for each node (`n = 3` in the diagram above). 492 Side note: any scan over an interleaved index will always [default to 493 the span of the root table in the interleaved 494 hierarchy](https://github.com/cockroachdb/cockroach/blob/de7337dc5ca5b4e5ee17e812c817e4bba5a449ca/pkg/sql/sqlbase/table.go#L239L243) so 495 it's not strictly necessary to take the union. For future-proofing's sake, 496 we do this since it's a trivial performance impact to do a union over a set of spans during planning. 497 498 These spans will need to be "fixed" in order to prevent partial interleave 499 reads since range splits can happen in between interleaves. 500 If an interleaf is only partially read on a given node (for example, the parent 501 row is read but none of the children rows are read because they overflow into 502 the next span partition on a different node), then we will miss a few joined 503 rows. We need to first extract the parent's prefix key then call `PrefixEnd()`. 504 To do this, we need to figure out the cumulative. That is for a given `child` key 505 ``` 506 /<tableid>/<indexid>/<parent-interleave-prefix>/<interleave sentinel>/<rest-of-child-primary-key>/... 507 ``` 508 where the number of segments (`/.../`) for `<parent-interleave-prefix>` (i.e. 509 number of columns in the prefix) is `shared_prefix_len` in the child table's 510 `InterleaveDescriptor`. 511 For example, if we wanted to fix this span (generated from `PartitionSpans`) 512 (`#` is the `<interleave sentinel>`) 513 ``` 514 StartKey: /parent/1/2/#/child/2 515 EndKey: /parent/1/42/#/child/4 516 ``` 517 to 518 ``` 519 StartKey: /parent/1/2/#/child/2 520 EndKey: /parent/1/43 521 ``` 522 such that we know for certain we read all `child` rows with interleave prefix 523 `5`, then we need to first extract `/parent/1/42` from 524 `/parent/1/42/#/child/4`. We can call `encoding.PeekLength()` `3` times (once 525 for each segment) to find the total prefix length then take the prefix of 526 `/parent/1/42/#/child/4` of that length. More generally, for a descendant 527 `EndKey` 528 ``` 529 /<1st-tableid>/<1st-indexid>/<1st-index-columns>/#/<2nd-tableid>/<2nd-indexid>/2nd-index-columns/#/... 530 ``` 531 the number of times we call `encoding.PeekLength()` is 532 ``` 533 3 * count(interleave ancestors) + sum(shared_prefix_len) - 1 534 ``` 535 where the `- 1` is to not include the last `<interleave sentinel>`. 536 537 If the above "fixing" happens too often, it begs the question of avoiding 538 splits inside interleaves as much as possible. This [was mentioned briefly in 539 the initial RFC of interleaved 540 tables](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160624_sql_interleaved_tables.md). 541 See [Avoiding splits inside interleaves](#8-avoiding-splits-inside-interleaves) 542 for more detail. 543 544 ## Drawbacks 545 546 Everything discussed in this PR is more efficient than the way we do joins currently. 547 This is because we do one scan over the interleaved hierarchy instead of two 548 separate scans for each the two table in the join. 549 550 In terms of future caveats, interleaved table joins are ill-advised if the majority 551 of queries are joins on a small subset of the interleaved hierarchy. In the scope of 552 this RFC (where we focus on joins between two tables in the interleaved hierarchy), 553 one should only have a parent and child table in a given interleaved hierarchy. 554 555 Once multi-table interleaved joins are implemented (see [\[4\] Multi-table 556 joins](#4-multi-table-joins)), any scan queries 557 (whether on individual tables or joins) should be on the majority of the tables 558 in the interleaved hierarchy (to optimize the amount of relevant data read when 559 doing [the interleaved hierarchy 560 scan](#1-scanning-the-interleaved-hierarchy)). 561 562 All this being said, we should be __very cautious__ prescribing usage patterns 563 for interleaved tables. We currently do not support interleave schema changes 564 (i.e. un-interleaving a table from its parent). It would be inconsistent of 565 us if we prescribe that interleaved tables should only be used in parent-child 566 relationships presently, but amend our prescription to include deeply-nested, 567 hierarchical relationships once we implement multi-table joins and common 568 ancestor joins. 569 570 ## Rationale and Alternatives 571 572 ### [1] Scanning the interleaved hierarchy 573 574 After discussions with @jordanlewis who has been working with indexes of interleaved 575 parent and child tables, a couple of key points were brought up that will dictate 576 how we do the reading component of `InterleaveReaderJoiner` (and `RowFetcher`): 577 1. Currently, one cannot selectively scan all parent rows without scanning 578 interleaved (grand-)\*children. Similarly, one cannot selectively scan 579 all the child rows without scanning its (grand-)\*parents and 580 (grand-)\*children. 581 - For example, a query for all rows of table `parent` with primary key 582 between `2` and `42` might look up the span `parent@primary /2-/43`. This 583 scan at the `KVFetcher` level iterates through all `child` rows and 584 [decodes their index 585 keys](https://github.com/cockroachdb/cockroach/blob/62b7495302a8e06b4be3780e349d10d31e378bd7/pkg/sql/sqlbase/rowfetcher.go#L154#L155) 586 with interleave prefix also between `2` and `42`. It then decides whether 587 or not the `child` row is a `parent` row via a [comparison on its 588 `TableID` and 589 `IndexID`](https://github.com/cockroachdb/cockroach/blob/62b7495302a8e06b4be3780e349d10d31e378bd7/pkg/sql/sqlbase/table.go#L778L780). 590 - One key observation by @jordanlewis is that a secondary index can 591 be scanned and index joined with the primary rows of the parent and/or 592 child table to obtain a scan of each table. 593 - This secondary index should not be maintained automatically: a user 594 should create this secondary index if need be. 595 - The planner will be need to know when to use this secondary index and 596 an index join to perform these types of scans: index selection can be 597 augmented to impose a penalty for scans over parent interleaves. 598 - Benchmark opportunity: secondary indexes do incur a small cost on 599 `INSERT/UPDATE`s. They are also located on separate ranges (and possibly 600 separate machines) which will require sending rows over the wire to the 601 `Joiner`s. This is potentially worse than doing an entire hierarchy 602 scan and filtering for relevant rows. 603 2. If a filter is given for the parent primary key, then we can scope the scan. 604 This however still picks up all the (grand-)\*children rows of the parent. 605 606 ### [2] Logical planning > physical planning 607 608 For planning interleaved table joins, one can either introduce the branching 609 point (that is, the point at which an `InterleavedRaederJoiner` is introduced) 610 in the physical plan (processor-level in the distributed execution engine) or 611 in the logical plan (which affects both execution engines). 612 613 Currently the local execution engine only scans and propagate rows in the 614 computed spans on the data nodes to the gateway node. Note however that these 615 scans occur at the leaves of plan tree (`scanNode`s are always leaves). More 616 precisely, the `InterleaveReaderJoiner` logic described for the distributed 617 execution engine can theoretically be accomplished in the local execution 618 engine: the local implementation spawns and runs an `InterleaveReaderJoiner` 619 processor and returns the joined results. If we never do quite fuse the local 620 and distributed backends into one entity, annotating a `joinNode` to perform an 621 interleaved join if possible will give us the **flexibility** to optimize both 622 pathways. 623 624 This decision to annotate logical plans to allow for more efficient joins [has 625 precedent when we added ordering information for merge joins to 626 `joinNode`s](https://github.com/cockroachdb/cockroach/pull/17214) for the 627 distributed execution engine. 628 629 ### [3] Generalizing planning 630 631 As noted in the [Guide-level explanation section](#Guide-level-explanation), 632 one can identify the plan tree pattern that has a `joinNode` as the root and 633 two `scanNode` leaves that correspond to the parent and child tables. For the 634 simple case, one can verify that the join column(s) are on the primary key of 635 the parent table (and a prefix of the interleaved index of the child table, but 636 this is mandated by `INTERLEAVE` anyways), but this can of course be 637 generalized. 638 639 #### [3a] Common ancestor joins 640 641 When we have a parent and direct child relationship and join, we really have a 642 join on a common ancestor's primary key (the parent) between the parent and the 643 child. If we imagine the interleave relationships for a given database as a 644 forest where every top-level parent is the root of a tree (and these trees can 645 be arbitrarily deep), then a **common ancestor** for two nodes `A` and `B` is 646 defined as as node whose subtrees contain `A` and `B`. 647 648 Imagine that we wanted to perform a join between two children tables on their 649 parent's primary key. For example, suppose we have table `parent, child1, 650 child2`: 651 ```sql 652 CREATE TABLE parent (id INT PRIMARY KEY); 653 654 CREATE TABLE child1 (id1 INT, pid INT, PRIMARY KEY (pid, id1)) INTERLEAVE IN PARENT parent (pid); 655 656 CREATE TABLE child2 (id2 INT, pid INT, PRIMARY KEY (pid, id2)) INTERLEAVE IN PARENT parent (pid); 657 658 SELECT * FROM child1 JOIN child2 ON child1.pid = child2.pid; 659 ``` 660 The forest representation (with one tree) looks like 661 ``` 662 parent 663 / \ 664 / \ 665 / \ 666 child1 child2 667 ``` 668 It is complete and sound to scan the interleaved parts of `child1` and `child2` 669 and perform interleaved joins per each unique `parent` primary key. Reasoning: 670 all `child1` rows with interleave prefix `X` are interleaved under the `parent` 671 row with primary key `X` by `INTERLEAVE` constraint. Similarly all `child2` 672 rows with `X` prefix are interleaved under the same `parent` row. Therefore, 673 all `child1` and `child2` rows with equivalent `pid`s are nested under the same 674 interleave block and thus will be joined. 675 676 This concept can be arbitrarily extended to any common ancestor primary key 677 join so long as the join is exactly on the common ancestor's primary key. Since 678 the nesting depths of interleaved tables are typically no more than 1-4 (if 679 there is very deep nesting, then we'll have other issues to worry about), we 680 can simply traverse up the connected tree until we reach the root and check if 681 a common ancestor was encountered (traverse up instead of down from either 682 node since `InterleaveDescriptor` only keeps track of ancestors, not 683 descendants). 684 685 The use cases for a join on an ancestor more than one level higher in the 686 hierarchical tree is limited since it is intuitively rare for a grandchildren to 687 have a foreign key reference to its grandparent. Similarly, joining siblings on 688 their parents' key is essentially a cross product join per each unique primary 689 parent key, which is hard to imagine as a common query. A multi-table join 690 between parent-child-grandchild is conceivable (see [Multi-table 691 joins](#4-multi-table-joins)). 692 693 #### [3b] Prefix joins 694 695 Prefix joins involves joining on a proper prefix of the primary key of 696 the parent (which is a prefix of the interleave prefix). 697 698 ##### Canonical prefix join example 699 700 For example suppose we table `parent` that 701 has primary key (PK) `(pk1, pk2, pk3)` and table `child` that has primary key 702 (PK) `(pk1, pk2, pk3, c_pk1)`. 703 704 The simple case we've been dealing with is a join on (exclusively) columns 705 `pk1`, `pk2`, and `pk3`. 706 707 It was proposed that we could possibly do a join on a prefix of the 708 interleave prefix. Suppose we wanted to join on `pk1` and 709 `pk2`. For clarity's sake, let us imagine we have the parent key schema 710 ``` 711 /<pk1>/<pk2>/<pk3>/... 712 ``` 713 and the interleaved child key schema 714 ``` 715 /<pk1>/<pk2>/<pk3>/<interleaved sentinel>/<c_pk1>/... 716 ``` 717 where the `interleaved sentinel` is `#` (in practice, it is the first byte in 718 `KeyEncDescending` such that it is always sorted after the parent KV pairs). 719 Example KV pairs (each representing an entire table row) for the parent and 720 child tables will be stored as (where child rows are indented) 721 ``` 722 ... 723 /1/1/2/... 724 /1/1/2/#/2/... 725 /1/1/2/#/4/... 726 /1/1/3/... 727 /1/1/3/#/1/... 728 /1/1/3/#/3/... 729 /1/1/3/#/7/... 730 /1/2/3/... 731 /1/2/3/#/5/... 732 /1/2/9/... 733 /1/2/9/#/5/... 734 /1/2/9/#/7/... 735 ... 736 ``` 737 Joining on `pk1` and `pk2` would result in the following parent-child KV pair joins 738 ``` 739 --------------------------------- 740 | parent (key) | child (key) | 741 --------------------------------- 742 | /1/1/2/... | /1/1/2/#/2/... | 743 | /1/1/2/... | /1/1/2/#/4/... | 744 | /1/1/2/... | /1/1/3/#/1/... | 745 | /1/1/2/... | /1/1/3/#/3/... | 746 | /1/1/2/... | /1/1/3/#/7/... | 747 | /1/1/3/... | /1/1/2/#/2/... | 748 | /1/1/3/... | /1/1/2/#/4/... | 749 | /1/1/3/... | /1/1/3/#/1/... | 750 | /1/1/3/... | /1/1/3/#/3/... | 751 | /1/1/3/... | /1/1/3/#/7/... | 752 | ... | .... | 753 -------------------------------- 754 ``` 755 and similarly for the joins where `(pk1, pk2) = (1, 2)`. 756 757 Since the primary index 758 for the parent table is ordered, two 2-pass approaches using a merge-join 759 pattern e-merges here: 760 761 TODO(richardwu): From my understanding, if one does not know the exact 762 `/pk1/pk2` start and end keys (i.e. one can't tell RowFetcher/kvBatchFetcher to scan 763 the first, second, etc. /pk1/pk2/ range), then one would have to scan the 764 entire hierarchy. The bases for the following approaches revolve around the 765 fact that we can't do separate scans for each unique `(pk1, pk2)` prefix. 766 767 ##### Approach 1 (2-pass, one scan) 768 This approach uses 769 ``` 770 O(# of parent+child rows for given (pk1, pk2) prefix) 771 ``` 772 memory and 1 span scan: 773 1. Scan top to bottom through the entire interleaved hierarchy 774 2. For every new `(pk1, pk2)` prefix: 775 - `NextRow()` and store parent/child row in their respective buffers until 776 we see a new `(pk1, pk2)` prefix 777 - Perform a full (cross product) join between all parent and child rows, 778 emitting newly joined rows 779 780 ##### Approach 2 (2-pass, least memory, O(2k) scans) 781 If the memory requirements of approach 1 becomes too unwieldy, one can consider 782 storing either the child rows or parent rows in one buffer (deciding which will 783 require table statistics or a general heuristic). This requires 784 ``` 785 min{ 786 O(# of child rows for given (pk1, pk2) prefix), 787 O(# of parent rows for given (pk1, pk2) prefix), 788 } 789 ``` 790 memory. WLOG, let's store the parent rows in our buffer. **This approach 791 relies on the assumption that RocksDB can efficiently cache the requested 792 span and subsequent overlapping scans of the initial span. @petermattis notes 793 that RocksDB should be able to cache these large blocks of SSTables**: 794 1. Initialize span as the entire interleaved hierarchy 795 2. Scan through the rows for a given `(pk1, pk2)` prefix: 796 - For every parent row, store in buffer 797 - For every child row, discard 798 - Once a new `(pk1, pk2)` prefix is encountered, mark the current 799 `(pk1, pk2)` prefix for the next scan 800 3. Scan through the the current `(pk1, pk2)` span again 801 - Join each encountered child row with the cached parent rows and emit 802 4. Begin a new scan with the next `(pk1, pk2)` prefix (go to #2) 803 804 To illustrate, let's assume that there are **three** unique `(pk1, pk2)` prefixes. This 805 forms three blocks of our overall span: block A, B, and C. 806 1. The first scan is over the entire span (A + B + C). 807 2. Once the first key of B is encountered, we perform a scan on block A again 808 and perform our join/emit. 809 3. We resume scanning block B + C. 810 4. Once the first key of C is encountered, we perform a scan on block B again 811 and perform our join/emit. 812 5. We resume scanning block C. We reach the end of our span scan. 813 814 For `k` unique prefixes, this requires O(2k) scans. __Assuming every scan 815 is retrieved from the RocksDB cache, we only do one read from disk.__ 816 817 #### [3c] Subset joins 818 819 Subset joins are joins on a strict subset of the interleave prefix 820 columns that is not a prefix itself. For example, if `(pk1, pk2, pk3)` is the 821 interleave prefix, then 822 ``` 823 {(pk2), (pk3), (pk1, pk3), (pk2, pk3)} 824 ``` 825 is its corresponding set of subsets. 826 827 There is no optimized way __specific to interleaved tables__ to performing subset 828 joins (that I can think of). If you refer to the [Canonical prefix join 829 example](#canonical-prefix-join-example), a join on `(pk1, pk3)` would involve 830 some disjoint cross joins which is hard to optimize in a general way. We could 831 of course use a hybrid merge-hash join for a join on `(pk1, pk3)`: we'd perform 832 a hash-join (hashed on `pk3`) on all rows for each `pk1` (see [this 833 issue for more details](https://github.com/cockroachdb/cockroach/issues/16580)). 834 835 ### [4] Multi-table joins 836 837 For a given interleaved hierarchy, we might fold in multiple joins from 838 more than two tables in the hierarchy. Suppose we have the hierarchy 839 ``` 840 customers 841 orders 842 items 843 ``` 844 We may want retrieve all `items` that `customers` have purchased joined with 845 their customer information (stored in `customers`). The corresponding 846 query might look like 847 ```sql 848 SELECT * FROM customers 849 JOIN 850 SELECT * FROM orders 851 JOIN items ON orders.id = items.order_id 852 ON customers.id = orders.cus_id 853 ``` 854 855 Ideally, this three-table join be combined into one join with one scan of the 856 interleaved hierarchy. `InterleaveReaderJoiner` would read rows from `N = 3` tables-indexes 857 and perform a merge join on all three tables (this is essentially a 3-way cross product, 858 which can be implemented with 3 loops nested in a hierarchy). 859 860 ### [5] Theta Joins 861 862 Theta joins, or joins on columns with predicates that are not all equivalence predicates, 863 are a little more complicated with interleaved table joins. I haven't fully 864 grok-ed the implications of theta joins for interleaved tables, so I'll leave it 865 open for now. 866 867 ### [6] InterleaveReader and MergeJoiner 868 869 After some initial discussion in the RFC PR, we came to the realization that 870 the goal of this RFC is to avoid scanning the entire interleaved hierarchy twice 871 with two independent `TableReader` processors. An alternative solution proposed 872 was to have some `InterleaveReader` that emits both `parent` and `child` rows 873 in the simple case. The joining logic is identical to that of `MergeJoiner` and 874 the rows from `InterleaveReader` can be piped to `MergeJoiner`. 875 876 In fact, the [prefix](#3b-prefix-joins) and [subset joins](#3c-subset-joins) 877 are general cases for `MergeJoiner` that is current being tracked by [this 878 issue](https://github.com/cockroachdb/cockroach/issues/16580). 879 880 This alternative architecture would look something like 881 882  883 884 #### Planning 885 886 Instead of specifying a hash router for each `TableReader` instance to their 887 respective `MergeJoiner`s (hashed by the equality columns), the two output streams 888 of the `InterleaveReader` will be mapped directly to the inputs of the 889 `MergeJoiner` processor. Two `OutputRouterSpec`s for `PASSTHROUGH` routers are 890 specified for each `InterleaveReader`. They each correspond to the output streams 891 for parent and child rows. This is accomplished by setting the appropriate 892 fields (i.e. `SourceProcessor`, `DestProcessor`, `DestInput`) fields of 893 `distsqlplan.Stream` and adding it to the plan. Since `distsqlplan.Stream` 894 currently does not support multiple source outputs for a given `SourceProcessor`, 895 we will also need to introduce `SourceOutput int` (similar to how `DestInput int` 896 permits multiple input streams). 897 898 #### Reading component 899 900 `InterleaveReader` also requires reading from `RowFetcher` and piping the 901 rows to each of its streams. Since `InterleaveReader` "owns" its `N` output 902 streams (rather than letting some `Router` message-broker direct the output), a 903 similar buffer logic will be [required to prevent 904 deadlocks](https://github.com/cockroachdb/cockroach/issues/17097). The deadlock happens 905 when `MergeJoiner` (`streamMerger` to be precise) tries to [batch all rows from both sides with the 906 current join column 907 values](https://github.com/cockroachdb/cockroach/blob/70c96175bb5b60bf6d531ab3bcebd2bb723d0bc7/pkg/sql/distsqlrun/stream_merger.go#L37#L45). 908 It will try to retrieve parent rows until it sees that the values for the join 909 columns increase (in order to form a batch of parent rows with the same join 910 column values). However, this would not be possible if the next row after the 911 parent row is a child row since the `InterleavedReader` reads top to bottom. Thus 912 we would need to buffer enough child rows until `streamMerger` can get the next 913 parent row which has a different primary key. 914 915 ``` 916 parent_r1 <- first row in parent batch, NextRow until parent row with a different primary key 917 child_r1 <- need to buffer since NextRow is called 918 child_r2 <- buffered 919 child_r3 <- buffered 920 parent_r2 <- parent row with a different primary key, finish parent batch 921 and move on to child batch (child_r1 - r3 is unblocked and sent) 922 ``` 923 924 In the case of a join on the entire interleave prefix, buffering is __okay__ 925 since there is at most 1 parent row for 0+ child rows. We would thus buffer at 926 most 1 child row in order to retrieve the next parent row (that has a larger 927 primary key, finishing off the batch). In the case where we do a [prefix 928 join](3b-prefix-joins), parent rows are separated by blocks of 0+ child rows, 929 thus we will need to buffer the child rows in order to retrieve all the parent 930 rows for a given batch. 931 932 This buffering logic needs to be [abstracted from 933 `routerBase`](https://github.com/cockroachdb/cockroach/blob/de7337dc5ca5b4e5ee17e812c817e4bba5a449ca/pkg/sql/distsqlrun/routers.go#L160L164) 934 and imported into `InterleaveReader`. 935 936 #### Joining component 937 938 The actual joining logic in `MergeJoiner` should be agnostic to whether the 939 rows came from two `TableReader`s or one `InterleaveReader`. 940 941 One of the drawbacks of delegating the joining logic to a generic `MergeJoiner` 942 is, in the simple case, having to buffer batches of child rows even though we 943 know there is only one parent row to join and we can stream these joined rows. 944 945 For example, the second batch for `parent` and `child2` looks like 946 ``` 947 <parent batch> <child batch> 948 parent_r2 child2_r2 949 child2_r3 950 child2_r4 951 ``` 952 Instead of buffering the entire batch, we know for certainty that 953 we need only buffer the one parent row and stream-join the child rows. 954 955 One way we can fix this is to refactor `streamMerger` to recognize 956 that if either the `left` or `right` batch has one row, we can stream join 957 the other row. 958 959 For the scope of the first iteration, we will defer this optimization. 960 961 If we wanted to perform [multi-table joins](#4-multi-table-joins) efficiently, 962 we could consider an `InterleaveJoiner` (or `MultiMergeJoiner`) that can merge 963 `N` arbitrary batches. 964 965 #### Drawbacks of this approach 966 967 There were a number of cons to this design compared to a single `InterleaveReaderJoiner` 968 processor, namely: 969 - We need to buffer child rows in `InterleaveReader` OR refactor `streamMerger` to 970 read rows from left and right concurrently to avoid deadlocking. Note 971 refactoring `streamMerger` doesn't solve the "prefix joins" or "subset" case: 972 we still need to buffer/store rows on the `InterleaveReader` side. 973 - We need to extract buffering logic from routers. 974 - Batches are unavoidable in `streamMerger` unless we hint to `MergeJoiner` 975 during planning that the parent side contains only 1 row per batch (for joins 976 on the full interleave prefix). If it does contain 1 row on one side we can 977 skip batching and simply stream joined rows. 978 - We need to refactor `Stream`s to handle multiple inputs (from `InterleaveReader`). 979 980 ### [7] Index joins 981 982 Although we [do not publicly advertise](https://www.cockroachlabs.com/docs/stable/interleave-in-parent.html) 983 interleaving secondary indexes under primary indexes, this is possible 984 with the syntax 985 ```sql 986 CREATE INDEX secondary_idx ON foo (id, data) INTERLEAVE IN PARENT foo (id) 987 ``` 988 We can thus identify interleaved joins on `indexJoinNode`s (where `left` and `right` 989 `scanNode`s in `joinNode`s are now `index` and `table` `scanNode`s in `indexJoinNode`s) 990 and perform efficient index joins. Index joins would employ the same processor (`InterleaveReaderJoiner`) 991 and the planning implementation will be symmetric to that of `joinNode`s. 992 993 ### [8] Avoiding splits inside interleaves 994 995 We would like to refactor [`MVCCFindSplitKey`](https://github.com/cockroachdb/cockroach/blob/424a396a7681b048439c940207027d9eb7976a85/pkg/storage/engine/mvcc.go#L2269#L2297) 996 such that it avoids splitting inside interleaves (i.e. a split between two 997 children rows that are nested under the same parent row). 998 999 We have [similar logic for avoiding splits in between 1000 rows](https://github.com/cockroachdb/cockroach/blob/424a396a7681b048439c940207027d9eb7976a85/pkg/storage/engine/mvcc.go#L2296) 1001 where they may be multiple family kv pairs. In short, 1002 [`EnsureSafeSplitKey`](https://github.com/cockroachdb/cockroach/blob/424a396a7681b048439c940207027d9eb7976a85/pkg/keys/keys.go#L668) 1003 simply truncates the split key obtained from the call to `MVCCFindSplitKey` in 1004 `libroach/db.cc` to the row prefix. For example, if the given key is chosen as 1005 the split key 1006 ``` 1007 /Table/50/1/<column id>/1 1008 ^ family id 1009 ``` 1010 it simply reassigns the split key as 1011 ``` 1012 /Table/50/1/<column id> 1013 ``` 1014 which would split the range such that the first key in row `<column id>` would 1015 be the first key in the right range (ranges are specified with an inclusive 1016 start key and an exclusive end key). 1017 1018 To apply this for interleaves too, we can branch on whether the row belongs to 1019 an interleaved table or not. If it's not an interleave table the `GetRowPrefixLength` 1020 procedure will take place. If the row belongs to an interleave table, we can invoke 1021 `GetTablePrefixLength`, where instead of taking the prefix up to and including 1022 the `column id`s, it would instead take the prefix up to the last table's index 1023 ID. The resulting split key from the above example is 1024 ``` 1025 /Table/50/1 1026 ``` 1027 We do this by decoding the key from left to right until the index of the last index ID 1028 (that is the last index ID after encountering an `<interleave sentinel>`). 1029 1030 To check whether the row belongs to an interleave table, we can either check 1031 for `<interleave sentinel>`s or plumb a flag from above. 1032 1033 There is an [outstanding 1034 issue](https://github.com/cockroachdb/cockroach/issues/19296) with ranges 1035 failing to split since `EnsureSafeSplitKey` would naively default to the 1036 beginning of the row (which may be the first key in the range), where there may 1037 be a better split point at the end of the row. The fix @a-robinson will 1038 implement should work fine with interleave boundaries too. 1039 1040 ## Unresolved questions 1041 1042 - Prefix joins: approach #2 is equivalent to or strictly better than 1043 approach #1, under the assumption that RocksDB caches things properly. Is 1044 approach #1 the safer option since we won't know how RocksDB cache behaves in 1045 production (i.e. when it runs out of cache space, since the amount RocksDB 1046 will need to cache is far greater than what we need cached). The point of 1047 approach #2 is if RocksDB caches our blocks anyways, then we don't need to 1048 store as much in memory.