github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171025_interleaved_table_joins.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171025_interleaved_table_joins.md (about)

     1  - Feature Name: Interleaved Table Joins
     2  - Status: completed
     3  - Start Date: 2017-10-03
     4  - Authors: Richard Wu
     5  - RFC PR: [#19028](https://github.com/cockroachdb/cockroach/pull/19028)
     6  - Cockroach Issue: [#18948](https://github.com/cockroachdb/cockroach/issues/18948)
     7  
     8  # Table of contents
     9  
    10  * [Table of contents](#table-of-contents)
    11  * [Summary](#summary)
    12  * [Motivation](#motivation)
    13  * [Guide-level explanation](#guide-level-explanation)
    14     * [Terminology](#terminology)
    15     * [CockroachDB User](#cockroachdb-user)
    16     * [CockroachDB Contributor](#cockroachdb-contributor)
    17     * [Examples](#examples)
    18  * [Reference-level explanation](#reference-level-explanation)
    19     * [Detailed design](#detailed-design)
    20        * [Processors implementation](#processors-implementation)
    21           * [Reading component](#reading-component)
    22           * [Joining component](#joining-component)
    23        * [Planning implementation](#planning-implementation)
    24           * [Logical planning](#logical-planning)
    25           * [Local execution engine](#local-execution-engine)
    26           * [Distributed execution engine (DistSQL)](#distributed-execution-engine-distsql)
    27     * [Drawbacks](#drawbacks)
    28     * [Rationale and Alternatives](#rationale-and-alternatives)
    29        * [[1] Scanning the interleaved hierarchy](#1-scanning-the-interleaved-hierarchy)
    30        * [[2] Logical planning &gt; physical planning](#2-logical-planning--physical-planning)
    31        * [[3] Generalizing planning](#3-generalizing-planning)
    32           * [[3a] Common ancestor joins](#3a-common-ancestor-joins)
    33           * [[3b] Prefix joins](#3b-prefix-joins)
    34              * [Canonical prefix join example](#canonical-prefix-join-example)
    35              * [Approach 1 (2-pass, one scan)](#approach-1-2-pass-one-scan)
    36              * [Approach 2 (2-pass, least memory, O(2k) scans)](#approach-2-2-pass-least-memory-o2k-scans)
    37           * [[3c] Subset joins](#3c-subset-joins)
    38        * [[4] Multi-table joins](#4-multi-table-joins)
    39        * [[5] Theta Joins](#5-theta-joins)
    40        * [[6] InterleaveReader and MergeJoiner](#6-interleavereader-and-mergejoiner)
    41           * [Planning](#planning)
    42           * [Reading component](#reading-component-1)
    43           * [Joining component](#joining-component-1)
    44           * [Drawbacks of this approach](#drawbacks-of-this-approach)
    45        * [[7] Index joins](#7-index-joins)
    46        * [[8] Avoiding splits inside interleaves](#8-avoiding-splits-inside-interleaves)
    47     * [Unresolved questions](#unresolved-questions)
    48  
    49  # Summary
    50  
    51  We [currently permit
    52  users](https://www.cockroachlabs.com/docs/stable/interleave-in-parent.html) to
    53  specify if a child table should be interleaved on-disk under its specified
    54  parent table. This feature was adopted from [Google Spanner's interleaved
    55  tables](https://cloud.google.com/spanner/docs/schema-and-data-model#creating_interleaved_tables)
    56  and was meant to be an optimization the user could opt in for tables that are
    57  often queried in a parent-child relationship (i.e. one-to-one, one-to-many).
    58  Refer to the [RFC on interleaved
    59  tables](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160624_sql_interleaved_tables.md)
    60  for more details.
    61  
    62  Beyond co-locating the child's key-value (kv) pairs with the parent's kv pairs
    63  in the underlying Key-Value (KV) storage (which provides the significant optimization
    64  of 1 phase-commit (1PC) commits for free), we do not currently take advantage
    65  of this fact to optimize queries.
    66  
    67  One type of query we can optimize for interleaved tables is joins between the
    68  parent and children tables. In the context of DistSQL, instead of spawning two
    69  `TableReader`s, one scan for the parent and one scan for the child table for a
    70  total of two scans, and a `Joiner` processor, we can do one scan over the
    71  relevant interleaved rows, then perform a join on the table rows.
    72  
    73  This optimization has significant upside on performance, potentially lowering
    74  both the number of scans - interleaved parent and child tables are
    75  scanned simultaneously instead of in two separate scans - and the volume of
    76  inter-node gRPC traffic for the join - rows routed by [the hash on their
    77  join
    78  columns](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160421_distributed_sql.md#processors)
    79  cross the network to their respective join processor.
    80  
    81  This RFC identifies two implementation phases for interleaved table joins:
    82  1. Mechanism that performs the joint scan as well as the join (Processors implementation phase)
    83  2. Planning interleaved table joins for a given query (Planning implementation phase)
    84  
    85  # Motivation
    86  
    87  We currently highlight [some of the
    88  benefits](https://www.cockroachlabs.com/docs/stable/interleave-in-parent.html#benefits)
    89  for interleaved tables such as joins, yet we do not currently execute them
    90  differently for interleaved tables compared to non-interleaved tables. In
    91  fact, since we do two separate table scans for the parent and child table in
    92  an interleaved configuration, there is a performance impact to scan disjoint
    93  KV ranges as [noted in the interleave table
    94  docs themselves](https://www.cockroachlabs.com/docs/stable/interleave-in-parent.html#tradeoffs).
    95  
    96  To illustrate the current behavior, let us interleave table `child` into
    97  table `parent`:
    98  ```sql
    99  CREATE DATABASE foo;
   100  
   101  CREATE TABLE IF NOT EXISTS parent (id INT PRIMARY KEY);
   102  
   103  CREATE TABLE IF NOT EXISTS child (id INT, parent_id INT, PRIMARY KEY (parent_id, id)) INTERLEAVE IN PARENT parent (parent_id);
   104  
   105  EXPLAIN SELECT * FROM child JOIN parent ON child.parent_id = parent.id;
   106  ```
   107  which results in the following logical plan
   108  ```
   109  +-------+------+----------------+--------------------+
   110  | Level | Type |     Field      |    Description     |
   111  +-------+------+----------------+--------------------+
   112  |     0 | join |                |                    |
   113  |     0 |      | type           | inner              |
   114  |     0 |      | equality       | (parent_id) = (id) |
   115  |     0 |      | mergeJoinOrder | +"(parent_id=id)"  |
   116  |     1 | scan |                |                    |
   117  |     1 |      | table          | child@primary      |
   118  |     1 |      | spans          | ALL                |
   119  |     1 | scan |                |                    |
   120  |     1 |      | table          | parent@primary     |
   121  |     1 |      | spans          | ALL                |
   122  +-------+------+----------------+--------------------+
   123  ```
   124  
   125  Users expecting a performance boost for parent-child joins (even simple ones
   126  like the above) are instead experiencing a performance hit.
   127  
   128  # Guide-level explanation
   129  
   130  It is worth noting how SQL rows are mapped to KV pairs, specifically how data
   131  for a given table is represented as key-value pairs on the primary index (refer to [this blog
   132  post](https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mapping-table-data-to-key-value-storage/)).
   133  
   134  ## Terminology
   135  
   136  - **logical plan**: what an SQL query is transformed into after parsing, type
   137    checking, name resolution, and query optimization. A logical plan is
   138    concretely represented as a (logical) plan tree (tree of `planNode`s) in
   139    CockroachDB.
   140  - **physical plan**: what [the distributed execution
   141    engine](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160421_distributed_sql.md)
   142    transforms the logical plan into. It is employed to generate and schedule
   143    flows which are composed of processors which are the functional units that
   144    defer computation to the data nodes.
   145  - **parent-child relationship**: when table `C` is interleaved into table `P`,
   146    then `C` is the child table of the parent table `P`. `P` and `C` have a
   147    parent-child relationship where `C` is a direct "descendant" of `P` (and `P`
   148    a direct "ancestor" of `C`).
   149  - **interleaved table**: child table that is interleaved into its parent primary index
   150    via the `CREATE TABLE ... INTERLEAVE IN PARENT` rule. This is concretely an
   151    **interleaved primary index** (see next point).
   152  - **interleaved index**: the primary or secondary index of a child table that
   153    is interleaved into the parent table's primary index. Interleaved secondary
   154    indexes are created via the `CREATE INDEX ... INTERLEAVE IN PARENT` rule.
   155  - **interleave prefix**: the primary key (consisting of 1+ columns) of the
   156    parent table and its corresponding prefix of the interleaved index of the
   157    child table. In the [Motivation](#motivation) example, the interleave prefix
   158    is `(child.parent_id)` or `(parent.id)`. Note the prefix may or may not be a
   159    proper prefix (it is possible for the child table to have no additional
   160    primary key columns).
   161  - **(grand-)\*children**: the set of all direct and indirect descendants of the
   162    specified table. That is: all tables/indexes that are interleaved together
   163    with the specified (ancestor) table except itself.
   164  - **(grand-)\*parent**: similar to **(grand-)\*children** except with ancestors
   165    instead of descendants.
   166  - **root table**: parent table that is __not__ interleaved into another
   167    table (it is the root of its interleaved hierarchy).
   168  - **interleaved hierarchy**: root table and all its (grand-)\*children. For a
   169    given database, there may exist 0 or more interleaved hierarchies.
   170  - **interleaf (pl. interleaves)**: a parent row with a primary key and all (grand-)\*children
   171  rows with the same interleave prefix.
   172  
   173  ## CockroachDB User
   174  
   175  From a CockroachDB user perspective, there is no apparent feature change. Join
   176  performance will be improved for joins between interleaved tables and their
   177  parent tables. Since we do not employ any significant query rewriting or
   178  optimization, the implementation of how we recognize interleaved table joins in
   179  the Planning phase will heavily dictate how we advertise this performance
   180  improvement.
   181  
   182  Specifically, the first iteration will permit more efficient joins between
   183  tables with a parent-child relationship. Any table that is often joined with
   184  a parent table (e.g. via a foreign key) should be interleaved in the the parent table.
   185  
   186  In general, tables that are consistently subject to hierarchical querying patterns
   187  (i.e. queried together via multi-way joins) can see improvements with interleaving.
   188  The [current docs](https://www.cockroachlabs.com/docs/stable/interleave-in-parent.html)
   189  do a good job highlighting the benefits and tradeoffs of interleaved tables
   190  in the general case. See [Drawbacks](#drawbacks) for caveats on advising users
   191  with respect to interleaved tables.
   192  
   193  ## CockroachDB Contributor
   194  
   195  For CockroachDB contributors and developers, changes to the codebase are mapped
   196  to their respective phases. Firstly, it is important to understand how the
   197  interleaved indexes are mapped in storage. Refer to the [RFC on interleaved
   198  tables](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160624_sql_interleaved_tables.md).
   199  
   200  As for how each phase of this feature affects the codebase:
   201  1. The Processors implementation phase will introduce a new processor, the
   202     `InterleaveReaderJoiner` processor that functionally combines two `TableReader`s
   203     (corresponding to each parent and child table) and a `MergeJoiner`
   204     (since interleave prefixes are defined on the parent table's primary key,
   205     there is an ordering guarantee we can take advantage of):
   206      - The `InterleaveReaderJoiner` will first scan the interleaved hierarchy
   207        (scoped to the relevant key span if there is a filter on the primary
   208        key). It can be configured with the `TableDescriptor`s and
   209        `IndexDescriptor`s of the parent and child table such that it can perform
   210        a single-pass scan of the two tables with `RowFetcher`.  See [\[1\]
   211        Scanning the interleaved
   212        hierarchy](#1-scanning-the-interleaved-hierarchy) for discussion on why
   213        this is the case.
   214      - There is the caveat that each `InterleaveReaderJoiner` must read full
   215        interleaves (and cannot partially read an interleaf), otherwise there may
   216        potentially be missing joined rows (see the [Reference-level explanation
   217        section](#Reference-level-explanation)).
   218      - After the rows are extracted, the joining logic is identical to that of
   219        `MergeJoiner`. We can abstract the joining logic from `MergeJoiner` and incorporate
   220        it into `InterleaveReaderJoiner`.
   221      - See [InterleaveReader and MergeJoiner](#6-interleavereader-and-mergejoiner) for an
   222        alternative design where the reader and joiner are separated into two
   223        processors.
   224  2. The Planning implementation phase will introduce changes to the logical plan
   225     whereby interleaved joins are acknowledged. Note choosing to introduce
   226     interleaved joins at the logical plan level as opposed to the physical plan
   227     level is further discussed in [\[2\] Logical planning > physical
   228     planning](#2-logical-planning--physical-planning).
   229      - `joinNode`s will be annotated in the logical plan if it is the root of
   230        a plan sub-tree, have two `scanNode` children/leaves, and:
   231         - Each of the `scanNode`s correspond to a scan of the parent and child
   232           tables
   233         - The `joinNode` is an **equality** join on the parent's primary key (the interleave
   234           prefix) (see [\[5\] Theta joins](#5-theta-joins) for inequality joins)
   235        In the general case, the table with the join columns could be a common
   236        ancestor (see [\[3a\] Common ancestor joins
   237        Alternatives](#3a-common-ancestor-joins)) and/or the join columns could
   238        be a prefix of the interleave prefix (see [\[3b\] Prefix
   239        joins](#3b-prefix-joins)).
   240      - Local execution engine will carry out the join as it would without interleaved
   241      joins (until we decide to push down interleaved join logic to the local nodes, see
   242      [\[2\] Logical planning > physical planning](#2-logical-planning--physical-planning).
   243      - Distributed execution engine will plan an `InterleaveReaderJoiner` each
   244        node (along with routers and streams) that has data for the relevant span
   245        when a `joinNode` with an interleaved join algorithm is encountered.
   246  
   247  Note every type of join (inner, left, right, full) can easily be supported since
   248  `InterleaveReaderJoiner` employs the same joining logic as `MergeJoiner`.
   249  
   250  ## Examples
   251  
   252  For the query described above where we do a simple join between table `parent`
   253  and table `child` where `child` is `INTERLEAVED INTO PARENT` `parent`
   254  ```sql
   255  SELECT * FROM child JOIN parent ON child.parent_id = parent.id
   256  ```
   257  
   258  we previously had the logical plan
   259  ```
   260  +-------+------+----------------+--------------------+
   261  | Level | Type |     Field      |    Description     |
   262  +-------+------+----------------+--------------------+
   263  |     0 | join |                |                    |
   264  |     0 |      | type           | inner              |
   265  |     0 |      | equality       | (parent_id) = (id) |
   266  |     0 |      | mergeJoinOrder | +"(parent_id=id)"  |
   267  |     1 | scan |                |                    |
   268  |     1 |      | table          | child@primary      |
   269  |     1 |      | spans          | ALL                |
   270  |     1 | scan |                |                    |
   271  |     1 |      | table          | parent@primary     |
   272  |     1 |      | spans          | ALL                |
   273  +-------+------+----------------+--------------------+
   274  ```
   275  which is a logical plan with a `joinNode` root and two `scanNode` leaves. This
   276  could very well be a `planNode` sub-tree in a more complex query. Regardless,
   277  one can annotate the `joinNode` to identify the interleaved join which will
   278  produce a query plan resembling
   279  ```
   280  +-------+------+------------------+--------------------+
   281  | Level | Type |     Field        |    Description     |
   282  +-------+------+------------------+--------------------+
   283  |     0 | join |                  |                    |
   284  |     0 |      | type             | inner              |
   285  |     0 |      | equality         | (parent_id) = (id) |
   286  |     0 |      | mergeJoinOrder   | +"(parent_id=id)"  |
   287  |     0 |      | interleavePrefix | parent_id, id      |
   288  |     0 |      | algorithmHint    | interleave
   289  |     1 | scan |                  |                    |
   290  |     1 |      | table            | child@primary      |
   291  |     1 |      | spans            | ALL                |
   292  |     1 | scan |                  |                    |
   293  |     1 |      | table            | parent@primary     |
   294  |     1 |      | spans            | ALL                |
   295  +-------+------+------------------+--------------------+
   296  ```
   297  The local execution engine can recognize the additional `interleavePrefix`
   298  field and `algorithmHint = interleave` once it implements interleaved join logic. The distributed
   299  execution engine can produce a physical plan with an `InterleaveReaderJoiner`
   300  processor for every such tree pattern.
   301  
   302  # Reference-level explanation
   303  
   304  This RFC will be scoped down to:
   305  1. joins between **two** tables in the interleaved hierarchy
   306  2. planning for interleaved joins between parent-child tables (see future work
   307     on [\[3a\] Common ancestor joins](#3a-common-ancestor-joins))
   308  3. the join columns must be exactly the interleave prefix (see future work on [\[3b\]
   309     Prefix joins](#3b-prefix-joins) and [\[3c\] Subset
   310     joins](#3c-subset-joins))
   311  
   312  Furthermore, we will discuss the two implementation phases (Processors and
   313  Planning) separately since they can be orthogonally implemented.
   314  
   315  ## Detailed design
   316  
   317  ### Processors implementation
   318  
   319  For the first iteration, we only care about equality joins on the entire
   320  interleave prefix. If you imagine the following interleaved hierarchy
   321  in storage (where each row represents a table row), where the primary
   322  key of of `parent` is `pk`, the primary key of `childN` is `(pk, ckN)`,
   323  and the interleaved index/interleave prefix is `(pk)`:
   324  ```
   325  <pk>/.../parent_r1
   326    <pk>/<ck1>/.../child1_r1
   327      ...
   328    <pk>/<ck1>/.../child1_r2
   329    <pk>/<ck2>/.../child2_r1
   330    <pk>/<ck3>/.../child3_r1
   331    ...
   332  <pk>/.../parent_r2
   333    <pk>/<ck1>/.../child1_r3
   334    <pk>/<ck2>/.../child2_r2
   335    <pk>/<ck2>/.../child2_r3
   336    <pk>/<ck2>/.../child2_r4
   337    ...
   338  ```
   339  Let's suppose we want to join `parent` and `child2`
   340  ```
   341  SELECT * FROM parent JOIN child2 ON parent.pk = child2.pk
   342  ```
   343  The expected output is
   344  ```
   345  parent_r1 U child2_r1
   346  parent_r2 U child2_r2
   347  parent_r2 U child2_r3
   348  parent_r2 U child2_r4
   349  ...
   350  ```
   351  
   352  We can scan top to bottom of the interleaved hierarchy and separate out
   353  `parent_rN` rows and `child2_rN` rows into two streams.
   354  
   355  #### Reading component
   356  
   357  Looking through the implementation of `RowFetcher`, specifically the call chain
   358  ```
   359  RowFetcher --> RowFetcher.NextKey --> RowFetcher.ReadIndexKey --> DecodeIndexKey (table.go)
   360  ```
   361  it doesn't seem like `RowFetcher` is very permissive when it comes to allow
   362  1-pass through on an interleaved hierarchy. Specfically, [`DecodeIndexKey` in
   363  `table.go`](https://github.com/cockroachdb/cockroach/blob/de7337dc5ca5b4e5ee17e812c817e4bba5a449ca/pkg/sql/sqlbase/table.go#L778L780)
   364   will need to be refactored so that it returns `true` for KV pairs if
   365  it matches either the target `parent` index or `child2` index. This is
   366  generalized to a set membership problem with a set of `N` `(tableID, indexID)` tuples if we
   367  emit multiple interleaved tables in one scan or to support [multi-table
   368  joins](#4-multi-table-joins)).
   369  
   370  The refactor necessary is as follows:
   371  - A new `RowFetcher` that produces rows from any of the `N` tables
   372    (identified as unique `TableDescriptor`-`IndexDescriptor` pairs)
   373  - `NextRow` will need to be able to return a `RowResponse` with the
   374    `EncDatumRow` as well as the row's `TableDescriptor` and `IndexDescriptor`.
   375  - The equivalence signature is defined as the sequence of ancestor table and
   376    index IDs encoded into a `[]byte`. For example the equivalence signature
   377    for the __primary index__ on `child2` interleaved into the primary index of
   378    `parent` as well as __any rows__ that belong to this primary index of `child2` is
   379      ```
   380      /parent/1/#/child2/1
   381      ```
   382    - We can pre-compute this equivalence signature for each table. When we
   383      proceed to decode each index key, we compute the equivalence signature
   384      and check if it corresponds to any of the tables.
   385  - Instead of trying to refactor `RowFetcher` and introducing additional
   386    overhead, it is prudent to separately implement this `RowFetcher`. We
   387    can eventually merge `RowFetcher` into `RowFetcher` after it is
   388    determined the overhead is marginal for the 1 table case.
   389  
   390  The [outstanding PR for `RowFetcher`](https://github.com/cockroachdb/cockroach/pull/19228)
   391  has the full implementation details.
   392  
   393  #### Joining component
   394  
   395  The `InterleaveReaderJoiner` can branch on the
   396  `TableDescriptor`-`IndexDescriptor` pair to distinguish between `parent` and
   397  `child2` rows.  Each join batch is defined as rows that are joined for a given
   398  interleave prefix.  If the `InterleaveReaderJoiner` observes a `parent` row, it
   399  can start a new join batch since each unique interleave prefix is a primary key
   400  on `parent` which is also unique. It memoizes this `parent` row. If a `child2`
   401  row is retrieved, it joins it with the most recent `parent` with
   402  [`joinerBase.render`](https://github.com/cockroachdb/cockroach/blob/c86f16f89c154797ed07012f66d4aa49b1947624/pkg/sql/distsqlrun/joinerbase.go#L175#L193).
   403  This implies that `InterleaveReaderJoiner` will need to [nest `joinerBase`
   404  similar to
   405  `mergeJoiner`](https://github.com/cockroachdb/cockroach/blob/c86f16f89c154797ed07012f66d4aa49b1947624/pkg/sql/distsqlrun/mergejoiner.go#L33).
   406  
   407  The joining logic becomes more complicated on joins not on the full interleave prefix
   408  as detailed for [prefix joins](#3b-prefix-joins) and [subset joins](#3c-subset-joins).
   409  
   410  ### Planning implementation
   411  
   412  There are three areas (that each vary in the amount of work required) that are
   413  relevant to implementing the planning of interleaved table joins:
   414  1. Logical planning & annotating `joinNode`s
   415  2. Local execution engine (little work required)
   416  3. Distributed execution engine & spawning appropriate flows
   417  
   418  #### Logical planning
   419  
   420  Annotating `joinNode`s can be accomplished after a logical plan has been
   421  created and during optimization (since one can consider identifying interleaved
   422  table joins an optimization; by precedent, we did this in the `optimizePlan`
   423  stage of the planner [for merge
   424  joins](https://github.com/cockroachdb/cockroach/pull/17214/files)). We
   425  introduce a general `algorithmHint` field on `joinNode`s to annotate.
   426  
   427  Annotation can be accomplished when we `expandPlan` (within `optimizePlan`) and do type
   428  switching on the nodes we encounter. Specifically, we'd have a helper function to
   429  peek into the left and right sources to see if they satisfy the conditions
   430  for an interleaved join. This is already being done for identifying merge
   431  joins and merge join order: in fact this can be a cookie cutter derivative
   432  of the [planning changes in the DistSQL merge join
   433  PR](https://github.com/cockroachdb/cockroach/pull/17214/files#diff-03ffd4efde59eae13665287cc1193d9a).
   434  
   435  The conditions for an interleaved join in this first iteration are:
   436  1. Each of the `scanNode`s correspond to a scan of the parent and child tables
   437  2. The `joinNode` is an **equality** join on the parent's primary key (the
   438     interleave prefix)
   439  
   440  We first check if the `planNode` attach to `left` and `right` are `scanNode`s.
   441  We then corroborate their `InterleaveDescriptor`s (nested in `IndexDescriptor` which
   442  is nested in `TableDescriptor`). Note that `InterleaveDescriptor` does in
   443  fact keep track of all ancestors, thus it's rather simple to do a interleaved hierarchy traversal
   444  if we want to do [common ancestor joins](#3a-common-ancestor-joins) in the future.
   445  If the first ancestor's (the parent) table ID of `left` or `right`
   446  corresponds to the table ID of the other table, then we've satisfied condition #1.
   447  
   448  Note that only a scan of the interleaved index of the child table can be
   449  incorporated in an interleaved join. For example, if only the child table's primary index
   450  is interleaved into the parent table but the `scanNode` is a scan over one of the child's
   451  secondary indexes, then an interleaved join is not appropriate. If the secondary
   452  index is interleaved then we can should also [optimize index joins](#7-index-joins).
   453  
   454  We then check if the `joinPredicate` on the join nodes have equality indices
   455  on the (exact) columns in the interleave prefix.
   456  
   457  If both conditions are met, we set `algorithmHint` to "parent-child interleaved join".
   458  The annotated plan tree is then sent off to the execution engine.
   459  
   460  #### Local execution engine
   461  
   462  Thankfully, there is not much to do here for the first iteration of this RFC!
   463  
   464  The [idea of spawning the `InterleaveReaderJoiner` in the local execution
   465  engine](#2-logical-planning--physical-planning) is out of this RFC's scope.
   466  
   467  #### Distributed execution engine (DistSQL)
   468  
   469  The bulk of the logic that will setup the interleave
   470  join will be in
   471  [`createPlanForJoin`](https://github.com/cockroachdb/cockroach/blob/10e3751071c3b80540486d4a2f11c2d322501d42/pkg/sql/distsql_physical_planner.go#L1851).
   472  If the interleave join algorithm has been selected by the logical planner, it
   473  will skip [creating the individual
   474  plans](https://github.com/cockroachdb/cockroach/blob/10e3751071c3b80540486d4a2f11c2d322501d42/pkg/sql/distsql_physical_planner.go#L1874L1878)
   475  for the `left` and `right` `scanNode`s and invoking `MergePlans`. Instead, the
   476  descriptors on the `scanNode`s  will be used to construct the
   477  `InterleaveReaderJoiner` processor.
   478  We pass down `TableDescriptor`s, `IndexDescriptor`s and any other arguments
   479  our `RowFetcher` requires to perform a single-pass read for the two tables.
   480  
   481  The final physical plan for a three-node cluster looks something like
   482  
   483  ![image](https://user-images.githubusercontent.com/10563314/31895522-610041a0-b7df-11e7-809a-73e1f17c99d4.png)
   484  
   485  First the union of the spans from the two `scanNode`s (by invoking
   486  [`MergeSpans`](https://github.com/cockroachdb/cockroach/blob/master/pkg/roachpb/merge_spans.go#L42))
   487  are passed into
   488  [`PartitionSpans`](https://github.com/cockroachdb/cockroach/blob/62b7495302a8e06b4be3780e349d10d31e378bd7/pkg/sql/distsql_physical_planner.go#L488).
   489  This will return a slice of `SpanPartition`s of length `n`, which corresponds
   490  to the number of data nodes.  We will eventually spawn `n`
   491  `InterleaveReaderJoiner`s, one for each node (`n = 3` in the diagram above).
   492  Side note: any scan over an interleaved index will always [default to
   493  the span of the root table in the interleaved
   494  hierarchy](https://github.com/cockroachdb/cockroach/blob/de7337dc5ca5b4e5ee17e812c817e4bba5a449ca/pkg/sql/sqlbase/table.go#L239L243) so
   495  it's not strictly necessary to take the union. For future-proofing's sake,
   496  we do this since it's a trivial performance impact to do a union over a set of spans during planning.
   497  
   498  These spans will need to be "fixed" in order to prevent partial interleave
   499  reads since range splits can happen in between interleaves.
   500  If an interleaf is only partially read on a given node (for example, the parent
   501  row is read but none of the children rows are read because they overflow into
   502  the next span partition on a different node), then we will miss a few joined
   503  rows. We need to first extract the parent's prefix key then call `PrefixEnd()`.
   504  To do this, we need to figure out the cumulative. That is for a given `child` key
   505  ```
   506  /<tableid>/<indexid>/<parent-interleave-prefix>/<interleave sentinel>/<rest-of-child-primary-key>/...
   507  ```
   508  where the number of segments (`/.../`) for `<parent-interleave-prefix>` (i.e.
   509  number of columns in the prefix) is `shared_prefix_len` in the child table's
   510  `InterleaveDescriptor`.
   511  For example, if we wanted to fix this span (generated from `PartitionSpans`)
   512  (`#` is the `<interleave sentinel>`)
   513  ```
   514  StartKey: /parent/1/2/#/child/2
   515  EndKey:   /parent/1/42/#/child/4
   516  ```
   517  to
   518  ```
   519  StartKey: /parent/1/2/#/child/2
   520  EndKey:   /parent/1/43
   521  ```
   522  such that we know for certain we read all `child` rows with interleave prefix
   523  `5`, then we need to first extract `/parent/1/42` from
   524  `/parent/1/42/#/child/4`. We can call `encoding.PeekLength()` `3` times (once
   525  for each segment) to find the total prefix length then take the prefix of
   526  `/parent/1/42/#/child/4` of that length.  More generally, for a descendant
   527  `EndKey`
   528  ```
   529  /<1st-tableid>/<1st-indexid>/<1st-index-columns>/#/<2nd-tableid>/<2nd-indexid>/2nd-index-columns/#/...
   530  ```
   531  the number of times we call `encoding.PeekLength()` is
   532  ```
   533  3 * count(interleave ancestors) + sum(shared_prefix_len) - 1
   534  ```
   535  where the `- 1` is to not include the last `<interleave sentinel>`.
   536  
   537  If the above "fixing" happens too often, it begs the question of avoiding
   538  splits inside interleaves as much as possible. This [was mentioned briefly in
   539  the initial RFC of interleaved
   540  tables](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160624_sql_interleaved_tables.md).
   541  See [Avoiding splits inside interleaves](#8-avoiding-splits-inside-interleaves)
   542  for more detail.
   543  
   544  ## Drawbacks
   545  
   546  Everything discussed in this PR is more efficient than the way we do joins currently.
   547  This is because we do one scan over the interleaved hierarchy instead of two
   548  separate scans for each the two table in the join.
   549  
   550  In terms of future caveats, interleaved table joins are ill-advised if the majority
   551  of queries are joins on a small subset of the interleaved hierarchy. In the scope of
   552  this RFC (where we focus on joins between two tables in the interleaved hierarchy),
   553  one should only have a parent and child table in a given interleaved hierarchy.
   554  
   555  Once multi-table interleaved joins are implemented (see [\[4\] Multi-table
   556  joins](#4-multi-table-joins)), any scan queries
   557  (whether on individual tables or joins) should be on the majority of the tables
   558  in the interleaved hierarchy (to optimize the amount of relevant data read when
   559  doing [the interleaved hierarchy
   560  scan](#1-scanning-the-interleaved-hierarchy)).
   561  
   562  All this being said, we should be __very cautious__ prescribing usage patterns
   563  for interleaved tables. We currently do not support interleave schema changes
   564  (i.e. un-interleaving a table from its parent). It would be inconsistent of
   565  us if we prescribe that interleaved tables should only be used in parent-child
   566  relationships presently, but amend our prescription to include deeply-nested,
   567  hierarchical relationships once we implement multi-table joins and common
   568  ancestor joins.
   569  
   570  ## Rationale and Alternatives
   571  
   572  ### [1] Scanning the interleaved hierarchy
   573  
   574  After discussions with @jordanlewis who has been working with indexes of interleaved
   575  parent and child tables, a couple of key points were brought up that will dictate
   576  how we do the reading component of `InterleaveReaderJoiner` (and `RowFetcher`):
   577  1. Currently, one cannot selectively scan all parent rows without scanning
   578     interleaved (grand-)\*children. Similarly, one cannot selectively scan
   579     all the child rows without scanning its (grand-)\*parents and
   580     (grand-)\*children.
   581     - For example, a query for all rows of table `parent` with primary key
   582       between `2` and `42` might look up the span `parent@primary /2-/43`. This
   583       scan at the `KVFetcher` level iterates through all `child` rows and
   584       [decodes their index
   585       keys](https://github.com/cockroachdb/cockroach/blob/62b7495302a8e06b4be3780e349d10d31e378bd7/pkg/sql/sqlbase/rowfetcher.go#L154#L155)
   586       with interleave prefix also between `2` and `42`. It then decides whether
   587       or not the `child` row is a `parent` row via a [comparison on its
   588       `TableID` and
   589       `IndexID`](https://github.com/cockroachdb/cockroach/blob/62b7495302a8e06b4be3780e349d10d31e378bd7/pkg/sql/sqlbase/table.go#L778L780).
   590     - One key observation by @jordanlewis is that a secondary index can
   591       be scanned and index joined with the primary rows of the parent and/or
   592       child table to obtain a scan of each table.
   593       - This secondary index should not be maintained automatically: a user
   594         should create this secondary index if need be.
   595       - The planner will be need to know when to use this secondary index and
   596         an index join to perform these types of scans: index selection can be
   597         augmented to impose a penalty for scans over parent interleaves.
   598       - Benchmark opportunity: secondary indexes do incur a small cost on
   599         `INSERT/UPDATE`s. They are also located on separate ranges (and possibly
   600         separate machines) which will require sending rows over the wire to the
   601         `Joiner`s. This is potentially worse than doing an entire hierarchy
   602         scan and filtering for relevant rows.
   603  2. If a filter is given for the parent primary key, then we can scope the scan.
   604     This however still picks up all the (grand-)\*children rows of the parent.
   605  
   606  ### [2] Logical planning > physical planning
   607  
   608  For planning interleaved table joins, one can either introduce the branching
   609  point (that is, the point at which an `InterleavedRaederJoiner` is introduced)
   610  in the physical plan (processor-level in the distributed execution engine) or
   611  in the logical plan (which affects both execution engines).
   612  
   613  Currently the local execution engine only scans and propagate rows in the
   614  computed spans on the data nodes to the gateway node. Note however that these
   615  scans occur at the leaves of plan tree (`scanNode`s are always leaves). More
   616  precisely, the `InterleaveReaderJoiner` logic described for the distributed
   617  execution engine can theoretically be accomplished in the local execution
   618  engine: the local implementation spawns and runs an `InterleaveReaderJoiner`
   619  processor and returns the joined results. If we never do quite fuse the local
   620  and distributed backends into one entity, annotating a `joinNode` to perform an
   621  interleaved join if possible will give us the **flexibility** to optimize both
   622  pathways.
   623  
   624  This decision to annotate logical plans to allow for more efficient joins [has
   625  precedent when we added ordering information for merge joins to
   626  `joinNode`s](https://github.com/cockroachdb/cockroach/pull/17214) for the
   627  distributed execution engine.
   628  
   629  ### [3] Generalizing planning
   630  
   631  As noted in the [Guide-level explanation section](#Guide-level-explanation),
   632  one can identify the plan tree pattern that has a `joinNode` as the root and
   633  two `scanNode` leaves that correspond to the parent and child tables. For the
   634  simple case, one can verify that the join column(s) are on the primary key of
   635  the parent table (and a prefix of the interleaved index of the child table, but
   636  this is mandated by `INTERLEAVE` anyways), but this can of course be
   637  generalized.
   638  
   639  #### [3a] Common ancestor joins
   640  
   641  When we have a parent and direct child relationship and join, we really have a
   642  join on a common ancestor's primary key (the parent) between the parent and the
   643  child.  If we imagine the interleave relationships for a given database as a
   644  forest where every top-level parent is the root of a tree (and these trees can
   645  be arbitrarily deep), then a **common ancestor** for two nodes `A` and `B` is
   646  defined as as node whose subtrees contain `A` and `B`.
   647  
   648  Imagine that we wanted to perform a join between two children tables on their
   649  parent's primary key. For example, suppose we have table `parent, child1,
   650  child2`:
   651  ```sql
   652  CREATE TABLE parent (id INT PRIMARY KEY);
   653  
   654  CREATE TABLE child1 (id1 INT, pid INT, PRIMARY KEY (pid, id1)) INTERLEAVE IN PARENT parent (pid);
   655  
   656  CREATE TABLE child2 (id2 INT, pid INT, PRIMARY KEY (pid, id2)) INTERLEAVE IN PARENT parent (pid);
   657  
   658  SELECT * FROM child1 JOIN child2 ON child1.pid = child2.pid;
   659  ```
   660  The forest representation (with one tree) looks like
   661  ```
   662        parent
   663         / \
   664       /     \
   665     /         \
   666  child1     child2
   667  ```
   668  It is complete and sound to scan the interleaved parts of `child1` and `child2`
   669  and perform interleaved joins per each unique `parent` primary key. Reasoning:
   670  all `child1` rows with interleave prefix `X` are interleaved under the `parent`
   671  row with primary key `X` by `INTERLEAVE` constraint.  Similarly all `child2`
   672  rows with `X` prefix are interleaved under the same `parent` row.  Therefore,
   673  all `child1` and `child2` rows with equivalent `pid`s are nested under the same
   674  interleave block and thus will be joined.
   675  
   676  This concept can be arbitrarily extended to any common ancestor primary key
   677  join so long as the join is exactly on the common ancestor's primary key. Since
   678  the nesting depths of interleaved tables are typically no more than 1-4 (if
   679  there is very deep nesting, then we'll have other issues to worry about), we
   680  can simply traverse up the connected tree until we reach the root and check if
   681  a common ancestor was encountered (traverse up instead of down from either
   682  node since `InterleaveDescriptor` only keeps track of ancestors, not
   683  descendants).
   684  
   685  The use cases for a join on an ancestor more than one level higher in the
   686  hierarchical tree is limited since it is intuitively rare for a grandchildren to
   687  have a foreign key reference to its grandparent. Similarly, joining siblings on
   688  their parents' key is essentially a cross product join per each unique primary
   689  parent key, which is hard to imagine as a common query. A multi-table join
   690  between parent-child-grandchild is conceivable (see [Multi-table
   691  joins](#4-multi-table-joins)).
   692  
   693  #### [3b] Prefix joins
   694  
   695  Prefix joins involves joining on a proper prefix of the primary key of
   696  the parent (which is a prefix of the interleave prefix).
   697  
   698  ##### Canonical prefix join example
   699  
   700  For example suppose we table `parent` that
   701  has primary key (PK) `(pk1, pk2, pk3)` and table `child` that has primary key
   702  (PK) `(pk1, pk2, pk3, c_pk1)`.
   703  
   704  The simple case we've been dealing with is a join on (exclusively) columns
   705  `pk1`, `pk2`, and `pk3`.
   706  
   707  It was proposed that we could possibly do a join on a prefix of the
   708  interleave prefix. Suppose we wanted to join on `pk1` and
   709  `pk2`. For clarity's sake, let us imagine we have the parent key schema
   710  ```
   711  /<pk1>/<pk2>/<pk3>/...
   712  ```
   713  and the interleaved child key schema
   714  ```
   715  /<pk1>/<pk2>/<pk3>/<interleaved sentinel>/<c_pk1>/...
   716  ```
   717  where the `interleaved sentinel` is `#` (in practice, it is the first byte in
   718  `KeyEncDescending` such that it is always sorted after the parent KV pairs).
   719  Example KV pairs (each representing an entire table row) for the parent and
   720  child tables will be stored as (where child rows are indented)
   721  ```
   722  ...
   723  /1/1/2/...
   724    /1/1/2/#/2/...
   725    /1/1/2/#/4/...
   726  /1/1/3/...
   727    /1/1/3/#/1/...
   728    /1/1/3/#/3/...
   729    /1/1/3/#/7/...
   730  /1/2/3/...
   731    /1/2/3/#/5/...
   732  /1/2/9/...
   733    /1/2/9/#/5/...
   734    /1/2/9/#/7/...
   735  ...
   736  ```
   737  Joining on `pk1` and `pk2` would result in the following parent-child KV pair joins
   738  ```
   739  ---------------------------------
   740  | parent (key) | child (key)    |
   741  ---------------------------------
   742  | /1/1/2/...   | /1/1/2/#/2/... |
   743  | /1/1/2/...   | /1/1/2/#/4/... |
   744  | /1/1/2/...   | /1/1/3/#/1/... |
   745  | /1/1/2/...   | /1/1/3/#/3/... |
   746  | /1/1/2/...   | /1/1/3/#/7/... |
   747  | /1/1/3/...   | /1/1/2/#/2/... |
   748  | /1/1/3/...   | /1/1/2/#/4/... |
   749  | /1/1/3/...   | /1/1/3/#/1/... |
   750  | /1/1/3/...   | /1/1/3/#/3/... |
   751  | /1/1/3/...   | /1/1/3/#/7/... |
   752  | ...          | ....           |
   753  --------------------------------
   754  ```
   755  and similarly for the joins where `(pk1, pk2) = (1, 2)`.
   756  
   757  Since the primary index
   758  for the parent table is ordered, two 2-pass approaches using a merge-join
   759  pattern e-merges here:
   760  
   761    TODO(richardwu): From my understanding, if one does not know the exact
   762    `/pk1/pk2` start and end keys (i.e. one can't tell RowFetcher/kvBatchFetcher to scan
   763    the first, second, etc. /pk1/pk2/ range), then one would have to scan the
   764    entire hierarchy. The bases for the following approaches revolve around the
   765    fact that we can't do separate scans for each unique `(pk1, pk2)` prefix.
   766  
   767  ##### Approach 1 (2-pass, one scan)
   768  This approach uses
   769  ```
   770  O(# of parent+child rows for given (pk1, pk2) prefix)
   771  ```
   772  memory and 1 span scan:
   773  1. Scan top to bottom through the entire interleaved hierarchy
   774  2. For every new `(pk1, pk2)` prefix:
   775     - `NextRow()` and store parent/child row in their respective buffers until
   776       we see a new `(pk1, pk2)` prefix
   777     - Perform a full (cross product) join between all parent and child rows,
   778       emitting newly joined rows
   779  
   780  ##### Approach 2 (2-pass, least memory, O(2k) scans)
   781  If the memory requirements of approach 1 becomes too unwieldy, one can consider
   782  storing either the child rows or parent rows in one buffer (deciding which will
   783  require table statistics or a general heuristic). This requires
   784  ```
   785  min{
   786    O(# of child rows for given (pk1, pk2) prefix),
   787    O(# of parent rows for given (pk1, pk2) prefix),
   788  }
   789  ```
   790  memory. WLOG, let's store the parent rows in our buffer.  **This approach
   791  relies on the assumption that RocksDB can efficiently cache the requested
   792  span and subsequent overlapping scans of the initial span. @petermattis notes
   793  that RocksDB should be able to cache these large blocks of SSTables**:
   794  1. Initialize span as the entire interleaved hierarchy
   795  2. Scan through the rows for a given `(pk1, pk2)` prefix:
   796     - For every parent row, store in buffer
   797     - For every child row, discard
   798     - Once a new `(pk1, pk2)` prefix is encountered, mark the current
   799     `(pk1, pk2)` prefix for the next scan
   800  3. Scan through the the current `(pk1, pk2)` span again
   801     - Join each encountered child row with the cached parent rows and emit
   802  4. Begin a new scan with the next `(pk1, pk2)` prefix (go to #2)
   803  
   804  To illustrate, let's assume that there are **three** unique `(pk1, pk2)` prefixes. This
   805  forms three blocks of our overall span: block A, B, and C.
   806  1. The first scan is over the entire span (A + B + C).
   807  2. Once the first key of B is encountered, we perform a scan on block A again
   808     and perform our join/emit.
   809  3. We resume scanning block B + C.
   810  4. Once the first key of C is encountered, we perform a scan on block B again
   811     and perform our join/emit.
   812  5. We resume scanning block C. We reach the end of our span scan.
   813  
   814  For `k` unique prefixes, this requires O(2k) scans. __Assuming every scan
   815  is retrieved from the RocksDB cache, we only do one read from disk.__
   816  
   817  #### [3c] Subset joins
   818  
   819  Subset joins are joins on a strict subset of the interleave prefix
   820  columns that is not a prefix itself. For example, if `(pk1, pk2, pk3)` is the
   821  interleave prefix, then
   822  ```
   823  {(pk2), (pk3), (pk1, pk3), (pk2, pk3)}
   824  ```
   825  is its corresponding set of subsets.
   826  
   827  There is no optimized way __specific to interleaved tables__ to performing subset
   828  joins (that I can think of). If you refer to the [Canonical prefix join
   829  example](#canonical-prefix-join-example), a join on `(pk1, pk3)` would involve
   830  some disjoint cross joins which is hard to optimize in a general way. We could
   831  of course use a hybrid merge-hash join for a join on `(pk1, pk3)`: we'd perform
   832  a hash-join (hashed on `pk3`) on all rows for each `pk1` (see [this
   833  issue for more details](https://github.com/cockroachdb/cockroach/issues/16580)).
   834  
   835  ### [4] Multi-table joins
   836  
   837  For a given interleaved hierarchy, we might fold in multiple joins from
   838  more than two tables in the hierarchy. Suppose we have the hierarchy
   839  ```
   840  customers
   841    orders
   842      items
   843  ```
   844  We may want retrieve all `items` that `customers` have purchased joined with
   845  their customer information (stored in `customers`). The corresponding
   846  query might look like
   847  ```sql
   848  SELECT * FROM customers
   849    JOIN
   850      SELECT * FROM orders
   851      JOIN items ON orders.id = items.order_id
   852    ON customers.id = orders.cus_id
   853  ```
   854  
   855  Ideally, this three-table join be combined into one join with one scan of the
   856  interleaved hierarchy. `InterleaveReaderJoiner` would read rows from `N = 3` tables-indexes
   857  and perform a merge join on all three tables (this is essentially a 3-way cross product,
   858  which can be implemented with 3 loops nested in a hierarchy).
   859  
   860  ### [5] Theta Joins
   861  
   862  Theta joins, or joins on columns with predicates that are not all equivalence predicates,
   863  are a little more complicated with interleaved table joins. I haven't fully
   864  grok-ed the implications of theta joins for interleaved tables, so I'll leave it
   865  open for now.
   866  
   867  ### [6] InterleaveReader and MergeJoiner
   868  
   869  After some initial discussion in the RFC PR, we came to the realization that
   870  the goal of this RFC is to avoid scanning the entire interleaved hierarchy twice
   871  with two independent `TableReader` processors. An alternative solution proposed
   872  was to have some `InterleaveReader` that emits both `parent` and `child` rows
   873  in the simple case. The joining logic is identical to that of `MergeJoiner` and
   874  the rows from `InterleaveReader` can be piped to `MergeJoiner`.
   875  
   876  In fact, the [prefix](#3b-prefix-joins) and [subset joins](#3c-subset-joins)
   877  are general cases for `MergeJoiner` that is current being tracked by [this
   878  issue](https://github.com/cockroachdb/cockroach/issues/16580).
   879  
   880  This alternative architecture would look something like
   881  
   882  ![image](https://user-images.githubusercontent.com/10563314/31399738-5bbf529a-adbb-11e7-9503-bec9a3b962f4.png)
   883  
   884  #### Planning
   885  
   886  Instead of specifying a hash router for each `TableReader` instance to their
   887  respective `MergeJoiner`s (hashed by the equality columns), the two output streams
   888  of the `InterleaveReader` will be mapped directly to the inputs of the
   889  `MergeJoiner` processor. Two `OutputRouterSpec`s for `PASSTHROUGH` routers are
   890  specified for each `InterleaveReader`. They each correspond to the output streams
   891  for parent and child rows. This is accomplished by setting the appropriate
   892  fields (i.e. `SourceProcessor`, `DestProcessor`, `DestInput`) fields of
   893  `distsqlplan.Stream` and adding it to the plan. Since `distsqlplan.Stream`
   894  currently does not support multiple source outputs for a given `SourceProcessor`,
   895  we will also need to introduce `SourceOutput int` (similar to how `DestInput int`
   896  permits multiple input streams).
   897  
   898  #### Reading component
   899  
   900  `InterleaveReader` also requires reading from `RowFetcher` and piping the
   901  rows to each of its streams. Since `InterleaveReader` "owns" its `N` output
   902  streams (rather than letting some `Router` message-broker direct the output), a
   903  similar buffer logic will be [required to prevent
   904  deadlocks](https://github.com/cockroachdb/cockroach/issues/17097). The deadlock happens
   905  when `MergeJoiner` (`streamMerger` to be precise) tries to [batch all rows from both sides with the
   906  current join column
   907  values](https://github.com/cockroachdb/cockroach/blob/70c96175bb5b60bf6d531ab3bcebd2bb723d0bc7/pkg/sql/distsqlrun/stream_merger.go#L37#L45).
   908  It will try to retrieve parent rows until it sees that the values for the join
   909  columns increase (in order to form a batch of parent rows with the same join
   910  column values). However, this would not be possible if the next row after the
   911  parent row is a child row since the `InterleavedReader` reads top to bottom. Thus
   912  we would need to buffer enough child rows until `streamMerger` can get the next
   913  parent row which has a different primary key.
   914  
   915  ```
   916  parent_r1    <- first row in parent batch, NextRow until parent row with a different primary key
   917    child_r1   <- need to buffer since NextRow is called
   918    child_r2   <- buffered
   919    child_r3   <- buffered
   920  parent_r2    <- parent row with a different primary key, finish parent batch
   921                  and move on to child batch (child_r1 - r3 is unblocked and sent)
   922  ```
   923  
   924  In the case of a join on the entire interleave prefix, buffering is __okay__
   925  since there is at most 1 parent row for 0+ child rows. We would thus buffer at
   926  most 1 child row in order to retrieve the next parent row (that has a larger
   927  primary key, finishing off the batch). In the case where we do a [prefix
   928  join](3b-prefix-joins), parent rows are separated by blocks of 0+ child rows,
   929  thus we will need to buffer the child rows in order to retrieve all the parent
   930  rows for a given batch.
   931  
   932  This buffering logic needs to be [abstracted from
   933  `routerBase`](https://github.com/cockroachdb/cockroach/blob/de7337dc5ca5b4e5ee17e812c817e4bba5a449ca/pkg/sql/distsqlrun/routers.go#L160L164)
   934  and imported into `InterleaveReader`.
   935  
   936  #### Joining component
   937  
   938  The actual joining logic in `MergeJoiner` should be agnostic to whether the
   939  rows came from two `TableReader`s or one `InterleaveReader`.
   940  
   941  One of the drawbacks of delegating the joining logic to a generic `MergeJoiner`
   942  is, in the simple case, having to buffer batches of child rows even though we
   943  know there is only one parent row to join and we can stream these joined rows.
   944  
   945  For example, the second batch for `parent` and `child2` looks like
   946  ```
   947  <parent batch>  <child batch>
   948  parent_r2       child2_r2
   949                  child2_r3
   950                  child2_r4
   951  ```
   952  Instead of buffering the entire batch, we know for certainty that
   953  we need only buffer the one parent row and stream-join the child rows.
   954  
   955  One way we can fix this is to refactor `streamMerger` to recognize
   956  that if either the `left` or `right` batch has one row, we can stream join
   957  the other row.
   958  
   959  For the scope of the first iteration, we will defer this optimization.
   960  
   961  If we wanted to perform [multi-table joins](#4-multi-table-joins) efficiently,
   962  we could consider an `InterleaveJoiner` (or `MultiMergeJoiner`) that can merge
   963  `N` arbitrary batches.
   964  
   965  #### Drawbacks of this approach
   966  
   967  There were a number of cons to this design compared to a single `InterleaveReaderJoiner`
   968  processor, namely:
   969  - We need to buffer child rows in `InterleaveReader` OR refactor `streamMerger` to
   970    read rows from left and right concurrently to avoid deadlocking. Note
   971    refactoring `streamMerger` doesn't solve the "prefix joins" or "subset" case:
   972    we still need to buffer/store rows on the `InterleaveReader` side.
   973  - We need to extract buffering logic from routers.
   974  - Batches are unavoidable in `streamMerger` unless we hint to `MergeJoiner`
   975    during planning that the parent side contains only 1 row per batch (for joins
   976    on the full interleave prefix). If it does contain 1 row on one side we can
   977    skip batching and simply stream joined rows.
   978  - We need to refactor `Stream`s to handle multiple inputs (from `InterleaveReader`).
   979  
   980  ### [7] Index joins
   981  
   982  Although we [do not publicly advertise](https://www.cockroachlabs.com/docs/stable/interleave-in-parent.html)
   983  interleaving secondary indexes under primary indexes, this is possible
   984  with the syntax
   985  ```sql
   986  CREATE INDEX secondary_idx ON foo (id, data) INTERLEAVE IN PARENT foo (id)
   987  ```
   988  We can thus identify interleaved joins on `indexJoinNode`s (where `left` and `right`
   989  `scanNode`s in `joinNode`s are now `index` and `table` `scanNode`s in `indexJoinNode`s)
   990  and perform efficient index joins. Index joins would employ the same processor (`InterleaveReaderJoiner`)
   991  and the planning implementation will be symmetric to that of `joinNode`s.
   992  
   993  ### [8] Avoiding splits inside interleaves
   994  
   995  We would like to refactor [`MVCCFindSplitKey`](https://github.com/cockroachdb/cockroach/blob/424a396a7681b048439c940207027d9eb7976a85/pkg/storage/engine/mvcc.go#L2269#L2297)
   996  such that it avoids splitting inside interleaves (i.e. a split between two
   997  children rows that are nested under the same parent row).
   998  
   999  We have [similar logic for avoiding splits in between
  1000  rows](https://github.com/cockroachdb/cockroach/blob/424a396a7681b048439c940207027d9eb7976a85/pkg/storage/engine/mvcc.go#L2296)
  1001  where they may be multiple family kv pairs. In short,
  1002  [`EnsureSafeSplitKey`](https://github.com/cockroachdb/cockroach/blob/424a396a7681b048439c940207027d9eb7976a85/pkg/keys/keys.go#L668)
  1003  simply truncates the split key obtained from the call to `MVCCFindSplitKey` in
  1004  `libroach/db.cc` to the row prefix. For example, if the given key is chosen as
  1005  the split key
  1006  ```
  1007  /Table/50/1/<column id>/1
  1008                           ^ family id
  1009  ```
  1010  it simply reassigns the split key as
  1011  ```
  1012  /Table/50/1/<column id>
  1013  ```
  1014  which would split the range such that the first key in row `<column id>` would
  1015  be the first key in the right range (ranges are specified with an inclusive
  1016  start key and an exclusive end key).
  1017  
  1018  To apply this for interleaves too, we can branch on whether the row belongs to
  1019  an interleaved table or not. If it's not an interleave table the `GetRowPrefixLength`
  1020  procedure will take place. If the row belongs to an interleave table, we can invoke
  1021  `GetTablePrefixLength`, where instead of taking the prefix up to and including
  1022  the `column id`s, it would instead take the prefix up to the last table's index
  1023  ID. The resulting split key from the above example is
  1024  ```
  1025  /Table/50/1
  1026  ```
  1027  We do this by decoding the key from left to right until the index of the last index ID
  1028  (that is the last index ID after encountering an `<interleave sentinel>`).
  1029  
  1030  To check whether the row belongs to an interleave table, we can either check
  1031  for `<interleave sentinel>`s or plumb a flag from above.
  1032  
  1033  There is an [outstanding
  1034  issue](https://github.com/cockroachdb/cockroach/issues/19296) with ranges
  1035  failing to split since `EnsureSafeSplitKey` would naively default to the
  1036  beginning of the row (which may be the first key in the range), where there may
  1037  be a better split point at the end of the row. The fix @a-robinson will
  1038  implement should work fine with interleave boundaries too.
  1039  
  1040  ## Unresolved questions
  1041  
  1042  - Prefix joins: approach #2 is equivalent to or strictly better than
  1043    approach #1, under the assumption that RocksDB caches things properly. Is
  1044    approach #1 the safer option since we won't know how RocksDB cache behaves in
  1045    production (i.e. when it runs out of cache space, since the amount RocksDB
  1046    will need to cache is far greater than what we need cached). The point of
  1047    approach #2 is if RocksDB caches our blocks anyways, then we don't need to
  1048    store as much in memory.