github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/pipeline-concepts/datum/join.md (about)

     1  # Join Input
     2  
     3  A join is a special type of pipeline input that enables you to combine
     4  files that reside in separate Pachyderm repositories and match a
     5  particular naming pattern. The join operator must be used in combination
     6  with a [glob pattern](../datum/glob-pattern.md) that reflects a specific naming convention.
     7  Note that in Pachyderm, matches are made on file paths
     8  only, not the files' content. 
     9  
    10  Pachyderm supports two types of joins: 
    11  
    12  - A default join setting, similar to a database *equi-join*
    13  or *inner-join* operation. Unlike the [cross input](../datum/cross-union.md),
    14  which creates datums from every combination of files in each input repository, 
    15  inner joins **only create datums where there is a *match***. 
    16  You can use inner joins to combine data from different Pachyderm repositories
    17  and ensure that only specific files from
    18  each repo are processed together. 
    19  If Pachyderm does not find any matching files, you get a zero-[datum](../datum/index.md) job.
    20  - Pachyderm also supports a join close to a database `outer-join`,
    21  allowing you to **create datums for all files in a repo, even if there is no match**. 
    22  The `outer-join` behavior can be set on any repository in your join.
    23  
    24  When you configure a join input (inner or outer), you must specify a glob pattern that
    25  includes a capture group. The capture group defines the specific string in
    26  the file path that is used to match files in other joined repos.
    27  Capture groups work analogously to the [regex capture group](https://www.regular-expressions.info/refcapture.html).
    28  You define the capture group inside parenthesis. Capture groups are numbered
    29  from left to right and can also be nested within each other. Numbering for
    30  nested capture groups is based on their opening parenthesis.
    31  
    32  Below you can find a few examples of applying a glob pattern with a capture
    33  group to a file path. For example, if you have the following file path:
    34  
    35  ```shell
    36  /foo/bar-123/ABC.txt
    37  ```
    38  
    39  The following glob patterns in a joint input create the
    40  following capture groups:
    41  
    42  | Regular expression  | Capture groups           |
    43  | ------------------- | ------------------------ |
    44  | `/(*)`              | `foo`                    |
    45  | `/*/bar-(*)`        | `123`                    |
    46  | `/(*)/*/(??)*.txt`  | Capture group 1: `foo`, capture group 2: `AB`. |
    47  | `/*/(bar-(123))/*`  | Capture group 1: `bar-123`, capture group 2: `123`. |
    48  
    49  
    50  Also, joins require you to specify a [replacement group](https://www.regular-expressions.info/replacebackref.html)
    51  in the `join_on` parameter to define which capture groups you want to try
    52  to match.
    53  
    54  For example, `$1` indicates that you want Pachyderm to match based on
    55  capture group `1`. Similarly, `$2` matches the capture group `2`.
    56  `$1$2` means that it must match both capture groups `1` and `2`.
    57  
    58  See the full `join` input configuration in the [pipeline specification](../../../reference/pipeline_spec.md).
    59  
    60  You can test your glob pattern and capture groups by using the
    61  `pachctl list datum -f <your_pipeline_spec.json>` command as described in
    62  [List Datum](../../datum/glob-pattern/#test-your-datums).
    63  
    64  ## Inner Join
    65  Per default, a join input has an `inner-join` behavior.
    66  
    67  ### Inner Join Example
    68  
    69  For example, you have two repositories. One with sensor readings
    70  and the other with parameters. The repositories have the following
    71  structures:
    72  
    73  * `readings` repo:
    74  
    75     ```shell
    76     ├── ID1234
    77         ├── file1.txt
    78         ├── file2.txt
    79         ├── file3.txt
    80         ├── file4.txt
    81         ├── file5.txt
    82     ```
    83  
    84  * `parameters` repo:
    85  
    86     ```shell
    87     ├── file1.txt
    88     ├── file2.txt
    89     ├── file3.txt
    90     ├── file4.txt
    91     ├── file5.txt
    92     ├── file6.txt
    93     ├── file7.txt
    94     ├── file8.txt
    95     ```
    96  
    97  Pachyderm runs your code only on the pairs of files that match
    98  the glob pattern and capture groups.
    99  
   100  The following example shows how you can use joins to group
   101  matching IDs:
   102  
   103  ```json
   104   {
   105     "pipeline": {
   106       "name": "joins"
   107     },
   108     "input": {
   109       "join": [
   110         {
   111           "pfs": {
   112             "repo": "readings",
   113             "branch": "master",
   114             "glob": "/*/(*).txt",
   115             "join_on": "$1"
   116           }
   117         },
   118        {
   119          "pfs": {
   120            "repo": "parameters",
   121            "branch": "master",
   122            "glob": "/(*).txt",
   123            "join_on": "$1"
   124          }
   125        }
   126      ]
   127    },
   128    "transform": {
   129       "cmd": [ "python3", "/joins.py"],
   130       "image": "joins-example"
   131     }
   132   }
   133  ```
   134  
   135  The glob pattern for the `readings` repository, `/*/(*).txt`, indicates all
   136  matching files in the `ID` sub-directory. In the `parameters` repository,
   137  the glob pattern `/(*).txt` selects all the matching files in the root
   138  directory.
   139  All files with indices from `1` to `5` match. The files
   140  with indices from `6` to `8` do not match. Therefore, you only get five
   141  datums for this job.
   142  
   143  To experiment further, see the full [joins example](https://github.com/pachyderm/pachyderm/tree/master/examples/joins).
   144  
   145  ## Outer Join
   146  
   147  Pachyderm also supports outer joins. Outer joins include everything a normal
   148  (inner) join does, and files that didn't match anything. Inputs can be set to
   149  outer semantics independently. So while there isn't an explicit notion of
   150  "left" or "right" outer joins, you can still get those semantics, and even
   151  extend them to multiway joins.
   152  
   153  ### Outer Join Example
   154  
   155  Building off the example above, notice that there are 3 files in the
   156  `parameters` repo, `file6.txt`, `file7.txt` and `file8.txt`, which don't match
   157  any files in the `readings` repo. In a normal join those files are simply
   158  omitted, but if you still want to see them without a match you can use an outer
   159  join. The change to the pipeline spec is quite simple:
   160  
   161  ```json
   162   {
   163     "pipeline": {
   164       "name": "joins"
   165     },
   166     "input": {
   167       "join": [
   168         {
   169           "pfs": {
   170             "repo": "readings",
   171             "branch": "master",
   172             "glob": "/*/(*).txt",
   173             "join_on": "$1"
   174           }
   175         },
   176        {
   177          "pfs": {
   178            "repo": "parameters",
   179            "branch": "master",
   180            "glob": "/(*).txt",
   181            "join_on": "$1",
   182            "outer_join": true
   183          }
   184        }
   185      ]
   186    },
   187    "transform": {
   188       "cmd": [ "python3", "/joins.py"],
   189       "image": "joins-example"
   190     }
   191   }
   192  ```
   193  
   194  Your code will still see the joined pairs that it saw before. In addition to
   195  those five datums your code will also see three new ones, one for each of the
   196  parameter files which didn't have a match. Note that this means your code needs
   197  to not crash when only some of the inputs are represented under `/pfs`.
   198  
   199  To experiment further, see the full [join example](https://github.com/pachyderm/pachyderm/tree/master/examples/joins).