github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/concepts/pipeline-concepts/datum/join.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/concepts/pipeline-concepts/datum/join.md (about)

     1  # Join
     2  
     3  A join is a special type of pipeline input that enables you to combine
     4  files that reside in separate Pachyderm repositories and match a
     5  particular naming pattern. The join operator must be used in combination
     6  with a glob pattern that reflects a specific naming convention.
     7  
     8  By analogy, a Pachyderm join is similar to a database *equi-join*,
     9  or *inner join* operation, but it matches on file paths
    10  only, not the contents of the files.
    11  
    12  Unlike the [cross input](../datum/cross-union.md), which creates datums
    13  from every combination of files in each input repository, joins only create
    14  datums where there is a *match*. You can use joins to combine data from
    15  different Pachyderm repositories and ensure that only specific files from
    16  each repo are processed together.
    17  
    18  When you configure a join input, you must specify a glob pattern that
    19  includes a capture group. The capture group defines the specific string in
    20  the file path that is used to match files in other joined repos.
    21  Capture groups work analogously to the [regex capture group](https://www.regular-expressions.info/refcapture.html).
    22  You define the capture group inside parenthesis. Capture groups are numbered
    23  from left to right and can also be nested within each other. Numbering for
    24  nested capture groups is based on their opening parenthesis.
    25  
    26  Below you can find a few examples of applying a glob pattern with a capture
    27  group to a file path. For example, if you have the following file path:
    28  
    29  ```shell
    30  /foo/bar-123/ABC.txt
    31  ```
    32  
    33  The following glob patterns in a joint input create the
    34  following capture groups:
    35  
    36  | Regular expression  | Capture groups           |
    37  | ------------------- | ------------------------ |
    38  | `/(*)`              | `foo`                    |
    39  | `/*/bar-(*)`        | `123`                    |
    40  | `/(*)/*/(??)*.txt`  | Capture group 1: `foo`, capture group 2: `AB`. |
    41  | `/*/(bar-(123))/*`  | Capture group 1: `bar-123`, capture group 2: `123`. |
    42  
    43  
    44  Also, joins require you to specify a [replacement group](https://www.regular-expressions.info/replacebackref.html)
    45  in the `join_on` parameter to define which capture groups you want to try
    46  to match.
    47  
    48  For example, `$1` indicates that you want Pachyderm to match based on
    49  capture group `1`. Similarly, `$2` matches the capture group `2`.
    50  `$1$2` means that it must match both capture groups `1` and `2`.
    51  
    52  If Pachyderm does not find any matching files, you get a zero-datum job.
    53  
    54  You can test your glob pattern and capture groups by using the
    55  `pachctl glob file` command as described in
    56  [Glob Pattern](../../datum/glob-pattern/#test-a-glob-pattern).
    57  
    58  ## Example
    59  
    60  For example, you have two repositories. One with sensor readings
    61  and the other with parameters. The repositories have the following
    62  structures:
    63  
    64  * `readings` repo:
    65  
    66     ```shell
    67     ├── ID1234
    68         ├── file1.txt
    69         ├── file2.txt
    70         ├── file3.txt
    71         ├── file4.txt
    72         ├── file5.txt
    73     ```
    74  
    75  * `parameters` repo:
    76  
    77     ```shell
    78     ├── file1.txt
    79     ├── file2.txt
    80     ├── file3.txt
    81     ├── file4.txt
    82     ├── file5.txt
    83     ├── file6.txt
    84     ├── file7.txt
    85     ├── file8.txt
    86     ```
    87  
    88  Pachyderm runs your code only on the pairs of files that match
    89  the glob pattern and capture groups.
    90  
    91  The following example shows how you can use joins to group
    92  matching IDs:
    93  
    94  ```json
    95   {
    96     "pipeline": {
    97       "name": "joins"
    98     },
    99     "input": {
   100       "join": [
   101         {
   102           "pfs": {
   103             "repo": "readings",
   104             "branch": "master",
   105             "glob": "/*/(*).txt",
   106             "join_on": "$1"
   107           }
   108         },
   109        {
   110          "pfs": {
   111            "repo": "parameters",
   112            "branch": "master",
   113            "glob": "/(*).txt",
   114            "join_on": "$1"
   115          }
   116        }
   117      ]
   118    },
   119    "transform": {
   120       "cmd": [ "python3", "/joins.py"],
   121       "image": "joins-example"
   122     }
   123   }
   124  ```
   125  
   126  The glob pattern for the `readings` repository, `/*/(*).txt`, indicates all
   127  matching files in the `ID` sub-directory. In the `parameters` repository,
   128  the glob pattern `/(*).txt` selects all the matching files in the root
   129  directory.
   130  All files with indices from `1` to `5` match. The files
   131  with indices from `6` to `8` do not match. Therefore, you only get five
   132  datums for this job.
   133  
   134  To experiment further, see the full [joins example](https://github.com/pachyderm/pachyderm/tree/master/examples/joins).