github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/pipeline-concepts/datum/join.md (about) 1 # Join Input 2 3 A join is a special type of pipeline input that enables you to combine 4 files that reside in separate Pachyderm repositories and match a 5 particular naming pattern. The join operator must be used in combination 6 with a [glob pattern](../datum/glob-pattern.md) that reflects a specific naming convention. 7 Note that in Pachyderm, matches are made on file paths 8 only, not the files' content. 9 10 Pachyderm supports two types of joins: 11 12 - A default join setting, similar to a database *equi-join* 13 or *inner-join* operation. Unlike the [cross input](../datum/cross-union.md), 14 which creates datums from every combination of files in each input repository, 15 inner joins **only create datums where there is a *match***. 16 You can use inner joins to combine data from different Pachyderm repositories 17 and ensure that only specific files from 18 each repo are processed together. 19 If Pachyderm does not find any matching files, you get a zero-[datum](../datum/index.md) job. 20 - Pachyderm also supports a join close to a database `outer-join`, 21 allowing you to **create datums for all files in a repo, even if there is no match**. 22 The `outer-join` behavior can be set on any repository in your join. 23 24 When you configure a join input (inner or outer), you must specify a glob pattern that 25 includes a capture group. The capture group defines the specific string in 26 the file path that is used to match files in other joined repos. 27 Capture groups work analogously to the [regex capture group](https://www.regular-expressions.info/refcapture.html). 28 You define the capture group inside parenthesis. Capture groups are numbered 29 from left to right and can also be nested within each other. Numbering for 30 nested capture groups is based on their opening parenthesis. 31 32 Below you can find a few examples of applying a glob pattern with a capture 33 group to a file path. For example, if you have the following file path: 34 35 ```shell 36 /foo/bar-123/ABC.txt 37 ``` 38 39 The following glob patterns in a joint input create the 40 following capture groups: 41 42 | Regular expression | Capture groups | 43 | ------------------- | ------------------------ | 44 | `/(*)` | `foo` | 45 | `/*/bar-(*)` | `123` | 46 | `/(*)/*/(??)*.txt` | Capture group 1: `foo`, capture group 2: `AB`. | 47 | `/*/(bar-(123))/*` | Capture group 1: `bar-123`, capture group 2: `123`. | 48 49 50 Also, joins require you to specify a [replacement group](https://www.regular-expressions.info/replacebackref.html) 51 in the `join_on` parameter to define which capture groups you want to try 52 to match. 53 54 For example, `$1` indicates that you want Pachyderm to match based on 55 capture group `1`. Similarly, `$2` matches the capture group `2`. 56 `$1$2` means that it must match both capture groups `1` and `2`. 57 58 See the full `join` input configuration in the [pipeline specification](../../../reference/pipeline_spec.md). 59 60 You can test your glob pattern and capture groups by using the 61 `pachctl list datum -f <your_pipeline_spec.json>` command as described in 62 [List Datum](../../datum/glob-pattern/#test-your-datums). 63 64 ## Inner Join 65 Per default, a join input has an `inner-join` behavior. 66 67 ### Inner Join Example 68 69 For example, you have two repositories. One with sensor readings 70 and the other with parameters. The repositories have the following 71 structures: 72 73 * `readings` repo: 74 75 ```shell 76 ├── ID1234 77 ├── file1.txt 78 ├── file2.txt 79 ├── file3.txt 80 ├── file4.txt 81 ├── file5.txt 82 ``` 83 84 * `parameters` repo: 85 86 ```shell 87 ├── file1.txt 88 ├── file2.txt 89 ├── file3.txt 90 ├── file4.txt 91 ├── file5.txt 92 ├── file6.txt 93 ├── file7.txt 94 ├── file8.txt 95 ``` 96 97 Pachyderm runs your code only on the pairs of files that match 98 the glob pattern and capture groups. 99 100 The following example shows how you can use joins to group 101 matching IDs: 102 103 ```json 104 { 105 "pipeline": { 106 "name": "joins" 107 }, 108 "input": { 109 "join": [ 110 { 111 "pfs": { 112 "repo": "readings", 113 "branch": "master", 114 "glob": "/*/(*).txt", 115 "join_on": "$1" 116 } 117 }, 118 { 119 "pfs": { 120 "repo": "parameters", 121 "branch": "master", 122 "glob": "/(*).txt", 123 "join_on": "$1" 124 } 125 } 126 ] 127 }, 128 "transform": { 129 "cmd": [ "python3", "/joins.py"], 130 "image": "joins-example" 131 } 132 } 133 ``` 134 135 The glob pattern for the `readings` repository, `/*/(*).txt`, indicates all 136 matching files in the `ID` sub-directory. In the `parameters` repository, 137 the glob pattern `/(*).txt` selects all the matching files in the root 138 directory. 139 All files with indices from `1` to `5` match. The files 140 with indices from `6` to `8` do not match. Therefore, you only get five 141 datums for this job. 142 143 To experiment further, see the full [joins example](https://github.com/pachyderm/pachyderm/tree/master/examples/joins). 144 145 ## Outer Join 146 147 Pachyderm also supports outer joins. Outer joins include everything a normal 148 (inner) join does, and files that didn't match anything. Inputs can be set to 149 outer semantics independently. So while there isn't an explicit notion of 150 "left" or "right" outer joins, you can still get those semantics, and even 151 extend them to multiway joins. 152 153 ### Outer Join Example 154 155 Building off the example above, notice that there are 3 files in the 156 `parameters` repo, `file6.txt`, `file7.txt` and `file8.txt`, which don't match 157 any files in the `readings` repo. In a normal join those files are simply 158 omitted, but if you still want to see them without a match you can use an outer 159 join. The change to the pipeline spec is quite simple: 160 161 ```json 162 { 163 "pipeline": { 164 "name": "joins" 165 }, 166 "input": { 167 "join": [ 168 { 169 "pfs": { 170 "repo": "readings", 171 "branch": "master", 172 "glob": "/*/(*).txt", 173 "join_on": "$1" 174 } 175 }, 176 { 177 "pfs": { 178 "repo": "parameters", 179 "branch": "master", 180 "glob": "/(*).txt", 181 "join_on": "$1", 182 "outer_join": true 183 } 184 } 185 ] 186 }, 187 "transform": { 188 "cmd": [ "python3", "/joins.py"], 189 "image": "joins-example" 190 } 191 } 192 ``` 193 194 Your code will still see the joined pairs that it saw before. In addition to 195 those five datums your code will also see three new ones, one for each of the 196 parameter files which didn't have a match. Note that this means your code needs 197 to not crash when only some of the inputs are represented under `/pfs`. 198 199 To experiment further, see the full [join example](https://github.com/pachyderm/pachyderm/tree/master/examples/joins).