github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/concepts/pipeline-concepts/datum/join.md (about) 1 # Join 2 3 A join is a special type of pipeline input that enables you to combine 4 files that reside in separate Pachyderm repositories and match a 5 particular naming pattern. The join operator must be used in combination 6 with a glob pattern that reflects a specific naming convention. 7 8 By analogy, a Pachyderm join is similar to a database *equi-join*, 9 or *inner join* operation, but it matches on file paths 10 only, not the contents of the files. 11 12 Unlike the [cross input](../datum/cross-union.md), which creates datums 13 from every combination of files in each input repository, joins only create 14 datums where there is a *match*. You can use joins to combine data from 15 different Pachyderm repositories and ensure that only specific files from 16 each repo are processed together. 17 18 When you configure a join input, you must specify a glob pattern that 19 includes a capture group. The capture group defines the specific string in 20 the file path that is used to match files in other joined repos. 21 Capture groups work analogously to the [regex capture group](https://www.regular-expressions.info/refcapture.html). 22 You define the capture group inside parenthesis. Capture groups are numbered 23 from left to right and can also be nested within each other. Numbering for 24 nested capture groups is based on their opening parenthesis. 25 26 Below you can find a few examples of applying a glob pattern with a capture 27 group to a file path. For example, if you have the following file path: 28 29 ```shell 30 /foo/bar-123/ABC.txt 31 ``` 32 33 The following glob patterns in a joint input create the 34 following capture groups: 35 36 | Regular expression | Capture groups | 37 | ------------------- | ------------------------ | 38 | `/(*)` | `foo` | 39 | `/*/bar-(*)` | `123` | 40 | `/(*)/*/(??)*.txt` | Capture group 1: `foo`, capture group 2: `AB`. | 41 | `/*/(bar-(123))/*` | Capture group 1: `bar-123`, capture group 2: `123`. | 42 43 44 Also, joins require you to specify a [replacement group](https://www.regular-expressions.info/replacebackref.html) 45 in the `join_on` parameter to define which capture groups you want to try 46 to match. 47 48 For example, `$1` indicates that you want Pachyderm to match based on 49 capture group `1`. Similarly, `$2` matches the capture group `2`. 50 `$1$2` means that it must match both capture groups `1` and `2`. 51 52 If Pachyderm does not find any matching files, you get a zero-datum job. 53 54 You can test your glob pattern and capture groups by using the 55 `pachctl glob file` command as described in 56 [Glob Pattern](../../datum/glob-pattern/#test-a-glob-pattern). 57 58 ## Example 59 60 For example, you have two repositories. One with sensor readings 61 and the other with parameters. The repositories have the following 62 structures: 63 64 * `readings` repo: 65 66 ```shell 67 ├── ID1234 68 ├── file1.txt 69 ├── file2.txt 70 ├── file3.txt 71 ├── file4.txt 72 ├── file5.txt 73 ``` 74 75 * `parameters` repo: 76 77 ```shell 78 ├── file1.txt 79 ├── file2.txt 80 ├── file3.txt 81 ├── file4.txt 82 ├── file5.txt 83 ├── file6.txt 84 ├── file7.txt 85 ├── file8.txt 86 ``` 87 88 Pachyderm runs your code only on the pairs of files that match 89 the glob pattern and capture groups. 90 91 The following example shows how you can use joins to group 92 matching IDs: 93 94 ```json 95 { 96 "pipeline": { 97 "name": "joins" 98 }, 99 "input": { 100 "join": [ 101 { 102 "pfs": { 103 "repo": "readings", 104 "branch": "master", 105 "glob": "/*/(*).txt", 106 "join_on": "$1" 107 } 108 }, 109 { 110 "pfs": { 111 "repo": "parameters", 112 "branch": "master", 113 "glob": "/(*).txt", 114 "join_on": "$1" 115 } 116 } 117 ] 118 }, 119 "transform": { 120 "cmd": [ "python3", "/joins.py"], 121 "image": "joins-example" 122 } 123 } 124 ``` 125 126 The glob pattern for the `readings` repository, `/*/(*).txt`, indicates all 127 matching files in the `ID` sub-directory. In the `parameters` repository, 128 the glob pattern `/(*).txt` selects all the matching files in the root 129 directory. 130 All files with indices from `1` to `5` match. The files 131 with indices from `6` to `8` do not match. Therefore, you only get five 132 datums for this job. 133 134 To experiment further, see the full [joins example](https://github.com/pachyderm/pachyderm/tree/master/examples/joins).