github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/concepts/pipeline-concepts/datum/cross-union.md (about) 1 # Cross and Union Inputs 2 3 <!---This section needs to be made more clear. There is a lot of information 4 that I would say describes the things you can do with a cross or union pipeline 5 but does not really have a good and clear explanation of what they are --> 6 7 Pachyderm enables you to combine multiple 8 input repositories in a single pipeline by using the `union` and 9 `cross` operators in the pipeline specification. 10 11 If you are familiar with [Set theory](https://en.wikipedia.org/wiki/Set_theory), 12 you can think of union as a *disjoint union binary operator* and cross as a 13 *cartesian product binary operator*. However, if you are unfamiliar with these 14 concepts, it is still easy to understand how cross and union work in Pachyderm. 15 16 This section describes how to use `cross` and `union` in your pipelines and how you 17 can optimize your code when you work with them. 18 19 ## Union Input 20 21 The union input combines each of the datums in the input repos as one 22 set of datums. 23 The number of datums that are processed is the sum of all the 24 datums in each repo. 25 26 For example, you have two input repos, `A` and `B`. Each of these 27 repositories contain three files with the following names. 28 29 Repository `A` has the following structure: 30 31 ```shell 32 A 33 ├── 1.txt 34 ├── 2.txt 35 └── 3.txt 36 ``` 37 38 Repository `B` has the following structure: 39 40 ```shell 41 B 42 ├── 4.txt 43 ├── 5.txt 44 └── 6.txt 45 ``` 46 47 If you want your pipeline to process each file independently as a 48 separate datum, use a glob pattern of `/*`. Each 49 glob is applied to each input independently. The input section 50 in the pipeline spec might have the following structure: 51 52 ```shell 53 "input": { 54 "union": [ 55 { 56 "pfs": { 57 "glob": "/*", 58 "repo": "A" 59 } 60 }, 61 { 62 "pfs": { 63 "glob": "/*", 64 "repo": "B" 65 } 66 } 67 ] 68 } 69 ``` 70 71 In this example, each Pachyderm repository has those three files in the root 72 directory, so three datums from each input. Therefore, the union of `A` and `B` 73 has six datums in total. 74 Your pipeline processes the following datums without any specific order: 75 76 ```shell 77 /pfs/A/1.txt 78 /pfs/A/2.txt 79 /pfs/A/3.txt 80 /pfs/B/4.txt 81 /pfs/B/5.txt 82 /pfs/B/6.txt 83 ``` 84 85 !!! note 86 Each datum in a pipeline is processed independently by a single 87 execution of your code. In this example, your code runs six times, and 88 each datum is available to it one at a time. For example, your code 89 processes `pfs/A/1.txt` in one of the runs and `pfs/B/5.txt` in a 90 different run, and so on. In a union, two or more datums are never 91 available to your code at the same time. You can simplify 92 your union code by using the `name` property as described below. 93 94 ### Simplifying the Union Pipelines Code 95 96 In the example above, your code needs to read into the `pfs/A` 97 _or_ `pfs/B` directory because only one of them is present in any given datum. 98 To simplify your code, you can add the `name` field to the `pfs` object and 99 give the same name to each of the input repos. For example, you can add, the 100 `name` field with the value `C` to the input repositories `A` and `B`: 101 102 ``` 103 "input": { 104 "union": [ 105 { 106 "pfs": { 107 "name": "C", 108 "glob": "/*", 109 "repo": "A" 110 } 111 }, 112 { 113 "pfs": { 114 "name": "C", 115 "glob": "/*", 116 "repo": "B" 117 } 118 } 119 ] 120 } 121 ``` 122 123 Then, in the pipeline, all datums appear in the same directory. 124 125 ```shell 126 /pfs/C/1.txt # from A 127 /pfs/C/2.txt # from A 128 /pfs/C/3.txt # from A 129 /pfs/C/4.txt # from B 130 /pfs/C/5.txt # from B 131 /pfs/C/6.txt # from B 132 ``` 133 134 ## Cross Input 135 136 In a cross input, Pachyderm exposes every combination of datums, 137 or a cross-product, from each of your input repositories to your code 138 in a single run. 139 In other words, a cross input pairs every datum in one repository with 140 each datum in another, creating sets of datums. Your transformation 141 code is provided one of these sets at the time to process. 142 143 For example, you have repositories `A` and `B` with three datums, each 144 with the following structure: 145 146 !!! note 147 For this example, the glob pattern is set to `/*`. 148 149 Repository `A` has three files at the top level: 150 151 ```shell 152 A 153 ├── 1.txt 154 ├── 2.txt 155 └── 3.txt 156 ``` 157 158 Repository `B` has three files at the top level: 159 160 ```shell 161 B 162 ├── 4.txt 163 ├── 5.txt 164 └── 6.txt 165 ``` 166 167 Because you have three datums in each repo, Pachyderm exposes 168 a total of nine combinations of datums to your code. 169 170 !!! important 171 In cross pipelines, both `pfs/A` and `pfs/B` 172 directories are visible during each code run. 173 174 ```shell 175 Run 1: /pfs/A/1.txt 176 /pfs/B/4.txt 177 178 Run 2: /pfs/A/1.txt 179 /pfs/B/5.txt 180 ... 181 182 Run 9: /pfs/A/3.txt 183 /pfs/B/6.txt 184 ``` 185 186 !!! note 187 In cross inputs, if you use the `name` field, your two 188 inputs cannot have the same name. This could cause file system collisions. 189 190 !!! note "See Also:" 191 192 - [Cross Input](../../../../reference/pipeline_spec/#cross-input) 193 - [Union Input](../../../../reference/pipeline_spec/#union-input) 194 - [Distributed hyperparameter tuning](https://github.com/pachyderm/pachyderm/tree/master/examples/ml/hyperparameter) 195 196 <!-- Add a link to an interactive tutorial when it's ready-->