github.com/pachyderm/pachyderm@v1.13.4/examples/joins/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 5 # Inner and Outer Join Inputs 6 > The outer join input is available in version **1.12 and higher**. 7 8 - In our first example, we will create a pipeline whose input datums result from a simple `inner join` between 2 repos. 9 - In our second example, we will showcase 3 variations of `outer join` pipelines between 2 repos, and outline how they differ from inner join and each other. 10 11 At the end of this page, you will understand the fundamental difference between the datums produced by an inner join and those created by an outer join. 12 13 ***Table of Contents*** 14 15 - [1. Getting ready](#1-getting-ready) 16 - [2. Data structure and naming convention](#2-data-structure-and-naming-convention) 17 - [3. Data preparation](#3-data-preparation) 18 - [4. Example 1 : An Inner Join pipeline creation](#4-example-1--an-inner-join-pipeline-creation) 19 - [5. Example 2 : Outer Join pipeline creation](#5-example-2--outer-join-pipeline-creation) 20 - [***Case 1*** Outer join on the Returns repo only](#case-1-outer-join-on-the-returns-repo-only) 21 - [***Case 2*** Outer join on the Stores repo only](#case-2-outer-join-on-the-stores-repo-only) 22 - [***Case 3*** Outer join on both the Stores and Returns repos](#case-3-outer-join-on-both-the-stores-and-returns-repos) 23 24 ***Key concepts*** 25 26 For these examples, we recommend to be familiar with the following concepts: 27 28 - [Join](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/join/) pipelines - execute your code on files that match a specific naming pattern in your joined repos. 29 - [Glob patterns](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/glob-pattern/) - for "RegEx-like" string matching on file paths and names. 30 31 Additionally, you might want to check [datum](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/relationship-between-datums/). 32 33 ## 1. Getting ready 34 ***Prerequisite*** 35 - A workspace on [Pachyderm Hub](https://docs.pachyderm.com/1.13.x/pachhub/pachhub_getting_started/) (recommended) or Pachyderm running [locally](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/). 36 - [pachctl command-line ](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/#install-pachctl) installed, and your context created (i.e. you are logged in) 37 38 ***Getting started*** 39 - Clone this repo. 40 - Make sure Pachyderm is running. You should be able to connect to your Pachyderm cluster via the `pachctl` CLI. 41 Run a quick: 42 ```shell 43 $ pachctl version 44 45 COMPONENT VERSION 46 pachctl 1.12.0 47 pachd 1.12.0 48 ``` 49 Ideally, have your pachctl and pachd versions match. At a minimum, you should always use the same major & minor versions of pachctl and pachd. 50 ## 2. Data structure and naming convention 51 52 > Remember, in Pachyderm, the join operates at the file-path level, **not** the files' content. Therefore, the structure of your directories and file naming conventions are key elements when implementing your use cases in Pachyderm. 53 54 We have derived our examples from simplified retail use cases: 55 - Purchases and Returns are made in given Stores. 56 - Those Stores have a given location (here, a zip code). 57 - There are 0 to many Stores for a given zip code. 58 59 Let's take a look at our data structure and naming convention. 60 We will create 3 repos: 61 * Repo: `stores` - Each store data are JSON files named after the storeID. 62 ```shell 63 └── STOREID1.txt 64 └── STOREID2.txt 65 └── STOREID3.txt 66 └── ... 67 ``` 68 To further these examples beyond the datum creation, we have added some simple processing at the code level. We will use the `zipcode` property to aggregate our data. 69 70 This is what the content of one of those STOREIDx.txt file looks like. 71 ```shell 72 { 73 "storeid":"4", 74 "name":"mariposa st.", 75 "address":{ 76 "zipcode":"94107", 77 "country":"US" 78 } 79 } 80 ``` 81 82 * Repo: `purchases` - Each purchase info is kept in a file named by concatenating the purchase's order number and its store ID. 83 ```shell 84 └── ORDERW080520_STOREID1.txt 85 └── ORDERW080521_STOREID1.txt 86 └── ORDERW078929_STOREID2.txt 87 └── ... 88 ``` 89 * Repo: `returns` - Same naming convention as purchases. 90 ```shell 91 └── ORDERW080528_STOREID5.txt 92 └── ODERW080520_STOREID1.txt 93 └── ... 94 ``` 95 ## 3. Data preparation 96 Let's start by creating our mock data and populate our repositories. 97 This preparatory step will allow you to experiment with the inner and outer join examples in any order. 98 99 ***Step 1*** - Prepare your data 100 101 The setup target of the `Makefile` in `pachyderm/examples/joins` will create 3 directories (stores, purchases, and returns) containing our example data. 102 In the `examples/joins` directory, run: 103 ```shell 104 $ make setup 105 ``` 106 ***Step 2*** - Create and populate Pachyderm's repositories from the directories above. 107 108 In the `examples/joins` directory, run: 109 ```shell 110 $ make create 111 ``` 112 or run: 113 ```shell 114 $ pachctl create repo stores 115 $ pachctl create repo purchases 116 $ pachctl create repo returns 117 $ pachctl put file -r stores@master:/ -f stores 118 $ pachctl put file -r purchases@master:/ -f purchases 119 $ pachctl put file -r returns@master:/ -f returns 120 ``` 121 122 Have a look at the content of your repositories: 123 ```shell 124 $ pachctl list file stores@master 125 $ pachctl list file purchases@master 126 $ pachctl list file returns@master 127 ``` 128 You should see the following files in each repo: 129 - Purchases: 130 131  132 133 - Stores: 134 135  136 137 - Returns: 138 139  140 141 You are ready to run any of the following examples. 142 ## 4. Example 1 : An Inner Join pipeline creation 143 144 ***Goal*** 145 146 We are going to list all purchases by zip code and understand what datums are created by our `inner join` input in the process. 147 Quick overview of our pipeline: 148 149 1. **Pipeline input repositories**: The `purchases` and `stores` repos are joined (`inner join`) on the STOREID of their file name. 150 151 | Purchases | Stores | 152 |-----------|--------| 153 | || 154 155 2. **Pipeline**: The [inner_join.json](./inner_join.json) pipeline will first create input datums following its `glob` pattern and `"join_on": "$1"` values, then execute some Python code reading the `zipcode` in their STOREIDx.txt file. 156 3. **Pipeline output repository**: The output repo `inner_join` will contain a list of text files named after the stores' zip codes. Each will list all purchases made in a specific zip code. 157 158 We will expect the 2 following files in the output repository of our `inner_join` pipeline: 159 ```shell 160 └── 02108.txt 161 └── 90210.txt 162 ``` 163 Each should contain 3 purchases. 164 165 ***Step 1*** - Before we create our `inner join` pipeline, we can preview what our datums will look like by running the following command in the `examples/joins` directory: 166 167 ```shell 168 $ pachctl list datum -f inner_join.json 169 ``` 170  171 172 Note that no purchase was made at the STOREID4; therefore **no datum was created**. See diagram below. This is a characteristic of an `inner join`. We only see the stores in which purchases were made. 173 174  175 176 ***Step 2*** - Let's now create our pipeline and see it in action. 177 178 Because unprocessed data are awaiting in our input repositories, the pipeline creation will automatically trigger a job. 179 In the `examples/joins` directory, run: 180 ```shell 181 $ pachctl create pipeline -f inner_join.json 182 ``` 183 184 Check your pipeline's status: 185 ```shell 186 $ pachctl list pipeline 187 ``` 188 189 ***Step 3*** - Check the output repository of the pipeline, now that our code has aggregated those datums per zip code: 190 191 ```shell 192 $ pachctl list file inner_join@master 193 ``` 194 You should see our 2 expected text files. 195 196  197 198 For a visual confirmation of their content, run: 199 200 ```shell 201 $ pachctl get file inner_join@master:/02108.txt 202 ``` 203 The following table lists the expected results for this scenario: 204 205 |/02108.txt|/90210.txt| 206 |----------|----------| 207 |Purchase at store: 1 ... ORDER W080520|Purchase at store: 3 ... ORDER W598471| 208 |Purchase at store: 1 ... ORDER W080521|Purchase at store: 5 ... ORDER W080528| 209 |Purchase at store: 2 ... ORDER W078929 |Purchase at store: 5 ... ORDER W080231| 210 211 > Want to take this example to the next level? Practice using joins AND [groups](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/group/). You can create a 2-step pipeline that will group Returns and Purchases by storeID then join the output repo with Stores to aggregate by location. 212 213 ## 5. Example 2 : Outer Join pipeline creation 214 > You specify an [outer join](#case-1-outer-join-on-the-returns-repo-only) by adding an "outer_join" boolean property to an input repo in the `join` section of your pipeline spec. 215 216 ***Goal*** 217 218 We are going to list all returns by zip code and understand the datums created by our `outer join` when setting the `"outer_join": true` on: 219 220 1. `returns` only 221 2. `stores` only 222 3. both our repos `stores` and `returns` 223 224 These 3 cases will create 3 different sets of datums that we will explain further. A quick overview of our pipeline: 225 226 1. **Pipeline input repositories**: The `stores` and `returns` repos are joined (See 3 cases of outer join above) on the STOREID of their file name. 227 | Returns | Stores | 228 |-----------|--------| 229 ||| 230 2. **Pipeline**: Our [outer_join.json](./outer_join.json) pipeline will first create input datums following its `glob` pattern, `"join_on": "$1"`, and `"outer_join"` values, then execute some Python code that processes each store and matched return (if any). 231 232 3. **Pipeline output repository**: The output repo `outer_join` will contain a list of text files named after the stores' zip code. The detail of each will depend on where the 'outer join' was put. 233 234 We have listed all the possible outcomes in the following cheat sheet. Each particular case will be further explained: 235  236 237 238 ### ***Case 1*** Outer join on the Returns repo only 239 1. In the `examples/joins` directory, edit the pipeline's specification `outer_join.json` and set `"outer_join" :true` on the repo `returns`: 240 ```shell 241 "input": { 242 "join": [ 243 { 244 "pfs": { 245 "repo": "stores", 246 "branch": "master", 247 "glob": "/STOREID(*).txt", 248 "join_on": "$1", 249 "outer_join": false 250 } 251 }, 252 { 253 "pfs": { 254 "repo": "returns", 255 "branch": "master", 256 "glob": "/*_STOREID(*).txt", 257 "join_on": "$1", 258 "outer_join": true 259 } 260 } 261 ] 262 }, 263 ``` 264 1. Before we create our `outer join` pipeline, let's preview the datums that will be created. In the `examples/joins` directory, run: 265 266 ```shell 267 $ pachctl list datum -f outer_join.json 268 ``` 269  270 271 272 What you are noticing is that all your 4 returns are showing in a datum. 273 > Note that one return was made in a store **not** listed in our repo (STOREID0). There was no match on any STOREID for this specific return file, however; **a datum was created**, containing only the return file. By setting the returns repo as `"outer_join":true`, you are requesting to see ALL of the repo's files reflected in datums, whether there is a match or not. 274 1. In the `examples/joins` directory, let's now create our pipeline by running: 275 ```shell 276 $ pachctl create pipeline -f outer_join.json 277 ``` 278 1. Check the pipeline's status: 279 ```shell 280 $ pachctl list pipeline 281 ``` 282 283 284 285 1. Have a look at our output repo, now that our code has aggregated those datums per zip code: 286 287 Notice that an UNKNOWN.txt file shows up listing the return without matching store. 288 289 ```shell 290 $ pachctl list file outer_join@master 291 ``` 292  293 294 1. Visual confirmation: run the following command and validate that your files contain the returns that you are expecting: 295 ```shell 296 $ pachctl get file outer_join@master:/02108.txt 297 ``` 298 The following table lists the expected results for this scenario: 299 300 |/02108.txt|/90210.txt|/UNKNOWN.txt| 301 |----------|----------|------------| 302 |Return at store: 1 ... ORDER W080520|Return at store: 5 ... ORDER W080528|This return store does not exist ... ORDER W261452 | 303 | | Return at store: 5 ... ORDER W080231| | 304 | | | | 305 306 ### ***Case 2*** Outer join on the Stores repo only 307 1. In the `examples/joins` directory, set `"outer_join" :true` on the `stores` repo in `outer_join.json`: 308 309 ```shell 310 "input": { 311 "join": [ 312 { 313 "pfs": { 314 "repo": "stores", 315 "branch": "master", 316 "glob": "/STOREID(*).txt", 317 "join_on": "$1", 318 "outer_join": true 319 } 320 }, 321 { 322 "pfs": { 323 "repo": "returns", 324 "branch": "master", 325 "glob": "/*_STOREID(*).txt", 326 "join_on": "$1", 327 "outer_join": false 328 } 329 } 330 ] 331 }, 332 ``` 333 1. Preview all the datums again: 334 335 In the `examples/joins` directory, run: 336 ```shell 337 $ pachctl list datum -f outer_join.json 338 ``` 339  340 341 342 >Note that **all** of the stores files are showing up in the datums: some are associated with a matched return, some are not, yet, **a datum is created**. Additionally, the return made to STOREID0 did **not** generate a datum since there is no STOREID0.txt in the Stores repo. 343 344 1. Update your pipeline. In the `examples/joins` directory, run: 345 346 ```shell 347 $ pachctl update pipeline -f outer_join.json --reprocess 348 ``` 349 350 1. Check the output repository of your pipeline and notice that a 94107.txt file now shows up (STOREID4 has a different zip code 94107). 351 352 ```shell 353 $ pachctl list file outer_join@master 354 ``` 355  356 357 1. Visualize their content: 358 359 ```shell 360 $ pachctl get file outer_join@master:/02108.txt 361 ``` 362 Expected results for this scenario: 363 364 |/02108.txt|/90210.txt|/94107.txt| 365 |----------|----------|------------| 366 |Return at store: 2 ... NONE|Return at store: 5 ... ORDER W080231|Return at store: 4 ... NONE| 367 |Return at store: 1 ... ORDER W080520| Return at store: 3 ... NONE| | 368 | |Return at store: 5 ... ORDER W080528| | 369 370 ### ***Case 3*** Outer join on both the Stores and Returns repos 371 1. Set `"outer_join" :true` on both repos `returns` and `stores` in `outer_join.json`. 372 1. Preview your datums then create/update your pipeline: 373 374  375 376 377 > Note that in this last case, the datums contain all of the returns (with or without a matching store) **and** all of the stores (with or without a matching return). 378 379 380 1. Check the output repository of your pipeline and notice that all the zipcode.txt files now show up, including the UNKNOWN.txt. 381 382  383 384 1. Expected results for this scenario: 385 386 |/02108.txt|/90210.txt|/94107.txt|UNKNOWN.txt| 387 |----------|----------|------------|-----------| 388 |Return at store: 2 ... NONE|Return at store: 5 ... ORDER W080231|Return at store: 4 ... NONE| This return store does not exist ... ORDER W261452| 389 |Return at store: 1 ... ORDER W080520| Return at store: 3 ... NONE| | | 390 | |Return at store: 5 ... ORDER W080528| | | 391 392 393 > **Side note**: We could also aggregate the files by STOREID using a [group input](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/group/). In that case, the datums produced would be bigger. Each datum would contain one STOREIDx.txt file and ALL of the matching files for this given storeID in the grouped repos. The choice between group and join is an architectural choice to optimize your pipelines' parallelization and processing time. 394 395 We encourage you to run our [group example](https://github.com/pachyderm/pachyderm/tree/1.13.x/examples/group) to understand more of the differences between join and group. Look for the `retail_group.json` pipeline. 396 397 You can also follow those quick steps: 398 ```shell 399 $ cd ../group 400 $ pachctl list datum -f retail_group.json 401 ``` 402 403 In this last pipeline, we have grouped `purchases`, `returns`, and `stores` repos on STOREID. Look at the datums created. 404 405  406 407