github.com/pachyderm/pachyderm@v1.13.4/examples/joins/README.md (about)

     1  >![pach_logo](../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  
     5  # Inner and Outer Join Inputs
     6  >![pach_logo](./img/pach_logo.svg) The outer join input is available in version **1.12 and higher**.
     7  
     8  - In our first example, we will create a pipeline whose input datums result from a simple `inner join` between 2 repos.
     9  - In our second example, we will showcase 3 variations of `outer join` pipelines between 2 repos, and outline how they differ from inner join and each other.
    10  
    11  At the end of this page, you will understand the fundamental difference between the datums produced by an inner join and those created by an outer join.
    12  
    13  ***Table of Contents***
    14  
    15  - [1. Getting ready](#1-getting-ready)
    16  - [2. Data structure and naming convention](#2-data-structure-and-naming-convention)
    17  - [3. Data preparation](#3-data-preparation)
    18  - [4. Example 1 : An Inner Join pipeline creation](#4-example-1--an-inner-join-pipeline-creation)
    19  - [5. Example 2 : Outer Join pipeline creation](#5-example-2--outer-join-pipeline-creation) 
    20      - [***Case 1*** Outer join on the Returns repo only](#case-1-outer-join-on-the-returns-repo-only)
    21      - [***Case 2*** Outer join on the Stores repo only](#case-2-outer-join-on-the-stores-repo-only)
    22      - [***Case 3*** Outer join on both the Stores and Returns repos](#case-3-outer-join-on-both-the-stores-and-returns-repos)
    23  
    24  ***Key concepts***
    25  
    26  For these examples, we recommend to be familiar with the following concepts:
    27  
    28  - [Join](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/join/) pipelines - execute your code on files that match a specific naming pattern in your joined repos.
    29  - [Glob patterns](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/glob-pattern/) - for "RegEx-like" string matching on file paths and names.
    30  
    31  Additionally, you might want to check [datum](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/relationship-between-datums/). 
    32  
    33  ## 1. Getting ready
    34  ***Prerequisite***
    35  - A workspace on [Pachyderm Hub](https://docs.pachyderm.com/1.13.x/pachhub/pachhub_getting_started/) (recommended) or Pachyderm running [locally](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/).
    36  - [pachctl command-line ](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/#install-pachctl) installed, and your context created (i.e. you are logged in)
    37  
    38  ***Getting started***
    39  - Clone this repo.
    40  - Make sure Pachyderm is running. You should be able to connect to your Pachyderm cluster via the `pachctl` CLI. 
    41  Run a quick:
    42  ```shell
    43  $ pachctl version
    44  
    45  COMPONENT           VERSION
    46  pachctl             1.12.0
    47  pachd               1.12.0
    48  ```
    49  Ideally, have your pachctl and pachd versions match. At a minimum, you should always use the same major & minor versions of pachctl and pachd. 
    50  ## 2. Data structure and naming convention
    51  
    52  >![pach_logo](./img/pach_logo.svg) Remember, in Pachyderm, the join operates at the file-path level, **not** the files' content. Therefore, the structure of your directories and file naming conventions are key elements when implementing your use cases in Pachyderm.
    53  
    54  We have derived our examples from simplified retail use cases: 
    55  - Purchases and Returns are made in given Stores. 
    56  - Those Stores have a given location (here, a zip code). 
    57  - There are 0 to many Stores for a given zip code.
    58  
    59  Let's take a look at our data structure and naming convention. 
    60  We will create 3 repos:
    61  * Repo: `stores` - Each store data are JSON files named after the storeID.
    62  ```shell
    63      └── STOREID1.txt
    64      └── STOREID2.txt
    65      └── STOREID3.txt
    66      └──  ...
    67  ```
    68  To further these examples beyond the datum creation, we have added some simple processing at the code level. We will use the `zipcode` property to aggregate our data. 
    69  
    70  This is what the content of one of those STOREIDx.txt file looks like.
    71  ```shell
    72      {
    73          "storeid":"4",
    74          "name":"mariposa st.",
    75          "address":{
    76              "zipcode":"94107",
    77              "country":"US"
    78          }
    79      }
    80  ```
    81  
    82  * Repo: `purchases` - Each purchase info is kept in a file named by concatenating the purchase's order number and its store ID.
    83  ```shell
    84      └── ORDERW080520_STOREID1.txt
    85      └── ORDERW080521_STOREID1.txt
    86      └── ORDERW078929_STOREID2.txt
    87      └── ...
    88  ```
    89  * Repo: `returns` - Same naming convention as purchases.
    90  ```shell
    91      └── ORDERW080528_STOREID5.txt
    92      └── ODERW080520_STOREID1.txt
    93      └── ...
    94  ```
    95  ## 3. Data preparation
    96  Let's start by creating our mock data and populate our repositories.
    97  This preparatory step will allow you to experiment with the inner and outer join examples in any order.
    98  
    99  ***Step 1*** - Prepare your data
   100  
   101  The setup target of the `Makefile` in `pachyderm/examples/joins` will create 3 directories (stores, purchases, and returns) containing our example data.
   102  In the `examples/joins` directory, run:
   103  ```shell
   104  $ make setup
   105  ```
   106  ***Step 2*** - Create and populate Pachyderm's repositories from the directories above.
   107  
   108  In the `examples/joins` directory, run:
   109  ```shell
   110  $ make create
   111  ```
   112  or run:
   113  ```shell
   114  $ pachctl create repo stores
   115  $ pachctl create repo purchases
   116  $ pachctl create repo returns
   117  $ pachctl put file -r stores@master:/ -f stores
   118  $ pachctl put file -r purchases@master:/ -f purchases
   119  $ pachctl put file -r returns@master:/ -f returns
   120  ```
   121  
   122  Have a look at the content of your repositories: 
   123  ```shell
   124  $ pachctl list file stores@master
   125  $ pachctl list file purchases@master
   126  $ pachctl list file returns@master
   127  ```
   128  You should see the following files in each repo:
   129  - Purchases:
   130  
   131  ![stores_repository](./img/pachctl_list_file_stores_master.png)
   132  
   133  - Stores:
   134  
   135  ![purchases_repository](./img/pachctl_list_file_purchases_master.png)
   136  
   137  - Returns:
   138  
   139  ![returns_repository](./img/pachctl_list_file_returns_master.png)
   140   
   141   You are ready to run any of the following examples.
   142  ## 4. Example 1 : An Inner Join pipeline creation 
   143  
   144  ***Goal***
   145  
   146  We are going to list all purchases by zip code and understand what datums are created by our `inner join` input in the process. 
   147  Quick overview of our pipeline:
   148  
   149  1. **Pipeline input repositories**: The `purchases` and `stores` repos are joined (`inner join`) on the STOREID of their file name.
   150  
   151      | Purchases | Stores |
   152      |-----------|--------|
   153      | ![stores_repository](./img/pachctl_list_file_stores_master.png)|![purchases_repository](./img/pachctl_list_file_purchases_master.png)|
   154  
   155  2. **Pipeline**: The [inner_join.json](./inner_join.json) pipeline will first create input datums following its `glob` pattern and `"join_on": "$1"` values, then execute some Python code reading the `zipcode` in their STOREIDx.txt file. 
   156  3. **Pipeline output repository**: The output repo `inner_join` will contain a list of text files named after the stores' zip codes. Each will list all purchases made in a specific zip code. 
   157  
   158      We will expect the 2 following files in the output repository of our `inner_join` pipeline: 
   159      ```shell
   160          └── 02108.txt
   161          └── 90210.txt
   162      ```
   163      Each should contain 3 purchases.
   164  
   165  ***Step 1*** - Before we create our `inner join` pipeline, we can preview what our datums will look like by running the following command in the `examples/joins` directory:
   166  
   167  ```shell
   168  $ pachctl list datum -f inner_join.json
   169  ```
   170  ![list_datum_outer_inner](./img/pachctl_list_datum_inner.png)
   171  
   172  Note that no purchase was made at the STOREID4; therefore **no datum was created**. See diagram below. This is a characteristic of an `inner join`. We only see the stores in which purchases were made.
   173  
   174  ![inner_join_list_datum](./img/inner_join_list_datum.png)
   175  
   176  ***Step 2*** - Let's now create our pipeline and see it in action. 
   177  
   178  Because unprocessed data are awaiting in our input repositories, the pipeline creation will automatically trigger a job.
   179  In the `examples/joins` directory, run:
   180  ```shell
   181  $ pachctl create pipeline -f inner_join.json
   182  ```
   183  
   184  Check your pipeline's status:
   185  ```shell
   186  $ pachctl list pipeline
   187  ```
   188  
   189  ***Step 3*** - Check the output repository of the pipeline, now that our code has aggregated those datums per zip code: 
   190  
   191  ```shell
   192  $ pachctl list file inner_join@master
   193  ```
   194  You should see our 2 expected text files. 
   195  
   196  ![output_repository](./img/pachctl_list_file_inner_join_master.png)
   197  
   198  For a visual confirmation of their content, run:
   199  
   200  ```shell
   201  $ pachctl get file inner_join@master:/02108.txt
   202  ```
   203  The following table lists the expected results for this scenario:
   204  
   205  |/02108.txt|/90210.txt|
   206  |----------|----------|
   207  |Purchase at store: 1  ... ORDER W080520|Purchase at store: 3 ... ORDER W598471|
   208  |Purchase at store: 1 ... ORDER W080521|Purchase at store: 5 ... ORDER W080528| 
   209  |Purchase at store: 2 ... ORDER W078929 |Purchase at store: 5 ... ORDER W080231| 
   210  
   211  >![pach_logo](./img/pach_logo.svg) Want to take this example to the next level? Practice using joins AND [groups](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/group/). You can create a 2-step pipeline that will group Returns and Purchases by storeID then join the output repo with Stores to aggregate by location. 
   212  
   213  ## 5. Example 2 : Outer Join pipeline creation 
   214  >![pach_logo](./img/pach_logo.svg) You specify an [outer join](#case-1-outer-join-on-the-returns-repo-only) by adding an "outer_join" boolean property to an input repo in the `join` section of your pipeline spec. 
   215  
   216  ***Goal***
   217  
   218  We are going to list all returns by zip code and understand the datums created by our  `outer join` when setting the `"outer_join": true` on:
   219  
   220  1. `returns` only
   221  2. `stores` only
   222  3. both our repos `stores` and `returns`
   223  
   224  These 3 cases will create 3 different sets of datums that we will explain further. A quick overview of our pipeline: 
   225  
   226  1. **Pipeline input repositories**: The `stores` and `returns` repos are joined (See 3 cases of outer join above) on the STOREID of their file name.
   227      | Returns | Stores |
   228      |-----------|--------|
   229      |![returns_repository](./img/pachctl_list_file_returns_master.png)|![purchases_repository](./img/pachctl_list_file_purchases_master.png)|
   230  2. **Pipeline**: Our [outer_join.json](./outer_join.json) pipeline will first create input datums following its `glob` pattern, `"join_on": "$1"`, and `"outer_join"` values, then execute some Python code that processes each store and matched return (if any).
   231  
   232  3. **Pipeline output repository**: The output repo `outer_join` will contain a list of text files named after the stores' zip code. The detail of each will depend on where the 'outer join' was put.
   233  
   234  We have listed all the possible outcomes in the following cheat sheet. Each particular case will be further explained:
   235  ![outer_join_list_datum](./img/outer_joins_list_datum.png)
   236  
   237  
   238  ### ***Case 1*** Outer join on the Returns repo only
   239  1. In the `examples/joins` directory, edit the pipeline's specification `outer_join.json` and set `"outer_join" :true` on the repo `returns`:
   240      ```shell
   241      "input": {
   242          "join": [
   243          {
   244              "pfs": {
   245              "repo": "stores",
   246              "branch": "master",
   247              "glob": "/STOREID(*).txt",
   248              "join_on": "$1",
   249              "outer_join": false
   250              }
   251          },
   252          {
   253          "pfs": {
   254              "repo": "returns",
   255              "branch": "master",
   256              "glob": "/*_STOREID(*).txt",
   257              "join_on": "$1",
   258              "outer_join": true
   259          }
   260          }
   261      ]
   262      },
   263      ```
   264  1. Before we create our `outer join` pipeline, let's preview the datums that will be created. In the `examples/joins` directory, run:
   265  
   266      ```shell
   267      $ pachctl list datum -f outer_join.json
   268      ```
   269      ![list_datum_outer_returns](./img/pachctl_list_datum_outer_returns.png)
   270   
   271   
   272      What you are noticing is that all your 4 returns are showing in a datum. 
   273      >![pach_logo](./img/pach_logo.svg) Note that one return was made in a store **not** listed in our repo (STOREID0). There was no match on any STOREID for this specific return file, however; **a datum was created**, containing only the return file. By setting the returns repo as `"outer_join":true`, you are requesting to see ALL of the repo's files reflected in datums, whether there is a match or not.
   274  1. In the `examples/joins` directory, let's now create our pipeline by running:
   275      ```shell
   276      $ pachctl create pipeline -f outer_join.json
   277      ```
   278  1. Check the pipeline's status:
   279      ```shell
   280      $ pachctl list pipeline
   281      ```
   282  
   283  
   284  
   285  1. Have a look at our output repo, now that our code has aggregated those datums per zip code:
   286  
   287      Notice that an UNKNOWN.txt file shows up listing the return without matching store.
   288  
   289      ```shell
   290      $ pachctl list file outer_join@master
   291      ```
   292      ![](./img/pachctl_list_file_outer_returns.png)
   293  
   294  1. Visual confirmation: run the following command and validate that your files contain the returns that you are expecting:
   295      ```shell
   296      $ pachctl get file outer_join@master:/02108.txt
   297      ```
   298      The following table lists the expected results for this scenario:
   299  
   300      |/02108.txt|/90210.txt|/UNKNOWN.txt|
   301      |----------|----------|------------|
   302      |Return at store: 1 ... ORDER W080520|Return at store: 5 ... ORDER W080528|This return store does not exist ... ORDER W261452 |
   303      | | Return at store: 5 ... ORDER W080231| |
   304      | | | |
   305  
   306  ### ***Case 2*** Outer join on the Stores repo only
   307  1. In the `examples/joins` directory, set `"outer_join" :true` on the `stores` repo in `outer_join.json`:
   308  
   309      ```shell
   310      "input": {
   311          "join": [
   312          {
   313              "pfs": {
   314              "repo": "stores",
   315              "branch": "master",
   316              "glob": "/STOREID(*).txt",
   317              "join_on": "$1",
   318              "outer_join": true
   319              }
   320          },
   321          {
   322          "pfs": {
   323              "repo": "returns",
   324              "branch": "master",
   325              "glob": "/*_STOREID(*).txt",
   326              "join_on": "$1",
   327              "outer_join": false
   328          }
   329          }
   330      ]
   331      },
   332      ```
   333  1. Preview all the datums again: 
   334  
   335      In the `examples/joins` directory, run:
   336      ```shell
   337      $ pachctl list datum -f outer_join.json
   338      ```
   339      ![list_datum_outer_returns](./img/pachctl_list_datum_outer_stores.png)
   340   
   341   
   342      >![pach_logo](./img/pach_logo.svg)Note that **all** of the stores files are showing up in the datums: some are associated with a matched return, some are not, yet, **a datum is created**. Additionally, the return made to STOREID0 did **not** generate a datum since there is no STOREID0.txt in the Stores repo.
   343  
   344  1. Update your pipeline. In the `examples/joins` directory, run:
   345  
   346      ```shell
   347      $ pachctl update pipeline -f outer_join.json --reprocess
   348      ```
   349  
   350  1. Check the output repository of your pipeline and notice that a 94107.txt file now shows up (STOREID4 has a different zip code 94107).
   351  
   352      ```shell
   353      $ pachctl list file outer_join@master
   354      ```
   355      ![](./img/pachctl_list_file_outer_stores.png)
   356  
   357  1. Visualize their content:
   358  
   359      ```shell
   360      $ pachctl get file outer_join@master:/02108.txt
   361      ```
   362      Expected results for this scenario:
   363  
   364      |/02108.txt|/90210.txt|/94107.txt|
   365      |----------|----------|------------|
   366      |Return at store: 2 ... NONE|Return at store: 5 ... ORDER W080231|Return at store: 4 ... NONE|
   367      |Return at store: 1 ... ORDER W080520| Return at store: 3 ... NONE| |
   368      | |Return at store: 5 ... ORDER W080528| |
   369  
   370  ### ***Case 3*** Outer join on both the Stores and Returns repos
   371  1. Set `"outer_join" :true` on both repos `returns` and `stores` in `outer_join.json`.
   372  1. Preview your datums then create/update your pipeline: 
   373  
   374      ![list_datum_full_outer](./img/pachctl_list_datum_full_outer.png)
   375   
   376   
   377      >![pach_logo](./img/pach_logo.svg) Note that in this last case, the datums contain all of the returns (with or without a matching store) **and** all of the stores (with or without a matching return).
   378  
   379  
   380  1. Check the output repository of your pipeline and notice that all the zipcode.txt files now show up, including the UNKNOWN.txt.
   381  
   382      ![](./img/pachctl_list_file_full_outer.png)
   383  
   384  1. Expected results for this scenario:
   385  
   386      |/02108.txt|/90210.txt|/94107.txt|UNKNOWN.txt|
   387      |----------|----------|------------|-----------|
   388      |Return at store: 2 ... NONE|Return at store: 5 ... ORDER W080231|Return at store: 4 ... NONE| This return store does not exist ... ORDER W261452|
   389      |Return at store: 1 ... ORDER W080520| Return at store: 3 ... NONE| | |
   390      | |Return at store: 5 ... ORDER W080528| | |
   391  
   392  
   393  >![pach_logo](./img/pach_logo.svg) **Side note**: We could also aggregate the files by STOREID using a [group input](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/group/). In that case, the datums produced would be bigger. Each datum would contain one STOREIDx.txt file and ALL of the matching files for this given storeID in the grouped repos. The choice between group and join is an architectural choice to optimize your pipelines' parallelization and processing time.
   394         
   395  We encourage you to run our [group example](https://github.com/pachyderm/pachyderm/tree/1.13.x/examples/group) to understand more of the differences between join and group. Look for the `retail_group.json` pipeline.
   396  
   397  You can also follow those quick steps:
   398  ```shell
   399  $ cd ../group
   400  $ pachctl list datum -f retail_group.json
   401  ```
   402  
   403  In this last pipeline, we have grouped `purchases`, `returns`, and `stores` repos on STOREID. Look at the datums created.
   404  
   405  ![pach_logo](./img/pachctl_list_datum_retail_group.png)
   406  
   407