github.com/pachyderm/pachyderm@v1.13.4/examples/group/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 5 # Group Pipelines 6 > The group functionality is available in version **1.12 and higher**. 7 8 9 ## Intro 10 You configure a group in the [pipeline specification](https://docs.pachyderm.com/1.13.x/reference/pipeline_spec/) file by adding a `group` input around the one or many pfs repositories you want to aggregate together. At each input repo level included in your group, you then need to specify a `group_by` that will define the capture group from your glob pattern that you want to consider to group your files. 11 12 13 - Our first examples will walk you through a simple use of group applied to the files of a single repository. 14 - Our second example will showcase a more complex group setting where information is grouped accross 3 repositories. 15 16 > Remember, in Pachyderm, the group operates at the file-path level, **not** the content of the files themselves. Therefore, the structure of your directories and file naming conventions are key elements when implementing your use cases in Pachyderm. 17 18 19 ## Getting ready 20 ***Key concepts*** 21 - [Group](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/group/) pipelines - execute your code on files that match a specific naming pattern in your group repo(s). 22 - [glob patterns](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/glob-pattern/) - for "RegEx-like" string matching on file paths and names. 23 24 You might also want to brush up your [datum](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/relationship-between-datums/) knowledge. 25 26 ***Prerequisite*** 27 - A workspace on [Pachyderm Hub](https://docs.pachyderm.com/1.13.x/pachhub/pachhub_getting_started/) (recommended) or Pachyderm running [locally](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/). 28 - [pachctl command-line ](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/#install-pachctl) installed, and your context created (i.e., you are logged in) 29 30 ***Getting started*** 31 - Clone this repo. 32 - Make sure Pachyderm is running. You should be able to connect to your Pachyderm cluster via the `pachctl` CLI. 33 Run a quick: 34 ```shell 35 $ pachctl version 36 37 COMPONENT VERSION 38 pachctl 1.12.0 39 pachd 1.12.0 40 ``` 41 Ideally, have your pachctl and pachd versions match. At a minimum, you should always use the same major & minor versions of your pachctl and pachd. 42 43 ## Example 1 - Simple group-by pipelines 44 ***Data structure and naming convention*** 45 46 Our first example comes from a simple healthcare use case: 47 48 * A patient gets test results, each of which can come from a different lab. Each of our files contains the test results from a particular lab for a given patient. 49 50 Let's have a look at the data structure and naming convention of our first example: 51 * Repo: `labresults` - Our file names follow the following "-" separated pattern: 52 53 1. **T** + {Time stamp} 54 2. Type of test (Here **LIPID** for all our files) 55 3. **PATID** + {Patient identifier} 56 4. **CLIA** + {Lab/Hospital identifier} 57 58 ```shell 59 └── T1606707557-LIPID-PATID1-CLIA24D9871327.txt 60 └── T1606331395-LIPID-PATID2-CLIA24D9871327.txt 61 └── T1606707613-LIPID-PATID1-CLIA24D9871328.txt 62 └── ... 63 ``` 64 For information, here is what the content of those txt files looks like. 65 66  67 68 ***Goal*** 69 We want to aggregate our labresults by patient or by hospital. We will create two separate pipelines out of the same repository, one for each case. 70 71 1. **Pipeline input repository**: `labresults` 72 - Group by patient: the group will be done by PATID 73 - Group by hospital: the group will be done by CLIA 74 75 1. **Pipeline**: Executes a set of command lines creating a new directory named after each capture group and copying the files that match the given group. (See our 2 pipelines:[`lab_group_by_hospital.json`](./lab_group_by_hospital.json) and [`lab_group_by_patient.json`](./lab_group_by_patient.json)). 76 77 1. **Pipeline output repository**: `group_by_hospital`or `group_by_patient` depending on which use case you run - Each output repo will contain a list or sub-directories named after each capture group and populated with a copy of their matching files. 78 In the diagram below, we have mapped out the data of our example and the expected results in each case. 79  80 81 ***Example walkthrough*** 82 83 1. Prepare your data: 84 85 Let's first create our mock dataset and create/populate our repository. 86 The setup target `setup-lab` of the `Makefile` in `pachyderm/examples/group` will create a directory (labresults) with our example data. 87 In the `examples/group` directory, run: 88 ```shell 89 $ make setup-lab 90 ``` 91 1. Create/populate Pachyderm's repository and create your pipelines: 92 93 In the `examples/joins` directory, run: 94 ```shell 95 $ make deploy-lab 96 ``` 97 or run: 98 ```shell 99 $ pachctl create repo labresults 100 $ pachctl put file -r labresults@master:/ -f labresults 101 $ pachctl create pipeline -f lab_group_by_hospital.json 102 $ pachctl create pipeline -f lab_group_by_patient.json 103 ``` 104 Have a quick look at your repositories: 105 ```shell 106 $ pachctl list file labresults@master 107 ``` 108 You should see the following files: 109 110  111 112 The commit in your entry repository has triggered the execution of your pipeline (i.e., a job). Have a quick check at your pipeline's status: 113 ```shell 114 $ pachctl list pipeline 115 ``` 116 Once it has run successfully, you should see something like this: 117 118  119 120 1. Let's have a look at our final product: 121 122 Check the output repository of your pipeline. 123 ```shell 124 $ pachctl list file group_by_patient@master 125 $ pachctl list file group_by_hospital@master 126 ``` 127 You should see your expected sub repositories. 128 129 Check one testresult for patient 1: 130 ```shell 131 $ pachctl get file group_by_patient@master:/1/T1606707613-LIPID-PATID1-CLIA24D9871328.txt 132 ``` 133 134 ## Example 2 - Group pipeline on several repositories 135 ***Data structure and naming convention*** 136 137 The second example is derived from a simplified retail use case: 138 - Purchases and returns are made in given stores. 139 - Those stores have a given location (here, a zip code). 140 - There are 0 to many stores in a given zip code. 141 142 This dataset is shared with the "Join pipelines"' examples. Read about the [structure of the data and naming conventions](https://github.com/pachyderm/pachyderm/blob/1.13.x/examples/joins/README.md#2-data-structure-and-naming-convention). 143 144 145 ***Goal*** 146 For each store, we are going to calculate the net amount of all transactions (net_amount = order_total - return_total) and save it to a text file named after the store identifier. 147 148 1. **Pipeline input repositories**: `stores` , `returns`, `purchases` 149 - Group by STOREID on all 3 repositories. 150 151 Each match (i.e., all transactions - purchases and returns - having occured at a given store along with the store information itself) will generate one datum. 152 2. **Pipeline**: Executes a python code reading the `purchases` and `returns` for each matching STOREID and writing the corresponding net_amount to a text file named after the STOREID. (See our pipeline: [`retail_group.json`](./retail_group.json)) 153 3. **Pipeline output repository**: `group_store_revenue` - list of text files named after the STOREID. 154 155 In the diagram below, we have mapped out our data. 156 157  158 159 The following table lists the expected result (the "net amount") for each store. 160 161  162 163 ***Example walkthrough*** 164 165 1. Let's create your new data: 166 167 In the `examples/group` directory, run: 168 ```shell 169 $ make setup-retail 170 ``` 171 You just created 3 directories: stores, purchases, returns. Check them out. 172 ```shell 173 $ ls ./purchases 174 ``` 175 176 1. Create/populate Pachyderm’s repository and create your pipelines: 177 178 In the `examples/group` directory, run: 179 ```shell 180 $ make deploy-retail 181 ``` 182 or run: 183 ```shell 184 $ pachctl create repo stores 185 $ pachctl create repo purchases 186 $ pachctl create repo returns 187 $ pachctl put file -r stores@master:/ -f stores 188 $ pachctl put file -r purchases@master:/ -f purchases 189 $ pachctl put file -r returns@master:/ -f returns 190 $ pachctl create pipeline -f retail_group.json 191 ``` 192 check your repositories: 193 ```shell 194 $ pachctl list file stores@master 195 $ pachctl list file purchases@master 196 $ pachctl list file returns@master 197 ``` 198 Here is the list of the files in the purchases repo: 199 200  201 202 and your pipeline: 203 ```shell 204 $ pachctl list pipeline 205 ``` 206  207 208 1. Have a look at your final product: 209 210 Once it has fully and successfully run, have a look at your output repository to confirm that it looks like what we expect. 211 ```shell 212 $ pachctl list file group_store_revenue@master 213 ``` 214 Now for a visual confirmation of the content of each specific file: 215 ```shell 216 $ pachctl get file group_store_revenue@master:/5.txt 217 ``` 218 It should look like this: 219 220  221