github.com/pachyderm/pachyderm@v1.13.4/examples/group/README.md

github.com/pachyderm/pachyderm@v1.13.4/examples/group/README.md (about)

     1  >![pach_logo](../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  
     5  # Group Pipelines 
     6  >![pach_logo](./img/pach_logo.svg) The group functionality is available in version **1.12 and higher**.
     7  
     8  
     9  ## Intro
    10  You configure a group in the [pipeline specification](https://docs.pachyderm.com/1.13.x/reference/pipeline_spec/) file by adding a `group` input around the one or many pfs repositories you want to aggregate together. At each input repo level included in your group, you then need to specify a `group_by` that will define the capture group from your glob pattern that you want to consider to group your files. 
    11  
    12  
    13  - Our first examples will walk you through a simple use of group applied to the files of a single repository. 
    14  - Our second example will showcase a more complex group setting where information is grouped accross 3 repositories. 
    15  
    16  >![pach_logo](./img/pach_logo.svg) Remember, in Pachyderm, the group operates at the file-path level, **not** the content of the files themselves. Therefore, the structure of your directories and file naming conventions are key elements when implementing your use cases in Pachyderm.
    17  
    18  
    19  ## Getting ready
    20  ***Key concepts***
    21  - [Group](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/group/) pipelines - execute your code on files that match a specific naming pattern in your group repo(s).
    22  - [glob patterns](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/glob-pattern/) - for "RegEx-like" string matching on file paths and names.
    23  
    24  You might also want to brush up your [datum](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/relationship-between-datums/) knowledge. 
    25  
    26  ***Prerequisite***
    27  - A workspace on [Pachyderm Hub](https://docs.pachyderm.com/1.13.x/pachhub/pachhub_getting_started/) (recommended) or Pachyderm running [locally](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/).
    28  - [pachctl command-line ](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/#install-pachctl) installed, and your context created (i.e., you are logged in)
    29  
    30  ***Getting started***
    31  - Clone this repo.
    32  - Make sure Pachyderm is running. You should be able to connect to your Pachyderm cluster via the `pachctl` CLI. 
    33  Run a quick:
    34  ```shell
    35  $ pachctl version
    36  
    37  COMPONENT           VERSION
    38  pachctl             1.12.0
    39  pachd               1.12.0
    40  ```
    41  Ideally, have your pachctl and pachd versions match. At a minimum, you should always use the same major & minor versions of your pachctl and pachd. 
    42  
    43  ## Example 1 - Simple group-by pipelines 
    44  ***Data structure and naming convention***
    45  
    46  Our first example comes from a simple healthcare use case:
    47  
    48  * A patient gets test results, each of which can come from a different lab. Each of our files contains the test results from a particular lab for a given patient.
    49  
    50  Let's have a look at the data structure and naming convention of our first example:  
    51  * Repo: `labresults` - Our file names follow the following "-" separated pattern: 
    52  
    53  1. **T** + {Time stamp}
    54  2. Type of test (Here **LIPID** for all our files)
    55  3. **PATID** + {Patient identifier}
    56  4. **CLIA** + {Lab/Hospital identifier}
    57  
    58  ```shell
    59      └── T1606707557-LIPID-PATID1-CLIA24D9871327.txt
    60      └── T1606331395-LIPID-PATID2-CLIA24D9871327.txt
    61      └── T1606707613-LIPID-PATID1-CLIA24D9871328.txt
    62      └──  ...
    63  ```
    64  For information, here is what the content of those txt files looks like. 
    65  
    66  ![labresultsample.png](./img/labresultsample.png)
    67  
    68  ***Goal***
    69  We  want to aggregate our labresults by patient or by hospital. We will create two separate pipelines out of the same repository, one for each case.
    70  
    71  1. **Pipeline input repository**: `labresults` 
    72      - Group by patient: the group will be done by PATID 
    73      - Group by hospital: the group will be done by CLIA
    74  
    75  1. **Pipeline**: Executes a set of command lines creating a new directory named after each capture group and copying the files that match the given group. (See our 2 pipelines:[`lab_group_by_hospital.json`](./lab_group_by_hospital.json) and [`lab_group_by_patient.json`](./lab_group_by_patient.json)).
    76  
    77  1. **Pipeline output repository**: `group_by_hospital`or `group_by_patient` depending on which use case you run - Each output repo will contain a list or sub-directories named after each capture group and populated with a copy of their matching files.
    78  In the diagram below, we have mapped out the data of our example and the expected results in each case.
    79  ![group_example1](./img/group_example1.png)
    80  
    81  ***Example walkthrough***
    82  
    83  1. Prepare your data:
    84   
    85      Let's first create our mock dataset and create/populate our repository.
    86      The setup target `setup-lab` of the `Makefile` in `pachyderm/examples/group` will create a directory (labresults) with our example data.
    87      In the `examples/group` directory, run:
    88      ```shell
    89      $ make setup-lab
    90      ```
    91  1. Create/populate Pachyderm's repository and create your pipelines:
    92  
    93      In the `examples/joins` directory, run:
    94      ```shell
    95      $ make deploy-lab
    96      ```
    97      or run:
    98      ```shell
    99      $ pachctl create repo labresults
   100      $ pachctl put file -r labresults@master:/ -f labresults
   101      $ pachctl create pipeline -f lab_group_by_hospital.json 
   102      $ pachctl create pipeline -f lab_group_by_patient.json
   103      ```
   104      Have a quick look at your repositories: 
   105      ```shell
   106      $ pachctl list file labresults@master
   107      ```
   108      You should see the following files:
   109  
   110      ![labresults repo](./img/list_file_labresults_master.png)
   111  
   112      The commit in your entry repository has triggered the execution of your pipeline (i.e., a job). Have a quick check at your pipeline's status:
   113      ```shell
   114      $ pachctl list pipeline
   115      ```
   116      Once it has run successfully, you should see something like this:
   117  
   118      ![pipelines](./img/lab_list_pipeline.png)
   119  
   120  1. Let's have a look at our final product: 
   121  
   122      Check the output repository of your pipeline.
   123      ```shell
   124      $ pachctl list file group_by_patient@master
   125      $ pachctl list file group_by_hospital@master
   126      ```
   127      You should see your expected sub repositories. 
   128  
   129      Check one testresult for patient 1:
   130      ```shell
   131      $ pachctl get file group_by_patient@master:/1/T1606707613-LIPID-PATID1-CLIA24D9871328.txt
   132      ```
   133  
   134  ## Example 2 - Group pipeline on several repositories 
   135  ***Data structure and naming convention***
   136  
   137  The second example is derived from a simplified retail use case: 
   138  - Purchases and returns are made in given stores. 
   139  - Those stores have a given location (here, a zip code). 
   140  - There are 0 to many stores in a given zip code.
   141  
   142  This dataset is shared with the "Join pipelines"' examples. Read about the [structure of the data and naming conventions](https://github.com/pachyderm/pachyderm/blob/1.13.x/examples/joins/README.md#2-data-structure-and-naming-convention).
   143  
   144  
   145  ***Goal***
   146  For each store, we are going to calculate the net amount of all transactions (net_amount = order_total - return_total) and save it to a text file named after the store identifier.
   147  
   148  1. **Pipeline input repositories**: `stores` , `returns`, `purchases` 
   149      - Group by STOREID on all 3 repositories. 
   150      
   151      Each match (i.e., all transactions - purchases and returns - having occured at a given store along with the store information itself) will generate one datum.
   152  2. **Pipeline**: Executes a python code reading the `purchases` and `returns` for each matching STOREID and writing the corresponding net_amount to a text file named after the STOREID. (See our pipeline: [`retail_group.json`](./retail_group.json))
   153  3. **Pipeline output repository**: `group_store_revenue` - list of text files named after the STOREID. 
   154  
   155  In the diagram below, we have mapped out our data. 
   156  
   157  ![group_example2](./img/group_example2.png)
   158  
   159  The following table lists the expected result (the "net amount") for each store. 
   160  
   161  ![group_example2_digest](./img/group_example2_digest.png)
   162  
   163  ***Example walkthrough***
   164  
   165  1. Let's create your new data:
   166  
   167      In the `examples/group` directory, run:
   168      ```shell
   169      $ make setup-retail
   170      ```
   171      You just created 3 directories: stores, purchases, returns. Check them out.
   172      ```shell
   173      $ ls ./purchases
   174      ```
   175  
   176  1. Create/populate Pachyderm’s repository and create your pipelines:
   177  
   178      In the `examples/group` directory, run:
   179      ```shell
   180      $ make deploy-retail
   181      ```
   182      or run:
   183      ```shell
   184      $ pachctl create repo stores
   185      $ pachctl create repo purchases
   186      $ pachctl create repo returns
   187      $ pachctl put file -r stores@master:/ -f stores
   188      $ pachctl put file -r purchases@master:/ -f purchases
   189      $ pachctl put file -r returns@master:/ -f returns
   190      $ pachctl create pipeline -f retail_group.json
   191      ```
   192      check your repositories:
   193      ```shell
   194      $ pachctl list file stores@master
   195      $ pachctl list file purchases@master
   196      $ pachctl list file returns@master	
   197      ```
   198      Here is the list of the files in the purchases repo:
   199  
   200      ![list_file_purchase_master](./img/list_file_purchase_master.png)
   201  
   202      and your pipeline:
   203      ```shell
   204      $ pachctl list pipeline
   205      ```
   206      ![retail_list_pipeline](./img/retail_list_pipeline.png)
   207  
   208  1. Have a look at your final product:
   209  
   210      Once it has fully and successfully run, have a look at your output repository to confirm that it looks like what we expect.
   211      ```shell
   212      $ pachctl list file group_store_revenue@master
   213      ```
   214      Now for a visual confirmation of the content of each specific file:
   215      ```shell
   216      $ pachctl get file group_store_revenue@master:/5.txt
   217      ```
   218      It should look like this:
   219  
   220      ![get_file_group_store_revenue_master](./img/get_file_group_store_revenue_master.png)
   221