github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/pipeline-concepts/datum/group.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/pipeline-concepts/datum/group.md (about)

     1  # Group Input
     2  
     3  A group is a special type of pipeline input that enables you to aggregate
     4  files that reside in one or separate Pachyderm repositories and match a
     5  particular naming pattern. The group operator must be used in combination
     6  with a glob pattern that reflects a specific naming convention.
     7  
     8  By analogy, a Pachyderm group is similar to a database *group-by*,
     9  but it matches on file paths only, not the content of the files.
    10  
    11  Unlike the [join](../datum/join.md) datum that will always contain a single match (even partial) from each input repo,
    12  **a group creates one datum for each set of matching files accross its input repos**.
    13  You can use group to aggregate data that is not adequately captured by your directory structure 
    14  or to control the granularity of your datums through file name-matching. 
    15  
    16  
    17  When you configure a group input, you must specify a glob pattern that
    18  includes a capture group. The capture group defines the specific string in
    19  the file path that is used to match files in other joined repos.
    20  Capture groups work analogously to the [regex capture group](https://www.regular-expressions.info/refcapture.html).
    21  You define the capture group inside parenthesis. Capture groups are numbered
    22  from left to right and can also be nested within each other. Numbering for
    23  nested capture groups is based on their opening parenthesis.
    24  
    25  Below you can find a few examples of applying a glob pattern with a capture
    26  group to a file path. For example, if you have the following file path:
    27  
    28  ```shell
    29  /foo/bar-123/ABC.txt
    30  ```
    31  
    32  The following glob patterns in a joint input create the
    33  following capture groups:
    34  
    35  | Regular expression  | Capture groups           |
    36  | ------------------- | ------------------------ |
    37  | `/(*)`              | `foo`                    |
    38  | `/*/bar-(*)`        | `123`                    |
    39  | `/(*)/*/(??)*.txt`  | Capture group 1: `foo`, capture group 2: `AB`. |
    40  | `/*/(bar-(123))/*`  | Capture group 1: `bar-123`, capture group 2: `123`. |
    41  
    42  
    43  Also, groups require you to specify a [replacement group](https://www.regular-expressions.info/replacebackref.html)
    44  in the `group_by` parameter to define which capture groups you want to try
    45  to match.
    46  
    47  For example, `$1` indicates that you want Pachyderm to match based on
    48  capture group `1`. Similarly, `$2` matches the capture group `2`.
    49  `$1$2` means that it must match both capture groups `1` and `2`.
    50  
    51  If Pachyderm does not find any matching files, you get a zero-datum job.
    52  
    53  You can test your glob pattern and capture groups by using the
    54  `pachctl list datum -f <your_pipeline_spec.json>` command as described in
    55  [List Datum](../../datum/glob-pattern/#test-your-datums).
    56  
    57  ## Example
    58  
    59  For example, a repository `labresults` contains the lab results of patients. 
    60  The files at the root of your repository have the following naming convention. You want to group your lab results by patientID.
    61  
    62  * `labresults` repo:
    63  
    64     ```shell
    65     ├── LIPID-patientID1-labID1.txt (1)
    66     ├── LIPID-patientID2-labID1.txt (2)
    67     ├── LIPID-patientID1-labID2.txt (3)
    68     ├── LIPID-patientID3-labID3.txt (4)
    69     ├── LIPID-patientID1-labID3.txt (5)
    70     ├── LIPID-patientID2-labID3.txt (6)
    71     ```
    72  
    73  Pachyderm runs your code on the set of files that match
    74  the glob pattern and capture groups.
    75  
    76  The following example shows how you can use group to aggregate all the lab results of each patient.
    77  
    78  ```json
    79   {
    80     "pipeline": {
    81       "name": "group"
    82     },
    83     "input": {
    84       "group": [
    85        {
    86          "pfs": {
    87            "repo": "labresults",
    88            "branch": "master",
    89            "glob": "/*-(*)-lab*.txt",
    90            "group_by": "$1"
    91          }
    92        }
    93      ]
    94    },
    95     "transform": {
    96        "cmd": [ "bash" ],
    97        "stdin": [ "wc" ,"-l" ,"/pfs/labresults/*" ]
    98        }
    99    }
   100   }
   101  ```
   102  
   103  The glob pattern for the `labresults` repository, `/*-(*)-lab*.txt`, selects all files with a patientID match in the root directory.
   104  
   105  The pipeline will process 3 datums for this job. 
   106  
   107  - all files containing `patientID1` (1, 3, 5) are grouped in one datum, 
   108  - a second datum will be made of (2, 6) for `patientID2`
   109  - and a third with (4) for `patientID3`
   110  
   111  The `pachctl list datum -f <your_pipeline_spec.json>` command is a useful tool to check your datums: 
   112  
   113  ```code
   114  ID FILES                                                                                                                                                                                                                        STATUS TIME
   115  -  labresults@722665ed49474db0aab5cbe4d8a20ff8:/LIPID-patientID1-labID1.txt, labresults@722665ed49474db0aab5cbe4d8a20ff8:/LIPID-patientID1-labID3.txt, labresults@722665ed49474db0aab5cbe4d8a20ff8:/LIPID-patientID1-labID2.txt -      -
   116  -  labresults@722665ed49474db0aab5cbe4d8a20ff8:/LIPID-patientID2-labID1.txt, labresults@722665ed49474db0aab5cbe4d8a20ff8:/LIPID-patientID2-labID3.txt                                                                           -      -
   117  -  labresults@722665ed49474db0aab5cbe4d8a20ff8:/LIPID-patientID3-labID3.txt
   118  ```
   119  
   120  To experiment further, see the full [group example](https://github.com/pachyderm/pachyderm/tree/master/examples/group).
   121