github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/pipeline-concepts/datum/group.md (about) 1 # Group Input 2 3 A group is a special type of pipeline input that enables you to aggregate 4 files that reside in one or separate Pachyderm repositories and match a 5 particular naming pattern. The group operator must be used in combination 6 with a glob pattern that reflects a specific naming convention. 7 8 By analogy, a Pachyderm group is similar to a database *group-by*, 9 but it matches on file paths only, not the content of the files. 10 11 Unlike the [join](../datum/join.md) datum that will always contain a single match (even partial) from each input repo, 12 **a group creates one datum for each set of matching files accross its input repos**. 13 You can use group to aggregate data that is not adequately captured by your directory structure 14 or to control the granularity of your datums through file name-matching. 15 16 17 When you configure a group input, you must specify a glob pattern that 18 includes a capture group. The capture group defines the specific string in 19 the file path that is used to match files in other joined repos. 20 Capture groups work analogously to the [regex capture group](https://www.regular-expressions.info/refcapture.html). 21 You define the capture group inside parenthesis. Capture groups are numbered 22 from left to right and can also be nested within each other. Numbering for 23 nested capture groups is based on their opening parenthesis. 24 25 Below you can find a few examples of applying a glob pattern with a capture 26 group to a file path. For example, if you have the following file path: 27 28 ```shell 29 /foo/bar-123/ABC.txt 30 ``` 31 32 The following glob patterns in a joint input create the 33 following capture groups: 34 35 | Regular expression | Capture groups | 36 | ------------------- | ------------------------ | 37 | `/(*)` | `foo` | 38 | `/*/bar-(*)` | `123` | 39 | `/(*)/*/(??)*.txt` | Capture group 1: `foo`, capture group 2: `AB`. | 40 | `/*/(bar-(123))/*` | Capture group 1: `bar-123`, capture group 2: `123`. | 41 42 43 Also, groups require you to specify a [replacement group](https://www.regular-expressions.info/replacebackref.html) 44 in the `group_by` parameter to define which capture groups you want to try 45 to match. 46 47 For example, `$1` indicates that you want Pachyderm to match based on 48 capture group `1`. Similarly, `$2` matches the capture group `2`. 49 `$1$2` means that it must match both capture groups `1` and `2`. 50 51 If Pachyderm does not find any matching files, you get a zero-datum job. 52 53 You can test your glob pattern and capture groups by using the 54 `pachctl list datum -f <your_pipeline_spec.json>` command as described in 55 [List Datum](../../datum/glob-pattern/#test-your-datums). 56 57 ## Example 58 59 For example, a repository `labresults` contains the lab results of patients. 60 The files at the root of your repository have the following naming convention. You want to group your lab results by patientID. 61 62 * `labresults` repo: 63 64 ```shell 65 ├── LIPID-patientID1-labID1.txt (1) 66 ├── LIPID-patientID2-labID1.txt (2) 67 ├── LIPID-patientID1-labID2.txt (3) 68 ├── LIPID-patientID3-labID3.txt (4) 69 ├── LIPID-patientID1-labID3.txt (5) 70 ├── LIPID-patientID2-labID3.txt (6) 71 ``` 72 73 Pachyderm runs your code on the set of files that match 74 the glob pattern and capture groups. 75 76 The following example shows how you can use group to aggregate all the lab results of each patient. 77 78 ```json 79 { 80 "pipeline": { 81 "name": "group" 82 }, 83 "input": { 84 "group": [ 85 { 86 "pfs": { 87 "repo": "labresults", 88 "branch": "master", 89 "glob": "/*-(*)-lab*.txt", 90 "group_by": "$1" 91 } 92 } 93 ] 94 }, 95 "transform": { 96 "cmd": [ "bash" ], 97 "stdin": [ "wc" ,"-l" ,"/pfs/labresults/*" ] 98 } 99 } 100 } 101 ``` 102 103 The glob pattern for the `labresults` repository, `/*-(*)-lab*.txt`, selects all files with a patientID match in the root directory. 104 105 The pipeline will process 3 datums for this job. 106 107 - all files containing `patientID1` (1, 3, 5) are grouped in one datum, 108 - a second datum will be made of (2, 6) for `patientID2` 109 - and a third with (4) for `patientID3` 110 111 The `pachctl list datum -f <your_pipeline_spec.json>` command is a useful tool to check your datums: 112 113 ```code 114 ID FILES STATUS TIME 115 - labresults@722665ed49474db0aab5cbe4d8a20ff8:/LIPID-patientID1-labID1.txt, labresults@722665ed49474db0aab5cbe4d8a20ff8:/LIPID-patientID1-labID3.txt, labresults@722665ed49474db0aab5cbe4d8a20ff8:/LIPID-patientID1-labID2.txt - - 116 - labresults@722665ed49474db0aab5cbe4d8a20ff8:/LIPID-patientID2-labID1.txt, labresults@722665ed49474db0aab5cbe4d8a20ff8:/LIPID-patientID2-labID3.txt - - 117 - labresults@722665ed49474db0aab5cbe4d8a20ff8:/LIPID-patientID3-labID3.txt 118 ``` 119 120 To experiment further, see the full [group example](https://github.com/pachyderm/pachyderm/tree/master/examples/group). 121