github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/concepts/pipeline-concepts/datum/glob-pattern.md (about) 1 # Glob Pattern 2 3 Defining how your data is spread among workers is one of 4 the most important aspects of distributed computation and is 5 the fundamental idea around concepts such as Map and Reduce. 6 7 Instead of confining users to data-distribution patterns, 8 such as Map, that splits everything as much as possible, and 9 Reduce, that groups all the data, Pachyderm 10 uses glob patterns to provide incredible flexibility to 11 define data distribution. 12 13 You can configure a glob pattern for each PFS input in 14 the input field of a pipeline specification. Pachyderm detects 15 this parameter and divides the input data into 16 individual *datums*. 17 18 You can think of each input repository as a filesystem where 19 the glob pattern is applied to the root of the 20 filesystem. The files and directories that match the 21 glob pattern are considered datums. The Pachyderm's 22 concept of glob patterns is similar to the Unix glob patterns. 23 For example, the `ls *.md` command matches all files with the 24 `.md` file extension. 25 26 In Pachyderm, the `/` and `*` indicators are most 27 commonly used globs. 28 29 The following are examples of glob patterns that you can define: 30 31 * `/` — Pachyderm denotes the whole repository as a 32 single datum and sends all of the input data to a 33 single worker node to be processed together. 34 * `/*` — Pachyderm defines each top-level filesystem 35 object, that is a file or a directory, in the input 36 repo as a separate datum. For example, 37 if you have a repository with ten files in it and no 38 directory structure, Pachyderm identifies each file as a 39 single datum and processes them independently. 40 * `/*/*` — Pachyderm processes each filesystem object 41 in each subdirectory as a separate datum. 42 43 <!-- Add the ohmyglob examples here--> 44 45 If you have more than one input repo in your pipeline, 46 you can define a different glob pattern for each input 47 repo. You can combine the datums from each input repo 48 by using either the `cross` or `union` operator to 49 create the final datums that your code processes. 50 For more information, see [Cross and Union](cross-union.md). 51 52 ## Example of Defining Datums 53 54 For example, you have the following directory: 55 56 !!! example 57 ```shell 58 /California 59 /San-Francisco.json 60 /Los-Angeles.json 61 ... 62 /Colorado 63 /Denver.json 64 /Boulder.json 65 ... 66 ... 67 ``` 68 69 Each top-level directory represents a US 70 state with a `json` file for each city in that state. 71 72 If you set glob pattern to `/`, every time 73 you change anything in any of the 74 files and directories or add a new file to the 75 repository, Pachyderm processes the contents 76 of the whole repository from scratch as a single datum. 77 For example, if you add `Sacramento.json` to the 78 `California/` directory, Pachyderm processes all files 79 and folders in the repo as a single datum. 80 81 If you set `/*` as a glob pattern, Pachyderm processes 82 the data for each state individually. It 83 defines one datum per state, which means that all the cities for 84 a given state are processed together by a single worker, but each 85 state is processed independently. For example, if you add a new file 86 `Sacramento.json` to the `California/` directory, Pachyderm 87 processes the `California/` datum only. 88 89 If you set `/*/*`, Pachyderm processes each city as a single 90 datum on a separate worker. For example, if you add 91 the `Sacramento.json` file, Pachyderm processes the 92 `Sacramento.json` file only. 93 94 Glob patterns also let you take only a particular directory or subset of 95 directories as an input instead of the whole repo. For example, 96 you can set `/California/*` to process only the data for the state of 97 California. Therefore, if you add a new city in the `Colorado/` directory, 98 Pachyderm ignore this change and does not start the pipeline. 99 However, if you add `Sacramento.json` to the `California/` directory, 100 Pachyderm processes the `California/` datum. 101 102 ## Test a Glob pattern 103 104 You can use the `pachctl glob file` command to preview which filesystem 105 objects a pipeline defines as datums. This command helps 106 you to test various glob patterns before you use them in a pipeline. 107 108 * If you set the `glob` property to `/`, Pachyderm detects all 109 top-level filesystem objects in the `train` repository as one 110 datum: 111 112 !!! example 113 ```shell 114 pachctl glob file train@master:/ 115 ``` 116 117 **System Response:** 118 119 ```shell 120 NAME TYPE SIZE 121 / dir 15.11KiB 122 ``` 123 124 * If you set the `glob` property to `/`, Pachyderm detects each 125 top-level filesystem object in the `train` repository as a separate 126 datum: 127 128 !!! example 129 ```shell 130 pachctl glob file train@master:/* 131 ``` 132 133 **System Response:** 134 135 ```shell 136 NAME TYPE SIZE 137 /IssueSummarization.py file 1.224KiB 138 /requirements.txt file 74B 139 /seq2seq_utils.py file 13.81KiB 140 ```