github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/concepts/pipeline-concepts/datum/glob-pattern.md (about)

     1  # Glob Pattern
     2  
     3  Defining how your data is spread among workers is one of
     4  the most important aspects of distributed computation and is
     5  the fundamental idea around concepts such as Map and Reduce.
     6  
     7  Instead of confining users to data-distribution patterns,
     8  such as Map, that splits everything as much as possible, and
     9  Reduce, that groups all the data, Pachyderm
    10  uses glob patterns to provide incredible flexibility to
    11  define data distribution.
    12  
    13  You can configure a glob pattern for each PFS input in
    14  the input field of a pipeline specification. Pachyderm detects
    15  this parameter and divides the input data into
    16  individual *datums*.
    17  
    18  You can think of each input repository as a filesystem where
    19  the glob pattern is applied to the root of the
    20  filesystem. The files and directories that match the
    21  glob pattern are considered datums. The Pachyderm's
    22  concept of glob patterns is similar to the Unix glob patterns.
    23  For example, the `ls *.md` command matches all files with the
    24  `.md` file extension.
    25  
    26  In Pachyderm, the `/` and `*` indicators are most
    27  commonly used globs.
    28  
    29  The following are examples of glob patterns that you can define:
    30  
    31  * `/` — Pachyderm denotes the whole repository as a
    32    single datum and sends all of the input data to a
    33    single worker node to be processed together.
    34  * `/*` — Pachyderm defines each top-level filesystem
    35    object, that is a file or a directory, in the input
    36    repo as a separate datum. For example,
    37    if you have a repository with ten files in it and no
    38    directory structure, Pachyderm identifies each file as a
    39    single datum and processes them independently.
    40  * `/*/*` — Pachyderm processes each filesystem object
    41    in each subdirectory as a separate datum.
    42  
    43  <!-- Add the ohmyglob examples here-->
    44  
    45  If you have more than one input repo in your pipeline,
    46  you can define a different glob pattern for each input
    47  repo. You can combine the datums from each input repo
    48  by using either the `cross` or `union` operator to
    49  create the final datums that your code processes.
    50  For more information, see [Cross and Union](cross-union.md).
    51  
    52  ## Example of Defining Datums
    53  
    54  For example, you have the following directory:
    55  
    56  !!! example
    57      ```shell
    58      /California
    59         /San-Francisco.json
    60         /Los-Angeles.json
    61         ...
    62      /Colorado
    63         /Denver.json
    64         /Boulder.json
    65         ...
    66      ...
    67      ```
    68  
    69  Each top-level directory represents a US
    70  state with a `json` file for each city in that state.
    71  
    72  If you set glob pattern to `/`, every time
    73  you change anything in any of the
    74  files and directories or add a new file to the
    75  repository, Pachyderm processes the contents
    76  of the whole repository from scratch as a single datum.
    77  For example, if you add `Sacramento.json` to the
    78  `California/` directory, Pachyderm processes all files
    79  and folders in the repo as a single datum.
    80  
    81  If you set `/*` as a glob pattern, Pachyderm processes
    82  the data for each state individually. It
    83  defines one datum per state, which means that all the cities for
    84  a given state are processed together by a single worker, but each
    85  state is processed independently. For example, if you add a new file
    86  `Sacramento.json` to the `California/` directory, Pachyderm
    87  processes the `California/` datum only.
    88  
    89  If you set `/*/*`, Pachyderm processes each city as a single
    90  datum on a separate worker. For example, if you add
    91  the `Sacramento.json` file, Pachyderm processes the
    92  `Sacramento.json` file only.
    93  
    94  Glob patterns also let you take only a particular directory or subset of
    95  directories as an input instead of the whole repo. For example,
    96  you can set `/California/*` to process only the data for the state of
    97  California. Therefore, if you add a new city in the `Colorado/` directory,
    98  Pachyderm ignore this change and does not start the pipeline.
    99  However, if you add  `Sacramento.json` to the `California/` directory,
   100  Pachyderm  processes the `California/` datum.
   101  
   102  ## Test a Glob pattern
   103  
   104  You can use the `pachctl glob file` command to preview which filesystem
   105  objects a pipeline defines as datums. This command helps
   106  you to test various glob patterns before you use them in a pipeline.
   107  
   108  * If you set the `glob` property to `/`, Pachyderm detects all
   109  top-level filesystem objects in the `train` repository as one
   110  datum:
   111  
   112  !!! example
   113      ```shell
   114      pachctl glob file train@master:/
   115      ```
   116  
   117      **System Response:**
   118  
   119      ```shell
   120      NAME TYPE SIZE
   121      /    dir  15.11KiB
   122      ```
   123  
   124  * If you set the `glob` property to `/`, Pachyderm detects each
   125  top-level filesystem object in the `train` repository as a separate
   126  datum:
   127  
   128  !!! example
   129      ```shell
   130      pachctl glob file train@master:/*
   131      ```
   132  
   133      **System Response:**
   134  
   135      ```shell
   136      NAME                   TYPE SIZE
   137      /IssueSummarization.py file 1.224KiB
   138      /requirements.txt      file 74B
   139      /seq2seq_utils.py      file 13.81KiB
   140      ```