github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/pipeline-concepts/datum/glob-pattern.md (about)

     1  # Glob Pattern
     2  
     3  Defining how your data is spread among workers is one of
     4  the most important aspects of distributed computation and is
     5  the fundamental idea around concepts such as Map and Reduce.
     6  
     7  Instead of confining users to data-distribution patterns,
     8  such as Map, that splits everything as much as possible, and
     9  Reduce, that groups all the data, Pachyderm
    10  uses glob patterns to provide incredible flexibility to
    11  define data distribution.
    12  
    13  You can configure a glob pattern for each PFS input in
    14  the input field of a pipeline specification. Pachyderm detects
    15  this parameter and divides the input data into
    16  individual *datums*.
    17  
    18  You can think of each input repository as a filesystem where
    19  the glob pattern is applied to the root of the
    20  filesystem. The files and directories that match the
    21  glob pattern are considered datums. The Pachyderm's
    22  concept of glob patterns is similar to the Unix glob patterns.
    23  For example, the `ls *.md` command matches all files with the
    24  `.md` file extension.
    25  
    26  In Pachyderm, the `/` and `*` indicators are most
    27  commonly used globs.
    28  
    29  The following are examples of glob patterns that you can define:
    30  
    31  * `/` — Pachyderm denotes the whole repository as a
    32    single datum and sends all of the input data to a
    33    single worker node to be processed together.
    34  * `/*` — Pachyderm defines each top-level filesystem
    35    object, that is a file or a directory, in the input
    36    repo as a separate datum. For example,
    37    if you have a repository with ten files in it and no
    38    directory structure, Pachyderm identifies each file as a
    39    single datum and processes them independently.
    40  * `/*/*` — Pachyderm processes each filesystem object
    41    in each subdirectory as a separate datum.
    42  
    43  <!-- Add the ohmyglob examples here-->
    44  
    45  If you have more than one input repo in your pipeline,
    46  you can define a different glob pattern for each input
    47  repo. You can combine the datums from each input repo
    48  by using the `cross`, `union`, `join`, or `group` operator to
    49  create the final datums that your code processes.
    50  For more information, see [Cross and Union](./cross-union.md), [Join](./join.md), [Group](./group.md).
    51  
    52  ## Example of Defining Datums
    53  
    54  For example, you have the following directory:
    55  
    56  !!! example
    57      ```shell
    58      /California
    59         /San-Francisco.json
    60         /Los-Angeles.json
    61         ...
    62      /Colorado
    63         /Denver.json
    64         /Boulder.json
    65         ...
    66      ...
    67      ```
    68  
    69  Each top-level directory represents a US
    70  state with a `json` file for each city in that state.
    71  
    72  If you set glob pattern to `/`, every time
    73  you change anything in any of the
    74  files and directories or add a new file to the
    75  repository, Pachyderm processes the contents
    76  of the whole repository from scratch as a single datum.
    77  For example, if you add `Sacramento.json` to the
    78  `California/` directory, Pachyderm processes all files
    79  and folders in the repo as a single datum.
    80  
    81  If you set `/*` as a glob pattern, Pachyderm processes
    82  the data for each state individually. It
    83  defines one datum per state, which means that all the cities for
    84  a given state are processed together by a single worker, but each
    85  state is processed independently. For example, if you add a new file
    86  `Sacramento.json` to the `California/` directory, Pachyderm
    87  processes the `California/` datum only.
    88  
    89  If you set `/*/*`, Pachyderm processes each city as a single
    90  datum on a separate worker. For example, if you add
    91  the `Sacramento.json` file, Pachyderm processes the
    92  `Sacramento.json` file only.
    93  
    94  Glob patterns also let you take only a particular directory or subset of
    95  directories as an input instead of the whole repo. For example,
    96  you can set `/California/*` to process only the data for the state of
    97  California. Therefore, if you add a new city in the `Colorado/` directory,
    98  Pachyderm ignore this change and does not start the pipeline.
    99  However, if you add  `Sacramento.json` to the `California/` directory,
   100  Pachyderm  processes the `California/` datum.
   101  
   102  ## Test a Glob pattern
   103  
   104  You can use the `pachctl glob file` command to preview which filesystem
   105  objects a pipeline defines as datums. This command helps
   106  you to test various glob patterns before you use them in a pipeline.
   107  
   108  * If you set the `glob` property to `/`, Pachyderm detects all
   109  top-level filesystem objects in the `train` repository as one
   110  datum:
   111  
   112  !!! example
   113      ```shell
   114      pachctl glob file train@master:/
   115      ```
   116  
   117      **System Response:**
   118  
   119      ```shell
   120      NAME TYPE SIZE
   121      /    dir  15.11KiB
   122      ```
   123  
   124  * If you set the `glob` property to `/*`, Pachyderm detects each
   125  top-level filesystem object in the `train` repository as a separate
   126  datum:
   127  
   128  !!! example
   129      ```shell
   130      pachctl glob file train@master:/*
   131      ```
   132  
   133      **System Response:**
   134  
   135      ```shell
   136      NAME                   TYPE SIZE
   137      /IssueSummarization.py file 1.224KiB
   138      /requirements.txt      file 74B
   139      /seq2seq_utils.py      file 13.81KiB
   140      ```
   141  
   142  ## Test your Datums
   143  
   144  The granularity of your datums defines how your data will be distributed across the available workers allocated to a job.
   145  Pachyderm allows you to check those datums:
   146  
   147    - for a pipeline currently being developed  
   148    - for a past job 
   149  
   150  ### Testing your glob pattern before creating a pipeline
   151  You can use the `pachctl list datum -f <my_pipeline_spec.json>` command to preview the datums defined by a pipeline given its specification file. 
   152  
   153  !!! note "Note"  
   154      The pipeline does not need to have been created for the command to return the list of datums. This "dry run" helps you adjust your glob pattern when creating your pipeline.
   155   
   156  
   157  !!! example
   158      ```shell
   159      pachctl list datum -f edges.json
   160      ```
   161      **System Response:**
   162  
   163      ```shell
   164          ID FILES                                                STATUS TIME
   165      -  images@8c958d1523f3428a98ac97fbfc367bae:/g2QnNqa.jpg -      -
   166      -  images@8c958d1523f3428a98ac97fbfc367bae:/8MN9Kg0.jpg -      -
   167      -  images@8c958d1523f3428a98ac97fbfc367bae:/46Q8nDz.jpg -      -
   168      ```
   169  
   170  ### Running list datum on a past job 
   171  You can use the `pachctl list datum <job_number>` command to check the datums processed by a given job.
   172  
   173  !!! example
   174      ```shell
   175      pachctl list datum d10979d9f9894610bb287fa5e0d734b5
   176      ```
   177      **System Response:**
   178  
   179      ```shell
   180          ID                                                                   FILES                                                STATUS TIME
   181      ebd35bb33c5f772f02d7dfc4735ad1dde8cc923474a1ee28a19b16b2990d29592e30 images@8c958d1523f3428a98ac97fbfc367bae:/g2QnNqa.jpg -      -
   182      ebd3ce3cdbab9b78cc58f40aa2019a5a6bce82d1f70441bd5d41a625b7769cce9bc4 images@8c958d1523f3428a98ac97fbfc367bae:/8MN9Kg0.jpg -      -
   183      ebd32cf84c73cfcc4237ac4afdfe6f27beee3cb039d38613421149122e1f9faff349 images@8c958d1523f3428a98ac97fbfc367bae:/46Q8nDz.jpg -      -
   184      ```
   185  
   186  !!! note "Note"  
   187      Now that the 3 datums have been processed, their ID field is showing.