github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/concepts/pipeline-concepts/datum/relationship-between-datums.md (about)

     1  # Datum Processing
     2  
     3  This section helps you to understand the following
     4  concepts:
     5  
     6  * Pachyderm job stages
     7  * Multiple datums processing
     8  * Incremental processing
     9  
    10  A datum is a Pachyderm abstraction that helps in optimizing
    11  pipeline processing. Because datums exist only as a pipeline
    12  processing property and are not filesystem objects, you can never
    13  list or copy a datum. Instead, a datum, as a representation of a unit
    14  of work, helps you to run your pipelines much faster by avoiding
    15  repeated processing of unchanged datums. For example, if you have
    16  multiple datums, and only one datum was modified, Pachyderm processes
    17  only that datum and skips processing other datums. This incremental
    18  behavior ensures efficient resource utilization.
    19  
    20  Each Pachyderm job can process multiple datums, which can consist
    21  of one or multiple files. While each input datum results in one output
    22  datum, the number of files in the output datum might differ from
    23  the number of files in the input datum.
    24  
    25  When you create a pipeline specification, one of the most important
    26  fields that you need to configure is `pfs/`, or PFS input.
    27  The PFS input field is where you define a data source from which
    28  the pipeline pulls data for further processing. The `glob`
    29  parameter defines the number of datums in the `pfs/` source
    30  repository. Thus, you can define everything in the source repository
    31  to be processed as a single datum or break it down to multiple
    32  datums. The way you break your source repository into datums
    33  directly affects incremental processing and your pipeline
    34  processing speed. You know your data better and can decide
    35  how to optimize your pipeline based on the repository structure
    36  and data generation workflows.
    37  For more information about glob patterns, see
    38  [Glob Pattern](glob-pattern.md).
    39  
    40  Disregarding of how many datums you define and how many
    41  filesystem objects a datum has, Pachyderm always matches the
    42  number of input datums with the number of output datums. For
    43  example, if you have three input datums in `pfs/`, you will
    44  have three output datums in `pfs/out`. `pfs/out` is the
    45  output repository that Pachyderm creates automatically for
    46  each pipeline. You can add your changes in any order and
    47  submit them in one or multiple commits, the result of your
    48  pipeline processing remains the same.
    49  
    50  Another aspect of Pachyderm data processing is
    51  appending and overwriting files. By default, Pachyderm
    52  appends new data to the existing data. For example, if you
    53  have a file `foo` that is 100 KB in size in the repository `A`
    54  and add the same file `foo` to that repository again by
    55  using the `pachctl put file` command, Pachyderm does not
    56  overwrite that file but appends it to the file `foo` in the
    57  repo. Therefore, the size of the file `foo` doubles and
    58  becomes 200 KB. Pachyderm enables you to overwrite files as
    59  well by using the `--overwrite` flag. The order of processing
    60  is not guaranteed, and all datums are processed randomly.
    61  For more information, see [File](../../data-concepts/file.md).
    62  
    63  When new data comes in, a Pachyderm pipeline automatically
    64  starts a new job. Each Pachyderm job consists of the
    65  following stages:
    66  
    67  1. Creation of input datums. In this stage, Pachyderm breaks
    68  input files into datums according to the glob pattern setting
    69  in the pipeline specification.
    70  1. Transformation. The pipeline uses your code to processes the
    71  datums.
    72  1. Creation of output datums. Pachyderm creates file or files from the
    73  processed data and combines them into output datums.
    74  1. Merge. Pachyderm combines all files with the same file path
    75  by appending, overwriting, or deleting them to create the final commit.
    76  
    77  If you think about this process in terms of filesystem objects and
    78  processing abstractions, the following transformation happens:
    79  
    80  !!! note ""
    81      **input files = > input datums => output datums => output files**
    82  
    83  This section provides examples that help you understand such fundamental
    84  Pachyderm concepts as the datum, incremental processing, and phases of
    85  data processing.
    86  
    87  ## Example 1: One file in the input datum, one file in the output datum
    88  
    89  The simplest example of datum processing is when you have one file in
    90  the input datum that results in one file in the output datum.
    91  In the diagram below, you can see three input datums, each of which
    92  includes one file, that result in three output datums. Whether you have
    93  submitted all these datums in a single or multiple commits, the final
    94  result remains the same—three datums, each of which has one file.
    95  
    96  In the diagram below, you can see the following datums:
    97  
    98   - `datum 1` has one file and results in one file in one output datum.
    99   - `datum 2` has one file and results in one file in one output datum.
   100   - `datum 3` has one file and results in one file in one output datum.
   101  
   102  ![One to one](../../../assets/images/d_datum_processing_one_to_one.svg)
   103  
   104  If you decide to overwrite a single line in the file in `datum 3` and
   105  add `datum 4`, Pachyderm sees the four datums and checks them for changes
   106  one-by-one. Pachyderm verifies that there are no changes in `datum 1` and
   107  `datum 2` and skips these datums. Pachyderm detects changes in the
   108  `datum 3` and the `--overwrite` flag and replaces the `datum 3` with the
   109  new `datum 3'`. When it detects `datum 4` as a completely new datum,
   110  it processes the whole datum as new. Although only two datums were
   111  processed, the output commit of this change contains all four files.
   112  
   113  ![One to one overwrite](../../../assets/images/d_datum_processing_one_to_one_overwrite.svg)
   114  
   115  ## Example 2: One file in the input datum, multiple files in the output datum
   116  
   117  Some pipelines ingest one file in one input datum and create multiple
   118  files in the output datum. The files in the output datums might need to
   119  be appended or overwritten with other files to create the final commit.
   120  
   121  If you apply changes to that datum, Pachyderm does not detect which
   122  particular part of the datum has changed and processes the whole datum.
   123  In the diagram below, you have the following datums:
   124  
   125  - `datum 1` has one file and results in files `1` and `3`.
   126  - `datum 2` has one file and results in files `2` and `3`.
   127  - `datum 3` has one file and results in files `2` and `1`.
   128  
   129  ![One to many](../../../assets/images/d_datum_processing_one_to_many.svg)
   130  
   131  Pachyderm processes all these datums independently, and in the end, it needs
   132  to create a commit by combining the results of processing these datums.
   133  A commit is a filesystem that has specific constraints, such as duplicate
   134  files with the same file path. Pachyderm merges results from
   135  different output datums with the same file path into single files. For
   136  example, `datum 1` produces `pfs/out/1` and `datum 3` produces `pfs/out/1`.
   137  Pachyderm merges these two files by appending them one to another
   138  without any particular order. Therefore, the file `1` in the final
   139  commit has parts from `datum1` and `datum3`.
   140  
   141  If you decide to create a new commit and overwrite the file in `datum 2`,
   142  Pachyderm detects three datums. Because `datum 1` and `datum 3` are
   143  unchanged, it skips processing these datums. Then, Pachyderm detects
   144  that something has changed in `datum 2`. Pachyderm is unaware of any
   145  details of the change. Therefore, it processes the whole `datum 2`
   146  and outputs the files `1`, `3`, and `4`. Then, Pachyderm merges
   147  these datums to create the following final result:
   148  
   149  ![One to many](../../../assets/images/d_datum_processing_one_to_many_overwrite.svg)
   150  
   151  
   152  In the diagram above, Pachyderm appends the file `1` from the `datum 2`
   153  to the file `1` in the final commit, deletes the file `2` from `datum 2`,
   154  overwrites the old part from `datum 2` in file `3`  with a new version,
   155  and creates a new output file `4`.
   156  
   157  Similarly, if you have multiple files in your input datum, Pachyderm might
   158  write them into multiple files in output datums that are later merged into
   159  files with the same file path.