github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/concepts/pipeline-concepts/datum/relationship-between-datums.md (about)

     1  # Datum Processing
     2  
     3  This section helps you to understand the following
     4  concepts:
     5  
     6  * Pachyderm job stages
     7  * Multiple datums processing
     8  * Incremental processing
     9  * Data persistence between datums
    10  
    11  A datum is a Pachyderm abstraction that helps in optimizing
    12  pipeline processing. Because datums exist only as a pipeline
    13  processing property and are not filesystem objects, you can never
    14  list or copy a datum. Instead, a datum, as a representation of a unit
    15  of work, helps you to run your pipelines much faster by avoiding
    16  repeated processing of unchanged datums. For example, if you have
    17  multiple datums, and only one datum was modified, Pachyderm processes
    18  only that datum and skips processing other datums. This incremental
    19  behavior ensures efficient resource utilization.
    20  
    21  Each Pachyderm job can process multiple datums, which can consist
    22  of one or multiple files. While each input datum results in one output
    23  datum, the number of files in the output datum might differ from
    24  the number of files in the input datum.
    25  
    26  When you create a pipeline specification, one of the most important
    27  fields that you need to configure is `pfs/`, or PFS input.
    28  The PFS input field is where you define a data source from which
    29  the pipeline pulls data for further processing. The `glob`
    30  parameter defines the number of datums in the `pfs/` source
    31  repository. Thus, you can define everything in the source repository
    32  to be processed as a single datum or break it down to multiple
    33  datums. The way you break your source repository into datums
    34  directly affects incremental processing and your pipeline
    35  processing speed. You know your data better and can decide
    36  how to optimize your pipeline based on the repository structure
    37  and data generation workflows.
    38  For more information about glob patterns, see
    39  [Glob Pattern](glob-pattern.md).
    40  
    41  Disregarding of how many datums you define and how many
    42  filesystem objects a datum has, Pachyderm always matches the
    43  number of input datums with the number of output datums. For
    44  example, if you have three input datums in `pfs/`, you will
    45  have three output datums in `pfs/out`. `pfs/out` is the
    46  output repository that Pachyderm creates automatically for
    47  each pipeline. You can add your changes in any order and
    48  submit them in one or multiple commits, the result of your
    49  pipeline processing remains the same.
    50  
    51  Another aspect of Pachyderm data processing is
    52  appending and overwriting files. By default, Pachyderm
    53  appends new data to the existing data. For example, if you
    54  have a file `foo` that is 100 KB in size in the repository `A`
    55  and add the same file `foo` to that repository again by
    56  using the `pachctl put file` command, Pachyderm does not
    57  overwrite that file but appends it to the file `foo` in the
    58  repo. Therefore, the size of the file `foo` doubles and
    59  becomes 200 KB. Pachyderm enables you to overwrite files as
    60  well by using the `--overwrite` flag. The order of processing
    61  is not guaranteed, and all datums are processed randomly.
    62  For more information, see [File](../../data-concepts/file.md).
    63  
    64  When new data comes in, a Pachyderm pipeline automatically
    65  starts a new job. Each Pachyderm job consists of the
    66  following stages:
    67  
    68  1. Creation of input datums. In this stage, Pachyderm breaks
    69  input files into datums according to the glob pattern setting
    70  in the pipeline specification.
    71  1. Transformation. The pipeline uses your code to processes the
    72  datums.
    73  1. Creation of output datums. Pachyderm creates file or files from the
    74  processed data and combines them into output datums.
    75  1. Merge. Pachyderm combines all files with the same file path
    76  by appending, overwriting, or deleting them to create the final commit.
    77  
    78  If you think about this process in terms of filesystem objects and
    79  processing abstractions, the following transformation happens:
    80  
    81  !!! note ""
    82      **input files = > input datums => output datums => output files**
    83  
    84  This section provides examples that help you understand such fundamental
    85  Pachyderm concepts as the datum, incremental processing, and phases of
    86  data processing.
    87  
    88  ## Example 1: One file in the input datum, one file in the output datum
    89  
    90  The simplest example of datum processing is when you have one file in
    91  the input datum that results in one file in the output datum.
    92  In the diagram below, you can see three input datums, each of which
    93  includes one file, that result in three output datums. Whether you have
    94  submitted all these datums in a single or multiple commits, the final
    95  result remains the same—three datums, each of which has one file.
    96  
    97  In the diagram below, you can see the following datums:
    98  
    99   - `datum 1` has one file and results in one file in one output datum.
   100   - `datum 2` has one file and results in one file in one output datum.
   101   - `datum 3` has one file and results in one file in one output datum.
   102  
   103  ![One to one](../../../assets/images/d_datum_processing_one_to_one.svg)
   104  
   105  If you decide to overwrite a single line in the file in `datum 3` and
   106  add `datum 4`, Pachyderm sees the four datums and checks them for changes
   107  one-by-one. Pachyderm verifies that there are no changes in `datum 1` and
   108  `datum 2` and skips these datums. Pachyderm detects changes in the
   109  `datum 3` and the `--overwrite` flag and replaces the `datum 3` with the
   110  new `datum 3'`. When it detects `datum 4` as a completely new datum,
   111  it processes the whole datum as new. Although only two datums were
   112  processed, the output commit of this change contains all four files.
   113  
   114  ![One to one overwrite](../../../assets/images/d_datum_processing_one_to_one_overwrite.svg)
   115  
   116  ## Example 2: One file in the input datum, multiple files in the output datum
   117  
   118  Some pipelines ingest one file in one input datum and create multiple
   119  files in the output datum. The files in the output datums might need to
   120  be appended or overwritten with other files to create the final commit.
   121  
   122  If you apply changes to that datum, Pachyderm does not detect which
   123  particular part of the datum has changed and processes the whole datum.
   124  In the diagram below, you have the following datums:
   125  
   126  - `datum 1` has one file and results in files `1` and `3`.
   127  - `datum 2` has one file and results in files `2` and `3`.
   128  - `datum 3` has one file and results in files `2` and `1`.
   129  
   130  ![One to many](../../../assets/images/d_datum_processing_one_to_many.svg)
   131  
   132  Pachyderm processes all these datums independently, and in the end, it needs
   133  to create a commit by combining the results of processing these datums.
   134  A commit is a filesystem that has specific constraints, such as duplicate
   135  files with the same file path. Pachyderm merges results from
   136  different output datums with the same file path into single files. For
   137  example, `datum 1` produces `pfs/out/1` and `datum 3` produces `pfs/out/1`.
   138  Pachyderm merges these two files by appending them one to another
   139  without any particular order. Therefore, the file `1` in the final
   140  commit has parts from `datum1` and `datum3`.
   141  
   142  If you decide to create a new commit and overwrite the file in `datum 2`,
   143  Pachyderm detects three datums. Because `datum 1` and `datum 3` are
   144  unchanged, it skips processing these datums. Then, Pachyderm detects
   145  that something has changed in `datum 2`. Pachyderm is unaware of any
   146  details of the change. Therefore, it processes the whole `datum 2`
   147  and outputs the files `1`, `3`, and `4`. Then, Pachyderm merges
   148  these datums to create the following final result:
   149  
   150  ![One to many](../../../assets/images/d_datum_processing_one_to_many_overwrite.svg)
   151  
   152  
   153  In the diagram above, Pachyderm appends the file `1` from the `datum 2`
   154  to the file `1` in the final commit, deletes the file `2` from `datum 2`,
   155  overwrites the old part from `datum 2` in file `3`  with a new version,
   156  and creates a new output file `4`.
   157  
   158  Similarly, if you have multiple files in your input datum, Pachyderm might
   159  write them into multiple files in output datums that are later merged into
   160  files with the same file path.
   161  
   162  ## Note: Data persistence between datums
   163  
   164  Pachyderm only controls and wipes the /pfs directories between datums. If scratch/temp space is used during execution, the user needs to be careful to clean that up. Not cleaning temporary directories may cause unexpected bugs where one datum accesses temporary files that were previously used by another datum!