github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/concepts/pipeline-concepts/datum/index.rst (about)

     1  .. _datum:
     2  
     3  Datum
     4  =====
     5  
     6  A datum is the smallest indivisible unit of computation within a job.
     7  A job can have one, many, or no datums. Each datum is processed
     8  independently with a single execution of the user codei, and
     9  then the results of all the datums are merged together to
    10  create the final output commit.
    11  
    12  The number of datums for a job is defined by the `glob pattern
    13  <glob-pattern.html>`__ which you specify for each input. Think of
    14  datums as if you were telling Pachyderm how to divide your
    15  input data to efficiently distribute computation and
    16  only process the *new* data. You can configure a whole
    17  input repository to be one datum, each top-level filesystem object
    18  to be a separate datum, specific paths can be datums,
    19  and so on. Datums affect how Pachyderm distributes processing workloads
    20  and are instrumental in optimizing your configuration for best performance.
    21  
    22  Pachyderm takes each datum and processes it in isolation on one of
    23  the pipeline worker nodes. You can define datums, workers, and other
    24  performance parameters through the
    25  corresponding fields in the `pipeline specification <../../../reference/pipeline-spec.html>`__.
    26  
    27  To understand how datums affect data processing in Pachyderm, you need to
    28  understand the following subconcepts:
    29  
    30  .. toctree::
    31     :maxdepth: 1
    32  
    33     glob-pattern.md
    34     relationship-between-datums.md
    35     cross-union.md