github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/concepts/pipeline-concepts/datum/index.md (about)

     1  # Datum
     2  
     3  !!! note "TL;DR"
     4      Datums define what input data is seen by your code. It can be
     5      all data at once, each directory independently, individual
     6      files one by one, or combined data from multiple inputs together.
     7  
     8  A datum is the smallest indivisible unit of computation within a job.
     9  A job can have one, many or no datums. Each datum is processed
    10  independently with a single execution of the user code and
    11  then the results of all the datums are merged together to
    12  create the final output commit.
    13  
    14  Datums define what input data is seen by your code. An input can take one or multiple
    15  repositories. Pachyderm has the following types of inputs that
    16  combine multiple repositories:
    17  
    18  **Cross**
    19  :    A cross input creates a cross-product of multiple repositories.
    20       Therefore, each datum from one repository is combined with each
    21       datum from the other repository.
    22  
    23  **Join**
    24  :    A join input enables you to join files that are stored
    25       in different Pachyderm repositories and match a particular
    26       file path pattern. Joins are similar to `cross`, except
    27       instead of matching every pair of datums from each input,
    28       it only matches specific ones based on file paths.
    29       Conceptually, joins are similar to the
    30       database’s inner join operations, although they only match
    31       on file paths, not the actual file content.
    32  
    33  **Union**
    34  :    A union input can take multiple repositories and processes
    35       all the data in each input independently. The pipeline
    36       processes the datums in no defined order and the output
    37       repository includes results from all input sources.
    38  
    39  The number of datums for a job is defined by the
    40  [glob pattern](glob-pattern.md) which you specify for each input. Think of
    41  datums as if you were telling Pachyderm how to divide your
    42  input data to efficiently distribute computation and
    43  only process the *new* data. You can configure a whole
    44  input repository to be one datum, each top-level filesystem object
    45  to be a separate datum, specific paths can be datums,
    46  and so on. Datums affect how Pachyderm distributes processing workloads
    47  and are instrumental in optimizing your configuration for best performance.
    48  
    49  Pachyderm takes each datum and processes it in isolation on one of
    50  the pipeline worker nodes. You can define datums, workers, and other
    51  performance parameters can all be configured through the
    52  corresponding fields in the [pipeline specification](../../../reference/pipeline_spec.md).
    53  
    54  To understand how datums affect data processing in Pachyderm, you need to
    55  understand the following subconcepts:
    56  
    57  * [Glob Pattern](glob-pattern.md)
    58  * [Datum Processing](relationship-between-datums.md)