github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/concepts/pipeline-concepts/datum/index.md (about) 1 # Datum 2 3 !!! note "TL;DR" 4 Datums define what input data is seen by your code. It can be 5 all data at once, each directory independently, individual 6 files one by one, or combined data from multiple inputs together. 7 8 A datum is the smallest indivisible unit of computation within a 9 job. 10 A job can have one, many or no datums. Each datum is processed 11 independently with a single execution of the user code and 12 then the results of all the datums are merged together to 13 create the final output commit. 14 15 A datum defines the input data. An input can take one or multiple 16 repositories. Pachyderm has the following types of inputs that 17 combine multiple repositories: 18 19 **Cross** 20 : A cross input creates a cross-product of multiple repositories. 21 Therefore, each datum from one repository is combined with each 22 datum from the other repository. 23 24 **Join** 25 : A join input enables you to join files that are stored 26 in different Pachyderm repositories and match a particular 27 file path pattern. Joins are similar to `cross`, except 28 instead of matching every pair of datums from each input, 29 it only matches specific ones based on file paths. 30 Conceptually, joins are similar to the 31 database’s inner join operations, although they only match 32 on file paths, not the actual file content. 33 34 **Union** 35 : A union input can take multiple repositories and processes 36 all the data in each input independently. The pipeline 37 processes the datums in no defined order and the output 38 repository includes results from all input sources. 39 40 The number of datums for a job is defined by the 41 [glob pattern](glob-pattern.md) which you specify for each input. Think of 42 datums as if you were telling Pachyderm how to divide your 43 input data to efficiently distribute computation and 44 only process the *new* data. You can configure a whole 45 input repository to be one datum, each top-level filesystem object 46 to be a separate datum, specific paths can be datums, 47 and so on. Datums affect how Pachyderm distributes processing workloads 48 and are instrumental in optimizing your configuration for best performance. 49 50 Pachyderm takes each datum and processes it in isolation on one of 51 the pipeline worker nodes. You can define datums, workers, and other 52 performance parameters can all be configured through the 53 corresponding fields in the [pipeline specification](../../../reference/pipeline_spec.md). 54 55 To understand how datums affect data processing in Pachyderm, you need to 56 understand the following subconcepts: 57 58 * [Glob Pattern](glob-pattern.md) 59 * [Datum Processing](relationship-between-datums.md)