github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/concepts/pipeline-concepts/datum/relationship-between-datums.md (about) 1 # Datum Processing 2 3 This section helps you to understand the following 4 concepts: 5 6 * Pachyderm job stages 7 * Multiple datums processing 8 * Incremental processing 9 10 A datum is a Pachyderm abstraction that helps in optimizing 11 pipeline processing. Because datums exist only as a pipeline 12 processing property and are not filesystem objects, you can never 13 list or copy a datum. Instead, a datum, as a representation of a unit 14 of work, helps you to run your pipelines much faster by avoiding 15 repeated processing of unchanged datums. For example, if you have 16 multiple datums, and only one datum was modified, Pachyderm processes 17 only that datum and skips processing other datums. This incremental 18 behavior ensures efficient resource utilization. 19 20 Each Pachyderm job can process multiple datums, which can consist 21 of one or multiple files. While each input datum results in one output 22 datum, the number of files in the output datum might differ from 23 the number of files in the input datum. 24 25 When you create a pipeline specification, one of the most important 26 fields that you need to configure is `pfs/`, or PFS input. 27 The PFS input field is where you define a data source from which 28 the pipeline pulls data for further processing. The `glob` 29 parameter defines the number of datums in the `pfs/` source 30 repository. Thus, you can define everything in the source repository 31 to be processed as a single datum or break it down to multiple 32 datums. The way you break your source repository into datums 33 directly affects incremental processing and your pipeline 34 processing speed. You know your data better and can decide 35 how to optimize your pipeline based on the repository structure 36 and data generation workflows. 37 For more information about glob patterns, see 38 [Glob Pattern](glob-pattern.md). 39 40 Disregarding of how many datums you define and how many 41 filesystem objects a datum has, Pachyderm always matches the 42 number of input datums with the number of output datums. For 43 example, if you have three input datums in `pfs/`, you will 44 have three output datums in `pfs/out`. `pfs/out` is the 45 output repository that Pachyderm creates automatically for 46 each pipeline. You can add your changes in any order and 47 submit them in one or multiple commits, the result of your 48 pipeline processing remains the same. 49 50 Another aspect of Pachyderm data processing is 51 appending and overwriting files. By default, Pachyderm 52 appends new data to the existing data. For example, if you 53 have a file `foo` that is 100 KB in size in the repository `A` 54 and add the same file `foo` to that repository again by 55 using the `pachctl put file` command, Pachyderm does not 56 overwrite that file but appends it to the file `foo` in the 57 repo. Therefore, the size of the file `foo` doubles and 58 becomes 200 KB. Pachyderm enables you to overwrite files as 59 well by using the `--overwrite` flag. The order of processing 60 is not guaranteed, and all datums are processed randomly. 61 For more information, see [File](../../data-concepts/file.md). 62 63 When new data comes in, a Pachyderm pipeline automatically 64 starts a new job. Each Pachyderm job consists of the 65 following stages: 66 67 1. Creation of input datums. In this stage, Pachyderm breaks 68 input files into datums according to the glob pattern setting 69 in the pipeline specification. 70 1. Transformation. The pipeline uses your code to processes the 71 datums. 72 1. Creation of output datums. Pachyderm creates file or files from the 73 processed data and combines them into output datums. 74 1. Merge. Pachyderm combines all files with the same file path 75 by appending, overwriting, or deleting them to create the final commit. 76 77 If you think about this process in terms of filesystem objects and 78 processing abstractions, the following transformation happens: 79 80 !!! note "" 81 **input files = > input datums => output datums => output files** 82 83 This section provides examples that help you understand such fundamental 84 Pachyderm concepts as the datum, incremental processing, and phases of 85 data processing. 86 87 ## Example 1: One file in the input datum, one file in the output datum 88 89 The simplest example of datum processing is when you have one file in 90 the input datum that results in one file in the output datum. 91 In the diagram below, you can see three input datums, each of which 92 includes one file, that result in three output datums. Whether you have 93 submitted all these datums in a single or multiple commits, the final 94 result remains the same—three datums, each of which has one file. 95 96 In the diagram below, you can see the following datums: 97 98 - `datum 1` has one file and results in one file in one output datum. 99 - `datum 2` has one file and results in one file in one output datum. 100 - `datum 3` has one file and results in one file in one output datum. 101 102  103 104 If you decide to overwrite a single line in the file in `datum 3` and 105 add `datum 4`, Pachyderm sees the four datums and checks them for changes 106 one-by-one. Pachyderm verifies that there are no changes in `datum 1` and 107 `datum 2` and skips these datums. Pachyderm detects changes in the 108 `datum 3` and the `--overwrite` flag and replaces the `datum 3` with the 109 new `datum 3'`. When it detects `datum 4` as a completely new datum, 110 it processes the whole datum as new. Although only two datums were 111 processed, the output commit of this change contains all four files. 112 113  114 115 ## Example 2: One file in the input datum, multiple files in the output datum 116 117 Some pipelines ingest one file in one input datum and create multiple 118 files in the output datum. The files in the output datums might need to 119 be appended or overwritten with other files to create the final commit. 120 121 If you apply changes to that datum, Pachyderm does not detect which 122 particular part of the datum has changed and processes the whole datum. 123 In the diagram below, you have the following datums: 124 125 - `datum 1` has one file and results in files `1` and `3`. 126 - `datum 2` has one file and results in files `2` and `3`. 127 - `datum 3` has one file and results in files `2` and `1`. 128 129  130 131 Pachyderm processes all these datums independently, and in the end, it needs 132 to create a commit by combining the results of processing these datums. 133 A commit is a filesystem that has specific constraints, such as duplicate 134 files with the same file path. Pachyderm merges results from 135 different output datums with the same file path into single files. For 136 example, `datum 1` produces `pfs/out/1` and `datum 3` produces `pfs/out/1`. 137 Pachyderm merges these two files by appending them one to another 138 without any particular order. Therefore, the file `1` in the final 139 commit has parts from `datum1` and `datum3`. 140 141 If you decide to create a new commit and overwrite the file in `datum 2`, 142 Pachyderm detects three datums. Because `datum 1` and `datum 3` are 143 unchanged, it skips processing these datums. Then, Pachyderm detects 144 that something has changed in `datum 2`. Pachyderm is unaware of any 145 details of the change. Therefore, it processes the whole `datum 2` 146 and outputs the files `1`, `3`, and `4`. Then, Pachyderm merges 147 these datums to create the following final result: 148 149  150 151 152 In the diagram above, Pachyderm appends the file `1` from the `datum 2` 153 to the file `1` in the final commit, deletes the file `2` from `datum 2`, 154 overwrites the old part from `datum 2` in file `3` with a new version, 155 and creates a new output file `4`. 156 157 Similarly, if you have multiple files in your input datum, Pachyderm might 158 write them into multiple files in output datums that are later merged into 159 files with the same file path.