github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/concepts/pipeline-concepts/datum/relationship-between-datums.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/concepts/pipeline-concepts/datum/relationship-between-datums.md (about)

1 # Datum Processing
2
3 This section helps you to understand the following
4 concepts:
5
6 * Pachyderm job stages
7 * Multiple datums processing
8 * Incremental processing
9 * Data persistence between datums
10
11 A datum is a Pachyderm abstraction that helps in optimizing
12 pipeline processing. Because datums exist only as a pipeline
13 processing property and are not filesystem objects, you can never
14 list or copy a datum. Instead, a datum, as a representation of a unit
15 of work, helps you to run your pipelines much faster by avoiding
16 repeated processing of unchanged datums. For example, if you have
17 multiple datums, and only one datum was modified, Pachyderm processes
18 only that datum and skips processing other datums. This incremental
19 behavior ensures efficient resource utilization.
20
21 Each Pachyderm job can process multiple datums, which can consist
22 of one or multiple files. While each input datum results in one output
23 datum, the number of files in the output datum might differ from
24 the number of files in the input datum.
25
26 When you create a pipeline specification, one of the most important
27 fields that you need to configure is `pfs/`, or PFS input.
28 The PFS input field is where you define a data source from which
29 the pipeline pulls data for further processing. The `glob`
30 parameter defines the number of datums in the `pfs/` source
31 repository. Thus, you can define everything in the source repository
32 to be processed as a single datum or break it down to multiple
33 datums. The way you break your source repository into datums
34 directly affects incremental processing and your pipeline
35 processing speed. You know your data better and can decide
36 how to optimize your pipeline based on the repository structure
37 and data generation workflows.
38 For more information about glob patterns, see
39 [Glob Pattern](glob-pattern.md).
40
41 Disregarding of how many datums you define and how many
42 filesystem objects a datum has, Pachyderm always matches the
43 number of input datums with the number of output datums. For
44 example, if you have three input datums in `pfs/`, you will
45 have three output datums in `pfs/out`. `pfs/out` is the
46 output repository that Pachyderm creates automatically for
47 each pipeline. You can add your changes in any order and
48 submit them in one or multiple commits, the result of your
49 pipeline processing remains the same.
50
51 Another aspect of Pachyderm data processing is
52 appending and overwriting files. By default, Pachyderm
53 appends new data to the existing data. For example, if you
54 have a file `foo` that is 100 KB in size in the repository `A`
55 and add the same file `foo` to that repository again by
56 using the `pachctl put file` command, Pachyderm does not
57 overwrite that file but appends it to the file `foo` in the
58 repo. Therefore, the size of the file `foo` doubles and
59 becomes 200 KB. Pachyderm enables you to overwrite files as
60 well by using the `--overwrite` flag. The order of processing
61 is not guaranteed, and all datums are processed randomly.
62 For more information, see [File](../../data-concepts/file.md).
63
64 When new data comes in, a Pachyderm pipeline automatically
65 starts a new job. Each Pachyderm job consists of the
66 following stages:
67
68 1. Creation of input datums. In this stage, Pachyderm breaks
69 input files into datums according to the glob pattern setting
70 in the pipeline specification.
71 1. Transformation. The pipeline uses your code to processes the
72 datums.
73 1. Creation of output datums. Pachyderm creates file or files from the
74 processed data and combines them into output datums.
75 1. Merge. Pachyderm combines all files with the same file path
76 by appending, overwriting, or deleting them to create the final commit.
77
78 If you think about this process in terms of filesystem objects and
79 processing abstractions, the following transformation happens:
80
81 !!! note ""
82 **input files = > input datums => output datums => output files**
83
84 This section provides examples that help you understand such fundamental
85 Pachyderm concepts as the datum, incremental processing, and phases of
86 data processing.
87
88 ## Example 1: One file in the input datum, one file in the output datum
89
90 The simplest example of datum processing is when you have one file in
91 the input datum that results in one file in the output datum.
92 In the diagram below, you can see three input datums, each of which
93 includes one file, that result in three output datums. Whether you have
94 submitted all these datums in a single or multiple commits, the final
95 result remains the same—three datums, each of which has one file.
96
97 In the diagram below, you can see the following datums:
98
99 - `datum 1` has one file and results in one file in one output datum.
100 - `datum 2` has one file and results in one file in one output datum.
101 - `datum 3` has one file and results in one file in one output datum.
102
103 ![One to one](../../../assets/images/d_datum_processing_one_to_one.svg)
104
105 If you decide to overwrite a single line in the file in `datum 3` and
106 add `datum 4`, Pachyderm sees the four datums and checks them for changes
107 one-by-one. Pachyderm verifies that there are no changes in `datum 1` and
108 `datum 2` and skips these datums. Pachyderm detects changes in the
109 `datum 3` and the `--overwrite` flag and replaces the `datum 3` with the
110 new `datum 3'`. When it detects `datum 4` as a completely new datum,
111 it processes the whole datum as new. Although only two datums were
112 processed, the output commit of this change contains all four files.
113
114 ![One to one overwrite](../../../assets/images/d_datum_processing_one_to_one_overwrite.svg)
115
116 ## Example 2: One file in the input datum, multiple files in the output datum
117
118 Some pipelines ingest one file in one input datum and create multiple
119 files in the output datum. The files in the output datums might need to
120 be appended or overwritten with other files to create the final commit.
121
122 If you apply changes to that datum, Pachyderm does not detect which
123 particular part of the datum has changed and processes the whole datum.
124 In the diagram below, you have the following datums:
125
126 - `datum 1` has one file and results in files `1` and `3`.
127 - `datum 2` has one file and results in files `2` and `3`.
128 - `datum 3` has one file and results in files `2` and `1`.
129
130 ![One to many](../../../assets/images/d_datum_processing_one_to_many.svg)
131
132 Pachyderm processes all these datums independently, and in the end, it needs
133 to create a commit by combining the results of processing these datums.
134 A commit is a filesystem that has specific constraints, such as duplicate
135 files with the same file path. Pachyderm merges results from
136 different output datums with the same file path into single files. For
137 example, `datum 1` produces `pfs/out/1` and `datum 3` produces `pfs/out/1`.
138 Pachyderm merges these two files by appending them one to another
139 without any particular order. Therefore, the file `1` in the final
140 commit has parts from `datum1` and `datum3`.
141
142 If you decide to create a new commit and overwrite the file in `datum 2`,
143 Pachyderm detects three datums. Because `datum 1` and `datum 3` are
144 unchanged, it skips processing these datums. Then, Pachyderm detects
145 that something has changed in `datum 2`. Pachyderm is unaware of any
146 details of the change. Therefore, it processes the whole `datum 2`
147 and outputs the files `1`, `3`, and `4`. Then, Pachyderm merges
148 these datums to create the following final result:
149
150 ![One to many](../../../assets/images/d_datum_processing_one_to_many_overwrite.svg)
151
152
153 In the diagram above, Pachyderm appends the file `1` from the `datum 2`
154 to the file `1` in the final commit, deletes the file `2` from `datum 2`,
155 overwrites the old part from `datum 2` in file `3` with a new version,
156 and creates a new output file `4`.
157
158 Similarly, if you have multiple files in your input datum, Pachyderm might
159 write them into multiple files in output datums that are later merged into
160 files with the same file path.
161
162 ## Note: Data persistence between datums
163
164 Pachyderm only controls and wipes the /pfs directories between datums. If scratch/temp space is used during execution, the user needs to be careful to clean that up. Not cleaning temporary directories may cause unexpected bugs where one datum accesses temporary files that were previously used by another datum!