github.com/pachyderm/pachyderm@v1.13.4/doc/docs/archived/incrementality.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/archived/incrementality.md (about)

1 # Incremental Processing
2
3 Pachyderm performs computations in an incremental fashion. That is, rather
4 than computing a result all at once, it computes it in small pieces and
5 then stitches those pieces together to form results. This allows Pachyderm to reuse results and compute
6 things much more efficiently than traditional systems, which are forced to compute everything from
7 scratch during every job.
8
9 If you are new to the idea of Pachyderm "datums," you can learn more [here](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/datum/).
10
11 ## Inter-datum Incrementality
12
13 Each of the input datums in a Pachyderm pipeline is processed in isolation, and the results of these isolated
14 computations are combined to create the final result. Pachyderm will never
15 process the same datum twice (unless you update a pipeline with the
16 `--reprocess` flag). If you commit new data in Pachyderm that leaves some of the previously existing datums
17 intact, the results of processing those pre-existing datums in a previous job will
18 also remain intact. That is, the previous results for those pre-existing datums won't
19 be recalculated.
20
21 This inter-datum incrementality is best illustrated with
22 an example. Suppose we have a pipeline with a single input that looks like this:
23
24 ```json
25 {
26 "pfs": {
27 "repo": "R",
28 "glob": "/*",
29 }
30 }
31 ```
32
33 Now, suppose you make a commit to `R` which adds a single file `F1`. Your
34 pipeline will run a job, and that job will find a single datum to process (`F1`).
35 This datum will be processed, because it's the first time the pipeline has
36 seen `F1`.
37
38 ![alt tag](incrementality1.png)
39
40 If you then make a second commit to `R` adding another file `F2`,
41 the pipeline will run a second job. This job will find two datums to
42 process (`F1` and `F2`). `F2` will be processed, because it hasn't been seen before. However `F1` will NOT be
43 processed, because an output from processing it already exists in Pachyderm.
44
45 Instead, the output from the previous job for `F1` will be combined with the
46 new result from processing `F2` to create the
47 output of this second job. This reuse of the result for `F1` effectively halves the amount of work necessary
48 to process the second commit.
49
50 ![alt tag](incrementality2.png)
51
52 Finally, suppose you make a third commit to `R`, which modifies `F1`. Again
53 you'll have a job that sees two datums (the new `F1` and the already processed `F2`). This time
54 `F2` won't get processed, but the new `F1` will be processed because it has different
55 content as compared to the old `F1`.
56
57 ![alt tag](incrementality3.png)
58
59 Note, you as a user don't need to do anything to enable this
60 inter-datum incrementality. It happens automatically, and it should should be transparent from
61 your perspective. In the above example, you get the
62 same result you would have gotten if you committed the same data in a single
63 commit.
64
65 As of Pachyderm v1.5.1, `list job` and `inspect job` will tell you how many
66 datums the job processed and how many it skipped. Below is an example of
67 a job that had 5 datums, 3 that were processed and 2 that were skipped.
68
69 ```
70 ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS DL UL STATE
71 54fbc366-3f11-41f6-9000-60fc8860fa55 pipeline/9c348deb64304d118101e5771e18c2af 13 seconds ago 10 seconds 0 3 + 2 / 5 0B 0B success
72 ```
73