github.com/pachyderm/pachyderm@v1.13.4/doc/docs/archived/incrementality.md (about) 1 # Incremental Processing 2 3 Pachyderm performs computations in an incremental fashion. That is, rather 4 than computing a result all at once, it computes it in small pieces and 5 then stitches those pieces together to form results. This allows Pachyderm to reuse results and compute 6 things much more efficiently than traditional systems, which are forced to compute everything from 7 scratch during every job. 8 9 If you are new to the idea of Pachyderm "datums," you can learn more [here](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/datum/). 10 11 ## Inter-datum Incrementality 12 13 Each of the input datums in a Pachyderm pipeline is processed in isolation, and the results of these isolated 14 computations are combined to create the final result. Pachyderm will never 15 process the same datum twice (unless you update a pipeline with the 16 `--reprocess` flag). If you commit new data in Pachyderm that leaves some of the previously existing datums 17 intact, the results of processing those pre-existing datums in a previous job will 18 also remain intact. That is, the previous results for those pre-existing datums won't 19 be recalculated. 20 21 This inter-datum incrementality is best illustrated with 22 an example. Suppose we have a pipeline with a single input that looks like this: 23 24 ```json 25 { 26 "pfs": { 27 "repo": "R", 28 "glob": "/*", 29 } 30 } 31 ``` 32 33 Now, suppose you make a commit to `R` which adds a single file `F1`. Your 34 pipeline will run a job, and that job will find a single datum to process (`F1`). 35 This datum will be processed, because it's the first time the pipeline has 36 seen `F1`. 37 38  39 40 If you then make a second commit to `R` adding another file `F2`, 41 the pipeline will run a second job. This job will find two datums to 42 process (`F1` and `F2`). `F2` will be processed, because it hasn't been seen before. However `F1` will NOT be 43 processed, because an output from processing it already exists in Pachyderm. 44 45 Instead, the output from the previous job for `F1` will be combined with the 46 new result from processing `F2` to create the 47 output of this second job. This reuse of the result for `F1` effectively halves the amount of work necessary 48 to process the second commit. 49 50  51 52 Finally, suppose you make a third commit to `R`, which modifies `F1`. Again 53 you'll have a job that sees two datums (the new `F1` and the already processed `F2`). This time 54 `F2` won't get processed, but the new `F1` will be processed because it has different 55 content as compared to the old `F1`. 56 57  58 59 Note, you as a user don't need to do anything to enable this 60 inter-datum incrementality. It happens automatically, and it should should be transparent from 61 your perspective. In the above example, you get the 62 same result you would have gotten if you committed the same data in a single 63 commit. 64 65 As of Pachyderm v1.5.1, `list job` and `inspect job` will tell you how many 66 datums the job processed and how many it skipped. Below is an example of 67 a job that had 5 datums, 3 that were processed and 2 that were skipped. 68 69 ``` 70 ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS DL UL STATE 71 54fbc366-3f11-41f6-9000-60fc8860fa55 pipeline/9c348deb64304d118101e5771e18c2af 13 seconds ago 10 seconds 0 3 + 2 / 5 0B 0B success 72 ``` 73