github.com/pachyderm/pachyderm@v1.13.4/doc/docs/archived/incrementality.md (about)

     1  # Incremental Processing
     2  
     3  Pachyderm performs computations in an incremental fashion.  That is, rather
     4  than computing a result all at once, it computes it in small pieces and
     5  then stitches those pieces together to form results. This allows Pachyderm to reuse results and compute
     6  things much more efficiently than traditional systems, which are forced to compute everything from
     7  scratch during every job.  
     8  
     9  If you are new to the idea of Pachyderm "datums," you can learn more [here](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/datum/).
    10  
    11  ## Inter-datum Incrementality
    12  
    13  Each of the input datums in a Pachyderm pipeline is processed in isolation, and the results of these isolated
    14  computations are combined to create the final result. Pachyderm will never
    15  process the same datum twice (unless you update a pipeline with the
    16  `--reprocess` flag). If you commit new data in Pachyderm that leaves some of the previously existing datums
    17  intact, the results of processing those pre-existing datums in a previous job will
    18  also remain intact.  That is, the previous results for those pre-existing datums won't
    19  be recalculated.
    20  
    21  This inter-datum incrementality is best illustrated with
    22  an example. Suppose we have a pipeline with a single input that looks like this:
    23  
    24  ```json
    25  {
    26    "pfs": {
    27      "repo": "R",
    28      "glob": "/*",
    29    }
    30  }
    31  ```
    32  
    33  Now, suppose you make a commit to `R` which adds a single file `F1`. Your
    34  pipeline will run a job, and that job will find a single datum to process (`F1`).
    35  This datum will be processed, because it's the first time the pipeline has
    36  seen `F1`.
    37  
    38  ![alt tag](incrementality1.png)
    39  
    40  If you then make a second commit to `R` adding another file `F2`, 
    41  the pipeline will run a second job. This job will find two datums to
    42  process (`F1` and `F2`). `F2` will be processed, because it hasn't been seen before. However `F1` will NOT be
    43  processed, because an output from processing it already exists in Pachyderm. 
    44  
    45  Instead, the output from the previous job for `F1` will be combined with the
    46  new result from processing `F2` to create the
    47  output of this second job. This reuse of the result for `F1` effectively halves the amount of work necessary
    48  to process the second commit.
    49  
    50  ![alt tag](incrementality2.png)
    51  
    52  Finally, suppose you make a third commit to `R`, which modifies `F1`. Again
    53  you'll have a job that sees two datums (the new `F1` and the already processed `F2`). This time
    54  `F2` won't get processed, but the new `F1` will be processed because it has different
    55  content as compared to the old `F1`.
    56  
    57  ![alt tag](incrementality3.png)
    58  
    59  Note, you as a user don't need to do anything to enable this
    60  inter-datum incrementality. It happens automatically, and it should should be transparent from
    61  your perspective. In the above example, you get the
    62  same result you would have gotten if you committed the same data in a single
    63  commit. 
    64  
    65  As of Pachyderm v1.5.1, `list job` and `inspect job` will tell you how many
    66  datums the job processed and how many it skipped. Below is an example of
    67  a job that had 5 datums, 3 that were processed and 2 that were skipped.
    68  
    69  ```
    70  ID                                   OUTPUT COMMIT                             STARTED            DURATION           RESTART PROGRESS      DL       UL       STATE
    71  54fbc366-3f11-41f6-9000-60fc8860fa55 pipeline/9c348deb64304d118101e5771e18c2af 13 seconds ago     10 seconds         0       3 + 2 / 5     0B       0B       success
    72  ```
    73