github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/splitting-data/index.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/splitting-data/index.md (about)

     1  # Split Data
     2  
     3  Pachyderm provides functionality that enables you to split
     4  your data while it is being loaded into a Pachyderm input
     5  repository which helps to optimize pipeline processing
     6  time and resource utilization.
     7  
     8  Splitting data helps you to address the following use cases:
     9  
    10  - **Optimize data processing**. If you have large size files,
    11    it might take Pachyderm a significant amount of time to process
    12    them. In addition, Pachyderm considers such a file as a single
    13    datum, and every time you apply even a minor change to that
    14    datum, Pachyderm processes the whole file from scratch.
    15  
    16  - **Increase diff granularity**. Pachyderm does not create
    17    per-line diffs that display line-by-line changes. Instead,
    18    Pachyderm provides per-file diffing. Therefore, if all of your
    19    data is in one huge file, you might not be able to see what has
    20    changed in that file. Breaking up the file to smaller chunks addresses
    21    this issue.
    22  
    23  This section provides examples of how you can use the `--split`
    24  command that breaks up your data into smaller chunks, called
    25  *split-files* and what happens when you update your data.