github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/splitting-data/index.md (about) 1 # Split Data 2 3 Pachyderm provides functionality that enables you to split 4 your data while it is being loaded into a Pachyderm input 5 repository which helps to optimize pipeline processing 6 time and resource utilization. 7 8 Splitting data helps you to address the following use cases: 9 10 - **Optimize data processing**. If you have large size files, 11 it might take Pachyderm a significant amount of time to process 12 them. In addition, Pachyderm considers such a file as a single 13 datum, and every time you apply even a minor change to that 14 datum, Pachyderm processes the whole file from scratch. 15 16 - **Increase diff granularity**. Pachyderm does not create 17 per-line diffs that display line-by-line changes. Instead, 18 Pachyderm provides per-file diffing. Therefore, if all of your 19 data is in one huge file, you might not be able to see what has 20 changed in that file. Breaking up the file to smaller chunks addresses 21 this issue. 22 23 This section provides examples of how you can use the `--split` 24 command that breaks up your data into smaller chunks, called 25 *split-files* and what happens when you update your data.