github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/deploy-manage/manage/data_management.md (about) 1 # Storage Use Optimization 2 3 This section discusses best practices for minimizing the 4 space needed to store your Pachyderm data, increasing 5 the performance of your data processing as related to 6 data organization, and general good ideas when you 7 are using Pachyderm to version/process your data. 8 9 ## Garbage collection 10 11 When a file, commit, repo is deleted, the data is not immediately removed 12 from the underlying storage system, such as S3, for performance and 13 architectural reasons. This is similar to how when you delete a file 14 on your computer, the file is not necessarily wiped from disk immediately. 15 16 To actually remove the data, you may need to manually invoke garbage 17 collection. The easiest way to do it is through `pachctl garbage-collect`. 18 You can start `pachctl garbage-collect` only when no active jobs are 19 running. Also, you need to ensure that all `pachctl put file` operations 20 have been completed. Garbage collection puts the cluster into a read-only 21 mode where no new jobs can be created and no data can be added. 22 23 ## Setting a root volume size 24 25 When planning and configuring your Pachyderm deployment, you need to 26 make sure that each node's root volume is big enough to accommodate 27 your total processing bandwidth. Specifically, you should calculate 28 the bandwidth for your expected running jobs as follows: 29 30 ```shell 31 (storage needed per datum) x (number of datums being processed simultaneously) / (number of nodes) 32 ``` 33 34 Here, the storage needed per datum must be the storage needed for 35 the largest datum you expect to process anywhere on your DAG plus 36 the size of the output files that will be written for that datum. 37 If your root volume size is not large enough, pipelines might fail 38 when downloading the input. The pod would get evicted and 39 rescheduled to a different node, where the same thing might happen 40 (assuming that node had a similar volume). 41 42 !!! note "See Also:" 43 44 [Troubleshoot a pipeline](../../../troubleshooting/pipeline_troubleshooting#all-your-pods-or-jobs-get-evicted)