github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/deploy-manage/manage/data_management.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/deploy-manage/manage/data_management.md (about)

1 # Storage Use Optimization
2
3 This section discusses best practices for minimizing the
4 space needed to store your Pachyderm data, increasing
5 the performance of your data processing as related to
6 data organization, and general good ideas when you
7 are using Pachyderm to version/process your data.
8
9 ## Garbage collection
10
11 When a file, commit, repo is deleted, the data is not immediately removed
12 from the underlying storage system, such as S3, for performance and
13 architectural reasons. This is similar to how when you delete a file
14 on your computer, the file is not necessarily wiped from disk immediately.
15
16 To actually remove the data, you may need to manually invoke garbage
17 collection. The easiest way to do it is through `pachctl garbage-collect`.
18 You can start `pachctl garbage-collect` only when no active jobs are
19 running. Also, you need to ensure that all `pachctl put file` operations
20 have been completed. Garbage collection puts the cluster into a read-only
21 mode where no new jobs can be created and no data can be added.
22
23 ## Setting a root volume size
24
25 When planning and configuring your Pachyderm deployment, you need to
26 make sure that each node's root volume is big enough to accommodate
27 your total processing bandwidth. Specifically, you should calculate
28 the bandwidth for your expected running jobs as follows:
29
30 ```shell
31 (storage needed per datum) x (number of datums being processed simultaneously) / (number of nodes)
32 ```
33
34 Here, the storage needed per datum must be the storage needed for
35 the largest datum you expect to process anywhere on your DAG plus
36 the size of the output files that will be written for that datum.
37 If your root volume size is not large enough, pipelines might fail
38 when downloading the input. The pod would get evicted and
39 rescheduled to a different node, where the same thing might happen
40 (assuming that node had a similar volume).
41
42 !!! note "See Also:"
43
44 [Troubleshoot a pipeline](../../../troubleshooting/pipeline_troubleshooting#all-your-pods-or-jobs-get-evicted)