github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/deploy-manage/manage/data_management.md (about)

     1  # Storage Use Optimization
     2  
     3  This section discusses best practices for minimizing the
     4  space needed to store your Pachyderm data, increasing
     5  the performance of your data processing as related to
     6  data organization, and general good ideas when you
     7  are using Pachyderm to version/process your data.
     8  
     9  ## Garbage collection
    10  
    11  When a file, commit, repo is deleted, the data is not immediately removed
    12  from the underlying storage system, such as S3, for performance and
    13  architectural reasons. This is similar to how when you delete a file
    14  on your computer, the file is not necessarily wiped from disk immediately.
    15  
    16  To actually remove the data, you may need to manually invoke garbage
    17  collection. The easiest way to do it is through `pachctl garbage-collect`.
    18  You can start `pachctl garbage-collect` only when no active jobs are
    19  running. Also, you need to ensure that all `pachctl put file` operations
    20  have been completed. Garbage collection puts the cluster into a read-only
    21  mode where no new jobs can be created and no data can be added.
    22  
    23  ## Setting a root volume size
    24  
    25  When planning and configuring your Pachyderm deployment, you need to
    26  make sure that each node's root volume is big enough to accommodate
    27  your total processing bandwidth. Specifically, you should calculate
    28  the bandwidth for your expected running jobs as follows:
    29  
    30  ```shell
    31  (storage needed per datum) x (number of datums being processed simultaneously) / (number of nodes)
    32  ```
    33  
    34  Here, the storage needed per datum must be the storage needed for
    35  the largest datum you expect to process anywhere on your DAG plus
    36  the size of the output files that will be written for that datum.
    37  If your root volume size is not large enough, pipelines might fail
    38  when downloading the input. The pod would get evicted and
    39  rescheduled to a different node, where the same thing might happen
    40  (assuming that node had a similar volume).
    41  
    42  !!! note "See Also:"
    43  
    44     [Troubleshoot a pipeline](../../../troubleshooting/pipeline_troubleshooting#all-your-pods-or-jobs-get-evicted)