github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/removing_data_from_pachyderm.md (about)

     1  # Delete Data
     2  
     3  If *bad* data was committed into a Pachyderm input repository, your
     4  pipeline might result in an error. In this case, you might need to
     5  delete this data to resolve the issue. Depending on the nature of
     6  the bad data and whether or not the bad data is in the HEAD of
     7  the branch, you can perform one of the following actions:
     8  
     9  - [Delete the HEAD of a Branch](#delete-the-head-of-a-branch).
    10  If the incorrect data was added in the latest commit and no additional
    11  data was committed since then, follow the steps in this section to fix
    12  the HEAD of the corrupted branch.
    13  - [Delete Old Commits](#delete-old-commits). If after
    14  committing the incorrect data, you have added more data to the same
    15  branch, follow the steps in this section to delete corrupted files.
    16  - [Delete sensitive data](#delete-sensitive-data). If the bad
    17  commit included sensitive data that you need immediately and completely
    18  erase from Pachyderm, follow the steps in this section to purge data.
    19  
    20  ## Delete the HEAD of a Branch
    21  
    22  If you have just committed incorrect, corrupt, or otherwise bad
    23  data to a branch in a Pachyderm repository, the HEAD of your branch,
    24  or the latest commit is bad. Users who read from that commit
    25  might be misled, and pipelines subscribed to it might fail or
    26  produce bad downstream output. You can solve this issue by running
    27  the `pachctl delete commit` command.
    28  
    29  To fix a broken HEAD, run the following command:
    30  
    31  ```shell
    32  pachctl delete commit <repo>@<branch-or-commit-id>
    33  ```
    34  
    35  When you delete a bad commit, Pachyderm performs the following actions:
    36  
    37  - Deletes the commit metadata.
    38  - Changes HEADs of all the branches that had the bad commit as their
    39    HEAD to the bad commit's parent. If the bad commit does not have
    40    a parent, Pachyderm sets the branch's HEAD to `nil`.
    41  - If the bad commit has children, sets their parents to the deleted commit
    42    parent. If the deleted commit does not have a parent, then the
    43    children commit parents are set to `nil`.
    44  - Deletes all the jobs that were triggered by the bad commit. Also,
    45    Pachyderm interrupts all running jobs, including not only the
    46    jobs that use the bad commit as a direct input but also the ones farther
    47    downstream in your DAG.
    48  - Deletes the output commits from the deleted jobs. All the actions
    49    listed above are applied to those commits as well.
    50  
    51  ## Delete Old Commits
    52  
    53  If you have committed more data to the branch after the bad data
    54  was added, you can try to delete the commit as described in
    55  [Delete the HEAD of a Branch](#delete-the-head-of-a-branch).
    56  However, unless the subsequent commits overwrote or deleted the
    57  bad data, the bad data might still be present in the
    58  children commits. Deleting a commit does not modify its children.
    59  
    60  In Git terms, `pachctl delete commit` is equivalent to squashing a
    61  commit out of existence, such as with the `git reset --hard` command.
    62  The `delete commit` command is not equivalent to reverting a
    63  commit in Git. The reason for this
    64  behavior is that the semantics of revert can get ambiguous
    65  when the files that are being reverted have been
    66  otherwise modified. Because Pachyderm is a centralized system
    67  and the volume of data that you typically store in Pachyderm is
    68  large, merge conflicts can quickly become untenable. Therefore,
    69  Pachyderm prevents merge conflicts entirely.
    70  
    71  To resolve issues with the commits that are not at the tip of the
    72  branch, you can try to delete the children commits. However,
    73  those commits might also have the data that you might want to
    74  keep.
    75  
    76  To delete a file in an older commit, complete the following steps:
    77  
    78  1. Start a new commit:
    79  
    80     ```shell
    81     pachctl start commit <repo>@<branch>
    82     ```
    83  
    84  1. Delete all corrupted files from the newly opened commit:
    85  
    86     ```shell
    87     pachctl delete file <repo>@<branch or commitID>:/path/to/files
    88     ```
    89  
    90  1. Finish the commit:
    91  
    92     ```shell
    93     pachctl finish commit <repo>@<branch>
    94     ```
    95  
    96  4. Delete the initial bad commit and all its children up to
    97     the newly finished commit.
    98  
    99     Depending on how you use Pachyderm, the final step might be
   100     optional. After you finish the commit, the HEADs of all your
   101     branches converge to correct results as downstream jobs finish.
   102     However, deleting those commits cleans up your
   103     commit history and ensures that the errant data is not
   104     available when non-HEAD versions of the data is read.
   105  
   106  ## Delete Sensitive Data
   107  
   108  When you delete data as described in [Delete Old Commits](#delete-old-commits),
   109  Pachyderm does not immediately delete it from the physical disk. Instead,
   110  Pachyderm deletes references to the underlying data and later
   111  performs garbage collection. That is when the data is truly erased from the
   112  disk.
   113  
   114  If you have accidentally committed sensitive data and you need to
   115  ensure that it is immediately erased and inaccessible, complete the
   116  following steps:
   117  
   118  1. Delete all the references to data as described in
   119  [Delete Old Commits](#delete-old-commits).
   120  
   121  1. Run `garbage-collect`:
   122  
   123     ```shell
   124     pachctl garbage-collect
   125     ```
   126  
   127     To make garbage collection more comprehensive, increase the
   128     amount of memory that is used during the garbage collection
   129     operation by specifying the `--memory` flag. The default value
   130     is 10 MB.
   131