github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/removing_data_from_pachyderm.md (about) 1 # Delete Data 2 3 If *bad* data was committed into a Pachyderm input repository, your 4 pipeline might result in an error. In this case, you might need to 5 delete this data to resolve the issue. Depending on the nature of 6 the bad data and whether or not the bad data is in the HEAD of 7 the branch, you can perform one of the following actions: 8 9 - [Delete the HEAD of a Branch](#delete-the-head-of-a-branch). 10 If the incorrect data was added in the latest commit and no additional 11 data was committed since then, follow the steps in this section to fix 12 the HEAD of the corrupted branch. 13 - [Delete Old Commits](#delete-old-commits). If after 14 committing the incorrect data, you have added more data to the same 15 branch, follow the steps in this section to delete corrupted files. 16 - [Delete sensitive data](#delete-sensitive-data). If the bad 17 commit included sensitive data that you need immediately and completely 18 erase from Pachyderm, follow the steps in this section to purge data. 19 20 ## Delete the HEAD of a Branch 21 22 If you have just committed incorrect, corrupt, or otherwise bad 23 data to a branch in a Pachyderm repository, the HEAD of your branch, 24 or the latest commit is bad. Users who read from that commit 25 might be misled, and pipelines subscribed to it might fail or 26 produce bad downstream output. You can solve this issue by running 27 the `pachctl delete commit` command. 28 29 To fix a broken HEAD, run the following command: 30 31 ```shell 32 pachctl delete commit <repo>@<branch-or-commit-id> 33 ``` 34 35 When you delete a bad commit, Pachyderm performs the following actions: 36 37 - Deletes the commit metadata. 38 - Changes HEADs of all the branches that had the bad commit as their 39 HEAD to the bad commit's parent. If the bad commit does not have 40 a parent, Pachyderm sets the branch's HEAD to `nil`. 41 - If the bad commit has children, sets their parents to the deleted commit 42 parent. If the deleted commit does not have a parent, then the 43 children commit parents are set to `nil`. 44 - Deletes all the jobs that were triggered by the bad commit. Also, 45 Pachyderm interrupts all running jobs, including not only the 46 jobs that use the bad commit as a direct input but also the ones farther 47 downstream in your DAG. 48 - Deletes the output commits from the deleted jobs. All the actions 49 listed above are applied to those commits as well. 50 51 ## Delete Old Commits 52 53 If you have committed more data to the branch after the bad data 54 was added, you can try to delete the commit as described in 55 [Delete the HEAD of a Branch](#delete-the-head-of-a-branch). 56 However, unless the subsequent commits overwrote or deleted the 57 bad data, the bad data might still be present in the 58 children commits. Deleting a commit does not modify its children. 59 60 In Git terms, `pachctl delete commit` is equivalent to squashing a 61 commit out of existence, such as with the `git reset --hard` command. 62 The `delete commit` command is not equivalent to reverting a 63 commit in Git. The reason for this 64 behavior is that the semantics of revert can get ambiguous 65 when the files that are being reverted have been 66 otherwise modified. Because Pachyderm is a centralized system 67 and the volume of data that you typically store in Pachyderm is 68 large, merge conflicts can quickly become untenable. Therefore, 69 Pachyderm prevents merge conflicts entirely. 70 71 To resolve issues with the commits that are not at the tip of the 72 branch, you can try to delete the children commits. However, 73 those commits might also have the data that you might want to 74 keep. 75 76 To delete a file in an older commit, complete the following steps: 77 78 1. Start a new commit: 79 80 ```shell 81 pachctl start commit <repo>@<branch> 82 ``` 83 84 1. Delete all corrupted files from the newly opened commit: 85 86 ```shell 87 pachctl delete file <repo>@<branch or commitID>:/path/to/files 88 ``` 89 90 1. Finish the commit: 91 92 ```shell 93 pachctl finish commit <repo>@<branch> 94 ``` 95 96 4. Delete the initial bad commit and all its children up to 97 the newly finished commit. 98 99 Depending on how you use Pachyderm, the final step might be 100 optional. After you finish the commit, the HEADs of all your 101 branches converge to correct results as downstream jobs finish. 102 However, deleting those commits cleans up your 103 commit history and ensures that the errant data is not 104 available when non-HEAD versions of the data is read. 105 106 ## Delete Sensitive Data 107 108 When you delete data as described in [Delete Old Commits](#delete-old-commits), 109 Pachyderm does not immediately delete it from the physical disk. Instead, 110 Pachyderm deletes references to the underlying data and later 111 performs garbage collection. That is when the data is truly erased from the 112 disk. 113 114 If you have accidentally committed sensitive data and you need to 115 ensure that it is immediately erased and inaccessible, complete the 116 following steps: 117 118 1. Delete all the references to data as described in 119 [Delete Old Commits](#delete-old-commits). 120 121 1. Run `garbage-collect`: 122 123 ```shell 124 pachctl garbage-collect 125 ``` 126 127 To make garbage collection more comprehensive, increase the 128 amount of memory that is used during the garbage collection 129 operation by specifying the `--memory` flag. The default value 130 is 10 MB. 131