github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/garbage-collection/gc.md (about) 1 --- 2 title: Garbage Collection 3 description: Clean up expired objects using the garbage collection feature in lakeFS. 4 parent: Garbage Collection 5 grand_parent: How-To 6 nav_order: 1 7 redirect_from: 8 - /howto/garbage-collection/index.html 9 - /howto/garbage-collection/committed.html 10 - /howto/garbage-collection/uncommitted.html 11 - /howto/garbage-collection/internals.html 12 - /reference/garbage-collection.html 13 - /howto/garbage-collection-index.html 14 - /howto/garbage-collection.html 15 - /reference/retention.html 16 --- 17 18 # Garbage Collection 19 20 [lakeFS Cloud](https://lakefs.cloud) users enjoy a [managed garbage collection]({% link howto/garbage-collection/managed-gc.md %}) service, and do not need to run this Spark program. 21 {: .tip } 22 23 24 By default, lakeFS keeps all your objects forever. This allows you to travel back in time to previous versions of your data. 25 However, sometimes you may want to remove the objects from the underlying storage completely. 26 Reasons for this include cost-reduction and privacy policies. 27 28 The garbage collection (GC) job is a Spark program that removes the following from the underlying storage: 29 1. _Committed objects_ that have been deleted (or replaced) in lakeFS, and are considered expired according to [rules you define](#garbage-collection-rules). 30 2. _Uncommitted objects_ that are no longer accessible 31 * For example, objects deleted before ever being committed. 32 33 {% include toc.html %} 34 35 ## Garbage collection rules 36 37 {: .note } 38 These rules only apply to objects that have been _committed_ at some point. 39 Without retention rules, only inaccessible _uncommitted_ objects will be removed by the job. 40 41 Garbage collection rules determine for how long an object is kept in the storage after it is _deleted_ (or replaced) in lakeFS. 42 For every branch, the GC job retains deleted objects for the number of days defined for the branch. 43 In the absence of a branch-specific rule, the default rule for the repository is used. 44 If an object is present in more than one branch ancestry, it is removed only after the retention period has ended for 45 all relevant branches. 46 47 Example GC rules for a repository: 48 ```json 49 { 50 "default_retention_days": 14, 51 "branches": [ 52 {"branch_id": "main", "retention_days": 21}, 53 {"branch_id": "dev", "retention_days": 7} 54 ] 55 } 56 ``` 57 58 In the above example, objects will be retained for 14 days after deletion by default. 59 However, if present in the branch `main`, objects will be retained for 21 days. 60 Objects present _only_ in the `dev` branch will be retained for 7 days after they are deleted. 61 62 ### How to configure garbage collection rules 63 64 To define retention rules, either use the `lakectl` command, the lakeFS web UI, or [API](/reference/api.html#/retention/set%20garbage%20collection%20rules): 65 66 <div class="tabs"> 67 <ul> 68 <li><a href="#lakectl-option">CLI</a></li> 69 <li><a href="#ui-option">Web UI</a></li> 70 </ul> 71 <div markdown="1" id="lakectl-option"> 72 73 Create a JSON file with your GC rules: 74 75 ```bash 76 cat <<EOT >> example_repo_gc_rules.json 77 { 78 "default_retention_days": 14, 79 "branches": [ 80 {"branch_id": "main", "retention_days": 21}, 81 {"branch_id": "dev", "retention_days": 7} 82 ] 83 } 84 EOT 85 ``` 86 87 Set the GC rules using `lakectl`: 88 ```bash 89 lakectl gc set-config lakefs://example-repo -f example_repo_gc_rules.json 90 ``` 91 92 </div> 93 <div markdown="1" id="ui-option"> 94 From the lakeFS web UI: 95 96 1. Navigate to the main page of your repository. 97 2. Go to _Settings_ -> _Garbage Collection_. 98 3. Click _Edit policy_ and paste your GC rule into the text box as a JSON. 99 4. Save your changes. 100 101  102 </div> 103 </div> 104 105 ## How to run the garbage collection job 106 107 To run the job, use the following `spark-submit` command (or using your preferred method of running Spark programs). 108 109 <div class="tabs"> 110 <ul> 111 <li><a href="#aws-option">On AWS</a></li> 112 <li><a href="#azure-option">On Azure</a></li> 113 <li><a href="#gcp-option">On GCP</a></li> 114 </ul> 115 <div markdown="1" id="aws-option"> 116 ```bash 117 spark-submit --class io.treeverse.gc.GarbageCollection \ 118 --packages org.apache.hadoop:hadoop-aws:2.7.7 \ 119 -c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1 \ 120 -c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \ 121 -c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \ 122 -c spark.hadoop.fs.s3a.access.key=<S3_ACCESS_KEY> \ 123 -c spark.hadoop.fs.s3a.secret.key=<S3_SECRET_KEY> \ 124 http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client/0.11.0/lakefs-spark-client-assembly-0.11.0.jar \ 125 example-repo us-east-1 126 ``` 127 </div> 128 <div markdown="1" id="azure-option"> 129 130 If you want to access your storage using the account key: 131 132 ```bash 133 spark-submit --class io.treeverse.gc.GarbageCollection \ 134 --packages org.apache.hadoop:hadoop-aws:3.2.1 \ 135 -c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1 \ 136 -c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \ 137 -c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \ 138 -c spark.hadoop.fs.azure.account.key.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=<AZURE_STORAGE_ACCESS_KEY> \ 139 http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client/0.11.0/lakefs-spark-client-assembly-0.11.0.jar \ 140 example-repo 141 ``` 142 143 Or, if you want to access your storage using an Azure service principal: 144 145 ```bash 146 spark-submit --class io.treeverse.gc.GarbageCollection \ 147 --packages org.apache.hadoop:hadoop-aws:3.2.1 \ 148 -c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1 \ 149 -c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \ 150 -c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \ 151 -c spark.hadoop.fs.azure.account.auth.type.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=OAuth \ 152 -c spark.hadoop.fs.azure.account.oauth.provider.type.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider \ 153 -c spark.hadoop.fs.azure.account.oauth2.client.id.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=<application-id> \ 154 -c spark.hadoop.fs.azure.account.oauth2.client.secret.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=<service-credential-key> \ 155 -c spark.hadoop.fs.azure.account.oauth2.client.endpoint.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=https://login.microsoftonline.com/<directory-id>/oauth2/token \ 156 http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client/0.11.0/lakefs-spark-client-assembly-0.11.0.jar \ 157 example-repo 158 ``` 159 160 **Notes:** 161 * On Azure, GC was tested only on Spark 3.3.0, but may work with other Spark and Hadoop versions. 162 * In case you don't have `hadoop-azure` package as part of your environment, you should add the package to your spark-submit with `--packages org.apache.hadoop:hadoop-azure:3.2.1` 163 * For GC to work on Azure blob, [soft delete](https://docs.microsoft.com/en-us/azure/storage/blobs/soft-delete-blob-overview) should be disabled. 164 </div> 165 166 <div markdown="1" id="gcp-option"> 167 ⚠️ At the moment, only the "mark" phase of the Garbage Collection is supported for GCP. 168 That is, this program will output a list of expired objects, and you will have to delete them manually. 169 We have [concrete plans](https://github.com/treeverse/lakeFS/issues/3626) to extend this support to actually delete the objects. 170 {: .note .note-warning } 171 172 ```bash 173 spark-submit --class io.treeverse.gc.GarbageCollection \ 174 --jars https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar \ 175 -c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1 \ 176 -c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \ 177 -c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \ 178 -c spark.hadoop.google.cloud.auth.service.account.enable=true \ 179 -c spark.hadoop.google.cloud.auth.service.account.json.keyfile=<PATH_TO_JSON_KEYFILE> \ 180 -c spark.hadoop.fs.gs.project.id=<GCP_PROJECT_ID> \ 181 -c spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \ 182 -c spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS \ 183 -c spark.hadoop.lakefs.gc.do_sweep=false \ 184 http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client/0.11.0/lakefs-spark-client-assembly-0.11.0.jar \ 185 example-repo 186 ``` 187 188 This program will not delete anything. 189 Instead, it will find all the objects that are safe to delete and save a list containing all their keys, in Parquet format. 190 The list will then be found under the path: 191 ``` 192 gs://<STORAGE_NAMESPACE>/_lakefs/retention/gc/unified/<RUN_ID>/deleted/ 193 ``` 194 195 Note that this is a path in your Google Storage bucket, and not in your lakeFS repository. 196 It is now safe to remove the objects that appear in this list directly from the storage. 197 198 </div> 199 </div> 200 201 You will find the list of objects removed by the job in the storage 202 namespace of the repository. It is saved in Parquet format under `_lakefs/retention/gc/unified/<RUN_ID>/deleted/`. 203 204 ### Mark and Sweep stages 205 206 You can break the job into two stages: 207 * _Mark_: find objects to remove, without actually removing them. 208 * _Sweep_: remove the objects. 209 210 #### Mark-only mode 211 212 To make GC run the mark stage only, add the following to your spark-submit command: 213 ```properties 214 spark.hadoop.lakefs.gc.do_sweep=false 215 ``` 216 217 In mark-only mode, GC will write the keys of the expired objects under: `<REPOSITORY_STORAGE_NAMESPACE>/_lakefs/retention/gc/unified/<MARK_ID>/`. 218 _MARK_ID_ is generated by the job. You can find it in the driver's output: 219 220 ``` 221 Report for mark_id=gmc6523jatlleurvdm30 path=s3a://example-bucket/_lakefs/retention/gc/unified/gmc6523jatlleurvdm30 222 ``` 223 224 #### Sweep-only mode 225 226 To make GC run the sweep stage only, add the following properties to your spark-submit command: 227 ```properties 228 spark.hadoop.lakefs.gc.do_mark=false 229 spark.hadoop.lakefs.gc.mark_id=<MARK_ID> # Replace <MARK_ID> with the identifier you obtained from a previous mark-only run 230 ``` 231 232 ## Garbage collection notes 233 234 1. In order for an object to be removed, it must not exist on the HEAD of any branch. 235 You should remove stale branches to prevent them from retaining old objects. 236 For example, consider a branch that has been merged to `main` and has become stale. 237 An object which is later deleted from `main` will always be present in the stale branch, preventing it from being removed. 238 239 1. lakeFS will never delete objects outside your repository's storage namespace. 240 In particular, objects that were imported using `lakectl import` or the UI import wizard will not be affected by GC jobs. 241 242 1. In cases where deleted objects are brought back to life while a GC job is running (for example, by reverting a commit), 243 the objects may or may not be deleted. 244 245 1. Garbage collection does not remove any commits: you will still be able to use commits containing removed objects, 246 but trying to read these objects from lakeFS will result in a `410 Gone` HTTP status.