github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/garbage-collection/gc.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/garbage-collection/gc.md (about)

     1  ---
     2  title: Garbage Collection
     3  description: Clean up expired objects using the garbage collection feature in lakeFS.
     4  parent: Garbage Collection
     5  grand_parent: How-To
     6  nav_order: 1
     7  redirect_from:
     8    - /howto/garbage-collection/index.html
     9    - /howto/garbage-collection/committed.html
    10    - /howto/garbage-collection/uncommitted.html
    11    - /howto/garbage-collection/internals.html
    12    - /reference/garbage-collection.html
    13    - /howto/garbage-collection-index.html
    14    - /howto/garbage-collection.html
    15    - /reference/retention.html
    16  ---
    17  
    18  # Garbage Collection
    19  
    20  [lakeFS Cloud](https://lakefs.cloud) users enjoy a [managed garbage collection]({% link howto/garbage-collection/managed-gc.md %}) service, and do not need to run this Spark program.
    21  {: .tip }
    22  
    23  
    24  By default, lakeFS keeps all your objects forever. This allows you to travel back in time to previous versions of your data.
    25  However, sometimes you may want to remove the objects from the underlying storage completely.
    26  Reasons for this include cost-reduction and privacy policies.
    27  
    28  The garbage collection (GC) job is a Spark program that removes the following from the underlying storage:
    29  1. _Committed objects_ that have been deleted (or replaced) in lakeFS, and are considered expired according to [rules you define](#garbage-collection-rules).
    30  2. _Uncommitted objects_ that are no longer accessible
    31      * For example, objects deleted before ever being committed.
    32  
    33  {% include toc.html %}
    34  
    35  ## Garbage collection rules
    36  
    37  {: .note }
    38  These rules only apply to objects that have been _committed_ at some point.
    39  Without retention rules, only inaccessible _uncommitted_ objects will be removed by the job.
    40  
    41  Garbage collection rules determine for how long an object is kept in the storage after it is _deleted_ (or replaced) in lakeFS.
    42  For every branch, the GC job retains deleted objects for the number of days defined for the branch.
    43  In the absence of a branch-specific rule, the default rule for the repository is used.
    44  If an object is present in more than one branch ancestry, it is removed only after the retention period has ended for
    45  all relevant branches.
    46  
    47  Example GC rules for a repository:
    48  ```json
    49  {
    50    "default_retention_days": 14,
    51    "branches": [
    52      {"branch_id": "main", "retention_days": 21},
    53      {"branch_id": "dev", "retention_days": 7}
    54    ]
    55  }
    56  ```
    57  
    58  In the above example, objects will be retained for 14 days after deletion by default.
    59  However, if present in the branch `main`, objects will be retained for 21 days.
    60  Objects present _only_ in the `dev` branch will be retained for 7 days after they are deleted.
    61  
    62  ### How to configure garbage collection rules
    63  
    64  To define retention rules, either use the `lakectl` command, the lakeFS web UI, or [API](/reference/api.html#/retention/set%20garbage%20collection%20rules):
    65  
    66  <div class="tabs">
    67    <ul>
    68      <li><a href="#lakectl-option">CLI</a></li>
    69      <li><a href="#ui-option">Web UI</a></li>
    70    </ul>
    71    <div markdown="1" id="lakectl-option">
    72  
    73  Create a JSON file with your GC rules:
    74  
    75  ```bash
    76  cat <<EOT >> example_repo_gc_rules.json
    77  {
    78    "default_retention_days": 14,
    79    "branches": [
    80      {"branch_id": "main", "retention_days": 21},
    81      {"branch_id": "dev", "retention_days": 7}
    82    ]
    83  }
    84  EOT
    85  ```
    86  
    87  Set the GC rules using `lakectl`:
    88  ```bash
    89  lakectl gc set-config lakefs://example-repo -f example_repo_gc_rules.json 
    90  ```
    91  
    92  </div>
    93  <div markdown="1" id="ui-option">
    94  From the lakeFS web UI:
    95  
    96  1. Navigate to the main page of your repository.
    97  2. Go to _Settings_ -> _Garbage Collection_.
    98  3. Click _Edit policy_ and paste your GC rule into the text box as a JSON.
    99  4. Save your changes.
   100  
   101  ![GC Rules From UI]({{ site.baseurl }}/assets/img/gc_rules_from_ui.png)
   102  </div>
   103  </div>
   104  
   105  ## How to run the garbage collection job
   106  
   107  To run the job, use the following `spark-submit` command (or using your preferred method of running Spark programs).
   108  
   109  <div class="tabs">
   110    <ul>
   111      <li><a href="#aws-option">On AWS</a></li>
   112      <li><a href="#azure-option">On Azure</a></li>
   113      <li><a href="#gcp-option">On GCP</a></li>
   114    </ul>
   115    <div markdown="1" id="aws-option">
   116    ```bash
   117  spark-submit --class io.treeverse.gc.GarbageCollection \
   118    --packages org.apache.hadoop:hadoop-aws:2.7.7 \
   119    -c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1  \
   120    -c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
   121    -c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
   122    -c spark.hadoop.fs.s3a.access.key=<S3_ACCESS_KEY> \
   123    -c spark.hadoop.fs.s3a.secret.key=<S3_SECRET_KEY> \
   124    http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client/0.11.0/lakefs-spark-client-assembly-0.11.0.jar \
   125    example-repo us-east-1
   126    ```
   127    </div>
   128    <div markdown="1" id="azure-option">
   129  
   130  If you want to access your storage using the account key:
   131  
   132    ```bash
   133  spark-submit --class io.treeverse.gc.GarbageCollection \
   134    --packages org.apache.hadoop:hadoop-aws:3.2.1 \
   135    -c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1  \
   136    -c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
   137    -c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
   138    -c spark.hadoop.fs.azure.account.key.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=<AZURE_STORAGE_ACCESS_KEY> \
   139    http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client/0.11.0/lakefs-spark-client-assembly-0.11.0.jar \
   140    example-repo
   141    ```
   142  
   143  Or, if you want to access your storage using an Azure service principal:
   144  
   145    ```bash
   146  spark-submit --class io.treeverse.gc.GarbageCollection \
   147    --packages org.apache.hadoop:hadoop-aws:3.2.1 \
   148    -c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1  \
   149    -c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
   150    -c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
   151    -c spark.hadoop.fs.azure.account.auth.type.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=OAuth \
   152    -c spark.hadoop.fs.azure.account.oauth.provider.type.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider \
   153    -c spark.hadoop.fs.azure.account.oauth2.client.id.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=<application-id> \
   154    -c spark.hadoop.fs.azure.account.oauth2.client.secret.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=<service-credential-key> \
   155    -c spark.hadoop.fs.azure.account.oauth2.client.endpoint.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=https://login.microsoftonline.com/<directory-id>/oauth2/token \
   156    http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client/0.11.0/lakefs-spark-client-assembly-0.11.0.jar \
   157    example-repo
   158    ```
   159  
   160  **Notes:**
   161  * On Azure, GC was tested only on Spark 3.3.0, but may work with other Spark and Hadoop versions.
   162  * In case you don't have `hadoop-azure` package as part of your environment, you should add the package to your spark-submit with `--packages org.apache.hadoop:hadoop-azure:3.2.1`
   163  * For GC to work on Azure blob, [soft delete](https://docs.microsoft.com/en-us/azure/storage/blobs/soft-delete-blob-overview) should be disabled.
   164  </div>
   165  
   166  <div markdown="1" id="gcp-option">
   167  ⚠️ At the moment, only the "mark" phase of the Garbage Collection is supported for GCP.
   168  That is, this program will output a list of expired objects, and you will have to delete them manually.
   169  We have [concrete plans](https://github.com/treeverse/lakeFS/issues/3626) to extend this support to actually delete the objects.
   170  {: .note .note-warning }
   171  
   172  ```bash
   173  spark-submit --class  io.treeverse.gc.GarbageCollection \
   174    --jars https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar \
   175    -c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1  \
   176    -c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
   177    -c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
   178    -c spark.hadoop.google.cloud.auth.service.account.enable=true \
   179    -c spark.hadoop.google.cloud.auth.service.account.json.keyfile=<PATH_TO_JSON_KEYFILE> \
   180    -c spark.hadoop.fs.gs.project.id=<GCP_PROJECT_ID> \
   181    -c spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
   182    -c spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS \
   183    -c spark.hadoop.lakefs.gc.do_sweep=false  \
   184    http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client/0.11.0/lakefs-spark-client-assembly-0.11.0.jar \
   185    example-repo
   186  ```
   187  
   188  This program will not delete anything.
   189  Instead, it will find all the objects that are safe to delete and save a list containing all their keys, in Parquet format.
   190  The list will then be found under the path:
   191  ```
   192  gs://<STORAGE_NAMESPACE>/_lakefs/retention/gc/unified/<RUN_ID>/deleted/
   193  ```
   194  
   195  Note that this is a path in your Google Storage bucket, and not in your lakeFS repository.
   196  It is now safe to remove the objects that appear in this list directly from the storage.
   197  
   198  </div>
   199  </div>
   200  
   201  You will find the list of objects removed by the job in the storage
   202  namespace of the repository. It is saved in Parquet format under `_lakefs/retention/gc/unified/<RUN_ID>/deleted/`.
   203  
   204  ### Mark and Sweep stages
   205  
   206  You can break the job into two stages:
   207  * _Mark_: find objects to remove, without actually removing them.
   208  * _Sweep_: remove the objects.
   209  
   210  #### Mark-only mode
   211  
   212  To make GC run the mark stage only, add the following to your spark-submit command:
   213  ```properties
   214  spark.hadoop.lakefs.gc.do_sweep=false
   215  ```
   216  
   217  In mark-only mode, GC will write the keys of the expired objects under: `<REPOSITORY_STORAGE_NAMESPACE>/_lakefs/retention/gc/unified/<MARK_ID>/`.
   218  _MARK_ID_ is generated by the job. You can find it in the driver's output:
   219  
   220  ```
   221  Report for mark_id=gmc6523jatlleurvdm30 path=s3a://example-bucket/_lakefs/retention/gc/unified/gmc6523jatlleurvdm30
   222  ```
   223  
   224  #### Sweep-only mode
   225  
   226  To make GC run the sweep stage only, add the following properties to your spark-submit command:
   227  ```properties
   228  spark.hadoop.lakefs.gc.do_mark=false
   229  spark.hadoop.lakefs.gc.mark_id=<MARK_ID> # Replace <MARK_ID> with the identifier you obtained from a previous mark-only run
   230  ```
   231  
   232  ## Garbage collection notes
   233  
   234  1. In order for an object to be removed, it must not exist on the HEAD of any branch.
   235     You should remove stale branches to prevent them from retaining old objects.
   236     For example, consider a branch that has been merged to `main` and has become stale.
   237     An object which is later deleted from `main` will always be present in the stale branch, preventing it from being removed.
   238  
   239  1. lakeFS will never delete objects outside your repository's storage namespace.
   240     In particular, objects that were imported using `lakectl import` or the UI import wizard will not be affected by GC jobs.
   241  
   242  1. In cases where deleted objects are brought back to life while a GC job is running (for example, by reverting a commit),
   243     the objects may or may not be deleted.
   244  
   245  1. Garbage collection does not remove any commits: you will still be able to use commits containing removed objects,
   246     but trying to read these objects from lakeFS will result in a `410 Gone` HTTP status.