github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/gc-on-parquet.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/gc-on-parquet.md (about)

1 ## Garbage collection using Parquet format
2
3 ### Motivation
4
5 Garbage collection in lakeFS is struggling at large scale.
6
7 ### Why Parquet?
8
9 Parquet is more than a file format. It's a breathing open-source project aiming to provide exactly what we're looking for here: efficient data retrieval. GC is essentially a huge anti-join, and it may be very difficult to achieve a similar optimization in our dedicated format.
10
11
12 ### Steps
13
14 #### 1. Sync
15 Ranges of the repository are translated into Parquet files. These are stored in a new metadata location under the repository's storage namespace.
16
17 #### 2. Sanity
18 Verify all the ranges required for this run are present, and that the number of entries in each range is correct.
19
20 #### 3. Start
21 Do an anti-join extracting all the addresses that need to be deleted.
22
23 ### Considerations
24 1. Storage: this requires using more of the user's storage. I don't think that's an issue since metadata is usually very small compared to the data size. In the future we can also optimize the set of ranges we actually bring there.
25 1. Risk: we are using a copy of the metadata instead of the actual one. We need to be careful not to miss anything, since missing data may potentially cause innocent objects to be deleted. This is a risk that any GC solution needs to take into account.
26 1. Ops burden: we will have to maintain and monitor the copying of the metadata. In case it fails, we need to recover it.