github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/gc-two-steps-separation.md (about) 1 # Separate GC into two separately runnable steps 2 3 ## Motivation 4 5 Improve the resilience of GC (separation of concerns, pure functions, and traceability) and allow examination 6 of intermediate results. 7 8 ## How? 9 10 The `lakefs.debug.gc.no_delete` flag lets users run GC and generate intermediate results, 11 yet GC doesn't continue over those results, and retrieving them manually is quite tedious. 12 Both problems can be solved by introducing two flags and changing the `lakefs.debug.gc.no_delete`: 13 14 1. `lakefs.debug.gc.no_delete` -> `lakefs.gc.do_mark` (Boolean: default = true): This flag instructs the GC operation to mark, i.e. collect, the addresses that are intended to be deleted. 15 2. `lakefs.gc.do_sweep` (Boolean: default = true): The GC operation sweep, i.e. delete, the marked addresses. 16 3. `lakefs.gc.mark_id` (String): This flag specifies which ID the GC will use to generate intermediate results and write outputs to. 17 18 ## Flags permutations 19 20 #### Only Mark 21 22 In the case of `lakefs.gc.do_sweep=false`, the GC will run, collect the addresses to be deleted, and write them to: 23 `STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/mark_id=RANDOM_VALUE/`. 24 25 #### Only Mark and Mark ID 26 27 If `lakefs.gc.do_sweep=false`, and a `lakefs.gc.mark_id` is provided, 28 the GC will run and collect the addresses to be deleted, and use the `lakefs.gc.mark_id` as the `MARK_ID` in: 29 i.e. `STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/mark_id=MARK_ID/`. 30 31 #### Only Sweep and Mark ID 32 33 If `lakefs.gc.do_mark=false`, the GC will use the provided `lakefs.gc.mark_id` and delete the objects mapped 34 from the addresses found under `STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/mark_id=MARK_ID/`. 35 36 #### Only Sweep 37 38 If `lakefs.gc.do_mark=false`, and `lakefs.gc.mark_id` is not provided (or empty), the operation will fail. 39 40 #### Mark ID 41 42 GC will use the `lakefs.gc.mark_id` to write intermediate results to `STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/mark_id=MARK_ID/`, 43 and will do a complete run (marking and sweeping). 44 45 #### No Mark and No Sweep 46 47 The operation will fail. 48 49 #### No flags are provided 50 51 Complete run of GC, including marking and sweeping (while writing intermediate results to `STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/mark_id=RANDOM_VALUE/`). 52 53 ### Note 54 55 The use of the `lakefs.gc.mark_id` flag is under the responsibility and management of the users. 56 It should be made clear to users that using it sequentially with the same mark ID may override results. 57 58 ## Current behavior (and why we don't need the `run_id=...` with addresses) 59 60 According to the current behavior of GC, intermediate results are written to `STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/run_id=RUN_ID/`. 61 There is no need to use run IDs here. It will be possible to limit the work of GC by using Run IDs, but this will be done with **commits** rather than addresses. 62 Mark ID seems to make more sense than run ID.