github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/gc-two-steps-separation.md (about)

     1  # Separate GC into two separately runnable steps
     2  
     3  ## Motivation
     4  
     5  Improve the resilience of GC (separation of concerns, pure functions, and traceability) and allow examination
     6  of intermediate results.
     7  
     8  ## How?
     9  
    10  The `lakefs.debug.gc.no_delete` flag lets users run GC and generate intermediate results,
    11  yet GC doesn't continue over those results, and retrieving them manually is quite tedious.  
    12  Both problems can be solved by introducing two flags and changing the `lakefs.debug.gc.no_delete`:
    13  
    14  1. `lakefs.debug.gc.no_delete` -> `lakefs.gc.do_mark` (Boolean: default = true): This flag instructs the GC operation to mark, i.e. collect, the addresses that are intended to be deleted.
    15  2. `lakefs.gc.do_sweep` (Boolean: default = true): The GC operation sweep, i.e. delete, the marked addresses.
    16  3. `lakefs.gc.mark_id` (String): This flag specifies which ID the GC will use to generate intermediate results and write outputs to.
    17  
    18  ## Flags permutations
    19  
    20  #### Only Mark
    21  
    22  In the case of `lakefs.gc.do_sweep=false`, the GC will run, collect the addresses to be deleted, and write them to:
    23  `STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/mark_id=RANDOM_VALUE/`.
    24  
    25  #### Only Mark and Mark ID
    26  
    27  If `lakefs.gc.do_sweep=false`, and a `lakefs.gc.mark_id` is provided,
    28  the GC will run and collect the addresses to be deleted, and use the `lakefs.gc.mark_id` as the `MARK_ID` in:
    29  i.e. `STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/mark_id=MARK_ID/`.
    30  
    31  #### Only Sweep and Mark ID
    32  
    33  If `lakefs.gc.do_mark=false`, the GC will use the provided `lakefs.gc.mark_id` and delete the objects mapped
    34  from the addresses found under `STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/mark_id=MARK_ID/`.
    35  
    36  #### Only Sweep
    37  
    38  If `lakefs.gc.do_mark=false`, and `lakefs.gc.mark_id` is not provided (or empty), the operation will fail.
    39  
    40  #### Mark ID
    41  
    42  GC will use the `lakefs.gc.mark_id` to write intermediate results to `STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/mark_id=MARK_ID/`,
    43  and will do a complete run (marking and sweeping).
    44  
    45  #### No Mark and No Sweep
    46  
    47  The operation will fail.
    48  
    49  #### No flags are provided
    50  
    51  Complete run of GC, including marking and sweeping (while writing intermediate results to `STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/mark_id=RANDOM_VALUE/`).
    52  
    53  ### Note
    54  
    55  The use of the `lakefs.gc.mark_id` flag is under the responsibility and management of the users.
    56  It should be made clear to users that using it sequentially with the same mark ID may override results.
    57  
    58  ## Current behavior (and why we don't need the `run_id=...` with addresses)
    59  
    60  According to the current behavior of GC, intermediate results are written to `STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/run_id=RUN_ID/`.
    61  There is no need to use run IDs here. It will be possible to limit the work of GC by using Run IDs, but this will be done with **commits** rather than addresses.
    62  Mark ID seems to make more sense than run ID.