github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/gc_plus/gc-plus-execution-plan.md (about)

     1  # Uncommitted Garbage Collection - Execution Plan
     2  
     3  Uncommitted Garbage Collection [Proposal](https://github.com/treeverse/lakeFS/blob/master/design/accepted/gc_plus/uncommitted-gc.md)
     4  
     5  ## Overview
     6  
     7  Implementing uncommitted GC can be divided into 3 main changes:
     8  1. lakeFS additions and modifications
     9  2. Implementing uncommitted GC logic on client side
    10  3. Modifying existing GC
    11  
    12  All changes in lakeFS are independent changes, work on lakeFS changes can start immediately.
    13  Some GC changes are dependent on lakeFS changes, these are prioritized as part of the plan.  
    14  All tasks are tracked under the [GC+ label](https://github.com/treeverse/lakeFS/labels/GC%2B) in GitHub
    15  
    16  ## Changes on lakeFS
    17  
    18  The following are a list of changes in lakeFS by order of priority, according the dependency constraints 
    19  
    20  1. Implement PrepareUncommittedForGC API
    21     - Creates uncommitted objects list files and save them in a designated GC path on the object store
    22     - After this is done, we can implement new uncommitted logic in GC
    23  
    24  2. Modify [Get/Link]PhysicalAddress
    25      - Issue validation token on GetPhysicalAddress with an expiry time and add it to valid token list
    26      - LinkPhysicalAddress will validate token and remove it from valid token list
    27  
    28  3. Add tracking of copied objects in ref-store
    29      - On each copy operation - add an entry to the copied object table in ref-store
    30      - Periodically delete entry or during defined operations (commit/ delete/reset branch)
    31  
    32  4. Implement CopyObject API
    33      - Read staging token from branch
    34      - Get entry from branch
    35      - If entry is staged
    36        - Update copy table
    37        - Create a shallow copy
    38      - If entry is committed or on different branches - perform a full copy using the underlying storage adapter copy method
    39      - Stage new entry using old entry data and new path
    40  
    41  5. Modify Gateway CopyObject API
    42      - See details above
    43  
    44  6. Modify lakeFSFS renameObject method
    45      - Use CopyObject + DeleteObject when working via Gateway or OpenAPI
    46  
    47  7. lakeFS clients
    48      - Any action required on clients?
    49      - Do we need to make them version aware?
    50      - Should we issue a notification to upgrade clients?
    51  
    52  8. Modify StageObject API
    53      - Return error if given address is inside repository namespace
    54  
    55  9. Implement new repository structure
    56      - Root prefix for new lakeFS data - `data/`
    57      - Use slice naming conventions as defined in proposal to create object path
    58      - Track object upload count
    59      - Create new slice by time or count
    60  
    61  ## Changes on GC
    62  
    63  - New logic in GC to support uncommitted garbage collection
    64    - Prepare uncommitted for GC
    65    - Read uncommitted data from lakeFS
    66    - Implement the optimized run deltas
    67  - Incorporate uncommitted changes into the current GC flow 
    68    - Changes required for committed data to support uncommitted flow 
    69    - Support mark and sweep for uncommitted 
    70    - integrate uncommitted and committed GC flows 
    71  
    72  ## Testing
    73  
    74  - Functionality - Basic GC functionality, check that committed GC logic not broken, uncommitted GC works as expected
    75  - Data Integrity - Ensure only objects that are eligible for deletion are hard deleted from repository
    76  - Performance - Verify uncommitted GC performance requirements are kept
    77  - Existing repository - test first run on an old repository, with and without objects in new repository structure. Ensure consecutive runs work as expected (in Optimized Run)
    78  
    79  ## Milestone
    80  
    81  - Implement Clean Run (including all lakeFS change) with minimal performance requirements met
    82  - Implement GC changes - Optimized Run + mark and sweep
    83  - Deployment to lakeFS Cloud