github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/gc_plus/gc-plus-execution-plan.md (about) 1 # Uncommitted Garbage Collection - Execution Plan 2 3 Uncommitted Garbage Collection [Proposal](https://github.com/treeverse/lakeFS/blob/master/design/accepted/gc_plus/uncommitted-gc.md) 4 5 ## Overview 6 7 Implementing uncommitted GC can be divided into 3 main changes: 8 1. lakeFS additions and modifications 9 2. Implementing uncommitted GC logic on client side 10 3. Modifying existing GC 11 12 All changes in lakeFS are independent changes, work on lakeFS changes can start immediately. 13 Some GC changes are dependent on lakeFS changes, these are prioritized as part of the plan. 14 All tasks are tracked under the [GC+ label](https://github.com/treeverse/lakeFS/labels/GC%2B) in GitHub 15 16 ## Changes on lakeFS 17 18 The following are a list of changes in lakeFS by order of priority, according the dependency constraints 19 20 1. Implement PrepareUncommittedForGC API 21 - Creates uncommitted objects list files and save them in a designated GC path on the object store 22 - After this is done, we can implement new uncommitted logic in GC 23 24 2. Modify [Get/Link]PhysicalAddress 25 - Issue validation token on GetPhysicalAddress with an expiry time and add it to valid token list 26 - LinkPhysicalAddress will validate token and remove it from valid token list 27 28 3. Add tracking of copied objects in ref-store 29 - On each copy operation - add an entry to the copied object table in ref-store 30 - Periodically delete entry or during defined operations (commit/ delete/reset branch) 31 32 4. Implement CopyObject API 33 - Read staging token from branch 34 - Get entry from branch 35 - If entry is staged 36 - Update copy table 37 - Create a shallow copy 38 - If entry is committed or on different branches - perform a full copy using the underlying storage adapter copy method 39 - Stage new entry using old entry data and new path 40 41 5. Modify Gateway CopyObject API 42 - See details above 43 44 6. Modify lakeFSFS renameObject method 45 - Use CopyObject + DeleteObject when working via Gateway or OpenAPI 46 47 7. lakeFS clients 48 - Any action required on clients? 49 - Do we need to make them version aware? 50 - Should we issue a notification to upgrade clients? 51 52 8. Modify StageObject API 53 - Return error if given address is inside repository namespace 54 55 9. Implement new repository structure 56 - Root prefix for new lakeFS data - `data/` 57 - Use slice naming conventions as defined in proposal to create object path 58 - Track object upload count 59 - Create new slice by time or count 60 61 ## Changes on GC 62 63 - New logic in GC to support uncommitted garbage collection 64 - Prepare uncommitted for GC 65 - Read uncommitted data from lakeFS 66 - Implement the optimized run deltas 67 - Incorporate uncommitted changes into the current GC flow 68 - Changes required for committed data to support uncommitted flow 69 - Support mark and sweep for uncommitted 70 - integrate uncommitted and committed GC flows 71 72 ## Testing 73 74 - Functionality - Basic GC functionality, check that committed GC logic not broken, uncommitted GC works as expected 75 - Data Integrity - Ensure only objects that are eligible for deletion are hard deleted from repository 76 - Performance - Verify uncommitted GC performance requirements are kept 77 - Existing repository - test first run on an old repository, with and without objects in new repository structure. Ensure consecutive runs work as expected (in Optimized Run) 78 79 ## Milestone 80 81 - Implement Clean Run (including all lakeFS change) with minimal performance requirements met 82 - Implement GC changes - Optimized Run + mark and sweep 83 - Deployment to lakeFS Cloud