github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/rejected/hard-delete.md (about) 1 # Hard Delete Uncommitted Data 2 3 ## Decision 4 5 No solution found that covers all scenarios and edge cases in an adequate way. Decided to shelve the effort for finding an 6 online solution for now, and concentrate on the offline solution. 7 8 ## Motivation 9 10 Uncommitted data which is no longer referenced (due to branch deletion, reset branch etc.) is not being deleted by lakeFS. 11 This may result excessive storage usage and possible compliance issues. 12 To solve this two approaches were suggested: 13 1. A batch operation performed as part of an external process (GC) 14 2. An online solution inside lakeFS 15 16 **This document details the latter.** 17 18 Several approaches were considered and described in this document. All proposal suffer from unresolved design issues, and 19 so no specific approach was selected as a solution. 20 21 ### Base assumptions 22 23 The following assumptions are required in all the suggested proposals: 24 1. Copy operation requires to be changed and implemented is full copy. (*irrelevant for suggestion no.3) 25 2. lakeFS is not responsible for retention of data outside the repository path (data ingest) 26 3. An online solution is not bulletproof, and will require a complementary external solution to handle edge cases as well as 27 existing uncommitted garbage. 28 29 ## 1. Write staged data directly to object store on a defined path 30 31 This proposal suggests moving the staging area management from the ref store to the object store and defining a structured path for 32 the object's physical address on the object store. 33 34 ### Performance degradation when using the object store 35 36 Due to the basic concept of handling the staging data in the objects store, this proposal suffers from a significant performance 37 degradation in one of lakeFS's principal flows - **listing**. Listing is used in flows such as: list branch objects, getting objects and committing. 38 We tested listing of ~240000 **staged objects** using `lakeFS ls` and comparing it with `aws s3 ls` with the following results: 39 > ./lakectl fs ls lakefs://niro-bench/main/guy-repo/itai-export/medium/ > 6.42s user 0.78s system 44% cpu 16.089 total 40 41 > aws s3 ls s3://guy-repo/itai-export/medium/ > /dev/null 56.82s user 2.34s system 74% cpu 1:19.79 total 42 43 ### Design 44 45 Objects will be stored in a path relative to the branch and staging token. For example: file `x/y.z` will be uploaded to path 46 `<bucket-name>/<repo-name>/<branch-name>/<staging-token>/x/y.z`. 47 As result, uncommitted objects are all objects on a path of staging tokens which were not yet committed. 48 When deleting a branch, each object under the branch's current staging token can be deleted. 49 50 ### Blockers 51 52 1. Commit flow is not atomic: 53 1. Start write on staged object 54 2. Start Commit 55 3. Commit flow writes current object to committed tables 56 4. Write to object is completed - changing the data after it was already committed 57 58 ### Opens 59 1. Solve the blocker - how to prevent data modification after commit? 60 2. How to manage ingested data? 61 3. How to handle SymLink and stageObject operations? 62 63 ## 2. Reference Counter on Object Metadata 64 65 This proposal suggested to maintain a reference counter in the underlying object's metadata and use the different block adapters 66 capabilities to perform atomic, concurrently safe incrementation of this counter. 67 The way we are handling the staged data in lakeFS, will not change in any other way. 68 **The proposal is not viable since though Google Storage and Azure provide this capability, AWS S3 does not.** 69 70 This solution incurs an additional metadata read (and possible write) in case it is in the repo namespace in selected operations. 71 72 ### Blockers 73 74 1. AWS does not support conditional set of metadata 75 76 ### Flows 77 78 #### Upload Object 79 80 1. Add a reference counter to the object's metadata and set it to one. 81 2. Write the object to the object store the same way we do today. 82 83 #### Stage Object 84 85 1. Object path is in the repo namespace: 86 1. read its metadata, 87 1. If counter > 0, increment the counter and update the metadata on the object. Otherwise, treat as deleted. 88 2. If the object's write to object store fails - rollback the metadata update. 89 2. Object path is outside the repo namespace: 90 1. lakeFS will not handle its retention - therefore it will have to be deleted by the 91 user or by a dedicated GC process 92 93 #### LinkPhysicalAddress 94 95 Assume this API uses a physical address in the repo namespace 96 1. Read object's metadata 97 2. Increment counter and write metadata 98 3. Add staging entry 99 100 #### Delete Object 101 102 1. Read object's metadata 103 2. Reference counter exists: 104 1. Decrement counter 105 2. if counter == 0 106 1. Hard-delete object 107 3. If reference counter doesn't exist: 108 1. Retention is not handled in lakeFS 109 110 #### Reset / Delete Branch 111 112 When resetting a branch we throw all the uncommitted branch data. 113 We can leverage the new KV drop async functionality to also check and hard-delete objects as needed 114 115 ### Atomic reference counter 116 117 For each storage adapter (AWS, GC, AZ), we will use a builtin logic to provide this functionality 118 119 #### AWS 120 121 AWS doesn't allow to update an object or its metadata after creation. In order to update the metadata a copy object should be performed with the new metadata values. The only 122 conditionals currently available are on the objects timestamp and ETag. Unfortunately ETags are modified only on objects content change and does not take into account changes in metadata. 123 124 #### Azure 125 126 Store the reference counter in the blob's metadata and use `set blob metadata` API with the supported conditional header on the ETag to perform an If-Match operation and update 127 the reference counter 128 129 130 #### Google Store 131 132 Store the reference counter in the blob's metadata, use the `patch object` API with the `ifMetagenerationMatch` conditional. 133 134 135 ## 3. Tracking references in staging manager 136 137 Staging manager will track physical addresses of staged objects and the staging tokens referencing them. 138 We will introduce a new object to the database: 139 140 key = <repo>/references/<physical_address>/<staging_token> 141 value = state enum (staged|deleted|comitted) 142 last_modified timestamp 143 144 On Upload object to a branch, we will add a key in the references path with the physical address and current staging token and 145 mark it as staged. 146 On Delete object from a staging area, we will update the reference key with value `deleted` 147 On Commit, we will update the reference key with value `comitted` 148 149 A background job will be responsible for scanning the references prefix, and handling the references according to the state. 150 The `last_modified` parameter is used to prevent any race conditions between reference candidates for deletion and ongoing operations 151 which might add references for this physical address. The assumption is that when all references of a physical address 152 are in `deleted` state, after a certain amount of time (TBD), this physical address cannot be referenced anymore (all staging tokens were either dropped or committed). 153 154 ### Background delete job (pseudo-code) 155 156 1. Scan `reference` prefix (can use after prefix for load balancing) 157 2. For each `physical_address` read all entries 158 1. If found state == 'committed' in any entry 159 1. Delete all keys for `physical_address`, by order of state: deleted -> staged -> committed 160 2. If state == 'deleted' in all entries and min(`last_modified`) < `TBD` 161 1. Hard-delete object 162 2. Delete all keys for `physical_address` 163 164 ### Key Update Order of Precedence 165 166 1. `commited` state takes precedence over all (terminal state - overrides any other value) - uses **Set** 167 2. `deleted` state can only be done on entries which are not `committed` and uses **SetIf** 168 3. `staged` state can only be done on entries which are not `committed` and uses **SetIf** 169 170 ### Flows 171 172 #### Upload Object 173 174 1. Write blob 175 2. Write reference entry to the database as staged 176 3. Add staged entry to database 177 4. If entry exists in current staging token - mark old reference as deleted 178 **Open:** Efficiently deal with overrides 179 180 #### Stage Object 181 182 1. Object path is in the repo namespace: 183 1. Add reference entry to the database 184 2. Object path is outside the repo namespace: 185 1. lakeFS will not handle its retention 186 3. Add staged entry 187 188 #### GetPhysicalAddress 189 Get physical address will add a reference to the generated unique physical address with state `deleted` 190 We will then return a valid token / expiry timestamp to the user in addition to the physical address 191 The user will need to pass the token to LinkPhysicalAddress 192 193 #### LinkPhysicalAddress 194 Add a validation to check if the token provided / or timestamp expired. 195 Set reference state as staged. 196 197 Assume this API uses a physical address in the repo namespace 198 1. Add reference entry to the database 199 2. Add staging entry 200 201 #### Delete Object 202 203 1. If object staged 204 1. Read reference key 205 2. If not `committed` 206 1. Change reference state to `deleted` and update `last_modified` 207 208 #### Commit 209 210 1. Mark all entries in staging area as `committed` 211 2. Swap staging token 212 3. Create commit 213 4. Update branch commit ID 214 215 #### Reset / Delete Branch 216 217 When resetting a branch we throw all the uncommitted branch data, this operation happens asynchronously via `AsyncDrop`. 218 We can leverage this asynchronous job to also check and perform the hard-delete 219 1. Scan deleted staging token entries 220 2. If entry is `tombstone` remove entry reference key 221 3. Otherwise, modify reference state to `deleted` and update `last_modified` 222 223 ## 4. Staging Token States 224 225 This is an improvement over the 3rd proposal which suggests tracking the state on the staging token instead of objects, 226 while still keeping track of references (without state) 227 228 This will be done using the following objects: 229 230 References will be tracked in a list instead of an explicit entry per reference 231 232 key = <repo>/references/<physical_address> 233 value = list(staging tokens) 234 235 For each staging token of a deleted / reset branches a deleted entry will be added 236 237 <repo>/staging/deleted/<staging_token> 238 239 For each commit we will add the staging tokens to a committed entry. 240 Adding a reference is done using `set if` and a retry mechanism to ensure consistency. 241 242 key = <repo>/staging/committed/<physical_address> 243 value = list(staging tokens) 244 245 ### Flows 246 247 #### Upload Object 248 249 1. Write blob 250 2. Read reference entry 251 3. Perform `set if` on the reference with the additional staging token 252 4. Add staged entry to database 253 254 * Delete object will not remove the reference 255 * References to objects are kept as long as staging token is not deleted or committed. 256 257 #### Delete Object 258 259 Deleted object will remain the same. We do not delete references in this flow. 260 261 #### Stage Object 262 Allow stage object operations only on addresses outside the repository namespace 263 lakeFS will not manage the retention of these objects 264 265 #### GetPhysicalAddress 266 267 Provide a validation token with the issued physical address. The issued address can be used only once, thus 268 we can assume the given address is not committed or referenced anywhere else. 269 Once an issued address was used by LinkPhysicalAddress, we will delete the token ID from the list of valid tokens. 270 271 1. Issue a JWT token with the physical address 272 2. Save token ID in db 273 3. return token with physical address to the user 274 275 #### LinkPhysicalAddress 276 277 1. Given physical address, and token 278 1. Check token validity 279 2. If token valid 280 1. Remove token ID 281 2. Add reference 282 3. Add entry in staging area 283 284 #### CopyObject 285 286 The new flow ensures that we do not add a reference for a committed object after we removed all its references as part of 287 the background delete job, and preventing the accidental deletion of committed objects. 288 289 1. While retry not exhausted 290 1. Read reference key, if key found 291 1. Read reference list for address, if list is empty 292 1. Assume address is deleted - return error 293 (only way it is empty is if background job is currently working on this reference during the delete flow) 294 2. Else - add branch staging token to reference list and perform `set if` 295 1. If predicate failed - retry 296 2. If key not found - assume address was committed - perform copy without adding a new reference 297 298 #### Commit 299 300 1. Perform commit flow in the same manner as today 301 2. On successful commit - create entry using `commit_id` and ordered list of tokens under the `committed` prefix 302 303 #### Reset / Delete Branch 304 305 1. Perform reset / delete flow in the same manner as today 306 2. On successful execution - create entries for all staging tokens under the `deleted` prefix 307 308 ### Background delete job 309 310 * Scan committed tokens 311 * For each committed object - remove reference key 312 * For each staging token in committed entry - "Move staging token to deleted" 313 All objects in committed staging tokens that were not actually committed are candidates for deletion, therefore 314 we can either execute the deleted tokens logic, or create entries for these tokens under the deleted prefix (implementation detail) 315 * Remove `staging/committed` key 316 * Delete staging area 317 * Scan deleted tokens 318 * For each token - remove references for all objects on staging token 319 * If it is the last reference for that object (list is empty after update) - perform hard-delete 320 * Delete staging area 321 322 ### Handling key override on same staging token - improvement 323 324 TBD 325 326 ## 5. No Copies 327 328 The idea is to make sure we do not enable copies, when no two entries in stage will point to the same physical object we will enable delete of the physical address in the following: 329 330 - Upload - get previous entry and delete previous physical address unless staging token was updated 331 - Revert - branch / prefix / object, when entry is dropped we can delete the physical address 332 - Delete - repo/branch/object, delete the physical address after entry no longer exists 333 334 We should enable the following to enable the physical delete: 335 336 - Make sure we don't have the same physical address by sign links we return in our API. The logical path will be part of the signature that will be part of the physical address we generate. This will enable us to block and use of a physical address outside the repo/branch/entry. 337 - The S3 gateway `put` operation with the copy support will use the underlying adapter to copy the data. Will require from each adapter to support/emulate a 'copy' operation. 338 * (optional) Enable 'move' API as alternative to 'copy' we have seen that moving objects from one location to another by copy+delete of metadata will enable easy support for lakeFSFS move. This can be implemented by marking the entry on staging as 'locked'. During the move operation - lock+copy+delete+unlock operation. In case of any failure, as we don't have transactions, we may keep two entries, or keep locked object without ever delete its physical address. 339 340 TODO(Barak): list more cases and how it will be address in this solution. 341 342 ### Object Locking 343 344 Add a new field in the staging area struct to indicate whether the entry is currently locked or not 345 ``` 346 type Value struct { 347 Identity []byte 348 Data []byte 349 Locked bool 350 } 351 ``` 352 We can then use `SetIf` to ensure concurrently safe locking mechanism. 353 When deleting a staged entry (either by DropKey, ResetBranch, DeleteBranch) we will perform Hard Delete of object only in case 354 the entry is not locked. 355 356 ### Flows 357 358 #### Move operation 359 1. Get entry 360 2. "Lock" entry (`SetIf`) 361 3. Copy entry to new path on current staging token 362 4. Delete old entry 363 5. "Unlock" entry (new entry) 364 365 #### Upload Object 366 367 1. Get entry 368 2. "Lock" entry if exists (override scenario) (`SetIf`) 369 3. Write blob 370 4. Add staged entry (or update) 371 5. Delete physical address if previously existed 372 373 #### Delete Object 374 1. Get entry 375 2. "Lock" entry (`SetIf`) to protect from delete - move race 376 3. Delete staging entry 377 4. Hard delete object 378 379 ### Races and issues 380 381 #### 1. Commit - Move Race 382 1. Start Move: 383 1. Get entry 384 2. "Lock" entry 385 2. Start Commit: 386 1. Change staging token 387 2. write commit data (old entry path was written) 388 3. Move: 389 1. Create new entry on new staging token 390 2. Delete old entry 391 3. "Unlock" physical address 392 393 Now we have an uncommitted entry pointing to a committed physical address. If we reset the branch we will delete 394 committed data! 395 396 #### 2. Delete - Move Race 397 1. Start Delete: 398 1. Get entry 399 2. Check if not "locked" 400 2. Start Move: 401 1. Get entry 402 2. "Lock" entry 403 3. Copy entry to new path on current staging token 404 3. Delete: 405 1. Hard Delete object from store 406 407 In this situation we have an uncommitted entry which points to an invalid physical address. 408 **Resolved by locking the entry on delete flow as well** 409 410 #### 3. Concurrent Move 411 Concurrent move should be blocked as it may cause multiple pointers for the same physical address. 412 The most straightforward way to do so is to rely on the absence of the lock. 413 The problem with this approach is that permanently locked addresses (which is a protection mechanism in case of catastrophic failures) 414 can never be moved -which leads us to the stale lock problem. 415 416 #### 4. Stale lock problem 417 We use the lock mechanism to protect from race scenarios on Upload (override), Move and Delete. As such, an entry with a stale 418 lock will prevent us from performing any of these operations. 419 We can overcome it with a time based lock - but this might present additional challenges to the proposed solution.