github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/rejected/hard-delete.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/rejected/hard-delete.md (about)

1 # Hard Delete Uncommitted Data
2
3 ## Decision
4
5 No solution found that covers all scenarios and edge cases in an adequate way. Decided to shelve the effort for finding an
6 online solution for now, and concentrate on the offline solution.
7
8 ## Motivation
9
10 Uncommitted data which is no longer referenced (due to branch deletion, reset branch etc.) is not being deleted by lakeFS.
11 This may result excessive storage usage and possible compliance issues.
12 To solve this two approaches were suggested:
13 1. A batch operation performed as part of an external process (GC)
14 2. An online solution inside lakeFS
15
16 **This document details the latter.**
17
18 Several approaches were considered and described in this document. All proposal suffer from unresolved design issues, and
19 so no specific approach was selected as a solution.
20
21 ### Base assumptions
22
23 The following assumptions are required in all the suggested proposals:
24 1. Copy operation requires to be changed and implemented is full copy. (*irrelevant for suggestion no.3)
25 2. lakeFS is not responsible for retention of data outside the repository path (data ingest)
26 3. An online solution is not bulletproof, and will require a complementary external solution to handle edge cases as well as
27 existing uncommitted garbage.
28
29 ## 1. Write staged data directly to object store on a defined path
30
31 This proposal suggests moving the staging area management from the ref store to the object store and defining a structured path for
32 the object's physical address on the object store.
33
34 ### Performance degradation when using the object store
35
36 Due to the basic concept of handling the staging data in the objects store, this proposal suffers from a significant performance
37 degradation in one of lakeFS's principal flows - **listing**. Listing is used in flows such as: list branch objects, getting objects and committing.
38 We tested listing of ~240000 **staged objects** using `lakeFS ls` and comparing it with `aws s3 ls` with the following results:
39 > ./lakectl fs ls lakefs://niro-bench/main/guy-repo/itai-export/medium/ > 6.42s user 0.78s system 44% cpu 16.089 total
40
41 > aws s3 ls s3://guy-repo/itai-export/medium/ > /dev/null 56.82s user 2.34s system 74% cpu 1:19.79 total
42
43 ### Design
44
45 Objects will be stored in a path relative to the branch and staging token. For example: file `x/y.z` will be uploaded to path
46 `<bucket-name>/<repo-name>/<branch-name>/<staging-token>/x/y.z`.
47 As result, uncommitted objects are all objects on a path of staging tokens which were not yet committed.
48 When deleting a branch, each object under the branch's current staging token can be deleted.
49
50 ### Blockers
51
52 1. Commit flow is not atomic:
53 1. Start write on staged object
54 2. Start Commit
55 3. Commit flow writes current object to committed tables
56 4. Write to object is completed - changing the data after it was already committed
57
58 ### Opens
59 1. Solve the blocker - how to prevent data modification after commit?
60 2. How to manage ingested data?
61 3. How to handle SymLink and stageObject operations?
62
63 ## 2. Reference Counter on Object Metadata
64
65 This proposal suggested to maintain a reference counter in the underlying object's metadata and use the different block adapters
66 capabilities to perform atomic, concurrently safe incrementation of this counter.
67 The way we are handling the staged data in lakeFS, will not change in any other way.
68 **The proposal is not viable since though Google Storage and Azure provide this capability, AWS S3 does not.**
69
70 This solution incurs an additional metadata read (and possible write) in case it is in the repo namespace in selected operations.
71
72 ### Blockers
73
74 1. AWS does not support conditional set of metadata
75
76 ### Flows
77
78 #### Upload Object
79
80 1. Add a reference counter to the object's metadata and set it to one.
81 2. Write the object to the object store the same way we do today.
82
83 #### Stage Object
84
85 1. Object path is in the repo namespace:
86 1. read its metadata,
87 1. If counter > 0, increment the counter and update the metadata on the object. Otherwise, treat as deleted.
88 2. If the object's write to object store fails - rollback the metadata update.
89 2. Object path is outside the repo namespace:
90 1. lakeFS will not handle its retention - therefore it will have to be deleted by the
91 user or by a dedicated GC process
92
93 #### LinkPhysicalAddress
94
95 Assume this API uses a physical address in the repo namespace
96 1. Read object's metadata
97 2. Increment counter and write metadata
98 3. Add staging entry
99
100 #### Delete Object
101
102 1. Read object's metadata
103 2. Reference counter exists:
104 1. Decrement counter
105 2. if counter == 0
106 1. Hard-delete object
107 3. If reference counter doesn't exist:
108 1. Retention is not handled in lakeFS
109
110 #### Reset / Delete Branch
111
112 When resetting a branch we throw all the uncommitted branch data.
113 We can leverage the new KV drop async functionality to also check and hard-delete objects as needed
114
115 ### Atomic reference counter
116
117 For each storage adapter (AWS, GC, AZ), we will use a builtin logic to provide this functionality
118
119 #### AWS
120
121 AWS doesn't allow to update an object or its metadata after creation. In order to update the metadata a copy object should be performed with the new metadata values. The only
122 conditionals currently available are on the objects timestamp and ETag. Unfortunately ETags are modified only on objects content change and does not take into account changes in metadata.
123
124 #### Azure
125
126 Store the reference counter in the blob's metadata and use `set blob metadata` API with the supported conditional header on the ETag to perform an If-Match operation and update
127 the reference counter
128
129
130 #### Google Store
131
132 Store the reference counter in the blob's metadata, use the `patch object` API with the `ifMetagenerationMatch` conditional.
133
134
135 ## 3. Tracking references in staging manager
136
137 Staging manager will track physical addresses of staged objects and the staging tokens referencing them.
138 We will introduce a new object to the database:
139
140 key = <repo>/references/<physical_address>/<staging_token>
141 value = state enum (staged|deleted|comitted)
142 last_modified timestamp
143
144 On Upload object to a branch, we will add a key in the references path with the physical address and current staging token and
145 mark it as staged.
146 On Delete object from a staging area, we will update the reference key with value `deleted`
147 On Commit, we will update the reference key with value `comitted`
148
149 A background job will be responsible for scanning the references prefix, and handling the references according to the state.
150 The `last_modified` parameter is used to prevent any race conditions between reference candidates for deletion and ongoing operations
151 which might add references for this physical address. The assumption is that when all references of a physical address
152 are in `deleted` state, after a certain amount of time (TBD), this physical address cannot be referenced anymore (all staging tokens were either dropped or committed).
153
154 ### Background delete job (pseudo-code)
155
156 1. Scan `reference` prefix (can use after prefix for load balancing)
157 2. For each `physical_address` read all entries
158 1. If found state == 'committed' in any entry
159 1. Delete all keys for `physical_address`, by order of state: deleted -> staged -> committed
160 2. If state == 'deleted' in all entries and min(`last_modified`) < `TBD`
161 1. Hard-delete object
162 2. Delete all keys for `physical_address`
163
164 ### Key Update Order of Precedence
165
166 1. `commited` state takes precedence over all (terminal state - overrides any other value) - uses **Set**
167 2. `deleted` state can only be done on entries which are not `committed` and uses **SetIf**
168 3. `staged` state can only be done on entries which are not `committed` and uses **SetIf**
169
170 ### Flows
171
172 #### Upload Object
173
174 1. Write blob
175 2. Write reference entry to the database as staged
176 3. Add staged entry to database
177 4. If entry exists in current staging token - mark old reference as deleted
178 **Open:** Efficiently deal with overrides
179
180 #### Stage Object
181
182 1. Object path is in the repo namespace:
183 1. Add reference entry to the database
184 2. Object path is outside the repo namespace:
185 1. lakeFS will not handle its retention
186 3. Add staged entry
187
188 #### GetPhysicalAddress
189 Get physical address will add a reference to the generated unique physical address with state `deleted`
190 We will then return a valid token / expiry timestamp to the user in addition to the physical address
191 The user will need to pass the token to LinkPhysicalAddress
192
193 #### LinkPhysicalAddress
194 Add a validation to check if the token provided / or timestamp expired.
195 Set reference state as staged.
196
197 Assume this API uses a physical address in the repo namespace
198 1. Add reference entry to the database
199 2. Add staging entry
200
201 #### Delete Object
202
203 1. If object staged
204 1. Read reference key
205 2. If not `committed`
206 1. Change reference state to `deleted` and update `last_modified`
207
208 #### Commit
209
210 1. Mark all entries in staging area as `committed`
211 2. Swap staging token
212 3. Create commit
213 4. Update branch commit ID
214
215 #### Reset / Delete Branch
216
217 When resetting a branch we throw all the uncommitted branch data, this operation happens asynchronously via `AsyncDrop`.
218 We can leverage this asynchronous job to also check and perform the hard-delete
219 1. Scan deleted staging token entries
220 2. If entry is `tombstone` remove entry reference key
221 3. Otherwise, modify reference state to `deleted` and update `last_modified`
222
223 ## 4. Staging Token States
224
225 This is an improvement over the 3rd proposal which suggests tracking the state on the staging token instead of objects,
226 while still keeping track of references (without state)
227
228 This will be done using the following objects:
229
230 References will be tracked in a list instead of an explicit entry per reference
231
232 key = <repo>/references/<physical_address>
233 value = list(staging tokens)
234
235 For each staging token of a deleted / reset branches a deleted entry will be added
236
237 <repo>/staging/deleted/<staging_token>
238
239 For each commit we will add the staging tokens to a committed entry.
240 Adding a reference is done using `set if` and a retry mechanism to ensure consistency.
241
242 key = <repo>/staging/committed/<physical_address>
243 value = list(staging tokens)
244
245 ### Flows
246
247 #### Upload Object
248
249 1. Write blob
250 2. Read reference entry
251 3. Perform `set if` on the reference with the additional staging token
252 4. Add staged entry to database
253
254 * Delete object will not remove the reference
255 * References to objects are kept as long as staging token is not deleted or committed.
256
257 #### Delete Object
258
259 Deleted object will remain the same. We do not delete references in this flow.
260
261 #### Stage Object
262 Allow stage object operations only on addresses outside the repository namespace
263 lakeFS will not manage the retention of these objects
264
265 #### GetPhysicalAddress
266
267 Provide a validation token with the issued physical address. The issued address can be used only once, thus
268 we can assume the given address is not committed or referenced anywhere else.
269 Once an issued address was used by LinkPhysicalAddress, we will delete the token ID from the list of valid tokens.
270
271 1. Issue a JWT token with the physical address
272 2. Save token ID in db
273 3. return token with physical address to the user
274
275 #### LinkPhysicalAddress
276
277 1. Given physical address, and token
278 1. Check token validity
279 2. If token valid
280 1. Remove token ID
281 2. Add reference
282 3. Add entry in staging area
283
284 #### CopyObject
285
286 The new flow ensures that we do not add a reference for a committed object after we removed all its references as part of
287 the background delete job, and preventing the accidental deletion of committed objects.
288
289 1. While retry not exhausted
290 1. Read reference key, if key found
291 1. Read reference list for address, if list is empty
292 1. Assume address is deleted - return error
293 (only way it is empty is if background job is currently working on this reference during the delete flow)
294 2. Else - add branch staging token to reference list and perform `set if`
295 1. If predicate failed - retry
296 2. If key not found - assume address was committed - perform copy without adding a new reference
297
298 #### Commit
299
300 1. Perform commit flow in the same manner as today
301 2. On successful commit - create entry using `commit_id` and ordered list of tokens under the `committed` prefix
302
303 #### Reset / Delete Branch
304
305 1. Perform reset / delete flow in the same manner as today
306 2. On successful execution - create entries for all staging tokens under the `deleted` prefix
307
308 ### Background delete job
309
310 * Scan committed tokens
311 * For each committed object - remove reference key
312 * For each staging token in committed entry - "Move staging token to deleted"
313 All objects in committed staging tokens that were not actually committed are candidates for deletion, therefore
314 we can either execute the deleted tokens logic, or create entries for these tokens under the deleted prefix (implementation detail)
315 * Remove `staging/committed` key
316 * Delete staging area
317 * Scan deleted tokens
318 * For each token - remove references for all objects on staging token
319 * If it is the last reference for that object (list is empty after update) - perform hard-delete
320 * Delete staging area
321
322 ### Handling key override on same staging token - improvement
323
324 TBD
325
326 ## 5. No Copies
327
328 The idea is to make sure we do not enable copies, when no two entries in stage will point to the same physical object we will enable delete of the physical address in the following:
329
330 - Upload - get previous entry and delete previous physical address unless staging token was updated
331 - Revert - branch / prefix / object, when entry is dropped we can delete the physical address
332 - Delete - repo/branch/object, delete the physical address after entry no longer exists
333
334 We should enable the following to enable the physical delete:
335
336 - Make sure we don't have the same physical address by sign links we return in our API. The logical path will be part of the signature that will be part of the physical address we generate. This will enable us to block and use of a physical address outside the repo/branch/entry.
337 - The S3 gateway `put` operation with the copy support will use the underlying adapter to copy the data. Will require from each adapter to support/emulate a 'copy' operation.
338 * (optional) Enable 'move' API as alternative to 'copy' we have seen that moving objects from one location to another by copy+delete of metadata will enable easy support for lakeFSFS move. This can be implemented by marking the entry on staging as 'locked'. During the move operation - lock+copy+delete+unlock operation. In case of any failure, as we don't have transactions, we may keep two entries, or keep locked object without ever delete its physical address.
339
340 TODO(Barak): list more cases and how it will be address in this solution.
341
342 ### Object Locking
343
344 Add a new field in the staging area struct to indicate whether the entry is currently locked or not
345 ```
346 type Value struct {
347 Identity []byte
348 Data []byte
349 Locked bool
350 }
351 ```
352 We can then use `SetIf` to ensure concurrently safe locking mechanism.
353 When deleting a staged entry (either by DropKey, ResetBranch, DeleteBranch) we will perform Hard Delete of object only in case
354 the entry is not locked.
355
356 ### Flows
357
358 #### Move operation
359 1. Get entry
360 2. "Lock" entry (`SetIf`)
361 3. Copy entry to new path on current staging token
362 4. Delete old entry
363 5. "Unlock" entry (new entry)
364
365 #### Upload Object
366
367 1. Get entry
368 2. "Lock" entry if exists (override scenario) (`SetIf`)
369 3. Write blob
370 4. Add staged entry (or update)
371 5. Delete physical address if previously existed
372
373 #### Delete Object
374 1. Get entry
375 2. "Lock" entry (`SetIf`) to protect from delete - move race
376 3. Delete staging entry
377 4. Hard delete object
378
379 ### Races and issues
380
381 #### 1. Commit - Move Race
382 1. Start Move:
383 1. Get entry
384 2. "Lock" entry
385 2. Start Commit:
386 1. Change staging token
387 2. write commit data (old entry path was written)
388 3. Move:
389 1. Create new entry on new staging token
390 2. Delete old entry
391 3. "Unlock" physical address
392
393 Now we have an uncommitted entry pointing to a committed physical address. If we reset the branch we will delete
394 committed data!
395
396 #### 2. Delete - Move Race
397 1. Start Delete:
398 1. Get entry
399 2. Check if not "locked"
400 2. Start Move:
401 1. Get entry
402 2. "Lock" entry
403 3. Copy entry to new path on current staging token
404 3. Delete:
405 1. Hard Delete object from store
406
407 In this situation we have an uncommitted entry which points to an invalid physical address.
408 **Resolved by locking the entry on delete flow as well**
409
410 #### 3. Concurrent Move
411 Concurrent move should be blocked as it may cause multiple pointers for the same physical address.
412 The most straightforward way to do so is to rely on the absence of the lock.
413 The problem with this approach is that permanently locked addresses (which is a protection mechanism in case of catastrophic failures)
414 can never be moved -which leads us to the stale lock problem.
415
416 #### 4. Stale lock problem
417 We use the lock mechanism to protect from race scenarios on Upload (override), Move and Delete. As such, an entry with a stale
418 lock will prevent us from performing any of these operations.
419 We can overcome it with a time based lock - but this might present additional challenges to the proposed solution.