github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/rejected/hard-delete.md (about)

     1  # Hard Delete Uncommitted Data
     2  
     3  ## Decision
     4  
     5  No solution found that covers all scenarios and edge cases in an adequate way. Decided to shelve the effort for finding an
     6  online solution for now, and concentrate on the offline solution.
     7  
     8  ## Motivation
     9  
    10  Uncommitted data which is no longer referenced (due to branch deletion, reset branch etc.) is not being deleted by lakeFS.
    11  This may result excessive storage usage and possible compliance issues.
    12  To solve this two approaches were suggested:
    13  1. A batch operation performed as part of an external process (GC)
    14  2. An online solution inside lakeFS
    15  
    16  **This document details the latter.**
    17  
    18  Several approaches were considered and described in this document. All proposal suffer from unresolved design issues, and
    19  so no specific approach was selected as a solution.
    20  
    21  ### Base assumptions
    22  
    23  The following assumptions are required in all the suggested proposals:
    24  1. Copy operation requires to be changed and implemented is full copy. (*irrelevant for suggestion no.3)
    25  2. lakeFS is not responsible for retention of data outside the repository path (data ingest)
    26  3. An online solution is not bulletproof, and will require a complementary external solution to handle edge cases as well as
    27  existing uncommitted garbage.
    28  
    29  ## 1. Write staged data directly to object store on a defined path
    30  
    31  This proposal suggests moving the staging area management from the ref store to the object store and defining a structured path for
    32  the object's physical address on the object store.
    33  
    34  ### Performance degradation when using the object store
    35  
    36  Due to the basic concept of handling the staging data in the objects store, this proposal suffers from a significant performance 
    37  degradation in one of lakeFS's principal flows - **listing**. Listing is used in flows such as: list branch objects, getting objects and committing.
    38  We tested listing of ~240000 **staged objects** using `lakeFS ls` and comparing it with `aws s3 ls` with the following results:
    39  > ./lakectl fs ls lakefs://niro-bench/main/guy-repo/itai-export/medium/ >   6.42s user 0.78s system 44% cpu 16.089 total
    40  
    41  > aws s3 ls s3://guy-repo/itai-export/medium/ > /dev/null  56.82s user 2.34s system 74% cpu 1:19.79 total
    42  
    43  ### Design
    44  
    45  Objects will be stored in a path relative to the branch and staging token. For example: file `x/y.z` will be uploaded to path
    46  `<bucket-name>/<repo-name>/<branch-name>/<staging-token>/x/y.z`.  
    47  As result, uncommitted objects are all objects on a path of staging tokens which were not yet committed.  
    48  When deleting a branch, each object under the branch's current staging token can be deleted.
    49  
    50  ### Blockers
    51  
    52  1. Commit flow is not atomic:
    53     1. Start write on staged object
    54     2. Start Commit
    55     3. Commit flow writes current object to committed tables
    56     4. Write to object is completed - changing the data after it was already committed
    57  
    58  ### Opens
    59  1. Solve the blocker - how to prevent data modification after commit? 
    60  2. How to manage ingested data?
    61  3. How to handle SymLink and stageObject operations?
    62  
    63  ## 2. Reference Counter on Object Metadata
    64  
    65  This proposal suggested to maintain a reference counter in the underlying object's metadata and use the different block adapters
    66  capabilities to perform atomic, concurrently safe incrementation of this counter.
    67  The way we are handling the staged data in lakeFS, will not change in any other way.
    68  **The proposal is not viable since though Google Storage and Azure provide this capability, AWS S3 does not.** 
    69  
    70  This solution incurs an additional metadata read (and possible write) in case it is in the repo namespace in selected operations.
    71  
    72  ### Blockers
    73  
    74  1. AWS does not support conditional set of metadata
    75  
    76  ### Flows
    77  
    78  #### Upload Object
    79  
    80  1. Add a reference counter to the object's metadata and set it to one.
    81  2. Write the object to the object store the same way we do today.
    82  
    83  #### Stage Object
    84  
    85  1. Object path is in the repo namespace:
    86     1. read its metadata, 
    87        1. If counter > 0, increment the counter and update the metadata on the object. Otherwise, treat as deleted. 
    88     2. If the object's write to object store fails - rollback the metadata update. 
    89  2. Object path is outside the repo namespace:
    90     1. lakeFS will not handle its retention - therefore it will have to be deleted by the
    91     user or by a dedicated GC process
    92  
    93  #### LinkPhysicalAddress
    94  
    95  Assume this API uses a physical address in the repo namespace
    96  1. Read object's metadata
    97  2. Increment counter and write metadata
    98  3. Add staging entry
    99  
   100  #### Delete Object
   101  
   102  1. Read object's metadata
   103  2. Reference counter exists:
   104     1. Decrement counter
   105     2. if counter == 0
   106        1. Hard-delete object 
   107  3. If reference counter doesn't exist:
   108     1. Retention is not handled in lakeFS
   109  
   110  #### Reset / Delete Branch
   111  
   112  When resetting a branch we throw all the uncommitted branch data.
   113  We can leverage the new KV drop async functionality to also check and hard-delete objects as needed
   114  
   115  ### Atomic reference counter
   116  
   117  For each storage adapter (AWS, GC, AZ), we will use a builtin logic to provide this functionality
   118  
   119  #### AWS
   120  
   121  AWS doesn't allow to update an object or its metadata after creation. In order to update the metadata a copy object should be performed with the new metadata values. The only 
   122  conditionals currently available are on the objects timestamp and ETag. Unfortunately ETags are modified only on objects content change and does not take into account changes in metadata.
   123  
   124  #### Azure
   125  
   126  Store the reference counter in the blob's metadata and use `set blob metadata` API with the supported conditional header on the ETag to perform an If-Match operation and update
   127  the reference counter
   128  
   129  
   130  #### Google Store
   131  
   132  Store the reference counter in the blob's metadata, use the `patch object` API with the `ifMetagenerationMatch` conditional.
   133  
   134  
   135  ## 3. Tracking references in staging manager
   136  
   137  Staging manager will track physical addresses of staged objects and the staging tokens referencing them.
   138  We will introduce a new object to the database:
   139  
   140        key    = <repo>/references/<physical_address>/<staging_token>  
   141        value  = state enum (staged|deleted|comitted)
   142                 last_modified timestamp
   143  
   144  On Upload object to a branch, we will add a key in the references path with the physical address and current staging token and 
   145  mark it as staged.   
   146  On Delete object from a staging area, we will update the reference key with value `deleted`  
   147  On Commit, we will update the reference key with value `comitted`
   148  
   149  A background job will be responsible for scanning the references prefix, and handling the references according to the state.
   150  The `last_modified` parameter is used to prevent any race conditions between reference candidates for deletion and ongoing operations 
   151  which might add references for this physical address. The assumption is that when all references of a physical address 
   152  are in `deleted` state, after a certain amount of time (TBD), this physical address cannot be referenced anymore (all staging tokens were either dropped or committed).
   153  
   154  ### Background delete job (pseudo-code)
   155  
   156  1. Scan `reference` prefix (can use after prefix for load balancing)
   157  2. For each `physical_address` read all entries
   158        1. If found state == 'committed' in any entry
   159           1. Delete all keys for `physical_address`, by order of state: deleted -> staged -> committed
   160        2. If state == 'deleted' in all entries and min(`last_modified`) < `TBD`
   161           1. Hard-delete object
   162           2. Delete all keys for `physical_address`
   163  
   164  ### Key Update Order of Precedence
   165  
   166  1. `commited` state takes precedence over all (terminal state - overrides any other value) - uses **Set**
   167  2. `deleted` state can only be done on entries which are not `committed` and uses **SetIf**
   168  3. `staged` state can only be done on entries which are not `committed` and uses **SetIf** 
   169  
   170  ### Flows
   171  
   172  #### Upload Object
   173  
   174  1. Write blob
   175  2. Write reference entry to the database as staged
   176  3. Add staged entry to database
   177  4. If entry exists in current staging token - mark old reference as deleted
   178  **Open:** Efficiently deal with overrides 
   179  
   180  #### Stage Object
   181  
   182  1. Object path is in the repo namespace:
   183     1. Add reference entry to the database 
   184  2. Object path is outside the repo namespace:
   185     1. lakeFS will not handle its retention
   186  3. Add staged entry
   187  
   188  #### GetPhysicalAddress
   189  Get physical address will add a reference to the generated unique physical address with state `deleted`
   190  We will then return a valid token / expiry timestamp to the user in addition to the physical address
   191  The user will need to pass the token to LinkPhysicalAddress
   192  
   193  #### LinkPhysicalAddress
   194  Add a validation to check if the token provided / or timestamp expired.
   195  Set reference state as staged.
   196  
   197  Assume this API uses a physical address in the repo namespace
   198  1. Add reference entry to the database
   199  2. Add staging entry
   200  
   201  #### Delete Object
   202  
   203  1. If object staged
   204     1. Read reference key
   205     2. If not `committed` 
   206        1. Change reference state to `deleted` and update `last_modified`
   207  
   208  #### Commit
   209  
   210  1. Mark all entries in staging area as `committed`
   211  2. Swap staging token
   212  3. Create commit
   213  4. Update branch commit ID
   214  
   215  #### Reset / Delete Branch
   216  
   217  When resetting a branch we throw all the uncommitted branch data, this operation happens asynchronously via `AsyncDrop`.
   218  We can leverage this asynchronous job to also check and perform the hard-delete
   219  1. Scan deleted staging token entries
   220  2. If entry is `tombstone` remove entry reference key
   221  3. Otherwise, modify reference state to `deleted` and update `last_modified`
   222  
   223  ## 4. Staging Token States
   224  
   225  This is an improvement over the 3rd proposal which suggests tracking the state on the staging token instead of objects,
   226  while still keeping track of references (without state)
   227  
   228  This will be done using the following objects:
   229  
   230  References will be tracked in a list instead of an explicit entry per reference
   231      
   232      key    = <repo>/references/<physical_address>  
   233      value  = list(staging tokens)
   234  
   235  For each staging token of a deleted / reset branches a deleted entry will be added
   236  
   237      <repo>/staging/deleted/<staging_token>
   238  
   239  For each commit we will add the staging tokens to a committed entry.
   240  Adding a reference is done using `set if` and a retry mechanism to ensure consistency.
   241  
   242      key    = <repo>/staging/committed/<physical_address>  
   243      value  = list(staging tokens)
   244  
   245  ### Flows
   246  
   247  #### Upload Object
   248  
   249  1. Write blob
   250  2. Read reference entry
   251  3. Perform `set if` on the reference with the additional staging token
   252  4. Add staged entry to database
   253  
   254  * Delete object will not remove the reference
   255  * References to objects are kept as long as staging token is not deleted or committed.
   256  
   257  #### Delete Object
   258  
   259  Deleted object will remain the same. We do not delete references in this flow.
   260  
   261  #### Stage Object
   262  Allow stage object operations only on addresses outside the repository namespace
   263  lakeFS will not manage the retention of these objects
   264  
   265  #### GetPhysicalAddress
   266  
   267  Provide a validation token with the issued physical address. The issued address can be used only once, thus
   268  we can assume the given address is not committed or referenced anywhere else.
   269  Once an issued address was used by LinkPhysicalAddress, we will delete the token ID from the list of valid tokens.
   270  
   271  1. Issue a JWT token with the physical address
   272  2. Save token ID in db
   273  3. return token with physical address to the user 
   274  
   275  #### LinkPhysicalAddress
   276  
   277  1. Given physical address, and token
   278     1. Check token validity
   279     2. If token valid
   280        1. Remove token ID
   281        2. Add reference
   282        3. Add entry in staging area
   283  
   284  #### CopyObject
   285  
   286  The new flow ensures that we do not add a reference for a committed object after we removed all its references as part of 
   287  the background delete job, and preventing the accidental deletion of committed objects.
   288  
   289  1. While retry not exhausted
   290     1. Read reference key, if key found
   291        1. Read reference list for address, if list is empty
   292           1. Assume address is deleted - return error  
   293           (only way it is empty is if background job is currently working on this reference during the delete flow)
   294        2. Else - add branch staging token to reference list and perform `set if`
   295           1. If predicate failed - retry
   296     2. If key not found - assume address was committed - perform copy without adding a new reference
   297  
   298  #### Commit
   299  
   300  1. Perform commit flow in the same manner as today
   301  2. On successful commit - create entry using `commit_id` and ordered list of tokens under the `committed` prefix 
   302  
   303  #### Reset / Delete Branch
   304  
   305  1. Perform reset / delete flow in the same manner as today
   306  2. On successful execution - create entries for all staging tokens under the `deleted` prefix
   307  
   308  ### Background delete job
   309  
   310  * Scan committed tokens
   311      * For each committed object - remove reference key
   312      * For each staging token in committed entry - "Move staging token to deleted"   
   313        All objects in committed staging tokens that were not actually committed are candidates for deletion, therefore
   314    we can either execute the deleted tokens logic, or create entries for these tokens under the deleted prefix (implementation detail)
   315      * Remove `staging/committed` key
   316      * Delete staging area
   317  * Scan deleted tokens  
   318    * For each token - remove references for all objects on staging token
   319    * If it is the last reference for that object (list is empty after update) - perform hard-delete
   320    * Delete staging area
   321  
   322  ### Handling key override on same staging token - improvement
   323  
   324  TBD
   325  
   326  ## 5. No Copies
   327  
   328  The idea is to make sure we do not enable copies, when no two entries in stage will point to the same physical object we will enable delete of the physical address in the following:
   329  
   330  - Upload - get previous entry and delete previous physical address unless staging token was updated
   331  - Revert - branch / prefix / object, when entry is dropped we can delete the physical address
   332  - Delete - repo/branch/object, delete the physical address after entry no longer exists
   333  
   334  We should enable the following to enable the physical delete:
   335  
   336  - Make sure we don't have the same physical address by sign links we return in our API. The logical path will be part of the signature that will be part of the physical address we generate. This will enable us to block and use of a physical address outside the repo/branch/entry.
   337  - The S3 gateway `put` operation with the copy support will use the underlying adapter to copy the data. Will require from each adapter to support/emulate a 'copy' operation.
   338  * (optional) Enable 'move' API as alternative to 'copy' we have seen that moving objects from one location to another by copy+delete of metadata will enable easy support for lakeFSFS move. This can be implemented by marking the entry on staging as 'locked'. During the move operation - lock+copy+delete+unlock operation. In case of any failure, as we don't have transactions, we may keep two entries, or keep locked object without ever delete its physical address.
   339  
   340  TODO(Barak): list more cases and how it will be address in this solution.
   341  
   342  ### Object Locking
   343  
   344  Add a new field in the staging area struct to indicate whether the entry is currently locked or not
   345  ```
   346  type Value struct {
   347  	Identity []byte
   348  	Data     []byte
   349  	Locked   bool
   350  }
   351  ```
   352  We can then use `SetIf` to ensure concurrently safe locking mechanism.
   353  When deleting a staged entry (either by DropKey, ResetBranch, DeleteBranch) we will perform Hard Delete of object only in case 
   354  the entry is not locked.
   355  
   356  ### Flows
   357  
   358  #### Move operation
   359  1. Get entry
   360  2. "Lock" entry (`SetIf`)
   361  3. Copy entry to new path on current staging token
   362  4. Delete old entry
   363  5. "Unlock" entry (new entry) 
   364  
   365  #### Upload Object
   366  
   367  1. Get entry
   368  2. "Lock" entry if exists (override scenario) (`SetIf`)
   369  3. Write blob
   370  4. Add staged entry (or update)
   371  5. Delete physical address if previously existed
   372  
   373  #### Delete Object
   374  1. Get entry
   375  2. "Lock" entry (`SetIf`) to protect from delete - move race
   376  3. Delete staging entry
   377  4. Hard delete object
   378  
   379  ### Races and issues
   380  
   381  #### 1. Commit - Move Race
   382  1. Start Move:
   383     1. Get entry
   384     2. "Lock" entry
   385  2. Start Commit:
   386     1. Change staging token
   387     2. write commit data (old entry path was written)
   388  3. Move:
   389     1. Create new entry on new staging token
   390     2. Delete old entry
   391     3. "Unlock" physical address
   392  
   393  Now we have an uncommitted entry pointing to a committed physical address. If we reset the branch we will delete
   394  committed data!
   395  
   396  #### 2. Delete - Move Race
   397  1. Start Delete:
   398     1. Get entry
   399     2. Check if not "locked"
   400  2. Start Move:
   401     1. Get entry
   402     2. "Lock" entry
   403     3. Copy entry to new path on current staging token
   404  3. Delete:
   405     1. Hard Delete object from store
   406  
   407  In this situation we have an uncommitted entry which points to an invalid physical address.
   408  **Resolved by locking the entry on delete flow as well**
   409  
   410  #### 3. Concurrent Move
   411  Concurrent move should be blocked as it may cause multiple pointers for the same physical address.
   412  The most straightforward way to do so is to rely on the absence of the lock.
   413  The problem with this approach is that permanently locked addresses (which is a protection mechanism in case of catastrophic failures)
   414  can never be moved -which leads us to the stale lock problem.
   415  
   416  #### 4. Stale lock problem
   417  We use the lock mechanism to protect from race scenarios on Upload (override), Move and Delete. As such, an entry with a stale
   418  lock will prevent us from performing any of these operations.
   419  We can overcome it with a time based lock - but this might present additional challenges to the proposed solution.