github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/staging-compaction.md (about)

     1  # Proposal: Staging Compaction
     2  
     3  ## Problem Description
     4  
     5  lakeFS staging area is built on top of the [KV store](../accepted/metadata_kv/index.md).
     6  In short, the structure for the `Branch` entity in the kv-store is as follows:
     7  ```go
     8  type Branch struct {
     9  	CommitID     CommitID
    10  	StagingToken StagingToken
    11  	// SealedTokens - Staging tokens are appended to the front, this allows building the diff iterator easily
    12  	SealedTokens []StagingToken
    13  }
    14  ```
    15  
    16  Uncommitted entries are read from the staging token first, then from the 
    17  sealed tokens by order. Writes are performed on the staging token.
    18  
    19  The KV design has proven to meet lakeFS requirements for its consistency 
    20  guarantees. For most use-cases of uncommitted areas in lakeFS, it does that 
    21  efficiently. However, there are some cases where it falls short, for example 
    22  this [issue](https://github.com/treeverse/lakeFS/issues/2092).  
    23  There might be other cases where the structure of the `Branch` entity impacts 
    24  the performance of reading from the staging area, for example when the 
    25  number of sealed tokens is large (N)[^1] and reading a missing entry requires 
    26  reading from all N sealed tokens. 
    27  
    28  ## Goals
    29  
    30  - Preserve lakeFS consistency guarantees for all scenarios.
    31  - A more efficient way to read from branches with a large number of 
    32    tombstones (i.e. Fix the issue mentioned above).
    33  - Preserve lakeFS performance for other scenarios.[^2]
    34  
    35  
    36  ## Non Goals
    37  
    38  - Improve the performance of the KV store in general.
    39  - Simplify the uncommitted area model in lakeFS. While we'll try to keep the 
    40    changes minimal, we might need to add some complexity to the model to 
    41    achieve the goals.
    42  
    43  
    44  ## Proposed Design
    45  
    46  We propose to add a compaction mechanism to the staging area. We'll explain 
    47  how to decide when to compact (Sensor), how to compact (Compactor), 
    48  data read/write flows and commit flows after compaction.
    49  
    50  ```go
    51  type Branch struct {
    52  	CommitID     CommitID
    53  	StagingToken StagingToken
    54  	// SealedTokens - Staging tokens are appended to the front, this allows building the diff iterator easily
    55  	SealedTokens []StagingToken
    56  	
    57  	CompactedMetaRange MetaRangeID
    58  	CompactionStartTime time.Time
    59  }
    60  ```
    61  
    62  ### Sensor
    63  
    64  The Sensor is responsible for deciding when to compact the staging area.
    65  The Sensor will be linked to the Graveler and will collect 
    66  information on writes to the staging area. It will decide when to compact a 
    67  certain branch based on the number of deleted entries to its staging area, 
    68  taking into accounts operations like commits/resets etc. It can also decide 
    69  based on the number of sealed tokens, although not necessarily a priority 
    70  for the first version. We can probably avoid the need to query the kv-store 
    71  to retrieve that information (except for the service startups) by caching 
    72  the information in the Sensor.
    73  
    74  Notice there's a single Sensor for each lakeFS. While the followup sections 
    75  describe why concurrent compactions don't harm consistency, they may be very 
    76  inefficient. Therefore, a Sensor deciding to compact will only do so if the 
    77  branch's `CompactionStartTime` is not within the last x<TBD> minutes using 
    78  `SetIf` to minimize collisions (although not avoiding them completely due to 
    79  clock skews). 
    80  
    81  ### Compactor
    82  
    83  Upon deciding to compact, the Sensor will trigger the Compactor. The 
    84  Compactor will perform an operation very similar to Commit. Before starting 
    85  to compact, it will atomically:
    86  
    87  1. Push a new `StagingToken` to the branch.
    88  1. Add the old `StagingToken` to the `SealedTokens` list.
    89  1. Set the `CompactionStartTime` to the current time.
    90  
    91  The Compactor will now read the branch's `SealedTokens` in order and apply 
    92  them on either the `CompactedMetaRange` if it exists, or on the branch's 
    93  HEAD CommitID MetaRange to create a newMetaRange. The Compactor will then 
    94  atomically update the branch entity:
    95  
    96  1. Set the `CompactedMetaRangeID` to the new MetaRangeID.
    97  1. Remove the compacted `SealedTokens` from the branch.
    98  
    99  The Compactor should fail the operation if the sealed tokens have changed 
   100  since the compaction started (SetIf). Although the algorithm can tolerate 
   101  additional sealed tokens being added during the compaction, it's better to 
   102  avoid competing with concurrent commits. Commits have the same benefits as 
   103  compactions, but they are proactively triggered by the user (and might fail 
   104  if compaction succeeds).
   105  
   106  Consistency guarantees for the compaction process are derived from the 
   107  Commit operation consistency guarantees. In short, the compaction process 
   108  follows the same principles as the Commit operation. Writes to a branch 
   109  staging token succeeds only if it's the staging token after the write 
   110  occurred. Therefore, replacing the sealed token with an equivelant MetaRange 
   111  guarantees no successful writes are gone missing.
   112  
   113  ### Branch Write Flow
   114  
   115  No change here. Writes are performed on the staging token.
   116  
   117  ### Branch Read Flow
   118  
   119  Reading from a branch includes both a specific entry lookup and a listing 
   120  request on entries of a branch. The read flow is more complicated than what 
   121  we have today, and is different for compacted and uncompacted branches.
   122  
   123  #### Combined (Committed+ Uncommitted)
   124  
   125  There are 3 layers of reading from a branch:
   126  1. Read from the staging token.
   127  2. Read from the sealed tokens (in order).
   128  3. Depends on compaction:
   129     1. If a CompactedMetaRangeID exists, read from the compacted MetaRange.
   130     1. If a CompactedMetaRangeID doesn't exist, read from the CommitID.
   131  
   132  #### Committed
   133  
   134  Just like today, reads are performed on the Branch's CommitID.
   135  
   136  #### Uncommitted
   137  
   138  For operations such as `DiffUncommitted` or checking if the staging 
   139  area is empty, the read flow will be as follows:
   140  
   141  1. Read from the staging token.
   142  2. Read from the sealed tokens (in order).
   143  3. If a CompactedMetaRangeID exists, read the 2-way diff between the compacted
   144     metarange and the CommitID's metarange.
   145  
   146  * There's an inefficiency here, as there's an option we'll need to read 2 whole 
   147    metaranges to get the diff, like when there's a single change in every 
   148    range. The nature of changes to a lakeFS branch is such that changes are 
   149    expected in a small number of ranges, and the diff operation is expected 
   150    to skip most ranges. Moreover, Pebble caching the ranges should remove the 
   151    most costly operation of ranges comparison - fetching them from S3. If 
   152    this is still inefficient, we can use the immutability trait[^3] of the 
   153    diff: we can calculate the diff result once and cache it.
   154  
   155  ### Commit Flow
   156  
   157  The commit flow is slightly affected by the compaction process. If 
   158  compaction never happened, the commit flow is the same as today. If a 
   159  compaction happened, apply the changes to the compacted metarange instead of 
   160  the HEAD commit metarange. A successful commit will reset the the 
   161  `CompactedMetaRangeID` field.
   162  
   163  ## Metrics
   164  
   165  We can collect the following prometheus metrics:
   166  - Compaction time.
   167  - Number of sealed tokens.
   168  
   169  [^1]: The number of sealed tokens increases by one for every HEAD changing 
   170  operation (commit, merge, etc.) on the branch. The number of sealed tokens 
   171  resets to 0 when one of the operations succeeds. Therefore, a large number of 
   172  sealed tokens is unlikely and is a sign of a bigger problem like the 
   173  inability to commit.   
   174  [^2]: Any additional call to the kv-store might have an indirect performance 
   175  impact, like in the case of DynamoDB throttling. We'll treat a performance 
   176  impact as a change of flow with a direct impact on a user operation. 
   177  [^3]: A diff between two immutable metaranges is also immutable.