github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/staging-compaction.md (about) 1 # Proposal: Staging Compaction 2 3 ## Problem Description 4 5 lakeFS staging area is built on top of the [KV store](../accepted/metadata_kv/index.md). 6 In short, the structure for the `Branch` entity in the kv-store is as follows: 7 ```go 8 type Branch struct { 9 CommitID CommitID 10 StagingToken StagingToken 11 // SealedTokens - Staging tokens are appended to the front, this allows building the diff iterator easily 12 SealedTokens []StagingToken 13 } 14 ``` 15 16 Uncommitted entries are read from the staging token first, then from the 17 sealed tokens by order. Writes are performed on the staging token. 18 19 The KV design has proven to meet lakeFS requirements for its consistency 20 guarantees. For most use-cases of uncommitted areas in lakeFS, it does that 21 efficiently. However, there are some cases where it falls short, for example 22 this [issue](https://github.com/treeverse/lakeFS/issues/2092). 23 There might be other cases where the structure of the `Branch` entity impacts 24 the performance of reading from the staging area, for example when the 25 number of sealed tokens is large (N)[^1] and reading a missing entry requires 26 reading from all N sealed tokens. 27 28 ## Goals 29 30 - Preserve lakeFS consistency guarantees for all scenarios. 31 - A more efficient way to read from branches with a large number of 32 tombstones (i.e. Fix the issue mentioned above). 33 - Preserve lakeFS performance for other scenarios.[^2] 34 35 36 ## Non Goals 37 38 - Improve the performance of the KV store in general. 39 - Simplify the uncommitted area model in lakeFS. While we'll try to keep the 40 changes minimal, we might need to add some complexity to the model to 41 achieve the goals. 42 43 44 ## Proposed Design 45 46 We propose to add a compaction mechanism to the staging area. We'll explain 47 how to decide when to compact (Sensor), how to compact (Compactor), 48 data read/write flows and commit flows after compaction. 49 50 ```go 51 type Branch struct { 52 CommitID CommitID 53 StagingToken StagingToken 54 // SealedTokens - Staging tokens are appended to the front, this allows building the diff iterator easily 55 SealedTokens []StagingToken 56 57 CompactedMetaRange MetaRangeID 58 CompactionStartTime time.Time 59 } 60 ``` 61 62 ### Sensor 63 64 The Sensor is responsible for deciding when to compact the staging area. 65 The Sensor will be linked to the Graveler and will collect 66 information on writes to the staging area. It will decide when to compact a 67 certain branch based on the number of deleted entries to its staging area, 68 taking into accounts operations like commits/resets etc. It can also decide 69 based on the number of sealed tokens, although not necessarily a priority 70 for the first version. We can probably avoid the need to query the kv-store 71 to retrieve that information (except for the service startups) by caching 72 the information in the Sensor. 73 74 Notice there's a single Sensor for each lakeFS. While the followup sections 75 describe why concurrent compactions don't harm consistency, they may be very 76 inefficient. Therefore, a Sensor deciding to compact will only do so if the 77 branch's `CompactionStartTime` is not within the last x<TBD> minutes using 78 `SetIf` to minimize collisions (although not avoiding them completely due to 79 clock skews). 80 81 ### Compactor 82 83 Upon deciding to compact, the Sensor will trigger the Compactor. The 84 Compactor will perform an operation very similar to Commit. Before starting 85 to compact, it will atomically: 86 87 1. Push a new `StagingToken` to the branch. 88 1. Add the old `StagingToken` to the `SealedTokens` list. 89 1. Set the `CompactionStartTime` to the current time. 90 91 The Compactor will now read the branch's `SealedTokens` in order and apply 92 them on either the `CompactedMetaRange` if it exists, or on the branch's 93 HEAD CommitID MetaRange to create a newMetaRange. The Compactor will then 94 atomically update the branch entity: 95 96 1. Set the `CompactedMetaRangeID` to the new MetaRangeID. 97 1. Remove the compacted `SealedTokens` from the branch. 98 99 The Compactor should fail the operation if the sealed tokens have changed 100 since the compaction started (SetIf). Although the algorithm can tolerate 101 additional sealed tokens being added during the compaction, it's better to 102 avoid competing with concurrent commits. Commits have the same benefits as 103 compactions, but they are proactively triggered by the user (and might fail 104 if compaction succeeds). 105 106 Consistency guarantees for the compaction process are derived from the 107 Commit operation consistency guarantees. In short, the compaction process 108 follows the same principles as the Commit operation. Writes to a branch 109 staging token succeeds only if it's the staging token after the write 110 occurred. Therefore, replacing the sealed token with an equivelant MetaRange 111 guarantees no successful writes are gone missing. 112 113 ### Branch Write Flow 114 115 No change here. Writes are performed on the staging token. 116 117 ### Branch Read Flow 118 119 Reading from a branch includes both a specific entry lookup and a listing 120 request on entries of a branch. The read flow is more complicated than what 121 we have today, and is different for compacted and uncompacted branches. 122 123 #### Combined (Committed+ Uncommitted) 124 125 There are 3 layers of reading from a branch: 126 1. Read from the staging token. 127 2. Read from the sealed tokens (in order). 128 3. Depends on compaction: 129 1. If a CompactedMetaRangeID exists, read from the compacted MetaRange. 130 1. If a CompactedMetaRangeID doesn't exist, read from the CommitID. 131 132 #### Committed 133 134 Just like today, reads are performed on the Branch's CommitID. 135 136 #### Uncommitted 137 138 For operations such as `DiffUncommitted` or checking if the staging 139 area is empty, the read flow will be as follows: 140 141 1. Read from the staging token. 142 2. Read from the sealed tokens (in order). 143 3. If a CompactedMetaRangeID exists, read the 2-way diff between the compacted 144 metarange and the CommitID's metarange. 145 146 * There's an inefficiency here, as there's an option we'll need to read 2 whole 147 metaranges to get the diff, like when there's a single change in every 148 range. The nature of changes to a lakeFS branch is such that changes are 149 expected in a small number of ranges, and the diff operation is expected 150 to skip most ranges. Moreover, Pebble caching the ranges should remove the 151 most costly operation of ranges comparison - fetching them from S3. If 152 this is still inefficient, we can use the immutability trait[^3] of the 153 diff: we can calculate the diff result once and cache it. 154 155 ### Commit Flow 156 157 The commit flow is slightly affected by the compaction process. If 158 compaction never happened, the commit flow is the same as today. If a 159 compaction happened, apply the changes to the compacted metarange instead of 160 the HEAD commit metarange. A successful commit will reset the the 161 `CompactedMetaRangeID` field. 162 163 ## Metrics 164 165 We can collect the following prometheus metrics: 166 - Compaction time. 167 - Number of sealed tokens. 168 169 [^1]: The number of sealed tokens increases by one for every HEAD changing 170 operation (commit, merge, etc.) on the branch. The number of sealed tokens 171 resets to 0 when one of the operations succeeds. Therefore, a large number of 172 sealed tokens is unlikely and is a sign of a bigger problem like the 173 inability to commit. 174 [^2]: Any additional call to the kv-store might have an indirect performance 175 impact, like in the case of DynamoDB throttling. We'll treat a performance 176 impact as a change of flow with a direct impact on a user operation. 177 [^3]: A diff between two immutable metaranges is also immutable.