github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/staging-compaction.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/staging-compaction.md (about)

1 # Proposal: Staging Compaction
2
3 ## Problem Description
4
5 lakeFS staging area is built on top of the [KV store](../accepted/metadata_kv/index.md).
6 In short, the structure for the `Branch` entity in the kv-store is as follows:
7 ```go
8 type Branch struct {
9 CommitID CommitID
10 StagingToken StagingToken
11 // SealedTokens - Staging tokens are appended to the front, this allows building the diff iterator easily
12 SealedTokens []StagingToken
13 }
14 ```
15
16 Uncommitted entries are read from the staging token first, then from the
17 sealed tokens by order. Writes are performed on the staging token.
18
19 The KV design has proven to meet lakeFS requirements for its consistency
20 guarantees. For most use-cases of uncommitted areas in lakeFS, it does that
21 efficiently. However, there are some cases where it falls short, for example
22 this [issue](https://github.com/treeverse/lakeFS/issues/2092).
23 There might be other cases where the structure of the `Branch` entity impacts
24 the performance of reading from the staging area, for example when the
25 number of sealed tokens is large (N)[^1] and reading a missing entry requires
26 reading from all N sealed tokens.
27
28 ## Goals
29
30 - Preserve lakeFS consistency guarantees for all scenarios.
31 - A more efficient way to read from branches with a large number of
32 tombstones (i.e. Fix the issue mentioned above).
33 - Preserve lakeFS performance for other scenarios.[^2]
34
35
36 ## Non Goals
37
38 - Improve the performance of the KV store in general.
39 - Simplify the uncommitted area model in lakeFS. While we'll try to keep the
40 changes minimal, we might need to add some complexity to the model to
41 achieve the goals.
42
43
44 ## Proposed Design
45
46 We propose to add a compaction mechanism to the staging area. We'll explain
47 how to decide when to compact (Sensor), how to compact (Compactor),
48 data read/write flows and commit flows after compaction.
49
50 ```go
51 type Branch struct {
52 CommitID CommitID
53 StagingToken StagingToken
54 // SealedTokens - Staging tokens are appended to the front, this allows building the diff iterator easily
55 SealedTokens []StagingToken
56
57 CompactedMetaRange MetaRangeID
58 CompactionStartTime time.Time
59 }
60 ```
61
62 ### Sensor
63
64 The Sensor is responsible for deciding when to compact the staging area.
65 The Sensor will be linked to the Graveler and will collect
66 information on writes to the staging area. It will decide when to compact a
67 certain branch based on the number of deleted entries to its staging area,
68 taking into accounts operations like commits/resets etc. It can also decide
69 based on the number of sealed tokens, although not necessarily a priority
70 for the first version. We can probably avoid the need to query the kv-store
71 to retrieve that information (except for the service startups) by caching
72 the information in the Sensor.
73
74 Notice there's a single Sensor for each lakeFS. While the followup sections
75 describe why concurrent compactions don't harm consistency, they may be very
76 inefficient. Therefore, a Sensor deciding to compact will only do so if the
77 branch's `CompactionStartTime` is not within the last x<TBD> minutes using
78 `SetIf` to minimize collisions (although not avoiding them completely due to
79 clock skews).
80
81 ### Compactor
82
83 Upon deciding to compact, the Sensor will trigger the Compactor. The
84 Compactor will perform an operation very similar to Commit. Before starting
85 to compact, it will atomically:
86
87 1. Push a new `StagingToken` to the branch.
88 1. Add the old `StagingToken` to the `SealedTokens` list.
89 1. Set the `CompactionStartTime` to the current time.
90
91 The Compactor will now read the branch's `SealedTokens` in order and apply
92 them on either the `CompactedMetaRange` if it exists, or on the branch's
93 HEAD CommitID MetaRange to create a newMetaRange. The Compactor will then
94 atomically update the branch entity:
95
96 1. Set the `CompactedMetaRangeID` to the new MetaRangeID.
97 1. Remove the compacted `SealedTokens` from the branch.
98
99 The Compactor should fail the operation if the sealed tokens have changed
100 since the compaction started (SetIf). Although the algorithm can tolerate
101 additional sealed tokens being added during the compaction, it's better to
102 avoid competing with concurrent commits. Commits have the same benefits as
103 compactions, but they are proactively triggered by the user (and might fail
104 if compaction succeeds).
105
106 Consistency guarantees for the compaction process are derived from the
107 Commit operation consistency guarantees. In short, the compaction process
108 follows the same principles as the Commit operation. Writes to a branch
109 staging token succeeds only if it's the staging token after the write
110 occurred. Therefore, replacing the sealed token with an equivelant MetaRange
111 guarantees no successful writes are gone missing.
112
113 ### Branch Write Flow
114
115 No change here. Writes are performed on the staging token.
116
117 ### Branch Read Flow
118
119 Reading from a branch includes both a specific entry lookup and a listing
120 request on entries of a branch. The read flow is more complicated than what
121 we have today, and is different for compacted and uncompacted branches.
122
123 #### Combined (Committed+ Uncommitted)
124
125 There are 3 layers of reading from a branch:
126 1. Read from the staging token.
127 2. Read from the sealed tokens (in order).
128 3. Depends on compaction:
129 1. If a CompactedMetaRangeID exists, read from the compacted MetaRange.
130 1. If a CompactedMetaRangeID doesn't exist, read from the CommitID.
131
132 #### Committed
133
134 Just like today, reads are performed on the Branch's CommitID.
135
136 #### Uncommitted
137
138 For operations such as `DiffUncommitted` or checking if the staging
139 area is empty, the read flow will be as follows:
140
141 1. Read from the staging token.
142 2. Read from the sealed tokens (in order).
143 3. If a CompactedMetaRangeID exists, read the 2-way diff between the compacted
144 metarange and the CommitID's metarange.
145
146 * There's an inefficiency here, as there's an option we'll need to read 2 whole
147 metaranges to get the diff, like when there's a single change in every
148 range. The nature of changes to a lakeFS branch is such that changes are
149 expected in a small number of ranges, and the diff operation is expected
150 to skip most ranges. Moreover, Pebble caching the ranges should remove the
151 most costly operation of ranges comparison - fetching them from S3. If
152 this is still inefficient, we can use the immutability trait[^3] of the
153 diff: we can calculate the diff result once and cache it.
154
155 ### Commit Flow
156
157 The commit flow is slightly affected by the compaction process. If
158 compaction never happened, the commit flow is the same as today. If a
159 compaction happened, apply the changes to the compacted metarange instead of
160 the HEAD commit metarange. A successful commit will reset the the
161 `CompactedMetaRangeID` field.
162
163 ## Metrics
164
165 We can collect the following prometheus metrics:
166 - Compaction time.
167 - Number of sealed tokens.
168
169 [^1]: The number of sealed tokens increases by one for every HEAD changing
170 operation (commit, merge, etc.) on the branch. The number of sealed tokens
171 resets to 0 when one of the operations succeeds. Therefore, a large number of
172 sealed tokens is unlikely and is a sign of a bigger problem like the
173 inability to commit.
174 [^2]: Any additional call to the kv-store might have an indirect performance
175 impact, like in the case of DynamoDB throttling. We'll treat a performance
176 impact as a change of flow with a direct impact on a user operation.
177 [^3]: A diff between two immutable metaranges is also immutable.