github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/metadata_kv/index.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/metadata_kv/index.md (about)

1 # Design proposal - lakeFS on KV
2
3 lakeFS stores 2 types of metadata:
4
5 1. Immutable metadata, namely committed entries - stored as ranges and metaranges in the underlying object store)
6 2. Mutable metadata, namely IAM entities, branch pointers and uncommitted entries
7
8 This proposal describes an alternative implementation for the **mutable** metadata.
9 Currently, this type of metadata is stored on (and relies on the guarantees provided by) a PostgreSQL database.
10
11 In this proposal, we detail a relatively narrow database abstraction - one that could be satisfied by a wide variety of database systems.
12 We also detail how lakeFS operates on top of this abstraction, instead of relying on PostgreSQL specific behavior
13
14 ## Goals
15
16 1. Make it easy to run lakeFS as a managed service in a multitenant environment
17 1. Allow running production lakeFS systems on top of a database that operations teams are capable of managing, as not all ops teams are versed in scaling PostgreSQL.
18 1. Increase users trust in lakeFS, in terms of positioning: PostgreSQL is apparently not the first DB that comes to mind in relation to scalability
19 1. Make lakeFS easier to experiment with: Allow running a local lakeFS server without any external dependencies (e.g. no more `docker-compose` in the quickstart)
20
21 ## Non-Goals
22
23 1. Improve performance, latency or throughput of the system
24 1. Provide new features, capabilities or guarantees that weren't previously possible in the lakeFS API/UI/CLI
25
26 ## Design
27
28 At the heart of the design, is a simple Key/Value interface, used for all mutable metadata management.
29
30 ### Semantics
31
32 Order operations by a "happens before" operation: operation A happened before operation B if A
33 finished before B started. If operation A happened before B then we also say that B happened
34 after A. (As usual, these are not a total ordering!)
35
36 #### Consistency guarantees
37
38 * If a write succeeds, successive reads and listings will return the contents of either that
39 write or of some other write that did not happen before it.
40
41 * A successful read returns the contents of some write that did not happen after it.
42
43 * A listing returns the contents of some writes that did not happen after it.
44
45 * A successful commit holds the contents of some listing of the entire keyspace.
46
47 * Mutating operations (commits and writes) may succeed or fail. When they fail, their contents
48 might still be visible.
49
50 #### Consistency **non**-guarantees
51
52 These guarantees do *not* give linearizability of any kind. In particular, these are some
53 possible behaviours.
54
55 1. **Impossible ordering by application logic (the "1-3-2 problem"):** I write an application
56 that reads a number from a file, increments it, and writes back the same file (this
57 application performs an unsafe increment). I start with the file containing "1", and run the
58 application twice concurrently. An observer (some process that repeatedly reads the file)
59 may observe the value sequence "1", "3", "2". If the observer commits each version, it can
60 create a **history** of these values in this order.
61 2. **Non-monotonicity (the "B-A-N-A-N-A-N-A-... problem :banana:"):** A file has contents "B".
62 I start a continuous committer (some process that repeatedly commits). Now I run two
63 concurrent updates: one updates the file contents to "N", the other updates the file contents
64 to "A". Different orderings can cause histories that look like "B", "A", "N", "A", "N", "A",
65 "N", ... to any length.
66
67 ### Key/Value Store interface
68
69 This is roughly the API:
70
71 ```go
72 type Store interface {
73 // Get returns a value for the given key, or ErrNotFound if key doesn't exist
74 Get(partitionKey, key []byte) (value []byte, err error)
75 // Scan returns an iterator that scans keys in byte order, starting at or after the `start` position
76 Scan(partitionKey, start []byte) (iter KeyValueIterator, err error)
77 // Set stores the given value, overwriting an existing value if one exists
78 Set(partitionKey, key, value []byte) error
79 // Delete will delete the key/value at key, if any
80 Delete(partitionKey, key []byte) error
81 // SetIf returns an ErrPredicateFailed error if the valuePredicate passed
82 // doesn't match the currently stored value. SetIf is a simple compare-and-swap operator:
83 // valuePredicate is either the existing value, or an opaque value representing it (hash, index, etc).
84 // this is intentianally simplistic: we can model a better abstraction on top, keeping this interface simple for implementors
85 SetIf(partitionKey, key, value, valuePredicate []byte) error
86 }
87 ```
88
89 Note: This API is roughly the one needed and is subject to change/tweaking.
90 It is meant to illustrate the required capabilities in order to build a functioning lakeFS system on top.
91
92 #### KV requirements
93
94 - read-after-write consistency: a read that follows a successful write should return the written value or newer
95 - keys could be enumerated lexicographically, in ascending byte order
96 - supports a key-level conditional operation based on a current value - or essentially, allow modeling a CAS operator
97
98 #### Databases that meet these requirements (Examples):
99
100 - PostgreSQL
101 - MySQL
102 - Embedded Pebble/RocksDB (great option for a "quickstart" environment?)
103 - MongoDB
104 - AWS DynamoDB
105 - FoundationDB
106 - Azure Cosmos
107 - Azure Blob Store
108 - Google BigTable
109 - Google Spanner
110 - Google Cloud Storage
111 - HBase
112 - Cassandra (or compatible systems such as ScyllaDB)
113 - Raft (embedded, or server implementations such as Consul/ETCd)
114 - Persistent Redis (Redis Enterprise, AWS MemoryDB)
115 - Simple in-memory tree (for simplifying and speeding up tests?)
116
117 ### Data Modeling: IAM
118
119 The API that exposes and manipulates IAM entities is already modeled as a lexicographically ordered key value store (pagination is based on sorted entity IDs, which are strings).
120
121 Relationships are modeled as auxiliary keys. For example, modeling a user, a group and a membership would look something like this (pseudo code):
122
123 ```go
124 func WriteUser(kv Store, user User) error {
125 data := proto.MustMarshal(user)
126 kv.Set([]byte(fmt.Sprintf("iam/users/%s", user.ID)), data)
127 }
128
129 func WriteGroup(kv Store, group Group) error {
130 data := proto.MustMarshal(group)
131 kv.Set([]byte(fmt.Sprintf("iam/groups/%s", group.ID)), data)
132 }
133
134 func AddUserToGroup(kv Store, user User, group Group) error {
135 data := proto.MustMarshal(&Membership{GroupID: group.ID, UserID: user.ID})
136 kv.Set([]byte(fmt.Sprintf("iam/user_groups/%s/%s", user.ID, group.ID)), data)
137 }
138
139 func ListUserGroups(kv Store, userID string) []string {
140 groupIds := make([]string, 0)
141 prefix := []byte(fmt.Sprintf("iam/user_groups/%s/", user.ID))
142 iter := kv.Scan(prefix)
143 for iter.Next() {
144 pair := iter.Pair()
145 if !bytes.HasPrefix(pair.Key(), prefix) {
146 break
147 }
148 membership := Membership{}
149 proto.MustUnmarshal(&membership)
150 groupIds = append(groupId, membership.GroupID)
151 }
152 ...
153 return groupIds
154 }
155 ```
156
157 It is possible to create a 2-way index for many-to-many relationships, but this is not generally required - it would be simpler to simply do a scan when needed and filter only the relevant values.
158 This is because the IAM keyspace is relatively very small - it would be surprising if the biggest lakeFS installation to ever exist would contain more than, say, 50k users.
159
160 Some care does need to be applied when managing these secondary indices - for example, when deleting an entity, secondary indices need to be pruned first, to avoid inconsistencies.
161
162 ### Graveler Metadata - Commits, Tags, Repositories
163
164 These are simpler entities - commits and tags are immutable and could potentially be stored on the object store and not in the key/value store (they are cheap either way, so it doesn't make much of a difference).
165 Repositories and tags, are also returned in lexicographical order, which map well to the suggested abstraction. Commits are usually returned using parent traversal, so no scanning takes place anyway.
166
167 There are no special concurrency requirements for these entities, apart for last-write-wins which is already the case for all modern stores.
168
169 ### Graveler Metadata - Branches and Staged Writes
170
171 This is where concurrency control gets interesting, and where lakeFS is expected to provide a **correct** system whose semantics are well understood (lakeFS currently [falls short](https://github.com/treeverse/lakeFS/issues/2405) in that regard).
172
173 Concurrency is more of an issue here because of how a commit works: when a commit starts, it scans the currently staged changes, applies them to the current commit pointed to by the branch, updating the branch reference and removing the staged changes it applied.
174
175 Getting this right means we have to take care of the following:
176
177 1. Ensure all staged changes that finished successfully before the commit started are applied as part of the commit (causality)
178 1. Ensure acknowledged writes end up either in a resulting commit, or staged to be committed (no lost writes)
179
180 To do this, we will employ 3 mechanisms:
181
182 1. [Optimistic Concurrency Control](https://en.wikipedia.org/wiki/Optimistic_concurrency_control) on the branch pointer using `SetIf()`
183 1. Reliance on write [idempotency](https://en.wikipedia.org/wiki/Idempotence) provided by Graveler (i.e., Writing the same exact entry, with the same identity - will not appear as a change)
184 1. Lock freedom (see [_non-blocking
185 algorithms_](https://en.wikipedia.org/wiki/Non-blocking_algorithm))
186
187 #### Committer flow
188
189 We add an additional field to each `Branch` object: In addition to the existing `staging_token`, we add an array of strings named `sealed_tokens`.
190
191 1. get branch, find current `staging_token`
192 1. use `SetIf()` to update the branch (if not modified by another process): push existing `staging_token` into `sealed_tokens`, set new UUID as `staging_token`. The branch is assuming to be represented by a single key/value pair that contains the `staging_token`, `sealed_tokens` and `commit_id` fields.
193 1. take the list of sealed tokens, and using the [`CombinedIterator()`](https://github.com/treeverse/lakeFS/blob/master/pkg/graveler/combined_iterator.go#L11), turn them into a single iterator to be applied on top of the existing commit
194 1. [_optional_] Once the commit has been persisted (metaranges and ranges
195 stored in object store, commit itself stored to KV using `Set()`),
196 attempt a "fast-path commit": perform another `SetIf()` that updates the
197 branch key/value pair again: replacing its commit ID with the new value,
198 and clearing `sealed_tokens`, as these have materialized into the new
199 commit.n
200 1. If fast-path commit isn't used or its `SetIf()` fails, repeatedly attempt
201 a "full commit": Regardless of concurrent commits racing against this
202 one, the 2 `sealed_tokens` will overlap: a _suffix_ of the
203 to-be-committed `sealed_tokens` will be a _prefix_ of the current
204 `sealed_tokens` on the branch.
205
206 The 2 edge cases are:
207
208 * _identical_ `sealed_tokens`: no concurrent commits won a race against
209 this commit.
210 * _nonoverlapping_ `sealed_tokens`: a concurrent **later** commit won a
211 race against this commit and committed all data that it intended to
212 commit. This is an "empty commit" state, and can be handled as an
213 error or not depending on business logic.
214
215 In other cases a concurrent **earlier** commit won a race against this
216 commit and committed _some_ of its data. But it is still safe to commit:
217 trim the prefix of the current `sealed_tokens` that is a suffix of the
218 to-be-committed `sealed_tokens`, and `SetIf` the branch to this new
219 record. If this update fails, everything is correct and safe and we can
220 apply business logic: retry a new full commit multiple times or
221 immediately abort with an error.
222
223 The "overlapping `sealed_tokens`" commit method gives lock freedom (if we
224 count per-thread time as number of KV operations performed by the thread):
225 at least one thread makes progress after it makes 3 KV operations.
226 Furthermore, the number of retries that a thread can make is at most the
227 number of concurrent preceding commits.
228
229
230 ![Committer Flow](./committer_flow.png)
231
232 #### Caching branch pointers and amortized reads
233
234 In the current design, for each read/write operation we add a single amortized read of the branch record as well.
235 Let's define an "amortized read" as the act of batching requests for the same branch for a short duration, thus amortizing the DB lookup cost across those requests.
236
237 For this design, we don't want to change this, at least for most requests: Add 1 additional wait time for a KV lookup that could be amortized across requests for the same branch.
238
239 To do this, we introduce a small in-memory cache (can utilize the same caching mechanism that already exists for IAM).
240 Please note: this *does not violate consistency*, see the [Read flow](#reader-flow) and [Writer flow](writer-flow) below to understand how.
241
242 #### Writer flow
243
244 1. Read the branch's existing staging token: if branch exists in the cache, use it! Otherwise, do an amortized read (see [above](#caching-branch-pointers-and-amortized-reads)) and cache the result for a very short duration.
245 1. Write to the staging token received - this is another key/value record (e.g. `"graveler/staging/${repoId}/${stagingToken}/${path}"`)
246 1. Read the branch's existing staging token **again**. This is always an amortized read, not a cache read. If we get the same `staging_token` - great, no commit has *started while writing* the record, return success to the user. For a system with low contention between writes and commits, this will be the usual case.
247 1. If the `staging_token` *has* changed - **retry the operation**. If the previous write made it in time to be included in the commit, we'll end up writing a record with the same identity - an idempotent operation.
248
249 ![Writer Flow](./writer_flow.png)
250
251 #### Reading/Listing flow
252
253 1. Read the branch's existing staging token(s): if branch exists in the cache, use it! Otherwise, do an amortized read (see [above](#caching-branch-pointers-and-amortized-reads)) and cache the result for a very short duration.
254 1. The length of `sealed_token` list will typically be *empty* or very small, see ((above)[#committer-flow])
255 1. We now use the existing `CombinedIterator` to read through all staging tokens and underlying commit.
256 1. Read the branch's existing staging token(s) **again**. This is always an amortized read, not a cache read. If it hasn't changed - great, no commit has *started while reading* the record, return success to the user. For a system with low contention between writes and commits, this will be the usual case.
257 1. If it has changed, we're reading from a stale set of staging tokens. A committer might have already deleted records from it. Retry the process.
258
259 ![Reader Flow](./reader_flow.png)
260
261 #### Important Note - exclusion duration
262
263 It is important to understand that the current pessimistic approach locks the branch for the entire duration of the commit.
264 This takes time proportional to the amount of changes to be committed and is unbounded. All writes to a branch are blocked for that period of time.
265
266 With the optimistic version, readers and writers end up retrying in during exactly 2 "constant time" operations during a commit: during the initial update with a new `staging_token`, and again when clearing `sealed_tokens`.
267 The duration of these operations is of course, not constant - it is (well, should be) very short, and not proportional to the size of the commit in any way.
268
269 ### Partitioning
270
271 Partitioning the KV key space designed to increase performance. The correctness of the KV interface without partitioning remains.
272 We rely on 2 assumptions for adding the partitioning:
273 1. lakeFS keys can be partitioned in a manner that doesn't require access to 2 partitions in a single operation.
274 The only operation of the API that handles more than a single key is `Scan` and there's no use-case for scanning more than a single partition.
275 1. Performance decreases if the keys are not partitioned. For example, a transaction on a kv Postgres table will lock the entire table
276 during the execution. Working on partitioned key-space will only block the single partition.
277
278 The KV store implementation is in charge of managing the partitioned storage. The user the KV store
279 is responsible for choosing the appropriate partitions. For example, the number of keys used for authentication
280 & authorization is relatively small, so using the same 'auth' partition for all auth entities is appropriate.
281 However, the number of entries (keys) under a single staging token is possibly the size of the entire repository,
282 so creating a partition for each staging token is the right strategy.
283
284 Guidelines:
285 - Partition keys are also a namespace for the key. i.e. the combination of (partitionKey,key) is unique in the KV database,
286 but (key) is not guaranteed to be unique.
287 - Although possible in some implementations, the interface will not support `dropPartition`.
288 A later addition of `dropPartition` could speed up commits by iterating over keys in a goroutine after the commit is done and not have to worry about iterator invalidation.
289 - New partition creation is implicit to keep the API free of additional `NewPartition` functionality.
290
291 ### Open Questions
292
293 1. Complexity cost - How complex is the system after implementing this? What would a real API look like? Serialization?
294 1. Performance - How does this affect critical path request latency? How does it affect overall system throughput?
295 1. Flexibility - Where could the narrow `Store` API become a constraint? What *won't* we be able to implement going forward due to lack of guarantees (e.g. no transactions)?
296 1. Alternatives - As this is also a solution to milestone #3, how does it fare against [known](https://github.com/treeverse/lakeFS/pull/1688) [proposals](https://github.com/treeverse/lakeFS/pull/1685)?