github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/reuse-contented-merge-metaranges/reuse-contented-merge-metaranges.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/reuse-contented-merge-metaranges/reuse-contented-merge-metaranges.md (about)

1 # Proposal: Re-use metaranges to retry contended merges
2
3 ## Why
4
5 Large merges can be slow, and we are performing a lot of work to speed them
6 up. This design is to speed up the case when multiple branches merge
7 concurrently into the same destination branch. The goal is to speed up
8 concurrent merges on both non-KV and KV systems.
9
10 On **Postgres-based (non-KV) systems**, merges wait for each other on the
11 branch lock, slowing successive merges down. What makes it _worse_ is that
12 the destination branch advances but the source does not. So in each
13 successive concurrent merge the destination is even more divergent from the
14 base.
15
16 On **KV-based systems** (the near future!), there is no branch locker.
17 Concurrent merges race each other for completion. The winning merge updates
18 the destination branch, and _all other merges lose and have to restart_. So
19 on each concurrent merge, every successive merge attempt takes even longer.
20
21 This proposal cannot prevent concurrent merges from racing. Instead it
22 makes retries considerably more efficient. Any range that is affected by
23 only one concurrent merge will be written just _once_, no matter how many
24 times it is retried. So additionally retries of the merge that is writing
25 that range will be able to skip _reading_ its two sources on every retry.
26 This should speed everything up!
27
28 ### KV or not KV?
29
30 Importantly, this proposal is independent of whether or not we use KV.
31 _Before_ KV you can use it by removing the branch locker and making the
32 branch update operation conditional. And then it **improves** concurrent
33 merge performance by allowing parallel execution of almost all of the
34 time-consuming tasks. So it increases the amount of work performed but it
35 performs that work in parallel and reduces the latency of the slow
36 concurrent merges.
37
38 _After_ KV, not only does it restore performance to this improved case -- it
39 also removes much of the duplicate work that concurrent merges perform.
40
41 ## How
42
43 The basic idea is to allow retries of a concurrent merge to re-use previous
44 work. This is similar to the "overlapping `sealed_tokens`" full-commit
45 method of the [commit flow][commit_flow]. However here there are no staging
46 tokens; the previous work of a merge attempt is the metarange that it
47 generated.
48
49 ![Branches A and B diverge from branch main. Merge A and B into main concurrently. A succeeds in merging and bumps HEAD of main. B succeeds in merging and generates a metarange, but now it cannot bump HEAD of main. Now try again to merge, but _use the newly-generated merged metarange instead of the HEAD of B.](./diagram.jpg)
50
51 There are two ways for a merge to fail:
52 * **Conflict:** The objects at some path on the merge base, source and
53 destination are all different and no strategy was specified to resolve it.
54
55 This failure is inherent and occurs regardless of additional concurrent
56 merges. It occurs while generating the merged metarange. It cannot be
57 fixed; correct behaviour is simply to report it to the user.
58 * **Losing a race:** No merge conflict occurred, but when trying to add the
59 resulting merge commit to the destination we discover that the HEAD of the
60 destination branch has changed.
61
62 This failure is due to losing a race against a concurrent merge or commit.
63 It must either be retried or abandoned (and, presumably, retried later).
64
65 Naturally only race failures are of interest. When a merge fails because of
66 a merge, it must be retried. However we can restart it using _the generated
67 merged metarange_ as source and the original destination as the base! (See
68 "Correctness" below for why this yields the correct result.)
69
70 The performance gain occurs during the retry: We expect the new source to
71 share many ranges with the desired result and with the destination:
72
73 * A range that was the same in the base and the destination, but changed
74 only between base and original source will still generally be the same in
75 the base and destination. And rewriting it from the new source yields the
76 same range -- it will be reused in the result.
77
78 **Result:** The range created in the first attempt will be recreated in
79 the second attempt, and will not need to be uploaded to the backing object
80 store. (We might later decide to cache the merged range result in memory,
81 which will let us skip even recreating that range!)
82 * A range that was the same in the base and the source, but changed only
83 between base and original destination, will still be unchanged to the new
84 source. No range needs to be read or generated -- we did not slow this
85 case down.
86 * A range that changed in all of base, old source, and old destination, will
87 still need to be read and analyzed. But the resulting range will be
88 unchanged from the new source, so the merge is trivial: only changes from
89 the old to new destination need be applied, and we expect many ranges not
90 to have such changes.
91
92 One example where this assumption holds is when merging a branch that
93 branched out a while back. The first merge will bring in everything that
94 has changed in other areas of the repository, so the second and following
95 merges should be much smaller.
96
97 Another example where we _expect_ this to work well is repeated multi-object
98 writes that are powered by merges: These will have frequent merges that
99 consist solely of a "directory", which is just consecutive or
100 nearly-consecutive files. Regular writes use a new "directory" and will
101 quickly start using separate ranges. Overwrites write to the same
102 "directory" each time, and will continue to race other overwites to that
103 directory -- but _not_ with concurrent writes to _other_ directories. This
104 use-case is _currently in progress_ for Spark with the lakeFS
105 OutputCommitter, and we will likely perforn it for Iceberg and probably
106 Delta.
107
108 ### Alternative
109
110 Rather than immediately retry a merge that loses a race, lakeFS could return
111 a failure with an additional "hints" field containing the resulting
112 metarange ID. The client can then try again, supplying lakeFS with the same
113 hints. This allows lakeFS to use the correct metarange in the retry.[^1]
114
115 This improves client control of retries. In some cases it may cause retries
116 to be load-balanced onto a less busy lakeFS, which might improve fairness.
117
118 ### Correctness
119
120 Why is it correct to use the new metarange? It is sufficient to show that
121 merging from the new metarange as source will yield the same sequence of
122 objects as merging from the old metarange. How this sequence is split into
123 ranges is important for efficiency, but yields an indistinguishable object
124 store.
125
126 In the sequel it helps to think of "deleted" objects that as having
127 particular contents. The behaviour of deleted objects in a merge regarding
128 results and conflicts is exactly the same as of objects with some particular
129 unique contents.
130
131 | base | src | dst/base' | result/src' | dst' | final result | comment |
132 |------|-----|-----------|-------------|------|--------------|--------------------------------|
133 | A | B | A | B | A | B | |
134 | A | A | B | B | B | B | |
135 | A | A | A | A | B | B | (Only) dst changed during race |
136 | A | B | C | conflict | -- | conflict | Conflicts never proceed |
137 | A | B | A | B | C | conflict | Conflict with updated dst |
138 | A | A | B | B | C | C | |
139
140
141 [commit_flow]: ../../accepted/metadata_kv/index.md#committer-flow
142
143 [^1]: lakeFS could sign the metarange in some way, to prevent clients
144 attempting to cheat.