github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/output-committer-conclusions.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/output-committer-conclusions.md (about)

1 # Technical conclusions from the LakeFSOutputCommitter attempt
2
3 ## Planning
4
5 ### Poor performance planning
6
7 #### Merges
8
9 The original design was based on a single branch per task. This allows for
10 very fast writing (all writes are independent) but merging 4000 times turns
11 out to be a slow operation.
12
13 Earlier communication with VE team would have allowed avoiding this.
14
15 **Conclusion:** Merge operations can be slow. We saw multiple seconds 90th
16 percentile on *independent* branch merges. FindMergeBase was very slow for
17 moderately far merge bases: seconds to search a few thousand commits.
18
19 **Possible mitigations:**
20
21 All except the first might need issues to open; **communicate with VE**.
22
23 * In future avoid using multiple merges for atomicity :-)
24 * Commit cache is too small and expires too quickly. Increase its size and
25 expiry. (#4637)
26 * FindMergeBase performs batched operations in a loop. Every BFS iteration
27 could accumulate all its commit gets into a single batch and dispatch it.
28 At the very least, don't batch these operations. (#2981 is old but still
29 relevant, maybe even more, it seems!)
30 * Consider allowing multi-way merges.
31 * Increase Pyramid cache sizes.
32
33 #### Limited support for relinking objects
34
35 We switched to _moving_ objects from per-task branches to a job branch. It
36 avoids multiple merges, and is safe because per-task objects have different
37 names. In practice the API does not allow moving object so this was really
38 relinking unstaged objects across branches.
39
40 Support for this was going away as part of incremental GC concurrently with
41 implementation. Luckily we immediately caught it.
42
43 **Conclusion:** We cannot reduce branch concurrency using multiple branches
44 like this. :shrug:
45
46 #### Single-branch write performance on lakeFSFS
47
48 Switching to writing everything to a _single_ branch shows poor performance
49 of lakeFS as exercised by lakeFSFS. CloudWatch Insights shows that one key
50 takes a huge amount of load -- the branch key. DynamoDB cannot partition a
51 single key, so it is necessarily throttled regardless of purchased quota.
52
53 **Conclusion:** lakeFSFS performance is poor on lakeFS with DynamoDB as its
54 KV store. Work on improving lakeFS performance with DynamoDB (some of this
55 work is proposed in #4769, #4734, #4712), and also on improving lakeFSFS on
56 lakeFS (some of this work is proposed in #4722, #4721, #4676).
57
58 ### Poor communication with other teams
59
60 We assumed multiple non-functional and one functional requirements that did
61 not actually exist. We should have specified these requirements as part of
62 the design and communicated them with VE team.
63
64 ### Failed to test the project can increase Spark write performance on lakeFS
65
66 As proposed, the main goal of the project was to increase write performance
67 on lakeFS. However our plan did not include an early enough test for this,
68 and in fact we did not measure it early enough.
69
70 We did have some reasons for this:
71
72 * We estimated system performance based on counting API calls. However the
73 estimates of API latencies proved incorrect.
74 * Whole system performance is hard to measure before the system is finished
75 and wanted to advance to that stage.
76 * We considered functional requirements (actually being able to implement a
77 working OutputCommitter) at a higher risk than nonfunctional requirements
78 (performance and compatibility).
79
80 However these reasons are clearly wrong in retrospect. Requirements should
81 have set clear performance requirements.
82
83 ## Shared vision
84
85 ### No expected performance of lakeFSFS
86
87 We spec'ed lakeFSFS to be a few percent slower than S3A on S3 with "magic",
88 but we have no typical load to use to measure.
89
90 As a result we cannot design efficient flows that use lakeFSFS, or that the
91 performance we get in some setup is adequate.
92
93 We lack an understanding of what performance to expect from lakeFSFS. This
94 should include expected performance and affecting variables for these:
95
96 * List "directory", recursively and nonrecursively.
97 * Create file.
98 * Open file.
99 * Check file existence.
100 * Delete file or "directory", recursively and nonrecursively.
101 * Write a 1000-partition Parquet file (for instance).
102 * Read from all objects of a 1000-partition Parquet file (for instance). A
103 good refinement is to read _all_ fields and also to read _just one_ field
104 from the file.
105
106 ### No expected performance of lakeFS
107
108 We spec'ed lakeFS to be a few percent slower than S3 but we have no typical
109 load that we intend to use to measure.
110
111 As a result we cannot design efficient flows that use lakeFS, or state that
112 the performance we get in some setup is adequate.
113
114 We lack an understanding of what performance to expect from lakeFS. It can
115 include expected performance and affecting variables for these:
116
117 * Commits. (Possible affecting variables: # of modified objects, ranges)
118 * Merges. (Possible affecting variables: # of commits to scan, # of ranges
119 modified, # of contending and uncontending concurrent merges)
120 * Reads and lists. (Possible affecting variables: # of concurrent reads on
121 same branch, # of concurrent reads, expected size of returned listing)
122 * Writes. (Possible affecting variables: # of concurrent mutations on same
123 branch, # of concurrent mutations)
124
125 ### No best practices
126
127 In order to support single-writer and multi-writer configurations we should
128 decide on best practices for how to perform several basic tasks. These may
129 well depend on the result of the above performance work. Currently nothing
130 is blocked on this, giving us some time to do this correctly. However some
131 things don't currently work and one reasonable outcome is to remove support
132 for them.
133
134 * How to cause multiple files to appear at atomically
135 * Commits?
136 * Merges?
137 * Some new mechanism, perhaps multiple staging areas?
138 * How to abort work on failure after emitting some objects.
139 * Recommended synchronization mechanisms in lakeFS
140 * Merges?
141 * Return status of delete object? (Currently broken, #4798)
142 * Return status of Set IfAbsent object? (Currently broken, #4796)
143 * Some new mechanism, perhaps a KV API?
144
145 ## Background: stages and experiments performed
146
147 In all measurements we wrote of a small CSV onto 4000 partitions, measuring
148 OutputCommitter performance only.
149
150 ### 1. Per-task branches, merged during taskCommit
151
152 Initial implementation had one branch for working on the job and one branch
153 per task. Each successful task was merged into the job branch. At the end
154 the job branch was merged to the output branch.
155
156 This ran slowly, with concurrent merges racing to update the job branch. A
157 client API timeout increase was needed, after which we got >30m to write.
158
159 The proposal to [re-use metaranges in merge retries][meta-retry] could help
160 here.
161
162 (We also had to boost Pyramid cache size to avoid spills. But a cache fill
163 takes very little time, and cannot account for 30m.)
164
165 At this stage we switched the lakeFS cloud staging cluster to use DynamoDB.
166 This slowed things down further, to >50m. Increasing DynamoDB quota helped
167 but did not go below 40m.
168
169 [meta-retry]: https://github.com/treeverse/lakeFS/blob/1dcefc2eecdc9b8693439c1afc6beccc134a11a8/design/open/reuse-contented-merge-metaranges/reuse-contented-merge-metaranges.md
170
171 ### 2. Independent concurrent merges during jobCommit
172
173 We worked around concurrent merges by merging only in jobCommit. It allows
174 independent concurrent merges of all the branches[^1] and eliminates racing
175 merges. There was no significant speedup (~30-40m, but on DynamoDB).
176
177 DynamoDB was throttled continually at this point. One issue: FindMergeBase
178 has to be called once per merge but it is slow to fetch commits. The cache
179 is much too small and unconfigurable. So cache spills happen. And fetches
180 are batched, adding even more to each iteration. See issues #2981, #4637.
181
182 One way to make this work well would be to support multiway merges like Git
183 does. It would remove the need for a fast FindMergeBase, and simplify user
184 experience when looking at commit logs.
185
186 [^1]: Performed optimally: keep a fixed number of workers. Each repeatedly
187 merges two branches that are not participating in any merges. Being so
188 dynamic means that the result of a particular slow merge is used later,
189 minimizing delay.
190
191 ### 3. taskCommit without merging
192
193 We next avoided all task branch merges. Instead, we re-linked objects from
194 the task branch into the job branch. This worked nicely, 2.75m. Still not
195 the desired 2m but seemingly withing reach.
196
197 But [support for this re-linking had just been removed][no-relink].
198
199 [no-relink]: https://github.com/treeverse/lakeFS/blob/master/design/accepted/gc_plus/uncommitted-gc.md#getlinkphysicaladdress
200
201 ### 4. Tasks on job branch
202
203 In the final attempt, we wrote all objects to a temporary prefix on the job
204 branch and re-linked them to their final location during taskCommit. It is
205 allowed by the new API. However this still overloads lakeFS, regardless of
206 configured DynamoDB capacity. lakeFS repeatedly accesses the branch entry,
207 and DynamoDB cannot partition it so it is subject to a hard rate-limit. As
208 a result we went back to >30m.
209
210 ### Complication: two job IDs
211
212 This applies to all work done.
213
214 One complication seems like a bug in Spark writers (FileFormatWriter). The
215 writers generate two distinct job contexts: one to call startJob/commitJob,
216 the other to call startTask/commitTask on all tasks. So it is not possible
217 to use the job ID to identify the job.
218
219 Instead we simply used the output path. Doing this prevents implementation
220 of concurrent writers; obviously we never got that far.
221