github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/output-committer-conclusions.md (about) 1 # Technical conclusions from the LakeFSOutputCommitter attempt 2 3 ## Planning 4 5 ### Poor performance planning 6 7 #### Merges 8 9 The original design was based on a single branch per task. This allows for 10 very fast writing (all writes are independent) but merging 4000 times turns 11 out to be a slow operation. 12 13 Earlier communication with VE team would have allowed avoiding this. 14 15 **Conclusion:** Merge operations can be slow. We saw multiple seconds 90th 16 percentile on *independent* branch merges. FindMergeBase was very slow for 17 moderately far merge bases: seconds to search a few thousand commits. 18 19 **Possible mitigations:** 20 21 All except the first might need issues to open; **communicate with VE**. 22 23 * In future avoid using multiple merges for atomicity :-) 24 * Commit cache is too small and expires too quickly. Increase its size and 25 expiry. (#4637) 26 * FindMergeBase performs batched operations in a loop. Every BFS iteration 27 could accumulate all its commit gets into a single batch and dispatch it. 28 At the very least, don't batch these operations. (#2981 is old but still 29 relevant, maybe even more, it seems!) 30 * Consider allowing multi-way merges. 31 * Increase Pyramid cache sizes. 32 33 #### Limited support for relinking objects 34 35 We switched to _moving_ objects from per-task branches to a job branch. It 36 avoids multiple merges, and is safe because per-task objects have different 37 names. In practice the API does not allow moving object so this was really 38 relinking unstaged objects across branches. 39 40 Support for this was going away as part of incremental GC concurrently with 41 implementation. Luckily we immediately caught it. 42 43 **Conclusion:** We cannot reduce branch concurrency using multiple branches 44 like this. :shrug: 45 46 #### Single-branch write performance on lakeFSFS 47 48 Switching to writing everything to a _single_ branch shows poor performance 49 of lakeFS as exercised by lakeFSFS. CloudWatch Insights shows that one key 50 takes a huge amount of load -- the branch key. DynamoDB cannot partition a 51 single key, so it is necessarily throttled regardless of purchased quota. 52 53 **Conclusion:** lakeFSFS performance is poor on lakeFS with DynamoDB as its 54 KV store. Work on improving lakeFS performance with DynamoDB (some of this 55 work is proposed in #4769, #4734, #4712), and also on improving lakeFSFS on 56 lakeFS (some of this work is proposed in #4722, #4721, #4676). 57 58 ### Poor communication with other teams 59 60 We assumed multiple non-functional and one functional requirements that did 61 not actually exist. We should have specified these requirements as part of 62 the design and communicated them with VE team. 63 64 ### Failed to test the project can increase Spark write performance on lakeFS 65 66 As proposed, the main goal of the project was to increase write performance 67 on lakeFS. However our plan did not include an early enough test for this, 68 and in fact we did not measure it early enough. 69 70 We did have some reasons for this: 71 72 * We estimated system performance based on counting API calls. However the 73 estimates of API latencies proved incorrect. 74 * Whole system performance is hard to measure before the system is finished 75 and wanted to advance to that stage. 76 * We considered functional requirements (actually being able to implement a 77 working OutputCommitter) at a higher risk than nonfunctional requirements 78 (performance and compatibility). 79 80 However these reasons are clearly wrong in retrospect. Requirements should 81 have set clear performance requirements. 82 83 ## Shared vision 84 85 ### No expected performance of lakeFSFS 86 87 We spec'ed lakeFSFS to be a few percent slower than S3A on S3 with "magic", 88 but we have no typical load to use to measure. 89 90 As a result we cannot design efficient flows that use lakeFSFS, or that the 91 performance we get in some setup is adequate. 92 93 We lack an understanding of what performance to expect from lakeFSFS. This 94 should include expected performance and affecting variables for these: 95 96 * List "directory", recursively and nonrecursively. 97 * Create file. 98 * Open file. 99 * Check file existence. 100 * Delete file or "directory", recursively and nonrecursively. 101 * Write a 1000-partition Parquet file (for instance). 102 * Read from all objects of a 1000-partition Parquet file (for instance). A 103 good refinement is to read _all_ fields and also to read _just one_ field 104 from the file. 105 106 ### No expected performance of lakeFS 107 108 We spec'ed lakeFS to be a few percent slower than S3 but we have no typical 109 load that we intend to use to measure. 110 111 As a result we cannot design efficient flows that use lakeFS, or state that 112 the performance we get in some setup is adequate. 113 114 We lack an understanding of what performance to expect from lakeFS. It can 115 include expected performance and affecting variables for these: 116 117 * Commits. (Possible affecting variables: # of modified objects, ranges) 118 * Merges. (Possible affecting variables: # of commits to scan, # of ranges 119 modified, # of contending and uncontending concurrent merges) 120 * Reads and lists. (Possible affecting variables: # of concurrent reads on 121 same branch, # of concurrent reads, expected size of returned listing) 122 * Writes. (Possible affecting variables: # of concurrent mutations on same 123 branch, # of concurrent mutations) 124 125 ### No best practices 126 127 In order to support single-writer and multi-writer configurations we should 128 decide on best practices for how to perform several basic tasks. These may 129 well depend on the result of the above performance work. Currently nothing 130 is blocked on this, giving us some time to do this correctly. However some 131 things don't currently work and one reasonable outcome is to remove support 132 for them. 133 134 * How to cause multiple files to appear at atomically 135 * Commits? 136 * Merges? 137 * Some new mechanism, perhaps multiple staging areas? 138 * How to abort work on failure after emitting some objects. 139 * Recommended synchronization mechanisms in lakeFS 140 * Merges? 141 * Return status of delete object? (Currently broken, #4798) 142 * Return status of Set IfAbsent object? (Currently broken, #4796) 143 * Some new mechanism, perhaps a KV API? 144 145 ## Background: stages and experiments performed 146 147 In all measurements we wrote of a small CSV onto 4000 partitions, measuring 148 OutputCommitter performance only. 149 150 ### 1. Per-task branches, merged during taskCommit 151 152 Initial implementation had one branch for working on the job and one branch 153 per task. Each successful task was merged into the job branch. At the end 154 the job branch was merged to the output branch. 155 156 This ran slowly, with concurrent merges racing to update the job branch. A 157 client API timeout increase was needed, after which we got >30m to write. 158 159 The proposal to [re-use metaranges in merge retries][meta-retry] could help 160 here. 161 162 (We also had to boost Pyramid cache size to avoid spills. But a cache fill 163 takes very little time, and cannot account for 30m.) 164 165 At this stage we switched the lakeFS cloud staging cluster to use DynamoDB. 166 This slowed things down further, to >50m. Increasing DynamoDB quota helped 167 but did not go below 40m. 168 169 [meta-retry]: https://github.com/treeverse/lakeFS/blob/1dcefc2eecdc9b8693439c1afc6beccc134a11a8/design/open/reuse-contented-merge-metaranges/reuse-contented-merge-metaranges.md 170 171 ### 2. Independent concurrent merges during jobCommit 172 173 We worked around concurrent merges by merging only in jobCommit. It allows 174 independent concurrent merges of all the branches[^1] and eliminates racing 175 merges. There was no significant speedup (~30-40m, but on DynamoDB). 176 177 DynamoDB was throttled continually at this point. One issue: FindMergeBase 178 has to be called once per merge but it is slow to fetch commits. The cache 179 is much too small and unconfigurable. So cache spills happen. And fetches 180 are batched, adding even more to each iteration. See issues #2981, #4637. 181 182 One way to make this work well would be to support multiway merges like Git 183 does. It would remove the need for a fast FindMergeBase, and simplify user 184 experience when looking at commit logs. 185 186 [^1]: Performed optimally: keep a fixed number of workers. Each repeatedly 187 merges two branches that are not participating in any merges. Being so 188 dynamic means that the result of a particular slow merge is used later, 189 minimizing delay. 190 191 ### 3. taskCommit without merging 192 193 We next avoided all task branch merges. Instead, we re-linked objects from 194 the task branch into the job branch. This worked nicely, 2.75m. Still not 195 the desired 2m but seemingly withing reach. 196 197 But [support for this re-linking had just been removed][no-relink]. 198 199 [no-relink]: https://github.com/treeverse/lakeFS/blob/master/design/accepted/gc_plus/uncommitted-gc.md#getlinkphysicaladdress 200 201 ### 4. Tasks on job branch 202 203 In the final attempt, we wrote all objects to a temporary prefix on the job 204 branch and re-linked them to their final location during taskCommit. It is 205 allowed by the new API. However this still overloads lakeFS, regardless of 206 configured DynamoDB capacity. lakeFS repeatedly accesses the branch entry, 207 and DynamoDB cannot partition it so it is subject to a hard rate-limit. As 208 a result we went back to >30m. 209 210 ### Complication: two job IDs 211 212 This applies to all work done. 213 214 One complication seems like a bug in Spark writers (FileFormatWriter). The 215 writers generate two distinct job contexts: one to call startJob/commitJob, 216 the other to call startTask/commitTask on all tasks. So it is not possible 217 to use the job ID to identify the job. 218 219 Instead we simply used the output path. Doing this prevents implementation 220 of concurrent writers; obviously we never got that far. 221