github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/rejected/spark-outputcommitter/committer.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/rejected/spark-outputcommitter/committer.md (about)

1 # Proposal
2
3 Add a Hadoop OutputCommitter that uses _existing_ lakeFS operations for
4 **atomic** commits that are efficient and safely concurrent.
5
6 ## Terminology
7
8 lakeFS and Hadoop both use the term "commit", but with different meanings.
9 In this document:
10 * A [_lakeFS commit_][lakefs-commit] is a revision entry in the lakeFS
11 repository log.
12 * A [_Hadoop OutputCommitter commit_][hadoop-commit] (_HOC commit_ for
13 short) is the action taken by a Hadoop OutputCommitter to finish writing
14 all objects of a job to their final locations.
15
16 # Why <a href="#user-content-why" id="user-content-why">#</a>
17
18 ## Summary
19
20 The LakeFSOutputCommitter will be:
21
22 * **Faster.** Because it uses native lakeFS features we can avoid the long
23 troubled history of Spark writes on object stores.
24
25 * 3x fewer API calls than when using the FileOutputCommitter v1 algorithm
26 (which seems the most commonly used by our users). Many of these are
27 entirely useless statObject calls.
28 * 2x fewer API calls that when using the FileOutputCommitter v2 algorithm
29 (which seems less used, and is often not as safe as the v1 algorithm).
30 * Around the same number of API calls as we would achieve if we supported
31 the new magic committer. Many of these are entirely useless statObject
32 call.
33 * **Atomic.** All objects and partitions of a file appear at once as a
34 committed object on the target branch. `_SUCCESS` files are still
35 generated, but only as a courtesy to AirFlow and similar triggers. A
36 process that examines the target branch will never see a partial state.
37 * **Integrated.** A full Spark write is a native lakeFS commit, a
38 first-class object in the lakeFS ecosystem. It has a commit log (useful
39 for lineage and structured metadata tagging), it can be reverted and
40 merged, it has clear semantics for garbage collection, etc.
41
42 ## Issues with existing OutputCommitters
43
44 OutputCommitters are charged with writing a directory tree (a multi-part
45 Hadoop file, for example a partitioned Parquet file) atomically.
46
47 Hadoop comes with several OutputCommitter algorithms. For S3 these try to
48 bridge the gap between the Hadoop assumed semantics of a POSIX-ish
49 filesystem (HDFS) and the actual semantics of an object store (S3).
50
51 Of course, any Hadoop OutputCommitter that is unrelated to lakeFS can only
52 ever perform HOC commits. Any lakeFS commits on the written data will need
53 to be separately managed by user code or manually.
54
55 The best correct committer is probably the [magic committer][magic]. It
56 works using only S3 (with its current consistency guarantees!), without
57 auxiliary systems such as a small HDFS or S3Guard. The magic committer
58 uploads a directory tree by starting a multipart upload to its intended location,
59 creating a completion record for that multipart upload on S3, and uploading
60 the object without completing the upload. To complete the upload, all
61 completion records are processed and the objects magically appear.
62
63 The [staging committer][staging] (aka "Netflix committer") is an older
64 committer for S3A. It requires a "staging" filesystem on HDFS and offers
65 manual modes for overwriting outputs ("conflict resolution").
66
67 These issues remain with the magic and staging committer:
68
69 * Not atomic. Multipart objects appear sequentially. Finding a "_success"
70 object indicator is still required before a directory tree can be
71 processed.
72 * The magic committer requires "magic committer" support in the FileSystem,
73 in order to write `__magic` objects to a _different_ path. (Current
74 lakeFSFS does not contain this support.)
75 * The magic committer is somewhat new:
76 > It’s also not been field tested to the extent of Netflix’s committer;
77 > consider it the least mature of the committers.
78 * Documentation is still somewhat lacking.
79
80 The both versions 1 and 2 of the FileOutputCommitter are not as good. They
81 _rename_ files, which on S3A works by _copying and deleting_. This is both
82 slow and highly non-atomic, as well as making it difficult to recover from
83 failed attempts. The current recommendations are to use the partitioned
84 staging committer for overwriting or updating partitioned data trees, and
85 the magic committer is most other cases.
86
87 ## LakeFSOutputCommitter
88
89 We propose to leverage the atomic capabilities of lakeFS to write a specific
90 OutputCommitter for lakeFSFS. In the initial version, it will branch out,
91 prepare the desired output, and merge back in as part of the HOC commit.[^1]
92 Aborting will be done by dropping the branch (or repurposing it if the same
93 job ID is requested again). Regular lakeFS retention can handle dropping
94 data objects; we might want to add file patterns to retention to allow
95 temporary objects to be dropped rapidly.
96
97 [^1]: A merge is a type of commit, of course, so in this model HOC commits
98 *are* lakeFS commits!
99
100 # How
101
102 ## Conflict modes
103
104 Spark supports multiple "save modes": Append, Overwrite, ErrorIfExists, and
105 Ignore. These impact conflict resolution. We will initially support just
106 overwrite modes: an entire previous table will be deleted on write.
107
108 ## Sample flow
109
110 * User configures lakeFSFS and configures LakeFSOutputCommitter as the
111 OutputCommitter for `lakefs` protocol Paths. **Possibly** lakeFSFS will
112 set up this OutputCommitter by default.
113
114 Successive steps are controlled by Spark/Hadoop to output, and correspond
115 to the Hadoop OutputCommitter protocol.
116 * **Setup (job/task TBD)**: Create a new branch for this job/task. Its name
117 is predictable from the job ID and/or task ID, so can be easily found
118 again. In "overwrite" mode, immediately delete the entire subtree of the
119 intended output path, and possibly delete branches of previous tasks.
120 This handles cases where the names of the objects written by the output
121 format change, for instance because of repartitioning to a smaller (or
122 different) number of partitions or because of nondeterminism in the names
123 of the objects.
124 * **Write objects**: Everything is written to its correct path on the branch.
125 * **[HOC] Commit**: Merge back to the original branch.
126
127 ## Properties
128
129 * The merge is performed by lakeFS so it is **atomic**.
130 * In the **single-writer case** the merge succeeds: no other operations
131 occur on the subtree.
132 * **In-place updates** work: the old objects are deleted and replaced by new
133 objects. This is true regardess of partitioning etc.
134 * **Multiple writers** are detected and the first to HOC-commit succeeds.
135 But all the others fail: they deleted the same previous files, or created
136 a conflicting file (at least their `_SUCCESS` indicator). So their lakeFS
137 merge fails due to a conflict with the first (successful) merge.
138 * (Conflicting) **non-OutputCommitter writes are detected** and clearly
139 handled. As long as other writes create _one_ object with an overlapping
140 name the merge will fail. So LakeFSOutputCommitter can achieve its
141 correct semantics regardless of other writers used.
142 * **Clearly correct by construction**: Rather than rely on single atomic
143 operations and carefully tailoring operations to Spark retry mechanisms,
144 we use lakeFS capabilities and guarantees. Analyzing correctness becomes
145 simpler.
146 * **Fast**: No data copies, just only (required) metadata operations. Cost
147 of the lakeFS commit is linear in the number of objects it touches (and a
148 fast operation to add many thousands objects). Total time to write is
149 close to 3* faster than the existing FileOutputCommitter in v1 mode, close
150 to 2* faster than the existing FileOutputCommitter in v2 mode (which is
151 unsafe in various cases), and about as fast as the magic OutputCommitter
152 _if_ lakeFSFS supported it.
153 * **Good semantics**: HOC commits will be lakeFS commits. The history of a
154 Spark job appears right in lakeFS history. Metadata even includes some
155 data lineage -- and in future we can easily add more, for instance as
156 merge (lakeFS) commit user metadata.
157
158 ### Implementation details
159
160 #### Hadoop >=3.1
161
162 Hadoop 3.1 offered a fairly complete overhaul of committer architecture,
163 configuration, and S3A support. Supporting new committers on older Hadoops
164 will be challenging. It also seems to be the version where the magic output
165 committer is recommended for use, so potentially our users will be there or
166 will agree to upgrade.
167
168 #### ParquetOutputCommitters
169
170 Parquet requires its committers to be
171 [`ParquetOutputCommitter`](https://github.com/apache/parquet-mr/blob/5608695f5777de1eb0899d9075ec9411cfdf31d3/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputCommitter.java#L37)s
172 (of course it does), see e.g. [Cloudera's
173 explanation](https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/bk_cloud-data-access/content/enabling-directory-committer-spark.html).
174 It's not intended to be derived from, and cannot easily use another
175 OutputCommitter. However there's a
176 [`BindingParquetOutputCommitter`](https://github.com/apache/spark/blob/08e6f633b5bc3a7d8d008db2a264b1607d269f25/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/BindingParquetOutputCommitter.scala#L37)
177 in Spark (not in Hadoop, not in Parquet-MR, in *Spark*) that claims to
178 transform the selected output committer into a ParquetOutputCommitter. This
179 is resolvable per documentation but will require some work.
180
181 ## Future work
182
183 ### More conflict resolution and save modes
184
185 Support all 4 Spark "save modes": Append, Overwrite, ErrorIfExists, and
186 Ignore.
187
188 ### Multiple writers
189
190 We can support multiple concurrent writers and allow all writers to succeed
191 (keeping the results of the last writer).
192
193 We add a "merge-if" operation to lakeFS: atomically merge branch B into
194 branch A if branch A is at a given lakeFS commit.
195
196 Now change LakeFSOutputCommitter to loop during HOC commit:
197
198 * Attempt to merge the task branch into the source branch _if the source
199 branch has not moved_.
200 * On failure:
201 - Merge the source branch into the task branch using the "destination
202 wins"[^2] strategy and delete all added files under the prefix (or add and
203 use a new "merge but *never* copy from source branch" strategy).
204 - Go back and attempt another merge.
205
206 This is essentially (noncooperative!) locking of output paths on top of
207 lakeFS, with no additional DB. We can even add cooperation by means of
208 various locking hints, _informing_ multiple jobs about attempting to update
209 the same paths but keeping things safe regardless.
210
211 [^2]: Whenever there is a conflict, we want the task branch (which will become
212 the "latest writer" after a successful HOC commit) to win.
213
214 # Potential wins
215
216 * Explicit requests _not_ to write 3 times and give better behaviour than
217 the default FileOutputCommitter from Spark have appeared on our Slack
218 [#data-architecture-discussion][slack-dont-write-thrice].
219 * Multiple users have requested "overwrite" save mode.
220 * Multiple users have requested multi-writer support.
221 * Note by a developer that Spark performs many lakeFS API operations when
222 writing.
223
224 [magic]: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html#The_Magic_Committer
225 [staging]: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html#The_Staging_Committer
226 [lakefs-commit]: https://docs.lakefs.io/understand/object-model.html#commits
227 [slack-dont-write-thrice]: https://app.slack.com/client/T013V60QY06/C020N7X2Y0H/thread/C020N7X2Y0H-1660298516.202499