github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/rejected/spark-outputcommitter/committer.md (about) 1 # Proposal 2 3 Add a Hadoop OutputCommitter that uses _existing_ lakeFS operations for 4 **atomic** commits that are efficient and safely concurrent. 5 6 ## Terminology 7 8 lakeFS and Hadoop both use the term "commit", but with different meanings. 9 In this document: 10 * A [_lakeFS commit_][lakefs-commit] is a revision entry in the lakeFS 11 repository log. 12 * A [_Hadoop OutputCommitter commit_][hadoop-commit] (_HOC commit_ for 13 short) is the action taken by a Hadoop OutputCommitter to finish writing 14 all objects of a job to their final locations. 15 16 # Why <a href="#user-content-why" id="user-content-why">#</a> 17 18 ## Summary 19 20 The LakeFSOutputCommitter will be: 21 22 * **Faster.** Because it uses native lakeFS features we can avoid the long 23 troubled history of Spark writes on object stores. 24 25 * 3x fewer API calls than when using the FileOutputCommitter v1 algorithm 26 (which seems the most commonly used by our users). Many of these are 27 entirely useless statObject calls. 28 * 2x fewer API calls that when using the FileOutputCommitter v2 algorithm 29 (which seems less used, and is often not as safe as the v1 algorithm). 30 * Around the same number of API calls as we would achieve if we supported 31 the new magic committer. Many of these are entirely useless statObject 32 call. 33 * **Atomic.** All objects and partitions of a file appear at once as a 34 committed object on the target branch. `_SUCCESS` files are still 35 generated, but only as a courtesy to AirFlow and similar triggers. A 36 process that examines the target branch will never see a partial state. 37 * **Integrated.** A full Spark write is a native lakeFS commit, a 38 first-class object in the lakeFS ecosystem. It has a commit log (useful 39 for lineage and structured metadata tagging), it can be reverted and 40 merged, it has clear semantics for garbage collection, etc. 41 42 ## Issues with existing OutputCommitters 43 44 OutputCommitters are charged with writing a directory tree (a multi-part 45 Hadoop file, for example a partitioned Parquet file) atomically. 46 47 Hadoop comes with several OutputCommitter algorithms. For S3 these try to 48 bridge the gap between the Hadoop assumed semantics of a POSIX-ish 49 filesystem (HDFS) and the actual semantics of an object store (S3). 50 51 Of course, any Hadoop OutputCommitter that is unrelated to lakeFS can only 52 ever perform HOC commits. Any lakeFS commits on the written data will need 53 to be separately managed by user code or manually. 54 55 The best correct committer is probably the [magic committer][magic]. It 56 works using only S3 (with its current consistency guarantees!), without 57 auxiliary systems such as a small HDFS or S3Guard. The magic committer 58 uploads a directory tree by starting a multipart upload to its intended location, 59 creating a completion record for that multipart upload on S3, and uploading 60 the object without completing the upload. To complete the upload, all 61 completion records are processed and the objects magically appear. 62 63 The [staging committer][staging] (aka "Netflix committer") is an older 64 committer for S3A. It requires a "staging" filesystem on HDFS and offers 65 manual modes for overwriting outputs ("conflict resolution"). 66 67 These issues remain with the magic and staging committer: 68 69 * Not atomic. Multipart objects appear sequentially. Finding a "_success" 70 object indicator is still required before a directory tree can be 71 processed. 72 * The magic committer requires "magic committer" support in the FileSystem, 73 in order to write `__magic` objects to a _different_ path. (Current 74 lakeFSFS does not contain this support.) 75 * The magic committer is somewhat new: 76 > It’s also not been field tested to the extent of Netflix’s committer; 77 > consider it the least mature of the committers. 78 * Documentation is still somewhat lacking. 79 80 The both versions 1 and 2 of the FileOutputCommitter are not as good. They 81 _rename_ files, which on S3A works by _copying and deleting_. This is both 82 slow and highly non-atomic, as well as making it difficult to recover from 83 failed attempts. The current recommendations are to use the partitioned 84 staging committer for overwriting or updating partitioned data trees, and 85 the magic committer is most other cases. 86 87 ## LakeFSOutputCommitter 88 89 We propose to leverage the atomic capabilities of lakeFS to write a specific 90 OutputCommitter for lakeFSFS. In the initial version, it will branch out, 91 prepare the desired output, and merge back in as part of the HOC commit.[^1] 92 Aborting will be done by dropping the branch (or repurposing it if the same 93 job ID is requested again). Regular lakeFS retention can handle dropping 94 data objects; we might want to add file patterns to retention to allow 95 temporary objects to be dropped rapidly. 96 97 [^1]: A merge is a type of commit, of course, so in this model HOC commits 98 *are* lakeFS commits! 99 100 # How 101 102 ## Conflict modes 103 104 Spark supports multiple "save modes": Append, Overwrite, ErrorIfExists, and 105 Ignore. These impact conflict resolution. We will initially support just 106 overwrite modes: an entire previous table will be deleted on write. 107 108 ## Sample flow 109 110 * User configures lakeFSFS and configures LakeFSOutputCommitter as the 111 OutputCommitter for `lakefs` protocol Paths. **Possibly** lakeFSFS will 112 set up this OutputCommitter by default. 113 114 Successive steps are controlled by Spark/Hadoop to output, and correspond 115 to the Hadoop OutputCommitter protocol. 116 * **Setup (job/task TBD)**: Create a new branch for this job/task. Its name 117 is predictable from the job ID and/or task ID, so can be easily found 118 again. In "overwrite" mode, immediately delete the entire subtree of the 119 intended output path, and possibly delete branches of previous tasks. 120 This handles cases where the names of the objects written by the output 121 format change, for instance because of repartitioning to a smaller (or 122 different) number of partitions or because of nondeterminism in the names 123 of the objects. 124 * **Write objects**: Everything is written to its correct path on the branch. 125 * **[HOC] Commit**: Merge back to the original branch. 126 127 ## Properties 128 129 * The merge is performed by lakeFS so it is **atomic**. 130 * In the **single-writer case** the merge succeeds: no other operations 131 occur on the subtree. 132 * **In-place updates** work: the old objects are deleted and replaced by new 133 objects. This is true regardess of partitioning etc. 134 * **Multiple writers** are detected and the first to HOC-commit succeeds. 135 But all the others fail: they deleted the same previous files, or created 136 a conflicting file (at least their `_SUCCESS` indicator). So their lakeFS 137 merge fails due to a conflict with the first (successful) merge. 138 * (Conflicting) **non-OutputCommitter writes are detected** and clearly 139 handled. As long as other writes create _one_ object with an overlapping 140 name the merge will fail. So LakeFSOutputCommitter can achieve its 141 correct semantics regardless of other writers used. 142 * **Clearly correct by construction**: Rather than rely on single atomic 143 operations and carefully tailoring operations to Spark retry mechanisms, 144 we use lakeFS capabilities and guarantees. Analyzing correctness becomes 145 simpler. 146 * **Fast**: No data copies, just only (required) metadata operations. Cost 147 of the lakeFS commit is linear in the number of objects it touches (and a 148 fast operation to add many thousands objects). Total time to write is 149 close to 3* faster than the existing FileOutputCommitter in v1 mode, close 150 to 2* faster than the existing FileOutputCommitter in v2 mode (which is 151 unsafe in various cases), and about as fast as the magic OutputCommitter 152 _if_ lakeFSFS supported it. 153 * **Good semantics**: HOC commits will be lakeFS commits. The history of a 154 Spark job appears right in lakeFS history. Metadata even includes some 155 data lineage -- and in future we can easily add more, for instance as 156 merge (lakeFS) commit user metadata. 157 158 ### Implementation details 159 160 #### Hadoop >=3.1 161 162 Hadoop 3.1 offered a fairly complete overhaul of committer architecture, 163 configuration, and S3A support. Supporting new committers on older Hadoops 164 will be challenging. It also seems to be the version where the magic output 165 committer is recommended for use, so potentially our users will be there or 166 will agree to upgrade. 167 168 #### ParquetOutputCommitters 169 170 Parquet requires its committers to be 171 [`ParquetOutputCommitter`](https://github.com/apache/parquet-mr/blob/5608695f5777de1eb0899d9075ec9411cfdf31d3/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputCommitter.java#L37)s 172 (of course it does), see e.g. [Cloudera's 173 explanation](https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/bk_cloud-data-access/content/enabling-directory-committer-spark.html). 174 It's not intended to be derived from, and cannot easily use another 175 OutputCommitter. However there's a 176 [`BindingParquetOutputCommitter`](https://github.com/apache/spark/blob/08e6f633b5bc3a7d8d008db2a264b1607d269f25/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/BindingParquetOutputCommitter.scala#L37) 177 in Spark (not in Hadoop, not in Parquet-MR, in *Spark*) that claims to 178 transform the selected output committer into a ParquetOutputCommitter. This 179 is resolvable per documentation but will require some work. 180 181 ## Future work 182 183 ### More conflict resolution and save modes 184 185 Support all 4 Spark "save modes": Append, Overwrite, ErrorIfExists, and 186 Ignore. 187 188 ### Multiple writers 189 190 We can support multiple concurrent writers and allow all writers to succeed 191 (keeping the results of the last writer). 192 193 We add a "merge-if" operation to lakeFS: atomically merge branch B into 194 branch A if branch A is at a given lakeFS commit. 195 196 Now change LakeFSOutputCommitter to loop during HOC commit: 197 198 * Attempt to merge the task branch into the source branch _if the source 199 branch has not moved_. 200 * On failure: 201 - Merge the source branch into the task branch using the "destination 202 wins"[^2] strategy and delete all added files under the prefix (or add and 203 use a new "merge but *never* copy from source branch" strategy). 204 - Go back and attempt another merge. 205 206 This is essentially (noncooperative!) locking of output paths on top of 207 lakeFS, with no additional DB. We can even add cooperation by means of 208 various locking hints, _informing_ multiple jobs about attempting to update 209 the same paths but keeping things safe regardless. 210 211 [^2]: Whenever there is a conflict, we want the task branch (which will become 212 the "latest writer" after a successful HOC commit) to win. 213 214 # Potential wins 215 216 * Explicit requests _not_ to write 3 times and give better behaviour than 217 the default FileOutputCommitter from Spark have appeared on our Slack 218 [#data-architecture-discussion][slack-dont-write-thrice]. 219 * Multiple users have requested "overwrite" save mode. 220 * Multiple users have requested multi-writer support. 221 * Note by a developer that Spark performs many lakeFS API operations when 222 writing. 223 224 [magic]: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html#The_Magic_Committer 225 [staging]: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html#The_Staging_Committer 226 [lakefs-commit]: https://docs.lakefs.io/understand/object-model.html#commits 227 [slack-dont-write-thrice]: https://app.slack.com/client/T013V60QY06/C020N7X2Y0H/thread/C020N7X2Y0H-1660298516.202499