github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/rejected/lakeFS-output-committer-execution-plan.md (about) 1 # LakeFS Output Committer - Execution Plan 2 3 [LakeFS Output Committer Proposal](https://github.com/treeverse/lakeFS/blob/master/design/open/spark-outputcommitter/committer.md) 4 5 ## Overview 6 7 Implementing lakeFS Output Committer meaning implementing the class [FileOutputCommitter](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.html) and using lakeFS client to create/merge branches as part of setup and commit job/tasks (this is the [sample flow](https://github.com/treeverse/lakeFS/blob/master/design/open/spark-outputcommitter/committer.md#sample-flow)). 8 9 ### Milestone 1 10 11 The first milestone is focused on writing a file with single mode (probably text) and one file format (probably append) ([#4311](https://github.com/treeverse/lakeFS/issues/4311)). 12 At this stage we'll implement `commit`, `setup`, `abort` class methods. 13 Moreover, as part of this milestone we'll add testing (the way need to be determined) of both the file format and the writing mode, according to our Hadoop support matrix. 14 15 ### Milestone 2 16 17 This milestone is focused on: 18 - writing a file in overwrite mode + testing ([#4517](https://github.com/treeverse/lakeFS/issues/4517), [#4528](https://github.com/treeverse/lakeFS/issues/4528)) 19 - writing a file in parquet file format + testing ([#4518](https://github.com/treeverse/lakeFS/issues/4518), [#4527](https://github.com/treeverse/lakeFS/issues/4527)) 20 21 These should be two separate tasks. 22 23 ### Milestone 3 24 25 This milestone is focused on: 26 - writing all file formats, in all modes ([#4508](https://github.com/treeverse/lakeFS/issues/4508), [#4509](https://github.com/treeverse/lakeFS/issues/4509)) 27 - implement the relevant remaining class methods ([#4507](https://github.com/treeverse/lakeFS/issues/4507)) 28 29 The relevant file formats to add are: 30 * CSV 31 * ORC 32 * JSON 33 34 The relevant modes to add are: 35 * ErrorIfExists (default) 36 * Ignore 37 38 ### Milestone 4 39 40 This milestone is focused on multiwriter support, first for overwrite save mode ([#4510](https://github.com/treeverse/lakeFS/issues/4510)) and then for other save modes ([#4511](https://github.com/treeverse/lakeFS/issues/4511)). 41 42 ### Other 43 44 Several tasks need to be determined in which milestone they should be included: 45 * Easier configuration: use laekFS Output Committer in default in lakeFSFS 46 * Improve configuration for Spark 3: better configuration options exist in Spark 3 (default OC for FS). 47 * Enhance metadata 48 49  50 51