github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/rejected/lakeFS-output-committer-execution-plan.md (about)

     1  # LakeFS Output Committer - Execution Plan
     2  
     3  [LakeFS Output Committer Proposal](https://github.com/treeverse/lakeFS/blob/master/design/open/spark-outputcommitter/committer.md)
     4  
     5  ## Overview
     6  
     7  Implementing lakeFS Output Committer meaning implementing the class [FileOutputCommitter](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.html) and using lakeFS client to create/merge branches as part of setup and commit job/tasks (this is the [sample flow](https://github.com/treeverse/lakeFS/blob/master/design/open/spark-outputcommitter/committer.md#sample-flow)).
     8  
     9  ### Milestone 1
    10  
    11  The first milestone is focused on writing a file with single mode (probably text) and one file format (probably append) ([#4311](https://github.com/treeverse/lakeFS/issues/4311)).
    12  At this stage we'll implement `commit`, `setup`, `abort` class methods.
    13  Moreover, as part of this milestone we'll add testing (the way need to be determined) of both the file format and the writing mode, according to our Hadoop support matrix.
    14  
    15  ### Milestone 2
    16  
    17  This milestone is focused on:
    18  - writing a file in overwrite mode + testing ([#4517](https://github.com/treeverse/lakeFS/issues/4517), [#4528](https://github.com/treeverse/lakeFS/issues/4528))
    19  - writing a file in parquet file format + testing ([#4518](https://github.com/treeverse/lakeFS/issues/4518), [#4527](https://github.com/treeverse/lakeFS/issues/4527))
    20  
    21  These should be two separate tasks.
    22  
    23  ### Milestone 3
    24  
    25  This milestone is focused on:
    26  - writing all file formats, in all modes ([#4508](https://github.com/treeverse/lakeFS/issues/4508), [#4509](https://github.com/treeverse/lakeFS/issues/4509))
    27  - implement the relevant remaining class methods ([#4507](https://github.com/treeverse/lakeFS/issues/4507))
    28  
    29  The relevant file formats to add are:
    30  * CSV
    31  * ORC
    32  * JSON
    33  
    34  The relevant modes to add are:
    35  * ErrorIfExists (default)
    36  * Ignore
    37  
    38  ### Milestone 4
    39  
    40  This milestone is focused on multiwriter support, first for overwrite save mode ([#4510](https://github.com/treeverse/lakeFS/issues/4510)) and then for other save modes ([#4511](https://github.com/treeverse/lakeFS/issues/4511)).
    41  
    42  ### Other
    43  
    44  Several tasks need to be determined in which milestone they should be included:
    45  * Easier configuration: use laekFS Output Committer in default in lakeFSFS
    46  * Improve configuration for Spark 3: better configuration options exist in Spark 3 (default OC for FS).
    47  * Enhance metadata
    48  
    49  ![Execution Plan](diagrams/lakeFS-OC-execution-plan.png)
    50  
    51