github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/rejected/spark-outputcommitter/committer.md (about)

     1  # Proposal
     2  
     3  Add a Hadoop OutputCommitter that uses _existing_ lakeFS operations for
     4  **atomic** commits that are efficient and safely concurrent.
     5  
     6  ## Terminology
     7  
     8  lakeFS and Hadoop both use the term "commit", but with different meanings.
     9  In this document:
    10    * A [_lakeFS commit_][lakefs-commit] is a revision entry in the lakeFS
    11      repository log.
    12    * A [_Hadoop OutputCommitter commit_][hadoop-commit] (_HOC commit_ for
    13      short) is the action taken by a Hadoop OutputCommitter to finish writing
    14      all objects of a job to their final locations.
    15  
    16  # Why <a href="#user-content-why" id="user-content-why">#</a>
    17  
    18  ## Summary
    19  
    20  The LakeFSOutputCommitter will be:
    21  
    22  * **Faster.** Because it uses native lakeFS features we can avoid the long
    23    troubled history of Spark writes on object stores.
    24  
    25    * 3x fewer API calls than when using the FileOutputCommitter v1 algorithm
    26  	(which seems the most commonly used by our users).  Many of these are
    27  	entirely useless statObject calls.
    28    * 2x fewer API calls that when using the FileOutputCommitter v2 algorithm
    29      (which seems less used, and is often not as safe as the v1 algorithm).
    30    * Around the same number of API calls as we would achieve if we supported
    31      the new magic committer.  Many of these are entirely useless statObject
    32      call.
    33  * **Atomic.** All objects and partitions of a file appear at once as a
    34    committed object on the target branch. `_SUCCESS` files are still
    35    generated, but only as a courtesy to AirFlow and similar triggers.  A
    36    process that examines the target branch will never see a partial state.
    37  * **Integrated.** A full Spark write is a native lakeFS commit, a
    38    first-class object in the lakeFS ecosystem.  It has a commit log (useful
    39    for lineage and structured metadata tagging), it can be reverted and
    40    merged, it has clear semantics for garbage collection, etc.
    41  
    42  ## Issues with existing OutputCommitters
    43  
    44  OutputCommitters are charged with writing a directory tree (a multi-part
    45  Hadoop file, for example a partitioned Parquet file) atomically.
    46  
    47  Hadoop comes with several OutputCommitter algorithms.  For S3 these try to
    48  bridge the gap between the Hadoop assumed semantics of a POSIX-ish
    49  filesystem (HDFS) and the actual semantics of an object store (S3).
    50  
    51  Of course, any Hadoop OutputCommitter that is unrelated to lakeFS can only
    52  ever perform HOC commits.  Any lakeFS commits on the written data will need
    53  to be separately managed by user code or manually.
    54  
    55  The best correct committer is probably the [magic committer][magic].  It
    56  works using only S3 (with its current consistency guarantees!), without
    57  auxiliary systems such as a small HDFS or S3Guard.  The magic committer
    58  uploads a directory tree by starting a multipart upload to its intended location,
    59  creating a completion record for that multipart upload on S3, and uploading
    60  the object without completing the upload.  To complete the upload, all
    61  completion records are processed and the objects magically appear.
    62  
    63  The [staging committer][staging] (aka "Netflix committer") is an older
    64  committer for S3A.  It requires a "staging" filesystem on HDFS and offers
    65  manual modes for overwriting outputs ("conflict resolution").
    66  
    67  These issues remain with the magic and staging committer:
    68  
    69  * Not atomic.  Multipart objects appear sequentially.  Finding a "_success"
    70    object indicator is still required before a directory tree can be
    71    processed.
    72  * The magic committer requires "magic committer" support in the FileSystem,
    73    in order to write `__magic` objects to a _different_ path.  (Current
    74    lakeFSFS does not contain this support.)
    75  * The magic committer is somewhat new:
    76    > It’s also not been field tested to the extent of Netflix’s committer;
    77    > consider it the least mature of the committers.
    78  * Documentation is still somewhat lacking.
    79  
    80  The both versions 1 and 2 of the FileOutputCommitter are not as good.  They
    81  _rename_ files, which on S3A works by _copying and deleting_.  This is both
    82  slow and highly non-atomic, as well as making it difficult to recover from
    83  failed attempts.  The current recommendations are to use the partitioned
    84  staging committer for overwriting or updating partitioned data trees, and
    85  the magic committer is most other cases.
    86  
    87  ## LakeFSOutputCommitter
    88  
    89  We propose to leverage the atomic capabilities of lakeFS to write a specific
    90  OutputCommitter for lakeFSFS.  In the initial version, it will branch out,
    91  prepare the desired output, and merge back in as part of the HOC commit.[^1]
    92  Aborting will be done by dropping the branch (or repurposing it if the same
    93  job ID is requested again).  Regular lakeFS retention can handle dropping
    94  data objects; we might want to add file patterns to retention to allow
    95  temporary objects to be dropped rapidly.
    96  
    97  [^1]: A merge is a type of commit, of course, so in this model HOC commits
    98  	*are* lakeFS commits!
    99  
   100  # How
   101  
   102  ## Conflict modes
   103  
   104  Spark supports multiple "save modes": Append, Overwrite, ErrorIfExists, and
   105  Ignore.  These impact conflict resolution.  We will initially support just
   106  overwrite modes: an entire previous table will be deleted on write.
   107  
   108  ## Sample flow
   109  
   110  * User configures lakeFSFS and configures LakeFSOutputCommitter as the
   111    OutputCommitter for `lakefs` protocol Paths.  **Possibly** lakeFSFS will
   112    set up this OutputCommitter by default.
   113  
   114    Successive steps are controlled by Spark/Hadoop to output, and correspond
   115    to the Hadoop OutputCommitter protocol.
   116  * **Setup (job/task TBD)**: Create a new branch for this job/task.  Its name
   117    is predictable from the job ID and/or task ID, so can be easily found
   118    again.  In "overwrite" mode, immediately delete the entire subtree of the
   119    intended output path, and possibly delete branches of previous tasks.
   120    This handles cases where the names of the objects written by the output
   121    format change, for instance because of repartitioning to a smaller (or
   122    different) number of partitions or because of nondeterminism in the names
   123    of the objects.
   124  * **Write objects**: Everything is written to its correct path on the branch.
   125  * **[HOC] Commit**: Merge back to the original branch.
   126  
   127  ## Properties
   128  
   129  * The merge is performed by lakeFS so it is **atomic**.
   130  * In the **single-writer case** the merge succeeds: no other operations
   131    occur on the subtree.
   132  * **In-place updates** work: the old objects are deleted and replaced by new
   133    objects.  This is true regardess of partitioning etc.
   134  * **Multiple writers** are detected and the first to HOC-commit succeeds.
   135    But all the others fail: they deleted the same previous files, or created
   136    a conflicting file (at least their `_SUCCESS` indicator).  So their lakeFS
   137    merge fails due to a conflict with the first (successful) merge.
   138  * (Conflicting) **non-OutputCommitter writes are detected** and clearly
   139    handled.  As long as other writes create _one_ object with an overlapping
   140    name the merge will fail.  So LakeFSOutputCommitter can achieve its
   141    correct semantics regardless of other writers used.
   142  * **Clearly correct by construction**: Rather than rely on single atomic
   143    operations and carefully tailoring operations to Spark retry mechanisms,
   144    we use lakeFS capabilities and guarantees.  Analyzing correctness becomes
   145    simpler.
   146  * **Fast**: No data copies, just only (required) metadata operations.  Cost
   147    of the lakeFS commit is linear in the number of objects it touches (and a
   148    fast operation to add many thousands objects).  Total time to write is
   149    close to 3* faster than the existing FileOutputCommitter in v1 mode, close
   150    to 2* faster than the existing FileOutputCommitter in v2 mode (which is
   151    unsafe in various cases), and about as fast as the magic OutputCommitter
   152    _if_ lakeFSFS supported it.
   153  * **Good semantics**: HOC commits will be lakeFS commits.  The history of a
   154    Spark job appears right in lakeFS history.  Metadata even includes some
   155    data lineage -- and in future we can easily add more, for instance as
   156    merge (lakeFS) commit user metadata.
   157  
   158  ### Implementation details
   159  
   160  #### Hadoop >=3.1
   161  
   162  Hadoop 3.1 offered a fairly complete overhaul of committer architecture,
   163  configuration, and S3A support.  Supporting new committers on older Hadoops
   164  will be challenging.  It also seems to be the version where the magic output
   165  committer is recommended for use, so potentially our users will be there or
   166  will agree to upgrade.
   167  
   168  #### ParquetOutputCommitters
   169  
   170  Parquet requires its committers to be
   171  [`ParquetOutputCommitter`](https://github.com/apache/parquet-mr/blob/5608695f5777de1eb0899d9075ec9411cfdf31d3/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputCommitter.java#L37)s
   172  (of course it does), see e.g. [Cloudera's
   173  explanation](https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/bk_cloud-data-access/content/enabling-directory-committer-spark.html).
   174  It's not intended to be derived from, and cannot easily use another
   175  OutputCommitter.  However there's a
   176  [`BindingParquetOutputCommitter`](https://github.com/apache/spark/blob/08e6f633b5bc3a7d8d008db2a264b1607d269f25/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/BindingParquetOutputCommitter.scala#L37)
   177  in Spark (not in Hadoop, not in Parquet-MR, in *Spark*) that claims to
   178  transform the selected output committer into a ParquetOutputCommitter.  This
   179  is resolvable per documentation but will require some work.
   180  
   181  ## Future work
   182  
   183  ### More conflict resolution and save modes
   184  
   185  Support all 4 Spark "save modes": Append, Overwrite, ErrorIfExists, and
   186  Ignore.
   187  
   188  ### Multiple writers
   189  
   190  We can support multiple concurrent writers and allow all writers to succeed
   191  (keeping the results of the last writer).
   192  
   193  We add a "merge-if" operation to lakeFS: atomically merge branch B into
   194  branch A if branch A is at a given lakeFS commit.
   195  
   196  Now change LakeFSOutputCommitter to loop during HOC commit:
   197  
   198  * Attempt to merge the task branch into the source branch _if the source
   199    branch has not moved_.
   200  * On failure:
   201    - Merge the source branch into the task branch using the "destination
   202  	wins"[^2] strategy and delete all added files under the prefix (or add and
   203  	use a new "merge but *never* copy from source branch" strategy).
   204    - Go back and attempt another merge.
   205  
   206  This is essentially (noncooperative!) locking of output paths on top of
   207  lakeFS, with no additional DB.  We can even add cooperation by means of
   208  various locking hints, _informing_ multiple jobs about attempting to update
   209  the same paths but keeping things safe regardless.
   210  
   211  [^2]: 	Whenever there is a conflict, we want the task branch (which will become
   212      the "latest writer" after a successful HOC commit) to win.
   213  
   214  # Potential wins
   215  
   216  * Explicit requests _not_ to write 3 times and give better behaviour than
   217    the default FileOutputCommitter from Spark have appeared on our Slack
   218    [#data-architecture-discussion][slack-dont-write-thrice].
   219  * Multiple users have requested "overwrite" save mode.
   220  * Multiple users have requested multi-writer support.
   221  * Note by a developer that Spark performs many lakeFS API operations when
   222    writing.
   223  
   224  [magic]:  https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html#The_Magic_Committer
   225  [staging]:  https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html#The_Staging_Committer
   226  [lakefs-commit]:  https://docs.lakefs.io/understand/object-model.html#commits
   227  [slack-dont-write-thrice]:  https://app.slack.com/client/T013V60QY06/C020N7X2Y0H/thread/C020N7X2Y0H-1660298516.202499