github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/output-committer-conclusions.md (about)

     1  # Technical conclusions from the LakeFSOutputCommitter attempt
     2  
     3  ## Planning
     4  
     5  ### Poor performance planning
     6  
     7  #### Merges
     8  
     9  The original design was based on a single branch per task.  This allows for
    10  very fast writing (all writes are independent) but merging 4000 times turns
    11  out to be a slow operation.
    12  
    13  Earlier communication with VE team would have allowed avoiding this.
    14  
    15  **Conclusion:** Merge operations can be slow.  We saw multiple seconds 90th
    16  percentile on *independent* branch merges.  FindMergeBase was very slow for
    17  moderately far merge bases: seconds to search a few thousand commits.
    18  
    19  **Possible mitigations:**
    20  
    21  All except the first might need issues to open; **communicate with VE**.
    22  
    23  * In future avoid using multiple merges for atomicity :-)
    24  * Commit cache is too small and expires too quickly.  Increase its size and
    25    expiry.  (#4637)
    26  * FindMergeBase performs batched operations in a loop.  Every BFS iteration
    27    could accumulate all its commit gets into a single batch and dispatch it.
    28    At the very least, don't batch these operations.  (#2981 is old but still
    29    relevant, maybe even more, it seems!)
    30  * Consider allowing multi-way merges.
    31  * Increase Pyramid cache sizes.
    32  
    33  #### Limited support for relinking objects
    34  
    35  We switched to _moving_ objects from per-task branches to a job branch.  It
    36  avoids multiple merges, and is safe because per-task objects have different
    37  names.  In practice the API does not allow moving object so this was really
    38  relinking unstaged objects across branches.
    39  
    40  Support for this was going away as part of incremental GC concurrently with
    41  implementation.  Luckily we immediately caught it.
    42  
    43  **Conclusion:** We cannot reduce branch concurrency using multiple branches
    44  like this. :shrug:
    45  
    46  #### Single-branch write performance on lakeFSFS
    47  
    48  Switching to writing everything to a _single_ branch shows poor performance
    49  of lakeFS as exercised by lakeFSFS.  CloudWatch Insights shows that one key
    50  takes a huge amount of load -- the branch key.  DynamoDB cannot partition a
    51  single key, so it is necessarily throttled regardless of purchased quota.
    52  
    53  **Conclusion:** lakeFSFS performance is poor on lakeFS with DynamoDB as its
    54  KV store.  Work on improving lakeFS performance with DynamoDB (some of this
    55  work is proposed in #4769, #4734, #4712), and also on improving lakeFSFS on
    56  lakeFS (some of this work is proposed in #4722, #4721, #4676).
    57  
    58  ### Poor communication with other teams
    59  
    60  We assumed multiple non-functional and one functional requirements that did
    61  not actually exist.  We should have specified these requirements as part of
    62  the design and communicated them with VE team.
    63  
    64  ### Failed to test the project can increase Spark write performance on lakeFS
    65  
    66  As proposed, the main goal of the project was to increase write performance
    67  on lakeFS.  However our plan did not include an early enough test for this,
    68  and in fact we did not measure it early enough.
    69  
    70  We did have some reasons for this:
    71  
    72  * We estimated system performance based on counting API calls.  However the
    73    estimates of API latencies proved incorrect.
    74  * Whole system performance is hard to measure before the system is finished
    75    and wanted to advance to that stage.
    76  * We considered functional requirements (actually being able to implement a
    77    working OutputCommitter) at a higher risk than nonfunctional requirements
    78    (performance and compatibility).
    79  
    80  However these reasons are clearly wrong in retrospect.  Requirements should
    81  have set clear performance requirements.
    82  
    83  ## Shared vision
    84  
    85  ### No expected performance of lakeFSFS
    86  
    87  We spec'ed lakeFSFS to be a few percent slower than S3A on S3 with "magic",
    88  but we have no typical load to use to measure.
    89  
    90  As a result we cannot design efficient flows that use lakeFSFS, or that the
    91  performance we get in some setup is adequate.
    92  
    93  We lack an understanding of what performance to expect from lakeFSFS.  This
    94  should include expected performance and affecting variables for these:
    95  
    96  * List "directory", recursively and nonrecursively.
    97  * Create file.
    98  * Open file.
    99  * Check file existence.
   100  * Delete file or "directory", recursively and nonrecursively.
   101  * Write a 1000-partition Parquet file (for instance).
   102  * Read from all objects of a 1000-partition Parquet file (for instance).  A
   103    good refinement is to read _all_ fields and also to read _just one_ field
   104    from the file.
   105  
   106  ### No expected performance of lakeFS
   107  
   108  We spec'ed lakeFS to be a few percent slower than S3 but we have no typical
   109  load that we intend to use to measure.
   110  
   111  As a result we cannot design efficient flows that use lakeFS, or state that
   112  the performance we get in some setup is adequate.
   113  
   114  We lack an understanding of what performance to expect from lakeFS.  It can
   115  include expected performance and affecting variables for these:
   116  
   117  * Commits.  (Possible affecting variables: # of modified objects, ranges)
   118  * Merges.  (Possible affecting variables: # of commits to scan, # of ranges
   119    modified, # of contending and uncontending concurrent merges)
   120  * Reads and lists.  (Possible affecting variables: # of concurrent reads on
   121    same branch, # of concurrent reads, expected size of returned listing)
   122  * Writes.  (Possible affecting variables: # of concurrent mutations on same
   123    branch, # of concurrent mutations)
   124  
   125  ### No best practices
   126  
   127  In order to support single-writer and multi-writer configurations we should
   128  decide on best practices for how to perform several basic tasks.  These may
   129  well depend on the result of the above performance work.  Currently nothing
   130  is blocked on this, giving us some time to do this correctly.  However some
   131  things don't currently work and one reasonable outcome is to remove support
   132  for them.
   133  
   134  * How to cause multiple files to appear at atomically
   135    * Commits?
   136    * Merges?
   137    * Some new mechanism, perhaps multiple staging areas?
   138  * How to abort work on failure after emitting some objects.
   139  * Recommended synchronization mechanisms in lakeFS
   140    * Merges?
   141    * Return status of delete object?  (Currently broken, #4798)
   142    * Return status of Set IfAbsent object?  (Currently broken, #4796)
   143    * Some new mechanism, perhaps a KV API?
   144  
   145  ## Background: stages and experiments performed
   146  
   147  In all measurements we wrote of a small CSV onto 4000 partitions, measuring
   148  OutputCommitter performance only.
   149  
   150  ### 1. Per-task branches, merged during taskCommit
   151  
   152  Initial implementation had one branch for working on the job and one branch
   153  per task.  Each successful task was merged into the job branch.  At the end
   154  the job branch was merged to the output branch.
   155  
   156  This ran slowly, with concurrent merges racing to update the job branch.  A
   157  client API timeout increase was needed, after which we got >30m to write.
   158  
   159  The proposal to [re-use metaranges in merge retries][meta-retry] could help
   160  here.
   161  
   162  (We also had to boost Pyramid cache size to avoid spills.  But a cache fill
   163  takes very little time, and cannot account for 30m.)
   164  
   165  At this stage we switched the lakeFS cloud staging cluster to use DynamoDB.
   166  This slowed things down further, to >50m.  Increasing DynamoDB quota helped
   167  but did not go below 40m.
   168  
   169  [meta-retry]: https://github.com/treeverse/lakeFS/blob/1dcefc2eecdc9b8693439c1afc6beccc134a11a8/design/open/reuse-contented-merge-metaranges/reuse-contented-merge-metaranges.md 
   170  
   171  ### 2. Independent concurrent merges during jobCommit
   172  
   173  We worked around concurrent merges by merging only in jobCommit.  It allows
   174  independent concurrent merges of all the branches[^1] and eliminates racing
   175  merges.  There was no significant speedup (~30-40m, but on DynamoDB).
   176  
   177  DynamoDB was throttled continually at this point.  One issue: FindMergeBase
   178  has to be called once per merge but it is slow to fetch commits.  The cache
   179  is much too small and unconfigurable.  So cache spills happen.  And fetches
   180  are batched, adding even more to each iteration.  See issues #2981, #4637.
   181  
   182  One way to make this work well would be to support multiway merges like Git
   183  does.  It would remove the need for a fast FindMergeBase, and simplify user
   184  experience when looking at commit logs.
   185  
   186  [^1]: Performed optimally: keep a fixed number of workers.  Each repeatedly
   187      merges two branches that are not participating in any merges.  Being so
   188      dynamic means that the result of a particular slow merge is used later,
   189      minimizing delay.
   190  
   191  ### 3. taskCommit without merging
   192  
   193  We next avoided all task branch merges.  Instead, we re-linked objects from
   194  the task branch into the job branch.  This worked nicely, 2.75m.  Still not
   195  the desired 2m but seemingly withing reach.
   196  
   197  But [support for this re-linking had just been removed][no-relink].
   198  
   199  [no-relink]: https://github.com/treeverse/lakeFS/blob/master/design/accepted/gc_plus/uncommitted-gc.md#getlinkphysicaladdress
   200  
   201  ### 4. Tasks on job branch
   202  
   203  In the final attempt, we wrote all objects to a temporary prefix on the job
   204  branch and re-linked them to their final location during taskCommit.  It is
   205  allowed by the new API.  However this still overloads lakeFS, regardless of
   206  configured DynamoDB capacity.  lakeFS repeatedly accesses the branch entry,
   207  and DynamoDB cannot partition it so it is subject to a hard rate-limit.  As
   208  a result we went back to >30m.
   209  
   210  ### Complication: two job IDs
   211  
   212  This applies to all work done.
   213  
   214  One complication seems like a bug in Spark writers (FileFormatWriter).  The
   215  writers generate two distinct job contexts: one to call startJob/commitJob,
   216  the other to call startTask/commitTask on all tasks.  So it is not possible
   217  to use the job ID to identify the job.
   218  
   219  Instead we simply used the output path.  Doing this prevents implementation
   220  of concurrent writers; obviously we never got that far.
   221