github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/performance-best-practices.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/performance-best-practices.md (about)

     1  ---
     2  title: Performance Best Practices
     3  parent: Understanding lakeFS
     4  description: This section suggests performance best practices to work with lakeFS.
     5  --- 
     6  # Performance Best Practices
     7  
     8  {% include toc.html %}
     9  
    10  ## Overview
    11  Use this guide to achieve the best performance with lakeFS.
    12  
    13  ## Avoid concurrent commits/merges
    14  Just like in Git, branch history is composed by commits and is linear by nature. 
    15  Concurrent commits/merges on the same branch result in a race. The first operation will finish successfully while the rest will retry.
    16  
    17  ## Perform meaningful commits
    18  It's a good idea to perform commits that are meaningful in the senese that they represent a logical point in your data's lifecycle. While lakeFS supports arbirartily large commits, avoiding commits with a huge number of objects will result in a more comprehensible commit history.
    19  
    20  ## Use zero-copy import
    21  To import object into lakeFS, either a single time or regularly, lakeFS offers a [zero-copy import][zero-copy-import] feature.
    22  Use this feature to import a large number of objects to lakeFS, instead of simply copying them into your repository.
    23  This feature will create a reference to the existing objects on your bucket and avoids the copy.
    24  
    25  ## Read data using the commit ID
    26  In cases where you are only interested in reading committed data: 
    27  * Use a commit ID (or a tag ID) in your path (e.g: `lakefs://repo/a1b2c3`).
    28  * Add `@` before the path  `lakefs://repo/main@/path`.
    29  
    30  When accessing data using the branch name (e.g. `lakefs://repo/main/path`) lakeFS will also try to fetch uncommitted data, which may result in reduced performance.
    31  For more information, see [how uncommitted data is managed in lakeFS][representing-refs-and-uncommitted-metadata].
    32  
    33  ## Operate directly on the storage
    34  Sometimes, storage operations can become a bottleneck. For example, when your data pipelines upload many big objects.
    35  In such cases, it can be beneficial to perform only versioning operations on lakeFS, while performing storage reads/writes directly on the object store.
    36  lakeFS offers multiple ways to do that:
    37  * The [`lakectl fs upload --pre-sign`][lakectl-upload] command (or [download][lakectl-download]).
    38  * The lakeFS [Hadoop Filesystem][hadoopfs].
    39  * The [staging API][api-staging] which can be used to add lakeFS references to objects after having written them to the storage.
    40  
    41  Accessing the object store directly is a faster way to interact with your data.
    42  
    43  ## Zero-copy
    44  lakeFS provides a zero-copy mechanism to data. Instead of copying the data, we can check out to a new branch. 
    45  Creating a new branch will take constant time as the new branch points to the same data as its parent.
    46  It will also lower the storage cost.
    47  
    48  
    49  [hadoopfs]:  {% link integrations/spark.md %}#lakefs-hadoop-filesystem
    50  [zero-copy-import]:  {% link howto/import.md %}#zero-copy-import
    51  [lakectl-upload]:  {% link reference/cli.md %}#lakectl-fs-upload
    52  [lakectl-download]:  {% link reference/cli.md %}#lakectl-fs-download
    53  [api-staging]:  {% link reference/api.md %}#operations-objects-stageObject
    54  [representing-refs-and-uncommitted-metadata]:  {% link understand/how/versioning-internals.md %}#representing-references-and-uncommitted-metadata