github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/performance-best-practices.md (about) 1 --- 2 title: Performance Best Practices 3 parent: Understanding lakeFS 4 description: This section suggests performance best practices to work with lakeFS. 5 --- 6 # Performance Best Practices 7 8 {% include toc.html %} 9 10 ## Overview 11 Use this guide to achieve the best performance with lakeFS. 12 13 ## Avoid concurrent commits/merges 14 Just like in Git, branch history is composed by commits and is linear by nature. 15 Concurrent commits/merges on the same branch result in a race. The first operation will finish successfully while the rest will retry. 16 17 ## Perform meaningful commits 18 It's a good idea to perform commits that are meaningful in the senese that they represent a logical point in your data's lifecycle. While lakeFS supports arbirartily large commits, avoiding commits with a huge number of objects will result in a more comprehensible commit history. 19 20 ## Use zero-copy import 21 To import object into lakeFS, either a single time or regularly, lakeFS offers a [zero-copy import][zero-copy-import] feature. 22 Use this feature to import a large number of objects to lakeFS, instead of simply copying them into your repository. 23 This feature will create a reference to the existing objects on your bucket and avoids the copy. 24 25 ## Read data using the commit ID 26 In cases where you are only interested in reading committed data: 27 * Use a commit ID (or a tag ID) in your path (e.g: `lakefs://repo/a1b2c3`). 28 * Add `@` before the path `lakefs://repo/main@/path`. 29 30 When accessing data using the branch name (e.g. `lakefs://repo/main/path`) lakeFS will also try to fetch uncommitted data, which may result in reduced performance. 31 For more information, see [how uncommitted data is managed in lakeFS][representing-refs-and-uncommitted-metadata]. 32 33 ## Operate directly on the storage 34 Sometimes, storage operations can become a bottleneck. For example, when your data pipelines upload many big objects. 35 In such cases, it can be beneficial to perform only versioning operations on lakeFS, while performing storage reads/writes directly on the object store. 36 lakeFS offers multiple ways to do that: 37 * The [`lakectl fs upload --pre-sign`][lakectl-upload] command (or [download][lakectl-download]). 38 * The lakeFS [Hadoop Filesystem][hadoopfs]. 39 * The [staging API][api-staging] which can be used to add lakeFS references to objects after having written them to the storage. 40 41 Accessing the object store directly is a faster way to interact with your data. 42 43 ## Zero-copy 44 lakeFS provides a zero-copy mechanism to data. Instead of copying the data, we can check out to a new branch. 45 Creating a new branch will take constant time as the new branch points to the same data as its parent. 46 It will also lower the storage cost. 47 48 49 [hadoopfs]: {% link integrations/spark.md %}#lakefs-hadoop-filesystem 50 [zero-copy-import]: {% link howto/import.md %}#zero-copy-import 51 [lakectl-upload]: {% link reference/cli.md %}#lakectl-fs-upload 52 [lakectl-download]: {% link reference/cli.md %}#lakectl-fs-download 53 [api-staging]: {% link reference/api.md %}#operations-objects-stageObject 54 [representing-refs-and-uncommitted-metadata]: {% link understand/how/versioning-internals.md %}#representing-references-and-uncommitted-metadata