github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/delta.md (about)

     1  ---
     2  title: Delta Lake
     3  description: This section explains how to use Delta Lake with lakeFS.
     4  parent: Integrations
     5  ---
     6  
     7  # Using lakeFS with Delta Lake
     8  
     9  [Delta Lake](https://delta.io/) Delta Lake is an open-source storage framework designed to improve performance and provide transactional guarantees to data lake tables.
    10  
    11  Because lakeFS is format-agnostic, you can save data in Delta format within a lakeFS repository and benefit from the advantages of both technologies. Specifically:
    12  
    13  1. ACID operations can span across multiple Delta tables.
    14  2. [CI/CD hooks][data-quality-gates] can validate Delta table contents, schema, or even referential integrity.
    15  3. lakeFS supports zero-copy branching for quick experimentation with full isolation.
    16  
    17  {% include toc.html %}
    18  
    19  ## Delta Lake Tables from the lakeFS Perspective
    20  
    21  lakeFS is a data versioning tool, functioning at the **object** level. This implies that, by default, lakeFS remains agnostic 
    22  to whether the objects within a Delta table location represent a table, table metadata, or data. As per the Delta Lake [protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md), 
    23  any modification to a table—whether it involves adding data or altering table metadata—results in the creation of a new object 
    24  in the table's [transaction log](https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html). 
    25  Typically, residing under the `_delta_log` path, relative to the root of the table's directory. This new object has an incremented version compared to its predecessor.
    26  
    27  Consequently, when making changes to a Delta table within the lakeFS environment, these changes are reflected as changes 
    28  to objects within the table location. For instance, inserting a record into a table named "my-table," which is partitioned 
    29  by 'category' and 'country,' is represented in lakeFS as added objects within the table prefix (i.e., the table data) and the table transaction log.
    30  ![Record addition](../assets/img/delta-record-addition.png)
    31  
    32  Similarly, when performing a metadata operation such as renaming a table column, new objects are appended to the table transaction log, 
    33  indicating the schema change.
    34  ![Record addition](../assets/img/delta-schema-change.png)
    35  
    36  ## Using Delta Lake with lakeFS from Apache Spark
    37  
    38  _Given the native integration between Delta Lake and Spark, it's most common that you'll interact with Delta tables in a Spark environment._
    39  
    40  To configure a Spark environment to read from and write to a Delta table within a lakeFS repository, you need to set the proper credentials and endpoint in the S3 Hadoop configuration, like you'd do with any [Spark](./spark.md) environment.
    41  
    42  Once set, you can interact with Delta tables using regular Spark path URIs. Make sure that you include the lakeFS repository and branch name:
    43  
    44  ```scala
    45  df.write.format("delta").save("s3a://<repo-name>/<branch-name>/path/to/delta-table")
    46  ```
    47  
    48  Note: If using the Databricks Analytics Platform, see the [integration guide](./spark.md#installation) for configuring a Databricks cluster to use lakeFS.
    49  
    50  To see the integration in action see [this notebook](https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/delta-lake.ipynb) in the [lakeFS Samples Repository](https://github.com/treeverse/lakeFS-samples/).
    51  
    52  ## Using Delta Lake with lakeFS from Python
    53  
    54  The [delta-rs](https://github.com/delta-io/delta-rs) library provides bindings for Python. This means that you can use Delta Lake and lakeFS directly from Python without needing Spark. Integration is done through the [lakeFS S3 Gateway]({% link understand/architecture.md %}#s3-gateway)
    55  
    56  The documentation for the `deltalake` Python module details how to [read](https://delta-io.github.io/delta-rs/python/usage.html#loading-a-delta-table), [write](https://delta-io.github.io/delta-rs/python/usage.html#writing-delta-tables), and [query](https://delta-io.github.io/delta-rs/python/usage.html#querying-delta-tables) Delta Lake tables. To use it with lakeFS use an `s3a` path for the table based on your repository and branch (for example, `s3a://delta-lake-demo/main/my_table/`) and specify the following `storage_options`:
    57  
    58  ```python
    59  storage_options = {"AWS_ENDPOINT": <your lakeFS endpoint>,
    60                     "AWS_ACCESS_KEY_ID": <your lakeFS access key>,
    61                     "AWS_SECRET_ACCESS_KEY": <your lakeFS secret key>,
    62                     "AWS_REGION": "us-east-1",
    63                     "AWS_S3_ALLOW_UNSAFE_RENAME": "true"
    64                    }
    65  ```
    66  
    67  If your lakeFS is not using HTTPS (for example, you're just running it locally) then add the option
    68  
    69  ```python
    70                     "AWS_STORAGE_ALLOW_HTTP": "true"
    71  ```
    72  
    73  To see the integration in action see [this notebook](https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/delta-lake-python.ipynb) in the [lakeFS Samples Repository](https://github.com/treeverse/lakeFS-samples/).
    74  
    75  ## Exporting Delta Lake tables from lakeFS into Unity Catalog
    76  
    77  This option is for users who are managing Delta Lake tables with lakeFS and access them through Databricks [Unity Catalog](https://www.databricks.com/product/unity-catalog). lakeFS offers 
    78  a [Data Catalog Export](../howto/catalog_exports.md) functionality that provides read-only access to your Delta tables from within Unity catalog. Using the data catalog exporters,  
    79  you can work on Delta tables in isolation and easily explore them within the Unity Catalog.
    80  
    81  Once exported, you can query the versioned table data with:
    82  ```sql
    83  SELECT * FROM my_catalog.main.my_delta_table
    84  ```
    85  Here, 'main' is the name of the lakeFS branch from which the delta table was exported.
    86  
    87  To enable Delta table exports to Unity catalog use the Unity [catalog integration guide](unity-catalog.md).
    88  
    89  ## Limitations
    90  
    91  ### Multi-Writer Support in lakeFS for Delta Lake Tables
    92  {: .no_toc}
    93  
    94  lakeFS currently supports a single writer for Delta Lake tables. Attempting to utilize multiple writers for writing to a Delta table may result in two types of issues:
    95  1. Merge Conflicts: These conflicts arise when multiple writers modify a Delta table on different branches, and an attempt is made to merge these branches.
    96  ![Merge conflicts](../assets/img/delta-lake/merge-conflict.png)
    97  2. Concurrent File Overwrite: This issue occurs when multiple writers concurrently modify a Delta table on the same branch.
    98  ![Concurrent file overwrite](../assets/img/delta-lake/concurrent-file-overwrite.png)
    99  
   100  Note: lakeFS currently lacks its own implementation for a LogStore, and the default Log store used does not control concurrency.
   101  {: .note }
   102  To address these limitations, consider following [best practices for implementing multi-writer support](#use-lakefs-branches-and-merges-to-support-multi-writers).
   103  
   104  ## Best Practices
   105  
   106  ### Implementing Multi-Writer Support through lakeFS Branches and Merges
   107  
   108  To achieve safe multi-writes to a Delta Lake table on lakeFS, we recommend following these best practices:
   109  1. **Isolate Changes:** Make modifications to your table in isolation. Each set of changes should be associated with a dedicated lakeFS branch, branching off from the main branch.
   110  2. **Merge Atomically:** After making changes in isolation, try to merge them back into the main branch. This approach guarantees that the integration of changes is cohesive.
   111  
   112  The workflow involves:
   113  * Creating a new lakeFS branch from the main branch for any table change.
   114  * Making modifications in isolation.
   115  * Attempting to merge the changes back into the main branch.
   116  * Iterating the process in case of a merge failure due to conflicts.
   117  
   118  The diagram below provides a visual representation of how branches and merges can be utilized to manage concurrency effectively:
   119  ![Multi writers workaround](../assets/img/delta-lake/multi-writers-with-lakefs-branches-merges.png)
   120  
   121  ### Follow Vacuum by Garbage Collection
   122  
   123  To delete unused files from a table directory while working with Delta Lake over lakeFS you need to first use Delta lake
   124  [Vacuum](https://docs.databricks.com/en/sql/language-manual/delta-vacuum.html) to soft-delete the files, and then use
   125  [lakeFS Garbage Collection](../howto/garbage-collection) to hard-delete them from the storage.
   126  
   127  **Note:** lakeFS enables you to recover from undesired vacuum runs by reverting the changes done by a vacuum run before running Garbage Collection.     
   128  {: .note }
   129  
   130  ### When running lakeFS inside your VPC (on AWS)
   131  
   132  When lakeFS runs inside your private network, your Databricks cluster needs to be able to access it. 
   133  This can be done by setting up a VPC peering between the two VPCs 
   134  (the one where lakeFS runs and the one where Databricks runs). For this to work on Delta Lake tables, you would also have to disable multi-cluster writes with:
   135  
   136  ```
   137  spark.databricks.delta.multiClusterWrites.enabled false
   138  ```
   139  
   140  ### Using multi cluster writes (on AWS)
   141  
   142  When using multi-cluster writes, Databricks overrides Delta’s S3-commit action. 
   143  The new action tries to contact lakeFS from servers on Databricks’ own AWS account, which of course won’t be able to access your private network. 
   144  So, if you must use multi-cluster writes, you’ll have to allow access from Databricks’ AWS account to lakeFS. 
   145  If you are trying to achieve that, please reach out on Slack and the community will try to assist.
   146  
   147  ## Further Reading
   148  
   149  See [Guaranteeing Consistency in Your Delta Lake Tables With lakeFS](https://lakefs.io/blog/guarantee-consistency-in-your-delta-lake-tables-with-lakefs/) post on the lakeFS blog to learn how to 
   150  guarantee data quality in a Delta table by utilizing lakeFS branches.
   151  
   152  
   153  [data-quality-gates]:  {% link understand/use_cases/cicd_for_data.md %}#using-hooks-as-data-quality-gates
   154  [deploy-docker]:  {% link howto/deploy/onprem.md %}#docker
   155