github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/unity-catalog.md (about)

     1  ---
     2  title: Unity Catalog
     3  description: Accessing lakeFS-exported Delta Lake tables from Unity Catalog.
     4  parent: Integrations
     5  redirect_from: /using/unity_catalog
     6  redirect_from: /integrations/unity_catalog
     7  ---
     8  
     9  # Using lakeFS with the Unity Catalog
    10  
    11  {% include toc_2-3.html %}
    12  
    13  ## Overview
    14  
    15  Databricks Unity Catalog serves as a centralized data governance platform for your data lakes.
    16  Through the Unity Catalog, you can search for and locate data assets across workspaces via a unified catalog.
    17  Leveraging the external tables feature within Unity Catalog, you can register a Delta Lake table exported from lakeFS and
    18  access it through the unified catalog. 
    19  The subsequent step-by-step guide will lead you through the process of configuring a [Lua hook]({% link howto/hooks/lua.md %})
    20  that exports Delta Lake tables from lakeFS, and subsequently registers them in Unity Catalog.
    21  
    22  {: .note}
    23  > Currently, Unity Catalog export feature exclusively supports AWS S3 as the underlying storage solution. It's planned to [support other cloud providers soon](https://github.com/treeverse/lakeFS/issues/7199).
    24  
    25  ## Prerequisites
    26  
    27  Before starting, ensure you have the following:
    28  
    29  1. Access to Unity Catalog
    30  2. An active lakeFS installation with S3 as the backing storage, and a repository in this installation.
    31  3. A Databricks SQL warehouse.
    32  4. AWS Credentials with S3 access.
    33  5. lakeFS credentials with access to your Delta Tables.
    34  
    35  {: .note}
    36  > Supported from lakeFS v1.4.0
    37  
    38  ### Databricks authentication
    39  
    40  Given that the hook will ultimately register a table in Unity Catalog, authentication with Databricks is imperative.
    41  Make sure that:
    42  
    43  1. You have a Databricks [Service Principal](https://docs.databricks.com/en/dev-tools/service-principals.html).
    44  2. The Service principal has [token usage permissions](https://docs.databricks.com/en/dev-tools/service-principals.html#step-3-assign-workspace-level-permissions-to-the-databricks-service-principal),
    45     and an associated [token](https://docs.databricks.com/en/dev-tools/service-principals.html#step-4-generate-a-databricks-personal-access-token-for-the-databricks-service-principal)
    46     configured.
    47  3. The service principal has the `Service principal: Manager` privilege over itself (Workspace: Admin console -> Service principals -> `<service principal>` -> Permissions -> Grant access (`<service principal>`:
    48     `Service principal: Manager`), with `Workspace access` and `Databricks SQL access` checked (Admin console -> Service principals -> `<service principal>` -> Configurations).
    49  4. Your SQL warehouse allows the service principal to use it (SQL Warehouses -> `<SQL warehouse>` -> Permissions -> `<service principal>`: `Can use`).
    50  5. The catalog grants the `USE CATALOG`, `USE SCHEMA`, `CREATE SCHEMA` permissions to the service principal(Catalog -> `<catalog name>` -> Permissions -> Grant -> `<service principal>`: `USE CATALOG`, `USE SCHEMA`, `CREATE SCHEMA`).
    51  6. You have an _External Location_ configured, and the service principal has the `CREATE EXTERNAL TABLE` permission over it (Catalog -> External Data -> External Locations -> Create location).
    52  
    53  ## Guide
    54  
    55  ### Table descriptor definition
    56  
    57  To guide the Unity Catalog exporter in configuring the table in the catalog, define its properties in the Delta Lake table descriptor. 
    58  The table descriptor should include (at minimum) the following fields:
    59  1. `name`: The table name.
    60  2. `type`: Should be `delta`.
    61  3. `catalog`: The name of the catalog in which the table will be created.
    62  4. `path`: The path in lakeFS (starting from the root of the branch) in which the Delta Lake table's data is found.
    63  
    64  Let's define the table descriptor and upload it to lakeFS:
    65  
    66  Save the following as `famous-people-td.yaml`:
    67  
    68  ```yaml
    69  ---
    70  name: famous_people
    71  type: delta
    72  catalog: my-catalog-name
    73  path: tables/famous-people
    74  ```
    75  
    76  {: .note}
    77  > It's recommended to create a Unity catalog with the same name as your repository
    78  
    79  Upload the table descriptor to `_lakefs_tables/famous-people-td.yaml` and commit:
    80  
    81  ```bash
    82  lakectl fs upload lakefs://repo/main/_lakefs_tables/famous-people-td.yaml -s ./famous-people-td.yaml && \
    83  lakectl commit lakefs://repo/main -m "add famous people table descriptor"
    84  ```
    85  
    86  ### Write some data
    87  
    88  Insert data into the table path, using your preferred method (e.g. [Spark]({% link integrations/spark.md %})), and commit upon completion.
    89  
    90  We shall use Spark and lakeFS's S3 gateway to write some data as a Delta table:
    91  ```bash
    92  pyspark --packages "io.delta:delta-spark_2.12:3.0.0,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262" \
    93    --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
    94    --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
    95    --conf spark.hadoop.fs.s3a.aws.credentials.provider='org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider' \
    96    --conf spark.hadoop.fs.s3a.endpoint='<LAKEFS_SERVER_URL>' \
    97    --conf spark.hadoop.fs.s3a.access.key='<LAKEFS_ACCESS_KEY>' \
    98    --conf spark.hadoop.fs.s3a.secret.key='<LAKEFS_SECRET_ACCESS_KEY>' \
    99    --conf spark.hadoop.fs.s3a.path.style.access=true
   100  ```
   101  
   102  ```python
   103  data = [
   104     ('James','Bond','England','intelligence'),
   105     ('Robbie','Williams','England','music'),
   106     ('Hulk','Hogan','USA','entertainment'),
   107     ('Mister','T','USA','entertainment'),
   108     ('Rafael','Nadal','Spain','professional athlete'),
   109     ('Paul','Haver','Belgium','music'),
   110  ]
   111  columns = ["firstname","lastname","country","category"]
   112  df = spark.createDataFrame(data=data, schema = columns)
   113  df.write.format("delta").mode("overwrite").partitionBy("category", "country").save("s3a://repo/main/tables/famous-people")
   114  ```
   115  
   116  ### The Unity Catalog exporter script
   117  
   118  {: .note}
   119  > For code references check [delta_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportdelta_exporter) and 
   120  [unity_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportunity_exporter) docs.
   121  
   122  Create `unity_exporter.lua`:
   123  
   124  ```lua
   125  local aws = require("aws")
   126  local formats = require("formats")
   127  local databricks = require("databricks")
   128  local delta_export = require("lakefs/catalogexport/delta_exporter")
   129  local unity_export = require("lakefs/catalogexport/unity_exporter")
   130  
   131  local sc = aws.s3_client(args.aws.access_key_id, args.aws.secret_access_key, args.aws.region)
   132  
   133  -- Export Delta Lake tables export:
   134  local delta_client = formats.delta_client(args.lakefs.access_key_id, args.lakefs.secret_access_key, args.aws.region)
   135  local delta_table_details = delta_export.export_delta_log(action, args.table_defs, sc.put_object, delta_client, "_lakefs_tables")
   136  
   137  -- Register the exported table in Unity Catalog:
   138  local databricks_client = databricks.client(args.databricks_host, args.databricks_token)
   139  local registration_statuses = unity_export.register_tables(action, "_lakefs_tables", delta_table_details, databricks_client, args.warehouse_id)
   140  
   141  for t, status in pairs(registration_statuses) do
   142      print("Unity catalog registration for table \"" .. t .. "\" completed with commit schema status : " .. status .. "\n")
   143  end
   144  ```
   145  
   146  Upload the lua script to the `main` branch under `scripts/unity_exporter.lua` and commit:
   147  
   148  ```bash
   149  lakectl fs upload lakefs://repo/main/scripts/unity_exporter.lua -s ./unity_exporter.lua && \
   150  lakectl commit lakefs://repo/main -m "upload unity exporter script"
   151  ```
   152  
   153  ### Action configuration
   154  
   155  Define an action configuration that will run the above script after a commit is completed (`post-commit`) over the `main` branch.
   156  
   157  Create `unity_exports_action.yaml`:
   158  
   159  ```yaml
   160  ---
   161  name: unity_exports
   162  on:
   163    post-commit:
   164       branches: ["main"]
   165  hooks:
   166    - id: unity_export
   167      type: lua
   168      properties:
   169        script_path: scripts/unity_exporter.lua
   170        args:
   171          aws:
   172            access_key_id: <AWS_ACCESS_KEY_ID>
   173            secret_access_key: <AWS_SECRET_ACCESS_KEY>
   174            region: <AWS_REGION>
   175          lakefs: # provide credentials of a user that has access to the script and Delta Table
   176            access_key_id: <LAKEFS_ACCESS_KEY_ID> 
   177            secret_access_key: <LAKEFS_SECRET_ACCESS_KEY>
   178          table_defs: # an array of table descriptors used to be defined in Unity Catalog
   179            - famous-people-td
   180          databricks_host: <DATABRICKS_HOST_URL>
   181          databricks_token: <DATABRICKS_SERVICE_PRINCIPAL_TOKEN>
   182          warehouse_id: <WAREHOUSE_ID>
   183  ```
   184  
   185  Upload the action configurations to `_lakefs_actions/unity_exports_action.yaml` and commit:
   186  
   187  {: .note}
   188  > Once the commit will finish its run, the action will start running since we've configured it to run on `post-commit` 
   189  events on the `main` branch.
   190  
   191  ```bash
   192  lakectl fs upload lakefs://repo/main/_lakefs_actions/unity_exports_action.yaml -s ./unity_exports_action.yaml && \
   193  lakectl commit lakefs://repo/main -m "upload action and run it"
   194  ```
   195  
   196  The action has run and exported the `famous_people` Delta Lake table to the repo's storage namespace, and has register 
   197  the table as an external table in Unity Catalog under the catalog `my-catalog-name`, schema `main` (as the branch's name) and 
   198  table name `famous_people`: `my-catalog-name.main.famous_people`.
   199  
   200  ![Hooks log result in lakeFS UI]({{ site.baseurl }}/assets/img/unity_export_hook_result_log.png)
   201  
   202  ### Databricks Integration
   203  
   204  After registering the table in Unity, you can leverage your preferred method to [query the data](https://docs.databricks.com/en/query/index.html) 
   205  from the exported table under `my-catalog-name.main.famous_people`, and view it in the Databricks's Catalog Explorer, or
   206  retrieve it using the Databricks CLI with the following command: 
   207  ```bash
   208  databricks tables get my-catalog-name.main.famous_people
   209  ```
   210  
   211  ![Unity Catalog Explorer view]({{ site.baseurl }}/assets/img/unity_exported_table_columns.png)