github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/unity-catalog-exporter.md (about)

     1  # Unity catalog exporter
     2  
     3  ## Introduction
     4  
     5  Currently, due to the limitations of Databricks Unity catalog, which supports only cloud provider direct storage
     6  endpoints and authentication, it's not feasible to configure it to work directly with lakeFS.  
     7  We wish to overcome this limitation to enable Unity catalog-backed services to read Delta Lake tables, which is the
     8  default table format used by Databricks, from lakeFS.
     9  
    10  ---
    11  
    12  ## Proposed Solution
    13  
    14  Following the [catalog exports issue](https://github.com/treeverse/lakeFS/issues/6461), the Unity catalog exporter will 
    15  utilize the [Delta Lake catalog exporter](./delta-catalog-exporter.md) to export an existing Delta Lake table to 
    16  `${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}`. Following this, it will create an external table
    17  in an existing `catalog.schema` within the Unity catalog, using the Databricks API, the provided 
    18  `_lakefs_tables/<table>.yaml` definitions by the user, and specifying the location where the Delta Log was exported to.
    19  
    20  ### Flow
    21  
    22  1. Execute the Delta Lake catalog exporter procedure and retrieve the path to the exported data.
    23  2. Utilizing the table names configured for this hook, such as `['my-table', 'my-other-table']`, establish or replace external
    24  tables within the Unity catalog (which is provided in the hook's configuration) and schema (which will be the branch). Ensure that you use
    25  the field names and data types as specified in the `_lakefs_tables/my-table.yaml` and `_lakefs_tables/my-other-table.yaml` files.
    26  
    27  Once the above hook's run completed successfully, the tables could be read form the Databricks Unity catalog backed service.
    28  
    29  ## Misc.
    30  
    31  - Authentication with Databricks will require a [service principal](https://docs.databricks.com/en/dev-tools/service-principals.html)
    32  and an associated [token](https://docs.databricks.com/en/dev-tools/service-principals.html#step-4-generate-a-databricks-personal-access-token-for-the-databricks-service-principal) to be provided to the hook's
    33  configurations. The Service Principal should have `Service principal: Manager`
    34  permission over itself (Workspace: Admin console -> Service principals -> `<service principal>` -> Permissions -> Grant access (`<service principal>`:
    35  Service principal: Manager), `Workspace access` and `Databricks SQL access` checked (Admin console -> Service principals -> `<service principal>` -> Configurations),
    36  a SQL warehouse that will run `CREATE EXTERNAL TABLE` queries, that allow the service principal to use it (SQL Warehouses ->
    37  `<SQL warehouse>` -> Permissions -> `<service principal>`: Can use), a Catalog that has granted a permission for the service principal
    38  to use it and to create and use a schema within it (Catalog -> `<catalog name>` -> Permissions -> Grant -> `<service principal>`: `USE CATALOG`, `USE SCHEMA`, `CREATE SCHEMA`),
    39  a Schema with `CREATE TABLE` permission configured for the service principal (Catalog -> `<catalog name>` -> `<schema name>` -> Permissions
    40  -> Grant -> `<service principal>`: `CREATE TABLE`)
    41  and an **External Location** (Catalog -> External Data -> External Locations -> Create location) to the table's bucket (lakeFS's bucket) with `CREATE EXTERNAL TABLE` permission for the service principal,
    42  - The users will supply an existing catalog under which the schema and table will be created using the [Databricks Go SDK](https://docs.databricks.com/en/dev-tools/sdk-go.html).