github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/unity-catalog.md (about) 1 --- 2 title: Unity Catalog 3 description: Accessing lakeFS-exported Delta Lake tables from Unity Catalog. 4 parent: Integrations 5 redirect_from: /using/unity_catalog 6 redirect_from: /integrations/unity_catalog 7 --- 8 9 # Using lakeFS with the Unity Catalog 10 11 {% include toc_2-3.html %} 12 13 ## Overview 14 15 Databricks Unity Catalog serves as a centralized data governance platform for your data lakes. 16 Through the Unity Catalog, you can search for and locate data assets across workspaces via a unified catalog. 17 Leveraging the external tables feature within Unity Catalog, you can register a Delta Lake table exported from lakeFS and 18 access it through the unified catalog. 19 The subsequent step-by-step guide will lead you through the process of configuring a [Lua hook]({% link howto/hooks/lua.md %}) 20 that exports Delta Lake tables from lakeFS, and subsequently registers them in Unity Catalog. 21 22 {: .note} 23 > Currently, Unity Catalog export feature exclusively supports AWS S3 as the underlying storage solution. It's planned to [support other cloud providers soon](https://github.com/treeverse/lakeFS/issues/7199). 24 25 ## Prerequisites 26 27 Before starting, ensure you have the following: 28 29 1. Access to Unity Catalog 30 2. An active lakeFS installation with S3 as the backing storage, and a repository in this installation. 31 3. A Databricks SQL warehouse. 32 4. AWS Credentials with S3 access. 33 5. lakeFS credentials with access to your Delta Tables. 34 35 {: .note} 36 > Supported from lakeFS v1.4.0 37 38 ### Databricks authentication 39 40 Given that the hook will ultimately register a table in Unity Catalog, authentication with Databricks is imperative. 41 Make sure that: 42 43 1. You have a Databricks [Service Principal](https://docs.databricks.com/en/dev-tools/service-principals.html). 44 2. The Service principal has [token usage permissions](https://docs.databricks.com/en/dev-tools/service-principals.html#step-3-assign-workspace-level-permissions-to-the-databricks-service-principal), 45 and an associated [token](https://docs.databricks.com/en/dev-tools/service-principals.html#step-4-generate-a-databricks-personal-access-token-for-the-databricks-service-principal) 46 configured. 47 3. The service principal has the `Service principal: Manager` privilege over itself (Workspace: Admin console -> Service principals -> `<service principal>` -> Permissions -> Grant access (`<service principal>`: 48 `Service principal: Manager`), with `Workspace access` and `Databricks SQL access` checked (Admin console -> Service principals -> `<service principal>` -> Configurations). 49 4. Your SQL warehouse allows the service principal to use it (SQL Warehouses -> `<SQL warehouse>` -> Permissions -> `<service principal>`: `Can use`). 50 5. The catalog grants the `USE CATALOG`, `USE SCHEMA`, `CREATE SCHEMA` permissions to the service principal(Catalog -> `<catalog name>` -> Permissions -> Grant -> `<service principal>`: `USE CATALOG`, `USE SCHEMA`, `CREATE SCHEMA`). 51 6. You have an _External Location_ configured, and the service principal has the `CREATE EXTERNAL TABLE` permission over it (Catalog -> External Data -> External Locations -> Create location). 52 53 ## Guide 54 55 ### Table descriptor definition 56 57 To guide the Unity Catalog exporter in configuring the table in the catalog, define its properties in the Delta Lake table descriptor. 58 The table descriptor should include (at minimum) the following fields: 59 1. `name`: The table name. 60 2. `type`: Should be `delta`. 61 3. `catalog`: The name of the catalog in which the table will be created. 62 4. `path`: The path in lakeFS (starting from the root of the branch) in which the Delta Lake table's data is found. 63 64 Let's define the table descriptor and upload it to lakeFS: 65 66 Save the following as `famous-people-td.yaml`: 67 68 ```yaml 69 --- 70 name: famous_people 71 type: delta 72 catalog: my-catalog-name 73 path: tables/famous-people 74 ``` 75 76 {: .note} 77 > It's recommended to create a Unity catalog with the same name as your repository 78 79 Upload the table descriptor to `_lakefs_tables/famous-people-td.yaml` and commit: 80 81 ```bash 82 lakectl fs upload lakefs://repo/main/_lakefs_tables/famous-people-td.yaml -s ./famous-people-td.yaml && \ 83 lakectl commit lakefs://repo/main -m "add famous people table descriptor" 84 ``` 85 86 ### Write some data 87 88 Insert data into the table path, using your preferred method (e.g. [Spark]({% link integrations/spark.md %})), and commit upon completion. 89 90 We shall use Spark and lakeFS's S3 gateway to write some data as a Delta table: 91 ```bash 92 pyspark --packages "io.delta:delta-spark_2.12:3.0.0,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262" \ 93 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \ 94 --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \ 95 --conf spark.hadoop.fs.s3a.aws.credentials.provider='org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider' \ 96 --conf spark.hadoop.fs.s3a.endpoint='<LAKEFS_SERVER_URL>' \ 97 --conf spark.hadoop.fs.s3a.access.key='<LAKEFS_ACCESS_KEY>' \ 98 --conf spark.hadoop.fs.s3a.secret.key='<LAKEFS_SECRET_ACCESS_KEY>' \ 99 --conf spark.hadoop.fs.s3a.path.style.access=true 100 ``` 101 102 ```python 103 data = [ 104 ('James','Bond','England','intelligence'), 105 ('Robbie','Williams','England','music'), 106 ('Hulk','Hogan','USA','entertainment'), 107 ('Mister','T','USA','entertainment'), 108 ('Rafael','Nadal','Spain','professional athlete'), 109 ('Paul','Haver','Belgium','music'), 110 ] 111 columns = ["firstname","lastname","country","category"] 112 df = spark.createDataFrame(data=data, schema = columns) 113 df.write.format("delta").mode("overwrite").partitionBy("category", "country").save("s3a://repo/main/tables/famous-people") 114 ``` 115 116 ### The Unity Catalog exporter script 117 118 {: .note} 119 > For code references check [delta_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportdelta_exporter) and 120 [unity_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportunity_exporter) docs. 121 122 Create `unity_exporter.lua`: 123 124 ```lua 125 local aws = require("aws") 126 local formats = require("formats") 127 local databricks = require("databricks") 128 local delta_export = require("lakefs/catalogexport/delta_exporter") 129 local unity_export = require("lakefs/catalogexport/unity_exporter") 130 131 local sc = aws.s3_client(args.aws.access_key_id, args.aws.secret_access_key, args.aws.region) 132 133 -- Export Delta Lake tables export: 134 local delta_client = formats.delta_client(args.lakefs.access_key_id, args.lakefs.secret_access_key, args.aws.region) 135 local delta_table_details = delta_export.export_delta_log(action, args.table_defs, sc.put_object, delta_client, "_lakefs_tables") 136 137 -- Register the exported table in Unity Catalog: 138 local databricks_client = databricks.client(args.databricks_host, args.databricks_token) 139 local registration_statuses = unity_export.register_tables(action, "_lakefs_tables", delta_table_details, databricks_client, args.warehouse_id) 140 141 for t, status in pairs(registration_statuses) do 142 print("Unity catalog registration for table \"" .. t .. "\" completed with commit schema status : " .. status .. "\n") 143 end 144 ``` 145 146 Upload the lua script to the `main` branch under `scripts/unity_exporter.lua` and commit: 147 148 ```bash 149 lakectl fs upload lakefs://repo/main/scripts/unity_exporter.lua -s ./unity_exporter.lua && \ 150 lakectl commit lakefs://repo/main -m "upload unity exporter script" 151 ``` 152 153 ### Action configuration 154 155 Define an action configuration that will run the above script after a commit is completed (`post-commit`) over the `main` branch. 156 157 Create `unity_exports_action.yaml`: 158 159 ```yaml 160 --- 161 name: unity_exports 162 on: 163 post-commit: 164 branches: ["main"] 165 hooks: 166 - id: unity_export 167 type: lua 168 properties: 169 script_path: scripts/unity_exporter.lua 170 args: 171 aws: 172 access_key_id: <AWS_ACCESS_KEY_ID> 173 secret_access_key: <AWS_SECRET_ACCESS_KEY> 174 region: <AWS_REGION> 175 lakefs: # provide credentials of a user that has access to the script and Delta Table 176 access_key_id: <LAKEFS_ACCESS_KEY_ID> 177 secret_access_key: <LAKEFS_SECRET_ACCESS_KEY> 178 table_defs: # an array of table descriptors used to be defined in Unity Catalog 179 - famous-people-td 180 databricks_host: <DATABRICKS_HOST_URL> 181 databricks_token: <DATABRICKS_SERVICE_PRINCIPAL_TOKEN> 182 warehouse_id: <WAREHOUSE_ID> 183 ``` 184 185 Upload the action configurations to `_lakefs_actions/unity_exports_action.yaml` and commit: 186 187 {: .note} 188 > Once the commit will finish its run, the action will start running since we've configured it to run on `post-commit` 189 events on the `main` branch. 190 191 ```bash 192 lakectl fs upload lakefs://repo/main/_lakefs_actions/unity_exports_action.yaml -s ./unity_exports_action.yaml && \ 193 lakectl commit lakefs://repo/main -m "upload action and run it" 194 ``` 195 196 The action has run and exported the `famous_people` Delta Lake table to the repo's storage namespace, and has register 197 the table as an external table in Unity Catalog under the catalog `my-catalog-name`, schema `main` (as the branch's name) and 198 table name `famous_people`: `my-catalog-name.main.famous_people`. 199 200  201 202 ### Databricks Integration 203 204 After registering the table in Unity, you can leverage your preferred method to [query the data](https://docs.databricks.com/en/query/index.html) 205 from the exported table under `my-catalog-name.main.famous_people`, and view it in the Databricks's Catalog Explorer, or 206 retrieve it using the Databricks CLI with the following command: 207 ```bash 208 databricks tables get my-catalog-name.main.famous_people 209 ``` 210 211 