github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/catalog_exports.md (about) 1 --- 2 title: Data Catalogs Export 3 description: This section explains how lakeFS can integrate with external Data Catalogs via metastore update operations. 4 parent: How-To 5 --- 6 7 # Data Catalogs Export 8 9 {% include toc_2-3.html %} 10 11 ## About Data Catalogs Export 12 13 Data Catalog Export is all about integrating query engines (like Spark, AWS Athena, Presto, etc.) with lakeFS. 14 15 Data Catalogs (such as Hive Metastore or AWS Glue) store metadata for services (such as Spark, Trino and Athena). They contain metadata such as the location of the table, information about columns, partitions and much more. 16 17 With Data Catalog Exports, one can leverage the versioning capabilities of lakeFS in external data warehouses and query engines to access tables with branches and commits. 18 19 At the end of this guide, you will be able to query lakeFS data from Athena, Trino and other catalog-dependent tools: 20 21 ```sql 22 USE main; 23 USE my_branch; -- any branch 24 USE v101; -- or tag 25 26 SELECT * FROM users 27 INNER JOIN events 28 ON users.id = events.user_id; -- SQL stays the same, branch or tag exist as schema 29 ``` 30 31 ## How it works 32 33 Several well known formats exist today let you export existing tables in lakeFS into a "native" object store representation 34 which does not require copying the data outside of lakeFS. 35 36 These are metadata representations and can be applied automatically through hooks. 37 38 ### Table Decleration 39 40 After creating a lakeFS repository, configure tables as table descriptor objects on the repository on the path `_lakefs_tables/TABLE.yaml`. 41 Note: the Glue exporter can currently only export tables of `type: hive`. We expect to add more. 42 43 #### Hive tables 44 45 Hive metadata server tables are essentially just a set of objects that share a prefix, with no table metadata stored on the object store. You need to configure prefix, partitions, and schema. 46 47 ```yaml 48 name: animals 49 type: hive 50 path: path/to/animals/ 51 partition_columns: ['year'] 52 schema: 53 type: struct 54 fields: 55 - name: year 56 type: integer 57 nullable: false 58 metadata: {} 59 - name: page 60 type: string 61 nullable: false 62 metadata: {} 63 - name: site 64 type: string 65 nullable: true 66 metadata: 67 comment: a comment about this column 68 ``` 69 70 Useful types recognized by Hive include `integer`, `long`, `short`, `string`, `double`, `float`, `date`, and `timestamp`. 71 {: .note } 72 73 ### Catalog Exporters 74 75 Exporters are code packages accessible through [Lua integration]({% link howto/hooks/lua.md %}#lua-library-reference). Each exporter is exposed as a Lua function under the package namespace `lakefs/catalogexport`. Call them from hooks to connect lakeFS tables to various catalogs. 76 77 #### Currently supported exporters 78 79 | Exporter | Description | Notes | 80 |:-----------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------| 81 | **Symlink exporter** | Writes metadata for the table using Hive's [SymlinkTextInputFormat](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html) | | 82 | **AWS Glue Catalog (+ Athena) exporter** | Creates a table in Glue using Hive's format and updates the location to symlink files (reuses Symlink Exporter). | See a step-by-step guide on how to integrate with [Glue Exporter]({% link integrations/glue_metastore.md %}) | 83 | **Delta Lake table exporter** | Export Delta Lake tables from lakeFS to an external storage | | 84 | **Unity Catalog exporter** | The Unity Catalog exporter serves the purpose of registering a Delta Lake table in Unity Catalog. It operates in conjunction with the Delta Lake exporter. In this workflow, the Delta Lake exporter is utilized to export a Delta Lake table from lakeFS. Subsequently, the obtained result is passed to the Unity Catalog exporter to facilitate its registration within Unity Catalog. | See a step-by-step guide on how to integrate with [Unity Catalog Exporter]({% link integrations/unity-catalog.md %})</br>Currently, only AWS S3 storage is supported | 85 86 #### Running an Exporter 87 88 Exporters are meant to run as [Lua hooks]({% link howto/hooks/lua.md %}). 89 90 Configure the actions trigger by using [events and branches]({% link howto/hooks/index.md %}#action-file-schema). Of course, you can add additional custom filtering logic to the Lua script if needed. 91 The default table name when exported is `${repository_id}_${_lakefs_tables/TABLE.md(name field)}_${ref_name}_${short_commit}`. 92 93 Example of an action that will be triggered when a `post-commit` event happens in the `export_table` branch. 94 95 ```yaml 96 name: Glue Table Exporter 97 description: export my table to glue 98 on: 99 post-commit: 100 branches: ["export_table"] 101 hooks: 102 - id: my_exporter 103 type: lua 104 properties: 105 # exporter script location 106 script_path: "scripts/my_export_script.lua" 107 args: 108 # table descriptor 109 table_source: '_lakefs_tables/my_table.yaml' 110 ``` 111 112 Tip: Actions can be extended to customize any desired behavior, for example validating branch names since they are part of the table name: 113 114 ```yaml 115 # _lakefs_actions/validate_branch_name.yaml 116 name: validate-lower-case-branches 117 on: 118 pre-create-branch: 119 hooks: 120 - id: check_branch_id 121 type: lua 122 properties: 123 script: | 124 regexp = require("regexp") 125 if not regexp.match("^[a-z0-9\\_\\-]+$", action.branch_id) then 126 error("branches must be lower case, invalid branch ID: " .. action.branch_id) 127 end 128 ``` 129 130 ### Flow 131 132 The following diagram demonstrates what happens when a lakeFS Action triggers runs a lua hook that calls an exporter. 133 134 ```mermaid 135 sequenceDiagram 136 note over Lua Hook: lakeFS Action trigger. <br> Pass Context for the export. 137 Lua Hook->>Exporter: export request 138 note over Table Registry: _lakefs_tables/TABLE.yaml 139 Exporter->>Table Registry: Get table descriptor 140 Table Registry->>Exporter: Parse table structure 141 Exporter->>Object Store: materialize an exported table 142 Exporter->>Catalog: register object store location 143 Query Engine-->Catalog: Query 144 Query Engine-->Object Store: Query 145 ```