github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/catalog_exports.md (about)

     1  ---
     2  title: Data Catalogs Export
     3  description: This section explains how lakeFS can integrate with external Data Catalogs via metastore update operations. 
     4  parent: How-To
     5  ---
     6  
     7  # Data Catalogs Export
     8  
     9  {% include toc_2-3.html %}
    10  
    11  ## About Data Catalogs Export
    12  
    13  Data Catalog Export is all about integrating query engines (like Spark, AWS Athena, Presto, etc.) with lakeFS.
    14  
    15  Data Catalogs (such as Hive Metastore or AWS Glue) store metadata for services (such as Spark, Trino and Athena). They contain metadata such as the location of the table, information about columns, partitions and much more.
    16  
    17  With Data Catalog Exports, one can leverage the versioning capabilities of lakeFS in external data warehouses and query engines to access tables with branches and commits. 
    18  
    19  At the end of this guide, you will be able to query lakeFS data from Athena, Trino and other catalog-dependent tools:
    20  
    21  ```sql
    22  USE main;
    23  USE my_branch; -- any branch
    24  USE v101; -- or tag
    25  
    26  SELECT * FROM users 
    27  INNER JOIN events 
    28  ON users.id = events.user_id; -- SQL stays the same, branch or tag exist as schema
    29  ```
    30  
    31  ## How it works 
    32  
    33  Several well known formats exist today let you export existing tables in lakeFS into a "native" object store representation
    34  which does not require copying the data outside of lakeFS.
    35  
    36  These are metadata representations and can be applied automatically through hooks.
    37  
    38  ### Table Decleration 
    39  
    40  After creating a lakeFS repository, configure tables as table descriptor objects on the repository on the path `_lakefs_tables/TABLE.yaml`.
    41  Note: the Glue exporter can currently only export tables of `type: hive`.  We expect to add more.
    42  
    43  #### Hive tables
    44  
    45  Hive metadata server tables are essentially just a set of objects that share a prefix, with no table metadata stored on the object store.  You need to configure prefix, partitions, and schema.
    46  
    47  ```yaml
    48  name: animals
    49  type: hive
    50  path: path/to/animals/
    51  partition_columns: ['year']
    52  schema:
    53    type: struct
    54    fields:
    55      - name: year
    56        type: integer
    57        nullable: false
    58        metadata: {}
    59      - name: page
    60        type: string
    61        nullable: false
    62        metadata: {}
    63      - name: site
    64        type: string
    65        nullable: true
    66        metadata:
    67          comment: a comment about this column
    68  ```
    69  
    70  Useful types recognized by Hive include `integer`, `long`, `short`, `string`, `double`, `float`, `date`, and `timestamp`.
    71  {: .note }
    72  
    73  ### Catalog Exporters 
    74  
    75  Exporters are code packages accessible through [Lua integration]({% link howto/hooks/lua.md %}#lua-library-reference). Each exporter is exposed as a Lua function under the package namespace `lakefs/catalogexport`.  Call them from hooks to connect lakeFS tables to various catalogs.
    76  
    77  #### Currently supported exporters
    78  
    79  | Exporter                                 | Description                                                                                                                                                                                                                                                                                                                                                                               | Notes                                                                                                                                                                |
    80  |:-----------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
    81  | **Symlink exporter**                     | Writes metadata for the table using Hive's [SymlinkTextInputFormat](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html)                                                                                                                                                                     |                                                                                                                                                                      |
    82  | **AWS Glue Catalog (+ Athena) exporter** | Creates a table in Glue using Hive's format and updates the location to symlink files (reuses Symlink Exporter).                                                                                                                                                                                                                                                                          | See a step-by-step guide on how to integrate with [Glue Exporter]({% link integrations/glue_metastore.md %})                                                         |
    83  | **Delta Lake table exporter**            | Export Delta Lake tables from lakeFS to an external storage                                                                                                                                                                                                                                                                                                                               |                                                                                                                                                                      |
    84  | **Unity Catalog exporter**               | The Unity Catalog exporter serves the purpose of registering a Delta Lake table in Unity Catalog. It operates in conjunction with the Delta Lake exporter. In this workflow, the Delta Lake exporter is utilized to export a Delta Lake table from lakeFS. Subsequently, the obtained result is passed to the Unity Catalog exporter to facilitate its registration within Unity Catalog. | See a step-by-step guide on how to integrate with [Unity Catalog Exporter]({% link integrations/unity-catalog.md %})</br>Currently, only AWS S3 storage is supported |
    85  
    86  #### Running an Exporter  
    87  
    88  Exporters are meant to run as [Lua hooks]({% link howto/hooks/lua.md %}).
    89                                                                                           
    90  Configure the actions trigger by using [events and branches]({% link howto/hooks/index.md %}#action-file-schema).  Of course, you can add additional custom filtering logic to the Lua script if needed.
    91  The default table name when exported is `${repository_id}_${_lakefs_tables/TABLE.md(name field)}_${ref_name}_${short_commit}`.
    92  
    93  Example of an action that will be triggered when a `post-commit` event happens in the `export_table` branch.
    94  
    95  ```yaml
    96  name: Glue Table Exporter
    97  description: export my table to glue  
    98  on:
    99    post-commit:
   100      branches: ["export_table"]
   101  hooks:
   102    - id: my_exporter
   103      type: lua
   104      properties:
   105        # exporter script location
   106        script_path: "scripts/my_export_script.lua"
   107        args:
   108          # table descriptor
   109          table_source: '_lakefs_tables/my_table.yaml'
   110  ```
   111  
   112  Tip: Actions can be extended to customize any desired behavior, for example validating branch names since they are part of the table name: 
   113  
   114  ```yaml
   115  # _lakefs_actions/validate_branch_name.yaml
   116  name: validate-lower-case-branches 
   117  on:
   118    pre-create-branch:
   119  hooks:
   120    - id: check_branch_id
   121      type: lua
   122      properties:
   123        script: |
   124          regexp = require("regexp")
   125          if not regexp.match("^[a-z0-9\\_\\-]+$", action.branch_id) then
   126            error("branches must be lower case, invalid branch ID: " .. action.branch_id)
   127          end
   128  ```
   129  
   130  ### Flow
   131  
   132  The following diagram demonstrates what happens when a lakeFS Action triggers runs a lua hook that calls an exporter.
   133  
   134  ```mermaid 
   135  sequenceDiagram
   136      note over Lua Hook: lakeFS Action trigger. <br> Pass Context for the export.
   137      Lua Hook->>Exporter: export request
   138      note over Table Registry: _lakefs_tables/TABLE.yaml
   139      Exporter->>Table Registry: Get table descriptor
   140      Table Registry->>Exporter: Parse table structure
   141      Exporter->>Object Store: materialize an exported table
   142      Exporter->>Catalog: register object store location
   143      Query Engine-->Catalog: Query
   144      Query Engine-->Object Store: Query
   145  ```