github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/glue_metastore.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/glue_metastore.md (about)

     1  ---
     2  title: Glue Data Catalog
     3  description: Query data from lakeFS branches in AWS Athena or other services backed by Glue Data Catalog.
     4  parent: Integrations
     5  redirect_from: /using/glue_metastore.html
     6  ---
     7  
     8  
     9  # Using lakeFS with the Glue Catalog
    10  
    11  {% include toc_2-3.html %}
    12  
    13  ## Overview
    14  
    15  The integration between Glue and lakeFS is based on [Data Catalog Exports]({% link howto/catalog_exports.md %}).
    16  
    17  This guide describes how to use lakeFS with the Glue Data Catalog.
    18  You'll be able to query your lakeFS data by specifying the repository, branch and commit in your SQL query.
    19  Currently, only read operations are supported on the tables.
    20  You will set up the automation required to work with lakeFS on top of the Glue Data Catalog, including:
    21  1. Create a table descriptor under `_lakefs_tables/<your-table>.yaml`. This will represent your table schema.
    22  2. Write an exporter script that will:
    23     * Mirror your branch's state into [Hive Symlink](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html) files readable by Athena.
    24     * Export the table descriptors from your branch to the Glue Catalog.
    25  3. Set up lakeFS [hooks]({% link howto/catalog_exports.md %}#running-an-exporter) to trigger the above script when specific events occur.
    26    
    27  ## Example: Using Athena to query lakeFS data
    28  
    29  ### Prerequisites
    30  
    31  Before starting, make sure you have:
    32  1. An active lakeFS installation with S3 as the backing storage, and a repository in this installation.
    33  2. A database in Glue Data Catalog (lakeFS does not create one).
    34  3. AWS Credentials with permission to manage Glue, Athena Query and S3 access.
    35  
    36  ### Add table descriptor
    37  
    38  Let's define a table, and commit it to lakeFS. 
    39  Save the YAML below as `animals.yaml` and upload it to lakeFS. 
    40  
    41  ```bash
    42  lakectl fs upload lakefs://catalogs/main/_lakefs_tables/animals.yaml -s ./animals.yaml && \
    43  lakectl commit lakefs://catalogs/main -m "added table"
    44  ```
    45  
    46  ```yaml 
    47  name: animals
    48  type: hive
    49  # data location root in lakeFS
    50  path: tables/animals
    51  # partitions order
    52  partition_columns: ['type', 'weight']
    53  schema:
    54    type: struct
    55    # all the columns spec
    56    fields:
    57      - name: type
    58        type: string
    59        nullable: true
    60        metadata:
    61          comment: axolotl, cat, dog, fish etc
    62      - name: weight
    63        type: integer
    64        nullable: false
    65        metadata: {}
    66      - name: name
    67        type: string
    68        nullable: false
    69        metadata: {}
    70  ```
    71  
    72  ### Write some table data
    73  
    74  Insert data into the table path, using your preferred method (e.g. [Spark]({% link integrations/spark.md %})), and commit upon completion.
    75  This example uses CSV files, and the files added to lakeFS should look like this:
    76  
    77  ![lakeFS Uploaded CSV Files]({{ site.baseurl }}/assets/img/csv_export_hooks_data.png)
    78  
    79  ### The exporter script
    80  
    81  Upload the following script to your main branch under `scripts/animals_exporter.lua` (or a path of your choice).
    82  
    83  {: .note}
    84  > For code references check [symlink_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportsymlink_exporter) and [glue_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportglue_exporter) docs.
    85  
    86  
    87  ```lua 
    88  local aws = require("aws")
    89  local symlink_exporter = require("lakefs/catalogexport/symlink_exporter")
    90  local glue_exporter = require("lakefs/catalogexport/glue_exporter")
    91  -- settings 
    92  local access_key = args.aws.aws_access_key_id
    93  local secret_key = args.aws.aws_secret_access_key
    94  local region = args.aws.aws_region
    95  local table_path = args.table_source -- table descriptor 
    96  local db = args.catalog.db_name -- glue db
    97  local table_input = args.catalog.table_input -- table input (AWS input spec) for Glue
    98  -- export symlinks 
    99  local s3 = aws.s3_client(access_key, secret_key, region)
   100  local result = symlink_exporter.export_s3(s3, table_path, action, {debug=true})
   101  -- register glue table
   102  local glue = aws.glue_client(access_key, secret_key, region)
   103  local res = glue_exporter.export_glue(glue, db, table_path, table_input, action, {debug=true})
   104  ```
   105  
   106  ### Configure Action Hooks
   107  
   108  Hooks serve as the mechanism that triggers the execution of the exporter.
   109  For more detailed information on how to configure exporter hooks, you can refer to [Running an Exporter]({% link howto/catalog_exports.md %}#running-an-exporter).
   110  
   111  {: .note}
   112  > The `args.catalog.table_input` argument in the Lua script is assumed to be passed from the action arguments, that way the same script can be reused for different tables. Check the [example]({% link howto/hooks/lua.md %}#lakefscatalogexportglue_exporterexport_glueglue-db-table_src_path-create_table_input-action_info-options) to construct the table input in the lua code.
   113  
   114  
   115  <div class="tabs">
   116    <ul>
   117      <li><a href="#single-hook-csv">Hook CSV Glue Table</a></li>
   118      <li><a href="#single-hook">Hook Parquet Glue Table</a></li>
   119      <li><a href="#multiple-hooks">Multiple Hooks / Inline script</a></li>
   120    </ul> 
   121    <div markdown="1" id="single-hook-csv">
   122  
   123  #### Single hook with CSV Table
   124  
   125  Upload to `_lakefs_actions/animals_glue.yaml`: 
   126  
   127  ```yaml
   128  name: Glue Exporter
   129  on:
   130    post-commit:
   131      branches: ["main"]
   132  hooks:
   133    - id: animals_table_glue_exporter
   134      type: lua
   135      properties:
   136        script_path: "scripts/animals_exporter.lua"
   137        args:
   138          aws:
   139            aws_access_key_id: "<AWS_ACCESS_KEY_ID>"
   140            aws_secret_access_key: "<AWS_SECRET_ACCESS_KEY>"
   141            aws_region: "<AWS_REGION>"
   142          table_source: '_lakefs_tables/animals.yaml'
   143          catalog:
   144            db_name: "my-glue-db"
   145            table_input:
   146              StorageDescriptor: 
   147                InputFormat: "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat"
   148                OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
   149                SerdeInfo:
   150                  SerializationLibrary: "org.apache.hadoop.hive.serde2.OpenCSVSerde"
   151                  Parameters:
   152                    separatorChar: ","
   153              Parameters: 
   154                classification: "csv"
   155                "skip.header.line.count": "1"
   156  ```
   157  
   158    </div>
   159    <div markdown="1" id="single-hook">
   160  #### Spark Parquet Example
   161  
   162  When working with Parquet files, upload the following to `_lakefs_actions/animals_glue.yaml`:
   163  
   164  ```yaml
   165  name: Glue Exporter
   166  on:
   167     post-commit:
   168        branches: ["main"]
   169  hooks:
   170     - id: animals_table_glue_exporter
   171       type: lua
   172       properties:
   173          script_path: "scripts/animals_exporter.lua"
   174          args:
   175             aws:
   176                aws_access_key_id: "<AWS_ACCESS_KEY_ID>"
   177                aws_secret_access_key: "<AWS_SECRET_ACCESS_KEY>"
   178                aws_region: "<AWS_REGION>"
   179             table_source: '_lakefs_tables/animals.yaml'
   180             catalog:
   181                db_name: "my-glue-db"
   182                table_input:
   183                   StorageDescriptor:
   184                     InputFormat: "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat"
   185                     OutputFormat: "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat"
   186                     SerdeInfo:
   187                         SerializationLibrary: "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
   188                   Parameters:
   189                     classification: "parquet"
   190                     EXTERNAL: "TRUE"
   191                     "parquet.compression": "SNAPPY"
   192  ```
   193  
   194    </div>
   195    <div markdown="1" id="multiple-hooks">
   196  #### Multiple Hooks / Inline script
   197  
   198  The following example demonstrates how to separate the symlink and glue exporter into building blocks running in separate hooks.
   199  It also shows how to run the lua script inline instead of a file, depending on user preference.
   200  
   201  ```yaml
   202  name: Animal Table Exporter
   203  on:
   204    post-commit:
   205      branches: ["main"]
   206  hooks:
   207    - id: symlink_exporter
   208      type: lua
   209      properties:
   210        args:
   211          aws:
   212            aws_access_key_id: "<AWS_ACCESS_KEY_ID>"
   213            aws_secret_access_key: "<AWS_SECRET_ACCESS_KEY>"
   214            aws_region: "<AWS_REGION>"
   215          table_source: '_lakefs_tables/animals.yaml'
   216        script: |
   217          local exporter = require("lakefs/catalogexport/symlink_exporter")
   218          local aws = require("aws")
   219          local table_path = args.table_source
   220          local s3 = aws.s3_client(args.aws.aws_access_key_id, args.aws.aws_secret_access_key, args.aws.aws_region)
   221          exporter.export_s3(s3, table_path, action, {debug=true})
   222    - id: glue_exporter
   223      type: lua
   224      properties:
   225        args:
   226          aws:
   227            aws_access_key_id: "<AWS_ACCESS_KEY_ID>"
   228            aws_secret_access_key: "<AWS_SECRET_ACCESS_KEY>"
   229            aws_region: "<AWS_REGION>"
   230          table_source: '_lakefs_tables/animals.yaml'
   231          catalog:
   232            db_name: "my-glue-db"
   233            table_input: # add glue table input here 
   234        script: |
   235          local aws = require("aws")
   236          local exporter = require("lakefs/catalogexport/glue_exporter")
   237          local glue = aws.glue_client(args.aws.aws_access_key_id, args.aws.aws_secret_access_key, args.aws.aws_region)
   238          exporter.export_glue(glue, args.catalog.db_name, args.table_source, args.catalog.table_input, action, {debug=true})  
   239  ```
   240  
   241    </div>
   242  </div>
   243  
   244  Adding the script and the action files to the repository and commit it. This is a post-commit action, meaning it will be executed after the commit operation has taken place. 
   245  
   246  ```bash
   247  lakectl fs upload lakefs://catalogs/main/scripts/animals_exporter.lua -s ./animals_exporter.lua
   248  lakectl fs upload lakefs://catalogs/main/_lakefs_actions/animals_glue.yaml -s ./animals_glue.yaml
   249  lakectl commit lakefs://catalogs/main -m "trigger first export hook"
   250  ```
   251  
   252  Once the action has completed its execution, you can review the results in the action logs.
   253  
   254  ![Hooks log result in lakeFS UI]({{ site.baseurl }}/assets/img/glue_export_hook_result_log.png)
   255  
   256  ### Use Athena 
   257  
   258  We can use the exported Glue table with any tool that supports Glue Catalog (or Hive compatible) such as Athena, Trino, Spark and others.
   259  To use Athena we can simply run `MSCK REPAIR TABLE` and then query the tables.
   260  
   261  In Athena, make sure that the correct database (`my-glue-db` in the example above) is configured, then run: 
   262  
   263  ```sql
   264  MSCK REPAIR TABLE `animals_catalogs_main_9255e5`; -- load partitions for the first time 
   265  SELECT * FROM `animals_catalogs_main_9255e5` limit 50;
   266  ```
   267  
   268  ![Athena SQL Result]({{ site.baseurl }}/assets/img/catalog_export_athena_aws_ui_sql.png)
   269  
   270  ### Cleanup
   271  
   272  Users can use additional hooks / actions to implement a custom cleanup logic to delete the symlink in S3 and Glue Tables. 
   273  
   274  ```lua
   275  glue.delete_table(db, '<glue table name>')
   276  s3.delete_recursive('bucket', 'path/to/symlinks/of/a/commit/')
   277  ```