github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/glue_metastore.md (about) 1 --- 2 title: Glue Data Catalog 3 description: Query data from lakeFS branches in AWS Athena or other services backed by Glue Data Catalog. 4 parent: Integrations 5 redirect_from: /using/glue_metastore.html 6 --- 7 8 9 # Using lakeFS with the Glue Catalog 10 11 {% include toc_2-3.html %} 12 13 ## Overview 14 15 The integration between Glue and lakeFS is based on [Data Catalog Exports]({% link howto/catalog_exports.md %}). 16 17 This guide describes how to use lakeFS with the Glue Data Catalog. 18 You'll be able to query your lakeFS data by specifying the repository, branch and commit in your SQL query. 19 Currently, only read operations are supported on the tables. 20 You will set up the automation required to work with lakeFS on top of the Glue Data Catalog, including: 21 1. Create a table descriptor under `_lakefs_tables/<your-table>.yaml`. This will represent your table schema. 22 2. Write an exporter script that will: 23 * Mirror your branch's state into [Hive Symlink](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html) files readable by Athena. 24 * Export the table descriptors from your branch to the Glue Catalog. 25 3. Set up lakeFS [hooks]({% link howto/catalog_exports.md %}#running-an-exporter) to trigger the above script when specific events occur. 26 27 ## Example: Using Athena to query lakeFS data 28 29 ### Prerequisites 30 31 Before starting, make sure you have: 32 1. An active lakeFS installation with S3 as the backing storage, and a repository in this installation. 33 2. A database in Glue Data Catalog (lakeFS does not create one). 34 3. AWS Credentials with permission to manage Glue, Athena Query and S3 access. 35 36 ### Add table descriptor 37 38 Let's define a table, and commit it to lakeFS. 39 Save the YAML below as `animals.yaml` and upload it to lakeFS. 40 41 ```bash 42 lakectl fs upload lakefs://catalogs/main/_lakefs_tables/animals.yaml -s ./animals.yaml && \ 43 lakectl commit lakefs://catalogs/main -m "added table" 44 ``` 45 46 ```yaml 47 name: animals 48 type: hive 49 # data location root in lakeFS 50 path: tables/animals 51 # partitions order 52 partition_columns: ['type', 'weight'] 53 schema: 54 type: struct 55 # all the columns spec 56 fields: 57 - name: type 58 type: string 59 nullable: true 60 metadata: 61 comment: axolotl, cat, dog, fish etc 62 - name: weight 63 type: integer 64 nullable: false 65 metadata: {} 66 - name: name 67 type: string 68 nullable: false 69 metadata: {} 70 ``` 71 72 ### Write some table data 73 74 Insert data into the table path, using your preferred method (e.g. [Spark]({% link integrations/spark.md %})), and commit upon completion. 75 This example uses CSV files, and the files added to lakeFS should look like this: 76 77  78 79 ### The exporter script 80 81 Upload the following script to your main branch under `scripts/animals_exporter.lua` (or a path of your choice). 82 83 {: .note} 84 > For code references check [symlink_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportsymlink_exporter) and [glue_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportglue_exporter) docs. 85 86 87 ```lua 88 local aws = require("aws") 89 local symlink_exporter = require("lakefs/catalogexport/symlink_exporter") 90 local glue_exporter = require("lakefs/catalogexport/glue_exporter") 91 -- settings 92 local access_key = args.aws.aws_access_key_id 93 local secret_key = args.aws.aws_secret_access_key 94 local region = args.aws.aws_region 95 local table_path = args.table_source -- table descriptor 96 local db = args.catalog.db_name -- glue db 97 local table_input = args.catalog.table_input -- table input (AWS input spec) for Glue 98 -- export symlinks 99 local s3 = aws.s3_client(access_key, secret_key, region) 100 local result = symlink_exporter.export_s3(s3, table_path, action, {debug=true}) 101 -- register glue table 102 local glue = aws.glue_client(access_key, secret_key, region) 103 local res = glue_exporter.export_glue(glue, db, table_path, table_input, action, {debug=true}) 104 ``` 105 106 ### Configure Action Hooks 107 108 Hooks serve as the mechanism that triggers the execution of the exporter. 109 For more detailed information on how to configure exporter hooks, you can refer to [Running an Exporter]({% link howto/catalog_exports.md %}#running-an-exporter). 110 111 {: .note} 112 > The `args.catalog.table_input` argument in the Lua script is assumed to be passed from the action arguments, that way the same script can be reused for different tables. Check the [example]({% link howto/hooks/lua.md %}#lakefscatalogexportglue_exporterexport_glueglue-db-table_src_path-create_table_input-action_info-options) to construct the table input in the lua code. 113 114 115 <div class="tabs"> 116 <ul> 117 <li><a href="#single-hook-csv">Hook CSV Glue Table</a></li> 118 <li><a href="#single-hook">Hook Parquet Glue Table</a></li> 119 <li><a href="#multiple-hooks">Multiple Hooks / Inline script</a></li> 120 </ul> 121 <div markdown="1" id="single-hook-csv"> 122 123 #### Single hook with CSV Table 124 125 Upload to `_lakefs_actions/animals_glue.yaml`: 126 127 ```yaml 128 name: Glue Exporter 129 on: 130 post-commit: 131 branches: ["main"] 132 hooks: 133 - id: animals_table_glue_exporter 134 type: lua 135 properties: 136 script_path: "scripts/animals_exporter.lua" 137 args: 138 aws: 139 aws_access_key_id: "<AWS_ACCESS_KEY_ID>" 140 aws_secret_access_key: "<AWS_SECRET_ACCESS_KEY>" 141 aws_region: "<AWS_REGION>" 142 table_source: '_lakefs_tables/animals.yaml' 143 catalog: 144 db_name: "my-glue-db" 145 table_input: 146 StorageDescriptor: 147 InputFormat: "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat" 148 OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" 149 SerdeInfo: 150 SerializationLibrary: "org.apache.hadoop.hive.serde2.OpenCSVSerde" 151 Parameters: 152 separatorChar: "," 153 Parameters: 154 classification: "csv" 155 "skip.header.line.count": "1" 156 ``` 157 158 </div> 159 <div markdown="1" id="single-hook"> 160 #### Spark Parquet Example 161 162 When working with Parquet files, upload the following to `_lakefs_actions/animals_glue.yaml`: 163 164 ```yaml 165 name: Glue Exporter 166 on: 167 post-commit: 168 branches: ["main"] 169 hooks: 170 - id: animals_table_glue_exporter 171 type: lua 172 properties: 173 script_path: "scripts/animals_exporter.lua" 174 args: 175 aws: 176 aws_access_key_id: "<AWS_ACCESS_KEY_ID>" 177 aws_secret_access_key: "<AWS_SECRET_ACCESS_KEY>" 178 aws_region: "<AWS_REGION>" 179 table_source: '_lakefs_tables/animals.yaml' 180 catalog: 181 db_name: "my-glue-db" 182 table_input: 183 StorageDescriptor: 184 InputFormat: "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat" 185 OutputFormat: "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat" 186 SerdeInfo: 187 SerializationLibrary: "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe" 188 Parameters: 189 classification: "parquet" 190 EXTERNAL: "TRUE" 191 "parquet.compression": "SNAPPY" 192 ``` 193 194 </div> 195 <div markdown="1" id="multiple-hooks"> 196 #### Multiple Hooks / Inline script 197 198 The following example demonstrates how to separate the symlink and glue exporter into building blocks running in separate hooks. 199 It also shows how to run the lua script inline instead of a file, depending on user preference. 200 201 ```yaml 202 name: Animal Table Exporter 203 on: 204 post-commit: 205 branches: ["main"] 206 hooks: 207 - id: symlink_exporter 208 type: lua 209 properties: 210 args: 211 aws: 212 aws_access_key_id: "<AWS_ACCESS_KEY_ID>" 213 aws_secret_access_key: "<AWS_SECRET_ACCESS_KEY>" 214 aws_region: "<AWS_REGION>" 215 table_source: '_lakefs_tables/animals.yaml' 216 script: | 217 local exporter = require("lakefs/catalogexport/symlink_exporter") 218 local aws = require("aws") 219 local table_path = args.table_source 220 local s3 = aws.s3_client(args.aws.aws_access_key_id, args.aws.aws_secret_access_key, args.aws.aws_region) 221 exporter.export_s3(s3, table_path, action, {debug=true}) 222 - id: glue_exporter 223 type: lua 224 properties: 225 args: 226 aws: 227 aws_access_key_id: "<AWS_ACCESS_KEY_ID>" 228 aws_secret_access_key: "<AWS_SECRET_ACCESS_KEY>" 229 aws_region: "<AWS_REGION>" 230 table_source: '_lakefs_tables/animals.yaml' 231 catalog: 232 db_name: "my-glue-db" 233 table_input: # add glue table input here 234 script: | 235 local aws = require("aws") 236 local exporter = require("lakefs/catalogexport/glue_exporter") 237 local glue = aws.glue_client(args.aws.aws_access_key_id, args.aws.aws_secret_access_key, args.aws.aws_region) 238 exporter.export_glue(glue, args.catalog.db_name, args.table_source, args.catalog.table_input, action, {debug=true}) 239 ``` 240 241 </div> 242 </div> 243 244 Adding the script and the action files to the repository and commit it. This is a post-commit action, meaning it will be executed after the commit operation has taken place. 245 246 ```bash 247 lakectl fs upload lakefs://catalogs/main/scripts/animals_exporter.lua -s ./animals_exporter.lua 248 lakectl fs upload lakefs://catalogs/main/_lakefs_actions/animals_glue.yaml -s ./animals_glue.yaml 249 lakectl commit lakefs://catalogs/main -m "trigger first export hook" 250 ``` 251 252 Once the action has completed its execution, you can review the results in the action logs. 253 254  255 256 ### Use Athena 257 258 We can use the exported Glue table with any tool that supports Glue Catalog (or Hive compatible) such as Athena, Trino, Spark and others. 259 To use Athena we can simply run `MSCK REPAIR TABLE` and then query the tables. 260 261 In Athena, make sure that the correct database (`my-glue-db` in the example above) is configured, then run: 262 263 ```sql 264 MSCK REPAIR TABLE `animals_catalogs_main_9255e5`; -- load partitions for the first time 265 SELECT * FROM `animals_catalogs_main_9255e5` limit 50; 266 ``` 267 268  269 270 ### Cleanup 271 272 Users can use additional hooks / actions to implement a custom cleanup logic to delete the symlink in S3 and Glue Tables. 273 274 ```lua 275 glue.delete_table(db, '<glue table name>') 276 s3.delete_recursive('bucket', 'path/to/symlinks/of/a/commit/') 277 ```