github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/hooks/lua.md (about) 1 --- 2 title: Lua Hooks 3 parent: Actions and Hooks 4 grand_parent: How-To 5 description: Lua Hooks reference 6 redirect_from: 7 - /hooks/lua.html 8 --- 9 10 11 # Lua Hooks 12 13 lakeFS supports running hooks without relying on external components using an [embedded Lua VM](https://github.com/Shopify/go-lua) 14 15 Using Lua hooks, it is possible to pass a Lua script to be executed directly by the lakeFS server when an action occurs. 16 17 The Lua runtime embedded in lakeFS is limited for security reasons. It provides a narrow set of APIs and functions that by default do not allow: 18 19 1. Accessing any of the running lakeFS server's environment 20 2. Accessing the local filesystem available the lakeFS process 21 22 {% include toc.html %} 23 24 ## Action File Lua Hook Properties 25 26 _See the [Action configuration](./index.md#action-file) for overall configuration schema and details._ 27 28 | Property | Description | Data Type | Required | Default Value | 29 |---------------|-------------------------------------------|------------|------------------------------------------------|---------------| 30 | `args` | One or more arguments to pass to the hook | Dictionary | false | | 31 | `script` | An inline Lua script | String | either this or `script_path` must be specified | | 32 | `script_path` | The path in lakeFS to a Lua script | String | either this or `script` must be specified | | 33 34 35 ## Example Lua Hooks 36 37 For more examples and configuration samples, check out the [examples/hooks/](https://github.com/treeverse/lakeFS/tree/master/examples/hooks) directory in the lakeFS repository. You'll also find step-by-step examples of hooks in action in the [lakeFS samples repository](https://github.com/treeverse/lakeFS-samples/). 38 39 ### Display information about an event 40 41 This example will print out a JSON representation of the event that occurred: 42 43 ```yaml 44 name: dump_all 45 on: 46 post-commit: 47 post-merge: 48 post-create-tag: 49 post-create-branch: 50 hooks: 51 - id: dump_event 52 type: lua 53 properties: 54 script: | 55 json = require("encoding/json") 56 print(json.marshal(action)) 57 ``` 58 59 ### Ensure that a commit includes a mandatory metadata field 60 61 A more useful example: ensure every commit contains a required metadata field: 62 63 ```yaml 64 name: pre commit metadata field check 65 on: 66 pre-commit: 67 branches: 68 - main 69 - dev 70 hooks: 71 - id: ensure_commit_metadata 72 type: lua 73 properties: 74 args: 75 notebook_url: {"pattern": "my-jupyter.example.com/.*"} 76 spark_version: {} 77 script_path: lua_hooks/ensure_metadata_field.lua 78 ``` 79 80 Lua code at `lakefs://repo/main/lua_hooks/ensure_metadata_field.lua`: 81 82 ```lua 83 regexp = require("regexp") 84 for k, props in pairs(args) do 85 current_value = action.commit.metadata[k] 86 if current_value == nil then 87 error("missing mandatory metadata field: " .. k) 88 end 89 if props.pattern and not regexp.match(props.pattern, current_value) then 90 error("current value for commit metadata field " .. k .. " does not match pattern: " .. props.pattern .. " - got: " .. current_value) 91 end 92 end 93 ``` 94 95 For more examples and configuration samples, check out the [examples/hooks/](https://github.com/treeverse/lakeFS/tree/master/examples/hooks) directory in the lakeFS repository. 96 97 ## Lua Library reference 98 99 The Lua runtime embedded in lakeFS is limited for security reasons. The provided APIs are shown below. 100 101 ### `array(table)` 102 103 Helper function to mark a table object as an array for the runtime by setting `_is_array: true` metatable field. 104 105 ### `aws` 106 107 ### `aws/s3_client` 108 S3 client library. 109 110 ```lua 111 local aws = require("aws") 112 -- pass valid AWS credentials 113 local client = aws.s3_client("ACCESS_KEY_ID", "SECRET_ACCESS_KEY", "REGION") 114 ``` 115 116 ### `aws/s3_client.get_object(bucket, key)` 117 118 Returns the body (as a Lua string) of the requested object and a boolean value that is true if the requested object exists 119 120 ### `aws/s3_client.put_object(bucket, key, value)` 121 122 Sets the object at the given bucket and key to the value of the supplied value string 123 124 ### `aws/s3_client.delete_object(bucket [, key])` 125 126 Deletes the object at the given key 127 128 ### `aws/s3_client.list_objects(bucket [, prefix, continuation_token, delimiter])` 129 130 Returns a table of results containing the following structure: 131 132 * `is_truncated`: (boolean) whether there are more results to paginate through using the continuation token 133 * `next_continuation_token`: (string) to pass in the next request to get the next page of results 134 * `results` (table of tables) information about the objects (and prefixes if a delimiter is used) 135 136 a result could in one of the following structures 137 138 ```lua 139 { 140 ["key"] = "a/common/prefix/", 141 ["type"] = "prefix" 142 } 143 ``` 144 145 or: 146 147 ```lua 148 { 149 ["key"] = "path/to/object", 150 ["type"] = "object", 151 ["etag"] = "etagString", 152 ["size"] = 1024, 153 ["last_modified"] = "2023-12-31T23:10:00Z" 154 } 155 ``` 156 157 ### `aws/s3_client.delete_recursive(bucket, prefix)` 158 159 Deletes all objects under the given prefix 160 161 ### `aws/glue` 162 163 Glue client library. 164 165 ```lua 166 local aws = require("aws") 167 -- pass valid AWS credentials 168 local glue = aws.glue_client("ACCESS_KEY_ID", "SECRET_ACCESS_KEY", "REGION") 169 ``` 170 171 ### `aws/glue.get_table(database, table [, catalog_id)` 172 173 Describe a table from the Glue catalog. 174 175 Example: 176 177 ```lua 178 local table, exists = glue.get_table(db, table_name) 179 if exists then 180 print(json.marshal(table)) 181 ``` 182 183 ### `aws/glue.create_table(database, table_input, [, catalog_id])` 184 185 Create a new table in Glue Catalog. 186 The `table_input` argument is a JSON that is passed "as is" to AWS and is parallel to the AWS SDK [TableInput](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateTable.html#API_CreateTable_RequestSyntax) 187 188 Example: 189 190 ```lua 191 local json = require("encoding/json") 192 local input = { 193 Name = "my-table", 194 PartitionKeys = array(partitions), 195 -- etc... 196 } 197 local json_input = json.marshal(input) 198 glue.create_table("my-db", table_input) 199 ``` 200 201 ### `aws/glue.update_table(database, table_input, [, catalog_id, version_id, skip_archive])` 202 203 Update an existing Table in Glue Catalog. 204 The `table_input` is the same as the argument in `glue.create_table` function. 205 206 ### `aws/glue.delete_table(database, table_input, [, catalog_id])` 207 208 Delete an existing Table in Glue Catalog. 209 210 ### `azure` 211 212 ### `azure/blob_client` 213 Azure blob client library. 214 215 ```lua 216 local azure = require("azure") 217 -- pass valid Azure credentials 218 local client = azure.blob_client("AZURE_STORAGE_ACCOUNT", "AZURE_ACCESS_KEY") 219 ``` 220 221 ### `azure/blob_client.get_object(path_uri)` 222 223 Returns the body (as a Lua string) of the requested object and a boolean value that is true if the requested object exists 224 `path_uri` - A valid Azure blob storage uri in the form of `https://myaccount.blob.core.windows.net/mycontainer/myblob` 225 226 ### `azure/blob_client.put_object(path_uri, value)` 227 228 Sets the object at the given bucket and key to the value of the supplied value string 229 `path_uri` - A valid Azure blob storage uri in the form of `https://myaccount.blob.core.windows.net/mycontainer/myblob` 230 231 ### `azure/blob_client.delete_object(path_uri)` 232 233 Deletes the object at the given key 234 `path_uri` - A valid Azure blob storage uri in the form of `https://myaccount.blob.core.windows.net/mycontainer/myblob` 235 236 ### `azure/abfss_transform_path(path)` 237 238 Transform an HTTPS Azure URL to a ABFSS scheme. Used by the delta_exporter function to support Azure Unity catalog use cases 239 `path` - A valid Azure blob storage URL in the form of `https://myaccount.blob.core.windows.net/mycontainer/myblob` 240 241 ### `crypto` 242 243 ### `crypto/aes/encryptCBC(key, plaintext)` 244 245 Returns a ciphertext for the aes encrypted text 246 247 ### `crypto/aes/decryptCBC(key, ciphertext)` 248 249 Returns the decrypted (plaintext) string for the encrypted ciphertext 250 251 ### `crypto/hmac/sign_sha256(message, key)` 252 253 Returns a SHA256 hmac signature for the given message with the supplied key (using the SHA256 hashing algorithm) 254 255 ### `crypto/hmac/sign_sha1(message, key)` 256 257 Returns a SHA1 hmac signature for the given message with the supplied key (using the SHA1 hashing algorithm) 258 259 ### `crypto/md5/digest(data)` 260 261 Returns the MD5 digest (string) of the given data 262 263 ### `crypto/sha256/digest(data)` 264 265 Returns the SHA256 digest (string) of the given data 266 267 ### `databricks/client(databricks_host, databricks_service_principal_token)` 268 269 Returns a table representing a Databricks client with the `register_external_table` and `create_or_get_schema` methods. 270 271 ### `databricks/client.create_schema(schema_name, catalog_name, get_if_exists)` 272 273 Creates a schema, or retrieves it if exists, in the configured Databricks host's Unity catalog. 274 If a schema doesn't exist, a new schema with the given `schema_name` will be created under the given `catalog_name`. 275 Returns the created/fetched schema name. 276 277 Parameters: 278 279 - `schema_name(string)`: The required schema name 280 - `catalog_name(string)`: The catalog name under which the schema will be created (or from which it will be fetched) 281 - `get_if_exists(boolean)`: In case of failure due to an existing schema with the given `schema_name` in the given 282 `catalog_name`, return the schema. 283 284 Example: 285 286 ```lua 287 local databricks = require("databricks") 288 local client = databricks.client("https://my-host.cloud.databricks.com", "my-service-principal-token") 289 local schema_name = client.create_schema("main", "mycatalog", true) 290 ``` 291 292 ### `databricks/client.register_external_table(table_name, physical_path, warehouse_id, catalog_name, schema_name, metadata)` 293 294 Registers an external table under the provided warehouse ID, catalog name, and schema name. 295 In order for this method call to succeed, an external location should be configured in the catalog, with the 296 `physical_path`'s root storage URI (for example: `s3://mybucket`). 297 Returns the table's creation status. 298 299 Parameters: 300 301 - `table_name(string)`: Table name. 302 - `physical_path(string)`: A location to which the external table will refer, e.g. `s3://mybucket/the/path/to/mytable`. 303 - `warehouse_id(string)`: The SQL warehouse ID used in Databricks to run the `CREATE TABLE` query (fetched from the SQL warehouse 304 `Connection Details`, or by running `databricks warehouses get`, choosing your SQL warehouse and fetching its ID). 305 - `catalog_name(string)`: The name of the catalog under which a schema will be created (or fetched from). 306 - `schema_name(string)`: The name of the schema under which the table will be created. 307 - `metadata(table)`: A table of metadata to be added to the table's registration. The metadata table should be of the form: 308 `{key1 = "value1", key2 = "value2", ...}`. 309 310 Example: 311 312 ```lua 313 local databricks = require("databricks") 314 local client = databricks.client("https://my-host.cloud.databricks.com", "my-service-principal-token") 315 local status = client.register_external_table("mytable", "s3://mybucket/the/path/to/mytable", "examwarehouseple", "my-catalog-name", "myschema") 316 ``` 317 318 - For the Databricks permissions needed to run this method, check out the [Unity Catalog Exporter]({% link integrations/unity-catalog.md %}) docs. 319 320 ### `encoding/base64/encode(data)` 321 322 Encodes the given data to a base64 string 323 324 ### `encoding/base64/decode(data)` 325 326 Decodes the given base64 encoded data and return it as a string 327 328 ### `encoding/base64/url_encode(data)` 329 330 Encodes the given data to an unpadded alternate base64 encoding defined in RFC 4648. 331 332 ### `encoding/base64/url_decode(data)` 333 334 Decodes the given unpadded alternate base64 encoding defined in RFC 4648 and return it as a string 335 336 ### `encoding/hex/encode(value)` 337 338 Encode the given value string to hexadecimal values (string) 339 340 ### `encoding/hex/decode(value)` 341 342 Decode the given hexadecimal string back to the string it represents (UTF-8) 343 344 ### `encoding/json/marshal(table)` 345 346 Encodes the given table into a JSON string 347 348 ### `encoding/json/unmarshal(string)` 349 350 Decodes the given string into the equivalent Lua structure 351 352 ### `encoding/yaml/marshal(table)` 353 354 Encodes the given table into a YAML string 355 356 ### `encoding/yaml/unmarshal(string)` 357 358 Decodes the given YAML encoded string into the equivalent Lua structure 359 360 ### `encoding/parquet/get_schema(payload)` 361 362 Read the payload (string) as the contents of a Parquet file and return its schema in the following table structure: 363 364 ```lua 365 { 366 { ["name"] = "column_a", ["type"] = "INT32" }, 367 { ["name"] = "column_b", ["type"] = "BYTE_ARRAY" } 368 } 369 ``` 370 371 ### `formats` 372 373 ### `formats/delta_client(key, secret, region)` 374 375 Creates a new Delta Lake client used to interact with the lakeFS server. 376 - `key`: lakeFS access key id 377 - `secret`: lakeFS secret access key 378 - `region`: The region in which your lakeFS server is configured at. 379 380 ### `formats/delta_client.get_table(repository_id, reference_id, prefix)` 381 382 Returns a representation of a Delta Lake table under the given repository, reference, and prefix. 383 The format of the response is two tables: 384 1. the first is a table of the format `{number, {string}}` where `number` is a version in the Delta Log, and the mapped `{string}` 385 array contains JSON strings of the different Delta Lake log operations listed in the mapped version entry. e.g.: 386 ```lua 387 { 388 0 = { 389 "{\"commitInfo\":...}", 390 "{\"add\": ...}", 391 "{\"remove\": ...}" 392 }, 393 1 = { 394 "{\"commitInfo\":...}", 395 "{\"add\": ...}", 396 "{\"remove\": ...}" 397 } 398 } 399 ``` 400 2. the second is a table of the metadata of the current table snapshot. The metadata table can be used to initialize the Delta Lake table in an external Catalog. 401 It consists of the following fields: 402 - `id`: The table's ID 403 - `name`: The table's name 404 - `description`: The table's description 405 - `schema_string`: The table's schema string 406 - `partition_columns`: The table's partition columns 407 - `configuration`: The table's configuration 408 - `created_time`: The table's creation time 409 410 ### `gcloud` 411 412 ### `gcloud/gs_client(gcs_credentials_json_string)` 413 414 Create a new Google Cloud Storage client using a string that contains a valid [`credentials.json`](https://developers.google.com/workspace/guides/create-credentials) file content. 415 416 ### `gcloud/gs.write_fuse_symlink(source, destination, mount_info)` 417 418 Will create a [gcsfuse symlink](https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/semantics.md#symlink-inodes) 419 from the source (typically a lakeFS physical address for an object) to a given destination. 420 421 `mount_info` is a Lua table with `"from"` and `"to"` keys - since symlinks don't work for `gs://...` URIs, they need to point 422 to the mounted location instead. `from` will be removed from the beginning of `source`, and `destination` will be added instead. 423 424 Example: 425 426 ```lua 427 source = "gs://bucket/lakefs/data/abc/def" 428 destination = "gs://bucket/exported/path/to/object" 429 mount_info = { 430 ["from"] = "gs://bucket", 431 ["to"] = "/home/user/gcs-mount" 432 } 433 gs.write_fuse_symlink(source, destination, mount_info) 434 -- Symlink: "/home/user/gcs-mount/exported/path/to/object" -> "/home/user/gcs-mount/lakefs/data/abc/def" 435 ``` 436 437 ### `hook` 438 439 A set of utilities to aide in writing user friendly hooks. 440 441 ### `hook/fail(message)` 442 443 Will abort the current hook's execution with the given message. This is similar to using `error()`, but is typically used to separate 444 generic runtime errors (an API call that returned an unexpected response) and explict failure of the calling hook. 445 446 When called, errors will appear without a stacktrace, and the error message will be directly the one given as `message`. 447 448 ```lua 449 > hook = require("hook") 450 > hook.fail("this hook shall not pass because of: " .. reason) 451 ``` 452 453 ### `lakefs` 454 455 The Lua Hook library allows calling back to the lakeFS API using the identity of the user that triggered the action. 456 For example, if user A tries to commit and triggers a `pre-commit` hook - any call made inside that hook to the lakeFS 457 API, will automatically use user A's identity for authorization and auditing purposes. 458 459 ### `lakefs/create_tag(repository_id, reference_id, tag_id)` 460 461 Create a new tag for the given reference 462 463 ### `lakefs/diff_refs(repository_id, lef_reference_id, right_reference_id [, after, prefix, delimiter, amount])` 464 465 Returns an object-wise diff between `left_reference_id` and `right_reference_id`. 466 467 ### `lakefs/list_objects(repository_id, reference_id [, after, prefix, delimiter, amount])` 468 469 List objects in the specified repository and reference (branch, tag, commit ID, etc.). 470 If delimiter is empty, will default to a recursive listing. Otherwise, common prefixes up to `delimiter` will be shown as a single entry. 471 472 ### `lakefs/get_object(repository_id, reference_id, path)` 473 474 Returns 2 values: 475 476 1. The HTTP status code returned by the lakeFS API 477 1. The content of the specified object as a lua string 478 479 ### `lakefs/diff_branch(repository_id, branch_id [, after, amount, prefix, delimiter])` 480 481 Returns an object-wise diff of uncommitted changes on `branch_id`. 482 483 ### `lakefs/stat_object(repository_id, ref_id, path)` 484 485 Returns a stat object for the given path under the given reference and repository. 486 487 ### `lakefs/catalogexport/glue_exporter.get_full_table_name(descriptor, action_info)` 488 489 Generate glue table name. 490 491 Parameters: 492 493 - `descriptor(Table)`: Object from (e.g. _lakefs_tables/my_table.yaml). 494 - `action_info(Table)`: The global action object. 495 496 ### `lakefs/catalogexport/delta_exporter` 497 498 A package used to export Delta Lake tables from lakeFS to an external cloud storage. 499 500 ### `lakefs/catalogexport/delta_exporter.export_delta_log(action, table_def_names, write_object, delta_client, table_descriptors_path, path_transformer)` 501 502 The function used to export Delta Lake tables. 503 The return value is a table with mapping of table names to external table location (from which it is possible to query the data) and latest Delta table version's metadata. 504 The response is of the form: 505 `{<table_name> = {path = "s3://mybucket/mypath/mytable", metadata = {id = "table_id", name = "table_name", ...}}}`. 506 507 Parameters: 508 509 - `action`: The global action object 510 - `table_def_names`: Delta tables name list (e.g. `{"table1", "table2"}`) 511 - `write_object`: A writer function with `function(bucket, key, data)` signature, used to write the exported Delta Log (e.g. `aws/s3_client.put_object` or `azure/blob_client.put_object`) 512 - `delta_client`: A Delta Lake client that implements `get_table: function(repo, ref, prefix)` 513 - `table_descriptors_path`: The path under which the table descriptors of the provided `table_def_names` reside 514 - `path_transformer`: (Optional) A function(path) used for transforming the path of the saved delta logs path fields as well as the saved table physical path (used to support Azure Unity catalog use cases) 515 516 Delta export example for AWS S3: 517 518 ```yaml 519 --- 520 name: delta_exporter 521 on: 522 post-commit: null 523 hooks: 524 - id: delta_export 525 type: lua 526 properties: 527 script: | 528 local aws = require("aws") 529 local formats = require("formats") 530 local delta_exporter = require("lakefs/catalogexport/delta_exporter") 531 local json = require("encoding/json") 532 533 local table_descriptors_path = "_lakefs_tables" 534 local sc = aws.s3_client(args.aws.access_key_id, args.aws.secret_access_key, args.aws.region) 535 local delta_client = formats.delta_client(args.lakefs.access_key_id, args.lakefs.secret_access_key, args.aws.region) 536 local delta_table_details = delta_export.export_delta_log(action, args.table_defs, sc.put_object, delta_client, table_descriptors_path) 537 538 for t, details in pairs(delta_table_details) do 539 print("Delta Lake exported table \"" .. t .. "\"'s location: " .. details["path"] .. "\n") 540 print("Delta Lake exported table \"" .. t .. "\"'s metadata:\n") 541 for k, v in pairs(details["metadata"]) do 542 if type(v) == "table" then 543 print("\t" .. k .. " = " .. json.marshal(v) .. "\n") 544 else 545 print("\t" .. k .. " = " .. v .. "\n") 546 end 547 end 548 end 549 args: 550 aws: 551 access_key_id: <AWS_ACCESS_KEY_ID> 552 secret_access_key: <AWS_SECRET_ACCESS_KEY> 553 region: us-east-1 554 lakefs: 555 access_key_id: <LAKEFS_ACCESS_KEY_ID> 556 secret_access_key: <LAKEFS_SECRET_ACCESS_KEY> 557 table_defs: 558 - mytable 559 ``` 560 561 For the table descriptor under the `_lakefs_tables/mytable.yaml`: 562 ```yaml 563 --- 564 name: myTableActualName 565 type: delta 566 path: a/path/to/my/delta/table 567 ``` 568 569 Delta export example for Azure Blob Storage: 570 571 ```yaml 572 name: Delta Exporter 573 on: 574 post-commit: 575 branches: ["{{ .Branch }}*"] 576 hooks: 577 - id: delta_exporter 578 type: lua 579 properties: 580 script: | 581 local azure = require("azure") 582 local formats = require("formats") 583 local delta_exporter = require("lakefs/catalogexport/delta_exporter") 584 585 local table_descriptors_path = "_lakefs_tables" 586 local sc = azure.blob_client(args.azure.storage_account, args.azure.access_key) 587 local function write_object(_, key, buf) 588 return sc.put_object(key,buf) 589 end 590 local delta_client = formats.delta_client(args.lakefs.access_key_id, args.lakefs.secret_access_key) 591 local delta_table_details = delta_export.export_delta_log(action, args.table_defs, sc.put_object, delta_client, table_descriptors_path) 592 593 for t, details in pairs(delta_table_details) do 594 print("Delta Lake exported table \"" .. t .. "\"'s location: " .. details["path"] .. "\n") 595 print("Delta Lake exported table \"" .. t .. "\"'s metadata:\n") 596 for k, v in pairs(details["metadata"]) do 597 if type(v) == "table" then 598 print("\t" .. k .. " = " .. json.marshal(v) .. "\n") 599 else 600 print("\t" .. k .. " = " .. v .. "\n") 601 end 602 end 603 end 604 args: 605 azure: 606 storage_account: "{{ .AzureStorageAccount }}" 607 access_key: "{{ .AzureAccessKey }}" 608 lakefs: # provide credentials of a user that has access to the script and Delta Table 609 access_key_id: "{{ .LakeFSAccessKeyID }}" 610 secret_access_key: "{{ .LakeFSSecretAccessKey }}" 611 table_defs: 612 - mytable 613 614 ``` 615 616 ### `lakefs/catalogexport/table_extractor` 617 618 Utility package to parse `_lakefs_tables/` descriptors. 619 620 ### `lakefs/catalogexport/table_extractor.list_table_descriptor_entries(client, repo_id, commit_id)` 621 622 List all YAML files under `_lakefs_tables/*` and return a list of type `[{physical_address, path}]`, ignores hidden files. 623 The `client` is `lakefs` client. 624 625 ### `lakefs/catalogexport/table_extractor.get_table_descriptor(client, repo_id, commit_id, logical_path)` 626 627 Read a table descriptor and parse YAML object. Will set `partition_columns` to `{}` if no partitions are defined. 628 The `client` is `lakefs` client. 629 630 ### `lakefs/catalogexport/hive.extract_partition_pager(client, repo_id, commit_id, base_path, partition_cols, page_size)` 631 632 Hive format partition iterator each result set is a collection of files under the same partition in lakeFS. 633 634 Example: 635 636 ```lua 637 local lakefs = require("lakefs") 638 local pager = hive.extract_partition_pager(lakefs, repo_id, commit_id, prefix, partitions, 10) 639 for part_key, entries in pager do 640 print("partition: " .. part_key) 641 for _, entry in ipairs(entries) do 642 print("path: " .. entry.path .. " physical: " .. entry.physical_address) 643 end 644 end 645 ``` 646 647 ### `lakefs/catalogexport/symlink_exporter` 648 649 Writes metadata for a table using Hive's [SymlinkTextInputFormat](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html). 650 Currently only `S3` is supported. 651 652 The default export paths per commit: 653 654 ``` 655 ${storageNamespace} 656 _lakefs/ 657 exported/ 658 ${ref}/ 659 ${commitId}/ 660 ${tableName}/ 661 p1=v1/symlink.txt 662 p1=v2/symlink.txt 663 p1=v3/symlink.txt 664 ... 665 ``` 666 667 ### `lakefs/catalogexport/symlink_exporter.export_s3(s3_client, table_src_path, action_info [, options])` 668 669 Export Symlink files that represent a table to S3 location. 670 671 Parameters: 672 673 - `s3_client`: Configured client. 674 - `table_src_path(string)`: Path to the table spec YAML file in `_lakefs_tables` (e.g. _lakefs_tables/my_table.yaml). 675 - `action_info(table)`: The global action object. 676 - `options(table)`: 677 - `debug(boolean)`: Print extra info. 678 - `export_base_uri(string)`: Override the prefix in S3 e.g. `s3://other-bucket/path/`. 679 - `writer(function(bucket, key, data))`: If passed then will not use s3 client, helpful for debug. 680 681 Example: 682 683 ```lua 684 local exporter = require("lakefs/catalogexport/symlink_exporter") 685 local aws = require("aws") 686 -- args are user inputs from a lakeFS action. 687 local s3 = aws.s3_client(args.aws.aws_access_key_id, args.aws.aws_secret_access_key, args.aws.aws_region) 688 exporter.export_s3(s3, args.table_descriptor_path, action, {debug=true}) 689 ``` 690 691 ### `lakefs/catalogexport/glue_exporter` 692 693 A Package for automating the export process from lakeFS stored tables into Glue catalog. 694 695 ### `lakefs/catalogexport/glue_exporter.export_glue(glue, db, table_src_path, create_table_input, action_info, options)` 696 697 Represent lakeFS table in Glue Catalog. 698 This function will create a table in Glue based on configuration. 699 It assumes that there is a symlink location that is already created and only configures it by default for the same commit. 700 701 Parameters: 702 703 - `glue`: AWS glue client 704 - `db(string)`: glue database name 705 - `table_src_path(string)`: path to table spec (e.g. _lakefs_tables/my_table.yaml) 706 - `create_table_input(Table)`: Input equal mapping to [table_input](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateTable.html#API_CreateTable_RequestSyntax) in AWS, the same as we use for `glue.create_table`. 707 should contain inputs describing the data format (e.g. InputFormat, OutputFormat, SerdeInfo) since the exporter is agnostic to this. 708 by default this function will configure table location and schema. 709 - `action_info(Table)`: the global action object. 710 - `options(Table)`: 711 - `table_name(string)`: Override default glue table name 712 - `debug(boolean` 713 - `export_base_uri(string)`: Override the default prefix in S3 for symlink location e.g. s3://other-bucket/path/ 714 715 When creating a glue table, the final table input will consist of the `create_table_input` input parameter and lakeFS computed defaults that will override it: 716 717 - `Name` Gable table name `get_full_table_name(descriptor, action_info)`. 718 - `PartitionKeys` Partition columns usually deduced from `_lakefs_tables/${table_src_path}`. 719 - `TableType` = "EXTERNAL_TABLE" 720 - `StorageDescriptor`: Columns usually deduced from `_lakefs_tables/${table_src_path}`. 721 - `StorageDescriptor.Location` = symlink_location 722 723 Example: 724 725 ```lua 726 local aws = require("aws") 727 local exporter = require("lakefs/catalogexport/glue_exporter") 728 local glue = aws.glue_client(args.aws_access_key_id, args.aws_secret_access_key, args.aws_region) 729 -- table_input can be passed as a simple Key-Value object in YAML as an argument from an action, this is inline example: 730 local table_input = { 731 StorageDescriptor: 732 InputFormat: "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat" 733 OutputFormat: "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat" 734 SerdeInfo: 735 SerializationLibrary: "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe" 736 Parameters: 737 classification: "parquet" 738 EXTERNAL: "TRUE" 739 "parquet.compression": "SNAPPY" 740 } 741 exporter.export_glue(glue, "my-db", "_lakefs_tables/animals.yaml", table_input, action, {debug=true}) 742 ``` 743 744 ### `lakefs/catalogexport/glue_exporter.get_full_table_name(descriptor, action_info)` 745 746 Generate glue table name. 747 748 Parameters: 749 750 - `descriptor(Table)`: Object from (e.g. _lakefs_tables/my_table.yaml). 751 - `action_info(Table)`: The global action object. 752 753 ### `lakefs/catalogexport/unity_exporter` 754 755 A package used to register exported Delta Lake tables to Databricks' Unity catalog. 756 757 ### `lakefs/catalogexport/unity_exporter.register_tables(action, table_descriptors_path, delta_table_details, databricks_client, warehouse_id)` 758 759 The function used to register exported Delta Lake tables in Databricks' Unity Catalog. 760 The registration will use the following paths to register the table: 761 `<catalog>.<branch name>.<table_name>` where the branch name will be used as the schema name. 762 The return value is a table with mapping of table names to registration request status. 763 764 **Note: (Azure users)** Databricks catalog external locations is supported only for ADLS Gen2 storage accounts. 765 When exporting Delta tables using the `lakefs/catalogexport/delta_exporter.export_delta_log` function, the `path_transformer` must be 766 used to convert the paths scheme to `abfss`. The built-in `azure` lua library provides this functionality with `transformPathToAbfss`. 767 768 Parameters: 769 770 - `action(table)`: The global action table 771 - `table_descriptors_path(string)`: The path under which the table descriptors of the provided `table_paths` reside. 772 - `delta_table_details(table)`: Table names to physical paths mapping and table metadata (e.g. `{table1 = {path = "s3://mybucket/mytable1", metadata = {id = "table_1_id", name = "table1", ...}}, table2 = {path = "s3://mybucket/mytable2", metadata = {id = "table_2_id", name = "table2", ...}}}`.) 773 - `databricks_client(table)`: A Databricks client that implements `create_or_get_schema: function(id, catalog_name)` and `register_external_table: function(table_name, physical_path, warehouse_id, catalog_name, schema_name)` 774 - `warehouse_id(string)`: Databricks warehouse ID. 775 776 Example: 777 The following registers an exported Delta Lake table to Unity Catalog. 778 779 ```lua 780 local databricks = require("databricks") 781 local unity_export = require("lakefs/catalogexport/unity_exporter") 782 783 local delta_table_locations = { 784 ["table1"] = "s3://mybucket/mytable1", 785 } 786 -- Register the exported table in Unity Catalog: 787 local action_details = { 788 repository_id = "my-repo", 789 commit_id = "commit_id", 790 branch_id = "main", 791 } 792 local databricks_client = databricks.client("<DATABRICKS_HOST>", "<DATABRICKS_TOKEN>") 793 local registration_statuses = unity_export.register_tables(action_details, "_lakefs_tables", delta_table_locations, databricks_client, "<WAREHOUSE_ID>") 794 795 for t, status in pairs(registration_statuses) do 796 print("Unity catalog registration for table \"" .. t .. "\" completed with status: " .. status .. "\n") 797 end 798 ``` 799 800 For the table descriptor under the `_lakefs_tables/delta-table-descriptor.yaml`: 801 ```yaml 802 --- 803 name: my_table_name 804 type: delta 805 path: path/to/delta/table/data 806 catalog: my-catalog 807 ``` 808 809 For detailed step-by-step guide on how to use `unity_exporter.register_tables` as a part of a lakeFS action refer to 810 the [Unity Catalog docs]({% link integrations/unity-catalog.md %}). 811 812 ### `path/parse(path_string)` 813 814 Returns a table for the given path string with the following structure: 815 816 ```lua 817 > require("path") 818 > path.parse("a/b/c.csv") 819 { 820 ["parent"] = "a/b/" 821 ["base_name"] = "c.csv" 822 } 823 ``` 824 825 ### `path/join(*path_parts)` 826 827 Receives a variable number of strings and returns a joined string that represents a path: 828 829 ```lua 830 > path = require("path") 831 > path.join("/", "path/", "to", "a", "file.data") 832 path/o/a/file.data 833 ``` 834 835 ### `path/is_hidden(path_string [, seperator, prefix])` 836 837 returns a boolean - `true` if the given path string is hidden (meaning it starts with `prefix`) - or if any of its parents start with `prefix`. 838 839 ```lua 840 > require("path") 841 > path.is_hidden("a/b/c") -- false 842 > path.is_hidden("a/b/_c") -- true 843 > path.is_hidden("a/_b/c") -- true 844 > path.is_hidden("a/b/_c/") -- true 845 ``` 846 ### `path/default_separator()` 847 848 Returns a constant string (`/`) 849 850 ### `regexp/match(pattern, s)` 851 852 Returns true if the string `s` matches `pattern`. 853 This is a thin wrapper over Go's [regexp.MatchString](https://pkg.go.dev/regexp#MatchString){: target="_blank" }. 854 855 ### `regexp/quote_meta(s)` 856 857 Escapes any meta-characters in string `s` and returns a new string 858 859 ### `regexp/compile(pattern)` 860 861 Returns a regexp match object for the given pattern 862 863 ### `regexp/compiled_pattern.find_all(s, n)` 864 865 Returns a table list of all matches for the pattern, (up to `n` matches, unless `n == -1` in which case all possible matches will be returned) 866 867 ### `regexp/compiled_pattern.find_all_submatch(s, n)` 868 869 Returns a table list of all sub-matches for the pattern, (up to `n` matches, unless `n == -1` in which case all possible matches will be returned). 870 Submatches are matches of parenthesized subexpressions (also known as capturing groups) within the regular expression, 871 numbered from left to right in order of opening parenthesis. 872 Submatch 0 is the match of the entire expression, submatch 1 is the match of the first parenthesized subexpression, and so on 873 874 ### `regexp/compiled_pattern.find(s)` 875 876 Returns a string representing the left-most match for the given pattern in string `s` 877 878 ### `regexp/compiled_pattern.find_submatch(s)` 879 880 find_submatch returns a table of strings holding the text of the leftmost match of the regular expression in `s` and the matches, if any, of its submatches 881 882 ### `strings/split(s, sep)` 883 884 returns a table of strings, the result of splitting `s` with `sep`. 885 886 ### `strings/trim(s)` 887 888 Returns a string with all leading and trailing white space removed, as defined by Unicode 889 890 ### `strings/replace(s, old, new, n)` 891 892 Returns a copy of the string s with the first n non-overlapping instances of `old` replaced by `new`. 893 If `old` is empty, it matches at the beginning of the string and after each UTF-8 sequence, yielding up to k+1 replacements for a k-rune string. 894 895 If n < 0, there is no limit on the number of replacements 896 897 ### `strings/has_prefix(s, prefix)` 898 899 Returns `true` if `s` begins with `prefix` 900 901 ### `strings/has_suffix(s, suffix)` 902 903 Returns `true` if `s` ends with `suffix` 904 905 ### `strings/contains(s, substr)` 906 907 Returns `true` if `substr` is contained anywhere in `s` 908 909 ### `time/now()` 910 911 Returns a `float64` representing the amount of nanoseconds since the unix epoch (01/01/1970 00:00:00). 912 913 ### `time/format(epoch_nano, layout, zone)` 914 915 Returns a string representation of the given epoch_nano timestamp for the given Timezone (e.g. `"UTC"`, `"America/Los_Angeles"`, ...) 916 The `layout` parameter should follow [Go's time layout format](https://pkg.go.dev/time#pkg-constants){: target="_blank" }. 917 918 ### `time/format_iso(epoch_nano, zone)` 919 920 Returns a string representation of the given `epoch_nano` timestamp for the given Timezone (e.g. `"UTC"`, `"America/Los_Angeles"`, ...) 921 The returned string will be in [ISO8601](https://en.wikipedia.org/wiki/ISO_8601){: target="_blank" } format. 922 923 ### `time/sleep(duration_ns)` 924 925 Sleep for `duration_ns` nanoseconds 926 927 ### `time/since(epoch_nano)` 928 929 Returns the amount of nanoseconds elapsed since `epoch_nano` 930 931 ### `time/add(epoch_time, duration_table)` 932 933 Returns a new timestamp (in nanoseconds passed since 01/01/1970 00:00:00) for the given `duration`. 934 The `duration` should be a table with the following structure: 935 936 ```lua 937 > require("time") 938 > time.add(time.now(), { 939 ["hour"] = 1, 940 ["minute"] = 20, 941 ["second"] = 50 942 }) 943 ``` 944 You may omit any of the fields from the table, resulting in a default value of `0` for omitted fields 945 946 ### `time/parse(layout, value)` 947 948 Returns a `float64` representing the amount of nanoseconds since the unix epoch (01/01/1970 00:00:00). 949 This timestamp will represent date `value` parsed using the `layout` format. 950 951 The `layout` parameter should follow [Go's time layout format](https://pkg.go.dev/time#pkg-constants){: target="_blank" } 952 953 ### `time/parse_iso(value)` 954 955 Returns a `float64` representing the amount of nanoseconds since the unix epoch (01/01/1970 00:00:00 for `value`. 956 The `value` string should be in [ISO8601](https://en.wikipedia.org/wiki/ISO_8601){: target="_blank" } format 957 958 ### `uuid/new()` 959 960 Returns a new 128-bit [RFC 4122 UUID](https://www.rfc-editor.org/rfc/rfc4122){: target="_blank" } in string representation. 961 962 ### `net/url` 963 964 Provides a `parse` function parse a URL string into parts, returns a table with the URL's host, path, scheme, query and fragment. 965 966 ```lua 967 > local url = require("net/url") 968 > url.parse("https://example.com/path?p1=a#section") 969 { 970 ["host"] = "example.com" 971 ["path"] = "/path" 972 ["scheme"] = "https" 973 ["query"] = "p1=a" 974 ["fragment"] = "section" 975 } 976 ``` 977 978 979 ### `net/http` (optional) 980 981 Provides a `request` function that performs an HTTP request. 982 For security reasons, this package is not available by default as it enables http requests to be sent out from the lakeFS instance network. The feature should be enabled under `actions.lua.net_http_enabled` [configuration]({% link reference/configuration.md %}). 983 Request will time out after 30 seconds. 984 985 ```lua 986 http.request(url [, body]) 987 http.request{ 988 url = string, 989 [method = string,] 990 [headers = header-table,] 991 [body = string,] 992 } 993 ``` 994 995 Returns a code (number), body (string), headers (table) and status (string). 996 997 - code - status code number 998 - body - string with the response body 999 - headers - table with the response request headers (key/value or table of values) 1000 - status - status code text 1001 1002 The first form of the call will perform GET requests or POST requests if the body parameter is passed. 1003 1004 The second form accepts a table and allows you to customize the request method and headers. 1005 1006 1007 Example of a GET request 1008 1009 ```lua 1010 local http = require("net/http") 1011 local code, body = http.request("https://example.com") 1012 if code == 200 then 1013 print(body) 1014 else 1015 print("Failed to get example.com - status code: " .. code) 1016 end 1017 1018 ``` 1019 1020 Example of a POST request 1021 1022 ```lua 1023 local http = require("net/http") 1024 local code, body = http.request{ 1025 url="https://httpbin.org/post", 1026 method="POST", 1027 body="custname=tester", 1028 headers={["Content-Type"]="application/x-www-form-urlencoded"}, 1029 } 1030 if code == 200 then 1031 print(body) 1032 else 1033 print("Failed to post data - status code: " .. code) 1034 end 1035 ```