github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/zero-deployment-hooks.md (about)

     1  # 0-deployment hooks
     2  
     3  ## User Story
     4  
     5  As a data engineer, I would like to perform metadata and schema validation automatically before committing and merging.
     6  I would also like to easily integrate with external components like Athena and Hive Metastore.
     7  I do not have the required permissions nor expertise to utilize Webhooks.
     8  
     9  ## Goals
    10  
    11  The overarching goal is to substantially decrease the friction of using hooks, driving more lakeFS users to production use-cases.
    12  
    13  1. Provide an easy, deployment-less way to use hooks - let users provide the logic they want with minimum configuration.
    14  1. Support both `pre` and `post` hooks:
    15      1. `pre` hooks used to validate and test metadata and schema
    16      1. `post` hooks used to automatically expose data to downstream systems
    17  1. Increase hooks discoverability by allowing 1-click UI driven setup for pre-configured hooks
    18  
    19  ## Non-Goals
    20  
    21  1. Unlike other proposals in the past - avoid becoming a scheduler or deployment tool for Kubernetes/Docker/...
    22  
    23  ## High Level Design
    24  
    25  Executing an external component (such as a container, sub-process, API or HTTP endpoint) has both operational implications as well as security implications.
    26  
    27  What we really want is the ability to execute user-supplied logic when a certain event occurs in lakeFS. Databases have been doing so for decades now,
    28  with things like Stored Procedures and Triggers: make the database itself the runtime instead of calling out to an external component.
    29  
    30  This proposal suggests adding an embedded, programmable interface to lakeFS so that hooks could natively be run within lakeFS itself.
    31  
    32  Embedding provides very tight control over the capabilities provided: expose a very tight API and programming interface: only a select set of functions and methods that can't "escape" the runtime environment.
    33  
    34  A very common embedded language for such purposes is [Lua](https://www.lua.org/).
    35  
    36  It is used as an embedded scripting language in many notable projects - From databases such as [Redis](https://redis.io/docs/manual/programmability/eval-intro/) and [Aerospike](https://docs.aerospike.com/server/architecture/udf) to load balancers such as [HAProxy](https://www.haproxy.com/blog/5-ways-to-extend-haproxy-with-lua/) and [NGINX](https://www.nginx.com/resources/wiki/modules/lua/) - all the way to games ([Roblox](https://create.roblox.com/docs/tutorials/scripting/basic-scripting/intro-to-scripting), [Gary's Mod](https://wiki.facepunch.com/gmod/Beginner_Tutorial_Intro) and others).
    37  
    38  ### Embedding Lua in lakeFS
    39  
    40  Fortunately, most of the work of making Hooks pluggable has already been done. lakeFS currently supports WebHooks and Airflow `post-*` hooks so a general interface making this pluggable is already in place.
    41  
    42  A feature-complete Lua VM in pure Go is available at [github.com/shopify/go-lua](https://github.com/Shopify/go-lua). It allows very fine-grained control on what user defined code is able to execute. It allows doing a few notable things that make it a good fit for hooks:
    43  
    44  1. Provide control over available modules, including builtin ones - For example, we can simply not load the `io` or `os` modules at all, providing no access to the host environment.
    45  1. Allow binding Go functions and structs to lua functions and tables - we can create our own set of exposed libraries that will be made available to user defined code.
    46  1. An accompanying project [Shopify/goluago](https://github.com/Shopify/goluago) provides a great set of common libraries (`strings`, `regexp`, `encoding/json` and more) so that for the most part, we won't have to reinvent the wheel to provide something usable.
    47  1. Allow injecting variables into the Lua VM's global table (`_G`) - this makes exposing metadata about the action easy and intuitive
    48  
    49  ### Defining Lua Hooks
    50  
    51  Lua code could be supplied either in-line inside the `.yaml` hook definition file as follows:
    52  
    53  ```yaml
    54  name: dump_all
    55  on:
    56    post-commit:
    57    post-merge:
    58    pre-commit:
    59    pre-merge:
    60    pre-create-tag:
    61    post-create-tag:
    62    pre-create-branch:
    63    post-create-branch:
    64  hooks:
    65    - id: dump_event
    66      type: lua
    67      properties:
    68        script: |
    69          json = require("encoding/json")
    70          print(json.marshal(action))
    71  ```
    72  
    73  Here we're specifying 2 important properties:
    74  
    75  - `hooks[].type = "lua"` - will cause the execution of the LuaHook type
    76  - `hooks[].properties.script (string)` - an in-line lua script, directly in the yaml file.
    77  
    78  Additionally, for more complex logic, it's probably a good idea to write the code as its own dedicated object, supplying a lakeFS path instead of an inline script.
    79  Example:
    80  
    81  ```yaml
    82  name: auto symlink
    83  on:
    84    post-create-branch:
    85      branches:
    86        - view-*
    87    post-commit:
    88      branches:
    89        - view-*
    90  hooks:
    91    - id: symlink_creator
    92      type: lua
    93      properties:
    94        args:
    95          # Export configuration
    96          aws_access_key_id: "{{ENV.AWS_ACCESS_KEY_ID}}"
    97          aws_secret_access_key: "{{ENV.AWS_SECRET_ACCESS_KEY}}"
    98          aws_region: us-east-1
    99          # Export location
   100          export_bucket: athena-views
   101          export_path: lakefs/exposed-tables/
   102          # Tables to export:
   103          sources:
   104            - tables/users/
   105            - tables/events/
   106        script_path: scripts/s3_hive_manifest_exporter.lua
   107  ```
   108  
   109  This adds the following settings:
   110  
   111  - `hooks[].properties.script_path (string)` - a path in the same lakeFS repo for a lua script to be executed
   112  - `hooks[].properties.args (map[string]interface{})` - a map that will be passed down to the lua script as a global variable called `args`.
   113  
   114  
   115  ### Lua API exposed to hooks
   116  
   117  The implementation will provide the following modules:
   118  
   119  - `_G` - a modified set of builtin functions that a lua runtime provides. A few functions will be removed/modified:
   120      - `loadfile` - removed, since we don't want free access to the local filesystem
   121      - `dofile` - removed, since we don't want free access to the local filesystem
   122      - `print` - modified to not write to os.Stdout but instead accept an external `bytes.Buffer` so that output reaches the action log
   123  - `aws/s3`:
   124      - `get_object`
   125      - `put_object`
   126      - `list_objects`
   127      - `delete_object`
   128      - `delete_recursive`
   129  - `crypto/aes` (taken from *goluago*):
   130      - `encryptCBC`
   131      - `decryptCBC`
   132  - `crypto/hmac` (taken from *goluago*):
   133      - `signsha256`
   134      - `signsha1`
   135  - `crypto/sha256` (taken from *goluago*):
   136      - `digest`
   137  - `encoding/base64` (taken from *goluago*):
   138      - `encode`
   139      - `decode`
   140      - `urlEncode`
   141      - `urlDecode`
   142  - `encoding/hex` (taken from *goluago*):
   143      - `encode`
   144      - `decode`
   145  - `encoding/json` (taken from *goluago*):
   146      - `marshal`
   147      - `unmarshal`
   148  - `encoding/parquet`:
   149      - `get_schema`
   150  - `regexp` (taken from *goluago*):
   151      - `match`
   152      - `quotemeta`
   153      - `compile`
   154  - `path`:
   155      - `parse`
   156      - `join`
   157      - `is_hidden`
   158  - `strings` (taken from *goluago*):
   159      - `split`
   160      - `trim`
   161      - `replace`
   162      - `has_prefix`
   163      - `has_suffix`
   164      - `contains`
   165  - `time` (taken from *goluago*):
   166      - `now`
   167      - `format`
   168      - `formatISO`
   169      - `sleep`
   170      - `since`
   171      - `add`
   172      - `parse`
   173      - `parseISO`
   174  - `uuid` (taken from *goluago*):
   175      - `new`
   176  
   177  Additionally, a lakeFS client pre-authenticated with the user that performed the action will be loaded:
   178  This client doesn't go over the network - it generates in-process `http.Request` objects that are then passed directly into the current process' `http.Server`.
   179  
   180  - `lakefs`
   181      - `create_tag`
   182      - `diff_refs`
   183      - `list_objects`
   184      - `get_object`
   185  
   186  Other methods or calls could be added over time.
   187  
   188  ## Hooks Discoverability
   189  
   190  This is nice to have and not necessarily part of a first release, but once we have embedded hooks, we can do the following:
   191  
   192  1. add a `scripts/` directory to new repositories that contains reusable lua scripts for common use cases
   193  1. suggest auto-creating a hook at certain touch points:
   194      - on new commit modal: when adding a k/v pair, suggest adding a hook to validate the existence of this key in subsequent commits
   195      - when committing a change/viewing an object tree that includes parquet/orc files - automatically add hook that validates breaking schema changes
   196      - when looking at a hive-partitioned table - automatically add a hook to export symlinks for Athena/Trino/...
   197      - when creating a branch, automatically add a hook to register tables on this branch in a metastore
   198      - when doing any versioning action - add simple hook to print out information about commits/merges/branches/tags when they happen
   199      - when tagging, add a hook to enforce a naming convention to tags
   200      - when merging, add a hook to ensure formats used, schema blacklist, etc.
   201  1. On the "Actions" tab, do a marketplace-like view of available hooks that could be added with a click (essentially the list above but centralized)
   202  
   203  ## Limitations/Downsides
   204  
   205  1. While Lua is popular as an embedded language for server components and databases, it is not a common language overall. It is very simple but we can assume the vast majority of users will not have any prior experience with it.
   206  1. While this solves some of the friction of having to run something external, it doesn't solve another underlying problem with hooks: long running `pre-` actions are still tied to the lifecycle of the HTTP requests. This has to be solved by another mechanism
   207  1. Users still need to configure the YAML files that tie together the lua logic with the events in the system. The usability and discoverability of that are still open questions.
   208  
   209  ## Alternatives
   210  
   211  Another option is to introduce WASM support. This theoretically allows writing hooks in many languages while getting those same guarantees.
   212  This, however, has several limitations:
   213  
   214  1. Users will either have to compile their code to wasm themselves, or lakeFS would have to compile code for them on the fly
   215  2. The [WASM Specification](https://webassembly.github.io/spec/core/) has a much larger surface area: supporting it while also making sure user functions cannot "escape" the hook runtime would be harder, making this less secure
   216  3. Support for dynamic languages in WASM is still in early stages. Most importantly, Python, which is the most common language for data engineering and ML has only limited support. The most notable wasm compiler being [pyodide](https://github.com/pyodide/pyodide) which currently can only expose a REPL over a web interface and doesn't allow headless compilation to wasm.
   217  
   218  Supporting Lua doesn't close the door on supporting WASM in the future as well and is much faster and simpler to implement.