github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/zero-deployment-hooks.md (about) 1 # 0-deployment hooks 2 3 ## User Story 4 5 As a data engineer, I would like to perform metadata and schema validation automatically before committing and merging. 6 I would also like to easily integrate with external components like Athena and Hive Metastore. 7 I do not have the required permissions nor expertise to utilize Webhooks. 8 9 ## Goals 10 11 The overarching goal is to substantially decrease the friction of using hooks, driving more lakeFS users to production use-cases. 12 13 1. Provide an easy, deployment-less way to use hooks - let users provide the logic they want with minimum configuration. 14 1. Support both `pre` and `post` hooks: 15 1. `pre` hooks used to validate and test metadata and schema 16 1. `post` hooks used to automatically expose data to downstream systems 17 1. Increase hooks discoverability by allowing 1-click UI driven setup for pre-configured hooks 18 19 ## Non-Goals 20 21 1. Unlike other proposals in the past - avoid becoming a scheduler or deployment tool for Kubernetes/Docker/... 22 23 ## High Level Design 24 25 Executing an external component (such as a container, sub-process, API or HTTP endpoint) has both operational implications as well as security implications. 26 27 What we really want is the ability to execute user-supplied logic when a certain event occurs in lakeFS. Databases have been doing so for decades now, 28 with things like Stored Procedures and Triggers: make the database itself the runtime instead of calling out to an external component. 29 30 This proposal suggests adding an embedded, programmable interface to lakeFS so that hooks could natively be run within lakeFS itself. 31 32 Embedding provides very tight control over the capabilities provided: expose a very tight API and programming interface: only a select set of functions and methods that can't "escape" the runtime environment. 33 34 A very common embedded language for such purposes is [Lua](https://www.lua.org/). 35 36 It is used as an embedded scripting language in many notable projects - From databases such as [Redis](https://redis.io/docs/manual/programmability/eval-intro/) and [Aerospike](https://docs.aerospike.com/server/architecture/udf) to load balancers such as [HAProxy](https://www.haproxy.com/blog/5-ways-to-extend-haproxy-with-lua/) and [NGINX](https://www.nginx.com/resources/wiki/modules/lua/) - all the way to games ([Roblox](https://create.roblox.com/docs/tutorials/scripting/basic-scripting/intro-to-scripting), [Gary's Mod](https://wiki.facepunch.com/gmod/Beginner_Tutorial_Intro) and others). 37 38 ### Embedding Lua in lakeFS 39 40 Fortunately, most of the work of making Hooks pluggable has already been done. lakeFS currently supports WebHooks and Airflow `post-*` hooks so a general interface making this pluggable is already in place. 41 42 A feature-complete Lua VM in pure Go is available at [github.com/shopify/go-lua](https://github.com/Shopify/go-lua). It allows very fine-grained control on what user defined code is able to execute. It allows doing a few notable things that make it a good fit for hooks: 43 44 1. Provide control over available modules, including builtin ones - For example, we can simply not load the `io` or `os` modules at all, providing no access to the host environment. 45 1. Allow binding Go functions and structs to lua functions and tables - we can create our own set of exposed libraries that will be made available to user defined code. 46 1. An accompanying project [Shopify/goluago](https://github.com/Shopify/goluago) provides a great set of common libraries (`strings`, `regexp`, `encoding/json` and more) so that for the most part, we won't have to reinvent the wheel to provide something usable. 47 1. Allow injecting variables into the Lua VM's global table (`_G`) - this makes exposing metadata about the action easy and intuitive 48 49 ### Defining Lua Hooks 50 51 Lua code could be supplied either in-line inside the `.yaml` hook definition file as follows: 52 53 ```yaml 54 name: dump_all 55 on: 56 post-commit: 57 post-merge: 58 pre-commit: 59 pre-merge: 60 pre-create-tag: 61 post-create-tag: 62 pre-create-branch: 63 post-create-branch: 64 hooks: 65 - id: dump_event 66 type: lua 67 properties: 68 script: | 69 json = require("encoding/json") 70 print(json.marshal(action)) 71 ``` 72 73 Here we're specifying 2 important properties: 74 75 - `hooks[].type = "lua"` - will cause the execution of the LuaHook type 76 - `hooks[].properties.script (string)` - an in-line lua script, directly in the yaml file. 77 78 Additionally, for more complex logic, it's probably a good idea to write the code as its own dedicated object, supplying a lakeFS path instead of an inline script. 79 Example: 80 81 ```yaml 82 name: auto symlink 83 on: 84 post-create-branch: 85 branches: 86 - view-* 87 post-commit: 88 branches: 89 - view-* 90 hooks: 91 - id: symlink_creator 92 type: lua 93 properties: 94 args: 95 # Export configuration 96 aws_access_key_id: "{{ENV.AWS_ACCESS_KEY_ID}}" 97 aws_secret_access_key: "{{ENV.AWS_SECRET_ACCESS_KEY}}" 98 aws_region: us-east-1 99 # Export location 100 export_bucket: athena-views 101 export_path: lakefs/exposed-tables/ 102 # Tables to export: 103 sources: 104 - tables/users/ 105 - tables/events/ 106 script_path: scripts/s3_hive_manifest_exporter.lua 107 ``` 108 109 This adds the following settings: 110 111 - `hooks[].properties.script_path (string)` - a path in the same lakeFS repo for a lua script to be executed 112 - `hooks[].properties.args (map[string]interface{})` - a map that will be passed down to the lua script as a global variable called `args`. 113 114 115 ### Lua API exposed to hooks 116 117 The implementation will provide the following modules: 118 119 - `_G` - a modified set of builtin functions that a lua runtime provides. A few functions will be removed/modified: 120 - `loadfile` - removed, since we don't want free access to the local filesystem 121 - `dofile` - removed, since we don't want free access to the local filesystem 122 - `print` - modified to not write to os.Stdout but instead accept an external `bytes.Buffer` so that output reaches the action log 123 - `aws/s3`: 124 - `get_object` 125 - `put_object` 126 - `list_objects` 127 - `delete_object` 128 - `delete_recursive` 129 - `crypto/aes` (taken from *goluago*): 130 - `encryptCBC` 131 - `decryptCBC` 132 - `crypto/hmac` (taken from *goluago*): 133 - `signsha256` 134 - `signsha1` 135 - `crypto/sha256` (taken from *goluago*): 136 - `digest` 137 - `encoding/base64` (taken from *goluago*): 138 - `encode` 139 - `decode` 140 - `urlEncode` 141 - `urlDecode` 142 - `encoding/hex` (taken from *goluago*): 143 - `encode` 144 - `decode` 145 - `encoding/json` (taken from *goluago*): 146 - `marshal` 147 - `unmarshal` 148 - `encoding/parquet`: 149 - `get_schema` 150 - `regexp` (taken from *goluago*): 151 - `match` 152 - `quotemeta` 153 - `compile` 154 - `path`: 155 - `parse` 156 - `join` 157 - `is_hidden` 158 - `strings` (taken from *goluago*): 159 - `split` 160 - `trim` 161 - `replace` 162 - `has_prefix` 163 - `has_suffix` 164 - `contains` 165 - `time` (taken from *goluago*): 166 - `now` 167 - `format` 168 - `formatISO` 169 - `sleep` 170 - `since` 171 - `add` 172 - `parse` 173 - `parseISO` 174 - `uuid` (taken from *goluago*): 175 - `new` 176 177 Additionally, a lakeFS client pre-authenticated with the user that performed the action will be loaded: 178 This client doesn't go over the network - it generates in-process `http.Request` objects that are then passed directly into the current process' `http.Server`. 179 180 - `lakefs` 181 - `create_tag` 182 - `diff_refs` 183 - `list_objects` 184 - `get_object` 185 186 Other methods or calls could be added over time. 187 188 ## Hooks Discoverability 189 190 This is nice to have and not necessarily part of a first release, but once we have embedded hooks, we can do the following: 191 192 1. add a `scripts/` directory to new repositories that contains reusable lua scripts for common use cases 193 1. suggest auto-creating a hook at certain touch points: 194 - on new commit modal: when adding a k/v pair, suggest adding a hook to validate the existence of this key in subsequent commits 195 - when committing a change/viewing an object tree that includes parquet/orc files - automatically add hook that validates breaking schema changes 196 - when looking at a hive-partitioned table - automatically add a hook to export symlinks for Athena/Trino/... 197 - when creating a branch, automatically add a hook to register tables on this branch in a metastore 198 - when doing any versioning action - add simple hook to print out information about commits/merges/branches/tags when they happen 199 - when tagging, add a hook to enforce a naming convention to tags 200 - when merging, add a hook to ensure formats used, schema blacklist, etc. 201 1. On the "Actions" tab, do a marketplace-like view of available hooks that could be added with a click (essentially the list above but centralized) 202 203 ## Limitations/Downsides 204 205 1. While Lua is popular as an embedded language for server components and databases, it is not a common language overall. It is very simple but we can assume the vast majority of users will not have any prior experience with it. 206 1. While this solves some of the friction of having to run something external, it doesn't solve another underlying problem with hooks: long running `pre-` actions are still tied to the lifecycle of the HTTP requests. This has to be solved by another mechanism 207 1. Users still need to configure the YAML files that tie together the lua logic with the events in the system. The usability and discoverability of that are still open questions. 208 209 ## Alternatives 210 211 Another option is to introduce WASM support. This theoretically allows writing hooks in many languages while getting those same guarantees. 212 This, however, has several limitations: 213 214 1. Users will either have to compile their code to wasm themselves, or lakeFS would have to compile code for them on the fly 215 2. The [WASM Specification](https://webassembly.github.io/spec/core/) has a much larger surface area: supporting it while also making sure user functions cannot "escape" the hook runtime would be harder, making this less secure 216 3. Support for dynamic languages in WASM is still in early stages. Most importantly, Python, which is the most common language for data engineering and ML has only limited support. The most notable wasm compiler being [pyodide](https://github.com/pyodide/pyodide) which currently can only expose a REPL over a web interface and doesn't allow headless compilation to wasm. 217 218 Supporting Lua doesn't close the door on supporting WASM in the future as well and is much faster and simpler to implement.