github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/rejected/k8s-job-hook.md (about) 1 # Kubernetes Job Hook Proposal 2 3 ## Goals 4 5 - Provide short term solution that enables lakeFS to trigger job execution on Kubernetes 6 - Enable non-blocking execution for post-commit and post-merge 7 - Support successful job completion as condition for the commit and merge operations 8 9 ## Non Goals 10 11 - GitHub Actions - fully maintained, VM with pre-build images to execute pre/post hooks 12 - Manage or control any aspect of the Kubernetes Job 13 14 ## Proposition 15 16 ### Overview 17 18 Current hooks mechanism described [here](https://docs.lakefs.io/setup/hooks.html) 19 20 Enabling additional side-car as part of lakeFS deployment will enable the current web-hook type to trigger a new job on a Kubernetes cluster. 21 22 The user will provide a web-hook definition as part of upload action yaml, that will include query parameters with the image and command to execute. The side-car use the posted information to execute a job based on the image and command line parameters supplied in the hook. 23 During commit/merge, lakeFS will trigger web-hook that will post the request to the side-car, the side-car will create a job to run in the cluster. 24 25 26 27  28 29 30 31 ### Kubernetes Job Sidecar 32 33 In order to encapsulate the new functionality, we will use a sidecar in the lakeFS pod that will accept the web-hook requests from lakeFS and create Kubernetes job. 34 A request to the side-car will include the image, arguments and if we like to wait for the job for completion. 35 The side-car will invoke the request to create job in the Kubernetes cluster. 36 In case of waiting for response, the side-car will response to the web-hook on job completion. 37 Without waiting for response, the side-car will response as soon as the cluster create job completes. 38 39 Using the current actions mechanism, the user will create and upload a yaml file, into the repository's `_lakefs_actions` folder. 40 41 Example of an action using the new hook definition: 42 43 ```yaml 44 name: Branch version tagger 45 description: set version tag on each merge to main 46 on: 47 post-merge: 48 branches: 49 - main 50 hooks: 51 - id: update_tag 52 type: webhook 53 description: Create a tag based on last merge 54 properties: 55 url: "http://localhost:8008/job" 56 query_params: 57 image: "myregistry/myhook:4" 58 command: ["python"] 59 args: ["bump-version.py"] 60 timeout: 30s 61 ``` 62 63 In this example we specified a post merge hook to trigger a job creation in cluster using our sidecar. 64 The job will use the image `myregistry/myhook:4` with the command `python` using the argument `bump-version.py`. 65 The following environment variables will be populated by the sidecar based on the event information triggered by the web hook: 66 67 ``` 68 LAKEFS_HOOK_EVENTTYPE - Type of the event that triggered the action 69 LAKEFS_HOOK_EVENTTIME - Time of the event that triggered the action 70 LAKEFS_HOOK_ACTIONNAME - Containing Hook Action's name 71 LAKEFS_HOOK_HOOKID - ID of the hook 72 LAKEFS_HOOK_REPOSITORYID - ID of the repository 73 LAKEFS_HOOK_BRANCHID - ID of the branch 74 LAKEFS_HOOK_SOURCEREF - Reference to the source that triggered the event 75 LAKEFS_HOOK_COMMITMESSAGE - The message for the commit 76 LAKEFS_HOOK_COMMITTER - Name of the committer 77 LAKEFS_HOOK_COMMIT_METADATA - Commit metadata (json serialized string) 78 ``` 79 80 By default job created by the sidecar, will use the following definition as the base to schedule a job: 81 82 ```yaml 83 apiVersion: batch/v1 84 kind: Job 85 metadata: 86 name: "" 87 namespace: lakefs-hooks 88 spec: 89 template: 90 spec: 91 restartPolicy: Never 92 containers: 93 - name: hook 94 image: "" 95 command: [] 96 args: [] 97 ``` 98 99 Note that the _metadata.name_, _image_, _command_, _args_ nd the environment variables will be set by the sidecar. 100 Name will include a unique identifier specific to the job execution. 101 102 *Limit the end-user image use* 103 104 Using the a configuration file used by the sidecar, we can specify a list of allowed images that the end-user can use. The sidecar will validate and reject any request to execute a job which is not allowed in case the `allowed_images` is populated. 105 106 ```yaml 107 allowed_images: 108 - rclone/rclone:1.57 109 - alpine 110 ``` 111 112 Each item in the images list, will match the image name and tag. When tag is missing it will match any given tag. 113 114 ### Execution 115 116 Using the lakeFS web-hook we can trigger a job creation on our Kubernetes cluster. 117 The job information created will be captured and returned as success. 118 In case we specify `wait_for_complete: true` as additional query parameter, the sidecar will wait until the job status turns to complete or the request is timed out based on the web-hook parameters. 119 120 Note that using `wait_for_complete` will block the web hook, which blocks the commit/merge operation, which blocks writes to the branch. In the time of the call to commit/merge, usually the client request can be also timed out by the load-balancer. Job execution lengths for blocking events should be less than any network timeout along the request route. 121 122 ### Authorizations 123 124 Base on the above, lakeFS deployment will require the following permissions: 125 126 - `job` get, create and watch 127 - `pod` get 128 - `pod/log` get, list, watch 129 130 The following describes possible `Role` that enables the above. 131 Note that we need to add the rules to the current set used by the lakeFS deployment, this document describes the requirements for this feature. 132 133 ``` 134 apiVersion: v1 135 kind: ServiceAccount 136 metadata: 137 namespace: default 138 name: lakefs 139 --- 140 apiVersion: rbac.authorization.k8s.io/v1 141 kind: Role 142 metadata: 143 name: lakefs 144 namespace: lakefs-hooks 145 rules: 146 - apiGroups: [""] 147 resources: ["job"] 148 verbs: ["get", "create", "watch"] 149 - apiGroups: [""] 150 resources: ["pod"] 151 verbs: ["get"] 152 - apiGroups: [""] 153 resources: ["job/logs"] 154 verbs: ["get", "list", "watch"] 155 --- 156 apiVersion: rbac.authorization.k8s.io/v1 157 kind: RoleBinding 158 metadata: 159 name: lakefs 160 namespace: lakefs-hooks 161 subjects: 162 - kind: ServiceAccount 163 name: lakefs 164 namespace: default 165 apiGroup: "" 166 roleRef: 167 kind: ClusterRole 168 name: lakefs 169 namespace: lakefs-hooks 170 apiGroup: rbac.authorization.k8s.io 171 ``` 172 173 ### Considerations 174 175 *Job lifetime* - Once a job is created and executed in the cluster, the lakeFS server will not take ownership of the object. A mechanism should be in place to clean up all jobs lakeFS applied and completed (successfully or not). 176 [Automatic Clean-up for Finished Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/) capability is currently found on Kubernetes 1.23 (which we don’t have yet on AWS for example) which can help with that. 177 178 179 ### Decision 180 181 No changes to lakeFS itself required to enable the above, and we will keep it external. Stash the proposal under rejected.