github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/rejected/k8s-job-hook.md (about)

     1  # Kubernetes Job Hook Proposal
     2  
     3  ## Goals
     4  
     5  - Provide short term solution that enables lakeFS to trigger job execution on Kubernetes
     6  - Enable non-blocking execution for post-commit and post-merge
     7  - Support successful job completion as condition for the commit and merge operations
     8  
     9  ## Non Goals
    10  
    11  - GitHub Actions - fully maintained, VM with pre-build images to execute pre/post hooks
    12  - Manage or control any aspect of the Kubernetes Job
    13  
    14  ## Proposition
    15  
    16  ### Overview
    17  
    18  Current hooks mechanism described [here](https://docs.lakefs.io/setup/hooks.html)
    19  
    20  Enabling additional side-car as part of lakeFS deployment will enable the current web-hook type to trigger a new job on a Kubernetes cluster.
    21  
    22  The user will provide a web-hook definition as part of upload action yaml, that will include query parameters with the image and command to execute. The side-car use the posted information to execute a job based on the image and command line parameters supplied in the hook.
    23  During commit/merge, lakeFS will trigger web-hook that will post the request to the side-car, the side-car will create a job to run in the cluster.
    24  
    25  
    26  
    27  ![Overview Diagram](diagrams/k8s-job-hook.png)
    28  
    29  
    30  
    31  ### Kubernetes Job Sidecar
    32  
    33  In order to encapsulate the new functionality, we will use a sidecar in the lakeFS pod that will accept the web-hook requests from lakeFS and create Kubernetes job.
    34  A request to the side-car will include the image, arguments and if we like to wait for the job for completion.
    35  The side-car will invoke the request to create job in the Kubernetes cluster.
    36  In case of waiting for response, the side-car will response to the web-hook on job completion.
    37  Without waiting for response, the side-car will response as soon as the cluster create job completes.
    38  
    39  Using the current actions mechanism, the user will create and upload a yaml file, into the repository's `_lakefs_actions` folder.
    40  
    41  Example of an action using the new hook definition:
    42  
    43  ```yaml
    44  name: Branch version tagger
    45  description: set version tag on each merge to main
    46  on:
    47    post-merge:
    48      branches:
    49        - main
    50  hooks:
    51    - id: update_tag
    52      type: webhook
    53      description: Create a tag based on last merge
    54      properties:
    55        url: "http://localhost:8008/job"
    56        query_params:
    57            image: "myregistry/myhook:4"
    58            command: ["python"]
    59            args: ["bump-version.py"]
    60        timeout: 30s
    61  ```
    62  
    63  In this example we specified a post merge hook to trigger a job creation in cluster using our sidecar.
    64  The job will use the image `myregistry/myhook:4` with the command `python` using the argument `bump-version.py`.
    65  The following environment variables will be populated by the sidecar based on the event information triggered by the web hook:
    66  
    67  ```
    68  LAKEFS_HOOK_EVENTTYPE - Type of the event that triggered the action
    69  LAKEFS_HOOK_EVENTTIME - Time of the event that triggered the action
    70  LAKEFS_HOOK_ACTIONNAME - Containing Hook Action's name
    71  LAKEFS_HOOK_HOOKID - ID of the hook
    72  LAKEFS_HOOK_REPOSITORYID - ID of the repository
    73  LAKEFS_HOOK_BRANCHID - ID of the branch
    74  LAKEFS_HOOK_SOURCEREF - Reference to the source that triggered the event
    75  LAKEFS_HOOK_COMMITMESSAGE - The message for the commit
    76  LAKEFS_HOOK_COMMITTER - Name of the committer
    77  LAKEFS_HOOK_COMMIT_METADATA - Commit metadata (json serialized string)
    78  ```
    79  
    80  By default job created by the sidecar, will use the following definition as the base to schedule a job:
    81  
    82  ```yaml
    83  apiVersion: batch/v1
    84  kind: Job
    85  metadata:
    86    name: ""
    87    namespace: lakefs-hooks
    88  spec:
    89    template:
    90      spec:
    91        restartPolicy: Never
    92        containers:
    93        - name: hook
    94          image: ""
    95          command: []
    96          args: []
    97  ```
    98  
    99  Note that the _metadata.name_, _image_, _command_, _args_ nd the environment variables will be set by the sidecar.
   100  Name will include a unique identifier specific to the job execution.
   101  
   102  *Limit the end-user image use*
   103  
   104  Using the a configuration file used by the sidecar, we can specify a list of allowed images that the end-user can use. The sidecar will validate and reject any request to execute a job which is not allowed in case the `allowed_images` is populated.
   105  
   106  ```yaml
   107  allowed_images:
   108    - rclone/rclone:1.57
   109    - alpine
   110  ```
   111  
   112  Each item in the images list, will match the image name and tag. When tag is missing it will match any given tag.
   113  
   114  ### Execution
   115  
   116  Using the lakeFS web-hook we can trigger a job creation on our Kubernetes cluster.
   117  The job information created will be captured and returned as success.
   118  In case we specify `wait_for_complete: true` as additional query parameter, the sidecar will wait until the job status turns to complete or the request is timed out based on the web-hook parameters.
   119  
   120  Note that using `wait_for_complete` will block the web hook, which blocks the commit/merge operation, which blocks writes to the branch. In the time of the call to commit/merge, usually the client request can be also timed out by the load-balancer. Job execution lengths for blocking events should be less than any network timeout along the request route.
   121  
   122  ### Authorizations
   123  
   124  Base on the above, lakeFS deployment will require the following permissions:
   125  
   126  - `job` get, create and watch
   127  - `pod` get
   128  - `pod/log` get, list, watch
   129  
   130  The following describes possible `Role` that enables the above.
   131  Note that we need to add the rules to the current set used by the lakeFS deployment, this document describes the requirements for this feature.
   132  
   133  ```
   134  apiVersion: v1
   135  kind: ServiceAccount
   136  metadata:
   137    namespace: default
   138    name: lakefs
   139  ---
   140  apiVersion: rbac.authorization.k8s.io/v1
   141  kind: Role
   142  metadata:
   143    name: lakefs
   144    namespace: lakefs-hooks
   145  rules:
   146    - apiGroups: [""]
   147      resources: ["job"]
   148      verbs: ["get", "create", "watch"]
   149    - apiGroups: [""]
   150      resources: ["pod"]
   151      verbs: ["get"]
   152    - apiGroups: [""]
   153      resources: ["job/logs"]
   154      verbs: ["get", "list", "watch"]
   155  ---
   156  apiVersion: rbac.authorization.k8s.io/v1
   157  kind: RoleBinding
   158  metadata:
   159    name: lakefs
   160    namespace: lakefs-hooks
   161  subjects:
   162    - kind: ServiceAccount
   163      name: lakefs
   164      namespace: default
   165      apiGroup: ""
   166  roleRef:
   167    kind: ClusterRole
   168    name: lakefs
   169    namespace: lakefs-hooks
   170    apiGroup: rbac.authorization.k8s.io
   171  ```
   172  
   173  ### Considerations
   174  
   175  *Job lifetime* - Once a job is created and executed in the cluster, the lakeFS server will not take ownership of the object. A mechanism should be in place to clean up all jobs lakeFS applied and completed (successfully or not).
   176  [Automatic Clean-up for Finished Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/) capability is currently found on Kubernetes 1.23 (which we don’t have yet on AWS for example) which can help with that.
   177  
   178  
   179  ### Decision
   180  
   181  No changes to lakeFS itself required to enable the above, and we will keep it external. Stash the proposal under rejected.