github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/kubeflow.md (about) 1 --- 2 title: Kubeflow 3 description: Easily build reproducible data pipelines with Kubeflow and lakeFS using commits, without modifying the code or logic of your job. 4 parent: Integrations 5 6 --- 7 # Using lakeFS with Kubeflow pipelines 8 [Kubeflow](https://www.kubeflow.org/docs/about/kubeflow/) is a project dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable. 9 A Kubeflow pipeline is a portable and scalable definition of an ML workflow composed of steps. Each step in the pipeline is an instance of a component represented as an instance of [ContainerOp](https://kf-pipelines.readthedocs.io/en/latest/source/kfp.dsl.html#kfp.dsl.ContainerOp). 10 11 {% include toc_2-3.html %} 12 13 14 ## Add pipeline steps for lakeFS operations 15 16 To integrate lakeFS into your Kubeflow pipeline, you need to create Kubeflow components that perform lakeFS operations. 17 Currently, there are two methods to create lakeFS ContainerOps: 18 1. Implement a function-based ContainerOp that uses the lakeFS Python API to invoke lakeFS operations. 19 1. Implement a ContainerOp that uses the `lakectl` CLI docker image to invoke lakeFS operations. 20 21 ### Function-based ContainerOps 22 23 To implement a [function-based component](https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/) that invokes lakeFS operations, 24 you should use the [Python OpenAPI client](python.md) lakeFS provides. See the example below that demonstrates how to make the client's package available to your ContainerOp. 25 26 #### Example operations 27 28 Create a new branch: A function-based ContainerOp that creates a branch called `example-branch` based on the `main` branch of `example-repo`. 29 30 ```python 31 from kfp import components 32 33 def create_branch(repo_name, branch_name, source_branch): 34 import lakefs_client 35 from lakefs_client import models 36 from lakefs_client.client import LakeFSClient 37 38 # lakeFS credentials and endpoint 39 configuration = lakefs_client.Configuration() 40 configuration.username = 'AKIAIOSFODNN7EXAMPLE' 41 configuration.password = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY' 42 configuration.host = 'https://lakefs.example.com' 43 client = LakeFSClient(configuration) 44 45 client.branches.create_branch(repository=repo_name, branch_creation=models.BranchCreation(name=branch_name, source=source_branch)) 46 47 # Convert the function to a lakeFS pipeline step. 48 create_branch_op = components.func_to_container_op( 49 func=create_branch, 50 packages_to_install=['lakefs_client==<lakeFS version>']) # Type in the lakeFS version you are using 51 ``` 52 53 You can invoke any lakeFS operation supported by lakeFS OpenAPI. For example, you could implement a commit and merge function-based ContainerOps. 54 Check out the full API [reference](https://docs.lakefs.io/reference/api.html). 55 56 ### Non-function-based ContainerOps 57 58 To implement a non-function based ContainerOp, you should use the [`treeverse/lakectl`](https://hub.docker.com/r/treeverse/lakectl) docker image. 59 With this image, you can run [lakectl]({% link reference/cli.md %}) commands to execute the desired lakeFS operation. 60 61 For `lakectl` to work with Kubeflow, you will need to pass your lakeFS configurations as environment variables named: 62 63 * `LAKECTL_CREDENTIALS_ACCESS_KEY_ID: AKIAIOSFODNN7EXAMPLE` 64 * `LAKECTL_SECRET_ACCESS_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY` 65 * `LAKECTL_SERVER_ENDPOINT_URL: https://lakefs.example.com` 66 67 #### Example operations 68 69 1. Commit changes to a branch: A ContainerOp that commits uncommitted changes to `example-branch` on `example-repo`. 70 71 ```python 72 from kubernetes.client.models import V1EnvVar 73 74 def commit_op(): 75 return dsl.ContainerOp( 76 name='commit', 77 image='treeverse/lakectl', 78 arguments=['commit', 'lakefs://example-repo/example-branch', '-m', 'commit message']).add_env_variable(V1EnvVar(name='LAKECTL_CREDENTIALS_ACCESS_KEY_ID',value='AKIAIOSFODNN7EXAMPLE')).add_env_variable(V1EnvVar(name='LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY',value='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY')).add_env_variable(V1EnvVar(name='LAKECTL_SERVER_ENDPOINT_URL',value='https://lakefs.example.com')) 79 ``` 80 81 1. Merge two lakeFS branches: A ContainerOp that merges `example-branch` into the `main` branch of `example-repo`. 82 83 ```python 84 def merge_op(): 85 return dsl.ContainerOp( 86 name='merge', 87 image='treeverse/lakectl', 88 arguments=['merge', 'lakefs://example-repo/example-branch', 'lakefs://example-repo/main']).add_env_variable(V1EnvVar(name='LAKECTL_CREDENTIALS_ACCESS_KEY_ID',value='AKIAIOSFODNN7EXAMPLE')).add_env_variable(V1EnvVar(name='LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY',value='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY')).add_env_variable(V1EnvVar(name='LAKECTL_SERVER_ENDPOINT_URL',value='https://lakefs.example.com')) 89 ``` 90 91 You can invoke any lakeFS operation supported by `lakectl` by implementing it as a ContainerOp. Check out the complete [CLI reference]({% link reference/cli.md %}) for the list of supported operations. 92 93 94 **Note** 95 The lakeFS Kubeflow integration that uses `lakectl` is supported on lakeFS version >= v0.43.0. 96 {: .note } 97 98 ## Add the lakeFS steps to your pipeline 99 100 Add the steps created in the previous step to your pipeline before compiling it. 101 102 ### Example pipeline 103 104 A pipeline that implements a simple ETL that has steps for branch creation and commits. 105 106 ```python 107 def lakectl_pipeline(): 108 create_branch_task = create_branch_op('example-repo', 'example-branch', 'main') # A function-based component 109 extract_task = example_extract_op() 110 commit_task = commit_op() 111 transform_task = example_transform_op() 112 commit_task = commit_op() 113 load_task = example_load_op() 114 ``` 115 116 117 **Note** 118 It's recommended to store credentials as Kubernetes secrets and pass them as [environment variables](https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-environment-variables ) to Kubeflow operations using [V1EnvVarSource](https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/V1EnvVarSource.md). 119 {: .note }