github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/kubeflow.md (about)

     1  ---
     2  title: Kubeflow
     3  description: Easily build reproducible data pipelines with Kubeflow and lakeFS using commits, without modifying the code or logic of your job.
     4  parent: Integrations
     5  
     6  ---
     7  # Using lakeFS with Kubeflow pipelines
     8  [Kubeflow](https://www.kubeflow.org/docs/about/kubeflow/) is a project dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.
     9  A Kubeflow pipeline is a portable and scalable definition of an ML workflow composed of steps. Each step in the pipeline is an instance of a component represented as an instance of [ContainerOp](https://kf-pipelines.readthedocs.io/en/latest/source/kfp.dsl.html#kfp.dsl.ContainerOp).
    10  
    11  {% include toc_2-3.html %}
    12  
    13  
    14  ## Add pipeline steps for lakeFS operations
    15  
    16  To integrate lakeFS into your Kubeflow pipeline, you need to create Kubeflow components that perform lakeFS operations.
    17  Currently, there are two methods to create lakeFS ContainerOps:
    18  1. Implement a function-based ContainerOp that uses the lakeFS Python API to invoke lakeFS operations.
    19  1. Implement a ContainerOp that uses the `lakectl` CLI docker image to invoke lakeFS operations.
    20  
    21  ### Function-based ContainerOps
    22  
    23  To implement a [function-based component](https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/) that invokes lakeFS operations,
    24  you should use the [Python OpenAPI client](python.md) lakeFS provides. See the example below that demonstrates how to make the client's package available to your ContainerOp.
    25  
    26  #### Example operations
    27  
    28  Create a new branch: A function-based ContainerOp that creates a branch called `example-branch` based on the `main` branch of `example-repo`.
    29  
    30  ```python
    31  from kfp import components
    32  
    33  def create_branch(repo_name, branch_name, source_branch):
    34     import lakefs_client
    35     from lakefs_client import models
    36     from lakefs_client.client import LakeFSClient
    37  
    38     # lakeFS credentials and endpoint
    39     configuration = lakefs_client.Configuration()
    40     configuration.username = 'AKIAIOSFODNN7EXAMPLE'
    41     configuration.password = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
    42     configuration.host = 'https://lakefs.example.com'
    43     client = LakeFSClient(configuration)
    44  
    45     client.branches.create_branch(repository=repo_name, branch_creation=models.BranchCreation(name=branch_name, source=source_branch))
    46  
    47  # Convert the function to a lakeFS pipeline step.
    48  create_branch_op = components.func_to_container_op(
    49     func=create_branch,
    50     packages_to_install=['lakefs_client==<lakeFS version>']) # Type in the lakeFS version you are using
    51  ```
    52  
    53  You can invoke any lakeFS operation supported by lakeFS OpenAPI. For example, you could implement a commit and merge function-based ContainerOps.
    54  Check out the full API [reference](https://docs.lakefs.io/reference/api.html).
    55  
    56  ### Non-function-based ContainerOps
    57  
    58  To implement a non-function based ContainerOp, you should use the [`treeverse/lakectl`](https://hub.docker.com/r/treeverse/lakectl) docker image.
    59  With this image, you can run [lakectl]({% link reference/cli.md %}) commands to execute the desired lakeFS operation.
    60  
    61  For `lakectl` to work with Kubeflow, you will need to pass your lakeFS configurations as environment variables named:
    62  
    63  * `LAKECTL_CREDENTIALS_ACCESS_KEY_ID: AKIAIOSFODNN7EXAMPLE`
    64  * `LAKECTL_SECRET_ACCESS_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`
    65  * `LAKECTL_SERVER_ENDPOINT_URL: https://lakefs.example.com`
    66  
    67  #### Example operations
    68  
    69  1. Commit changes to a branch: A ContainerOp that commits uncommitted changes to `example-branch` on `example-repo`.
    70  
    71     ```python
    72     from kubernetes.client.models import V1EnvVar
    73  
    74     def commit_op():
    75        return dsl.ContainerOp(
    76        name='commit',
    77        image='treeverse/lakectl',
    78        arguments=['commit', 'lakefs://example-repo/example-branch', '-m', 'commit message']).add_env_variable(V1EnvVar(name='LAKECTL_CREDENTIALS_ACCESS_KEY_ID',value='AKIAIOSFODNN7EXAMPLE')).add_env_variable(V1EnvVar(name='LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY',value='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY')).add_env_variable(V1EnvVar(name='LAKECTL_SERVER_ENDPOINT_URL',value='https://lakefs.example.com'))
    79     ```
    80  
    81  1. Merge two lakeFS branches: A ContainerOp that merges `example-branch` into the `main` branch of `example-repo`.
    82  
    83     ```python
    84     def merge_op():
    85       return dsl.ContainerOp(
    86       name='merge',
    87       image='treeverse/lakectl',
    88       arguments=['merge', 'lakefs://example-repo/example-branch', 'lakefs://example-repo/main']).add_env_variable(V1EnvVar(name='LAKECTL_CREDENTIALS_ACCESS_KEY_ID',value='AKIAIOSFODNN7EXAMPLE')).add_env_variable(V1EnvVar(name='LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY',value='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY')).add_env_variable(V1EnvVar(name='LAKECTL_SERVER_ENDPOINT_URL',value='https://lakefs.example.com'))
    89     ```
    90  
    91  You can invoke any lakeFS operation supported by `lakectl` by implementing it as a ContainerOp. Check out the complete [CLI reference]({% link reference/cli.md %}) for the list of supported operations.
    92  
    93  
    94  **Note**
    95  The lakeFS Kubeflow integration that uses `lakectl` is supported on lakeFS version >= v0.43.0.
    96  {: .note }
    97  
    98  ## Add the lakeFS steps to your pipeline
    99  
   100  Add the steps created in the previous step to your pipeline before compiling it.
   101  
   102  ### Example pipeline
   103  
   104  A pipeline that implements a simple ETL that has steps for branch creation and commits.
   105  
   106  ```python
   107  def lakectl_pipeline():
   108     create_branch_task = create_branch_op('example-repo', 'example-branch', 'main') # A function-based component
   109     extract_task = example_extract_op()
   110     commit_task = commit_op()
   111     transform_task = example_transform_op()
   112     commit_task = commit_op()
   113     load_task = example_load_op()
   114  ```
   115  
   116  
   117  **Note**
   118  It's recommended to store credentials as Kubernetes secrets and pass them as [environment variables](https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-environment-variables ) to Kubeflow operations using [V1EnvVarSource](https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/V1EnvVarSource.md).
   119  {: .note }