github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/use_cases/cicd_for_data.md (about)

     1  ---
     2  layout: default 
     3  title: CI/CD for Data Lakes
     4  description: In this tutorial, we will explore how to use lakeFS to build a CI/CD pipeline for data lakes.
     5  parent: Use Cases
     6  grand_parent: Understanding lakeFS
     7  redirect_from:
     8     - /use_cases/cicd_for_data.html
     9     - /usecases/ci.html
    10     - /usecases/cd.html
    11  ---
    12  
    13  # CI/CD for Data
    14  
    15  {% include toc.html %}
    16  
    17  ## Why do I need CI/CD?
    18  
    19  Data pipelines feed processed data from data lakes to downstream consumers like business dashboards and machine learning models. As more and more organizations rely on data to enable business critical decisions, data reliability and trust are of paramount concern. Thus, it's important to ensure that production data adheres to the data governance policies of businesses. These data governance requirements can be as simple as a **file format validation, schema check, or an exhaustive PII(Personally Identifiable Information)** data removal from all of organization's data. 
    20  
    21  Thus, to ensure the quality and reliability at each stage of the data lifecycle, data quality gates need to be implemented. That is, we need to run **Continuous Integration(CI)** tests on the data, and only if data governance requirements are met can the data can be promoted to production for business use. 
    22  
    23  Everytime there is an update to production data, the best practice would be to run CI tests and then promote(deploy) the data to production. 
    24  
    25  ## How do I implement CI/CD for data with lakeFS?
    26  
    27  lakeFS makes implementing CI/CD pipelines for data simpler. lakeFS provides a feature called hooks that allow automation of checks and validations of data on lakeFS branches. These checks can be triggered by certain data operations like committing, merging, etc. 
    28  
    29  Functionally, lakeFS hooks are similar to Git Hooks. lakeFS hooks are run remotely on a server, and they are guaranteed to run when the appropriate event is triggered.
    30  
    31  Here are some examples of the hooks lakeFS supports:
    32  * pre-merge
    33  * pre-commit
    34  * post-merge
    35  * post-commit
    36  * pre-create-branch
    37  * post-create-branch
    38  
    39  and so on.
    40  
    41  By leveraging the pre-commit and pre-merge hooks with lakeFS, you can implement CI/CD pipelines on your data lakes.
    42  
    43  Specific trigger rules, quality checks and the branch on which the rules are to be applied are declared in `actions.yaml` file. When a specific event (say, pre-merge) occurs, lakeFS runs all the validations declared in `actions.yaml` file. If validations error out, the merge event is blocked.
    44  
    45  Here is a sample `actions.yaml` file that has pre-merge hook configured to allow only parquet and delta lake file formats on main branch.
    46  
    47  ```bash
    48  name: ParquetOnlyInProduction
    49  description: This webhook ensures that only parquet files are written under production/
    50  on:
    51    pre-merge:
    52      branches:
    53        - main
    54  hooks:
    55    - id: production_format_validator
    56      type: webhook
    57      description: Validate file formats
    58      properties:
    59        url: "http://lakefs-hooks:5001/webhooks/format"
    60        query_params:
    61          allow: ["parquet", "delta_lake"]
    62          prefix: analytics/
    63  ```
    64  
    65  ## Using hooks as data quality gates
    66  
    67  Hooks are run on a remote server that can serve http requests from lakeFS server. lakeFS supports two types of hooks.
    68  1. webhooks (run remotely on a web server. e.g.: flask server in python) 
    69  2. airflow hooks (a dag of complex data quality checks/tasks that can be run on airflow server) 
    70  
    71  In this tutorial, we will show how to use webhooks (python flask webserver) to implement quality gates on your data branches. Specifically, how to configure hooks to allow only parquet and delta lake format files in the main branch.
    72  
    73  The tutorial provides a lakeFS environment, python flask server, a Jupyter notebook and sample data sets to demonstrate the integration of lakeFS hooks with Apache Spark and Python. It runs on Docker Compose.
    74  
    75  To understand how hooks work and how to configure hooks in your production system, refer to the documentation: [Hooks][data-quality-gates]. 
    76  
    77  ![lakeFS hooks - Promotion workflow]({{ site.baseurl }}/assets/img/promotion_workflow.png)
    78  
    79  Follow the steps below to try out CI/CD for data lakes.
    80  
    81  ## Implementing CI/CD pipeline with lakeFS - Demo
    82  
    83  The sample below provides a lakeFS environment, a Jupyter notebook, and a server on which for the lakeFS webhooks to run. 
    84  ### Prerequisites & Setup
    85  
    86  Before we get started, make sure [Docker](https://docs.docker.com/engine/install/) is installed on your machine.
    87  
    88  * Start by cloning the lakeFS samples Git repository:
    89  
    90      ```bash
    91  git clone https://github.com/treeverse/lakeFS-samples.git
    92      ```
    93  
    94      ```bash
    95  cd lakeFS-samples
    96      ```
    97  
    98  * Run following commands to start the components: 
    99  
   100      ```bash
   101  git submodule init
   102  git submodule update
   103  docker compose up
   104  ```
   105  
   106  Open the [local Jupyter Notebook](http://localhost:8888) and go to the `hooks-demo.ipynb` notebook.
   107  
   108  ## Resources 
   109  
   110  To explore different checks and validations on your data, refer to [pre-built hooks config](https://github.com/treeverse/lakeFS-hooks#included-webhooks) by the lakeFS team. 
   111  
   112  To understand the comprehensive list of hooks supported by lakeFS, refer to the [documentation](https://github.com/treeverse/lakeFS-hooks).
   113  
   114  
   115  [data-quality-gates]:  {% link understand/use_cases/cicd_for_data.md %}#using-hooks-as-data-quality-gates