github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/use_cases/etl_testing.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/use_cases/etl_testing.md (about)

     1  ---
     2  title: ETL Testing Environment
     3  description: In this tutorial, we will explore how to safely run ETL testing using lakeFS to create isolated dev/test data environments to run data pipelines.
     4  parent: Use Cases
     5  grand_parent: Understanding lakeFS
     6  redirect_from:
     7      - /use_cases/etl_testing.html
     8      - /use_cases/iso_env.html
     9      - /quickstart/iso_env.html
    10  ---
    11  
    12  # ETL Testing with Isolated Dev/Test Environments
    13  
    14  {% include toc.html %}
    15  
    16  ## Why are multiple environments so important?
    17  
    18  When working with a data lake, it's useful to have replicas of your production environment. These replicas allow you to test these ETLs and understand changes to your data without impacting the consumers of the production data.
    19  
    20  Running ETL and transformation jobs directly in production without proper ETL testing presents a huge risk of having data issues flow into dashboards, ML models, and other consumers sooner or later. 
    21  
    22  The most common approach to avoid making changes directly in production is to create and maintain multiple data environments and perform ETL testing on them. Dev environments give you a space in which to develop the data pipelines and test environment where pipeline changes are tested before pushing it to production.
    23  
    24  Without lakeFS, the challenge with this approach is that it can be time-consuming and costly to maintain these separate dev/test environments to enable thorough effective ETL testing. And for larger teams it forces multiple people to share these environments, requiring significant coordination. Depending on the size of the data involved there can also be high costs due to the duplication of data.
    25  
    26  ## How does lakeFS help with Dev/Test environments?
    27  
    28  lakeFS makes creating isolated dev/test environments for ETL testing quick and cheap. lakeFS uses zero-copy branching which means that there is no duplication of data when you create a new environment. This frees you from spending time on environment maintenance and makes it possible to create as many environments as needed.
    29  
    30  In a lakeFS repository, data is always located on a `branch`. You can think of each `branch` in lakeFS as its own environment. This is because branches are isolated, meaning changes on one branch have no effect other branches.
    31  
    32  Objects that remain unchanged between two branches are not copied, but rather shared to both branches via metadata pointers that lakeFS manages. If you make a change on one branch and want it reflected on another, you can perform a `merge` operation to update one branch with the changes from another.
    33  {: .note }
    34  
    35  ## Using branches as development and testing environments
    36  
    37  The key difference when using lakeFS for isolated data environments is that you can create them immediately before testing a change. And once new data is merged into production, you can delete the branch - effectively deleting the old environment.
    38  
    39  This is different from creating a long-living test environment used as a staging area to test all the updates. With lakeFS, **we create a new branch for each change to production** that we want to make. One benefit of this is the ability to test multiple changes at one time.
    40  
    41  ![dev/test branches as environments]({{ site.baseurl }}/assets/img/iso_env_dev_test_branching.png)
    42  
    43  ## Try it out! Creating Dev/Test Environments with lakeFS for ETL Testing
    44  
    45  lakeFS supports UI, CLI (`lakectl` commandline utility) and several clients for the [API]({% link reference/api.md %}) to run the Git-like operations. Let us explore how to create dev/test environments using each of these options below.
    46  
    47  There are two ways that you can try out lakeFS:
    48  
    49  * The lakeFS Playground on lakeFS Cloud - fully managed lakeFS with a 30-day free trial
    50  * Local Docker-based [quickstart]({% link quickstart/index.md %}) and [samples](https://github.com/treeverse/lakeFS-samples/)
    51  
    52  You can also [deploy lakeFS]({% link howto/deploy/index.md %}) locally or self-managed on your cloud of choice.
    53  
    54  ### Using lakeFS Playground on lakeFS Cloud
    55  
    56  In this tutorial, we will use [a lakeFS playground environment](https://lakefs.cloud/) to create dev/test data environments for ETL testing. This allows you to spin up a lakeFS instance in a click, create different data environments by simply branching out of your data repository and develop & test data pipelines in these isolated branches.
    57  
    58  First, let us spin up a [playground](https://lakefs.cloud/) instance. Once you have a live environment, login to your instance with access and secret keys. Then, you can work with the sample data repository `my-repo` that is created for you.
    59  
    60  
    61  ![sample repository]({{ site.baseurl }}/assets/img/iso_env_myrepo.png)
    62  
    63  
    64  Click on `my-repo` and notice that by default, the repo has a `main` branch created and `sample_data` preloaded to work with.
    65  
    66  
    67  ![main branch]({{ site.baseurl }}/assets/img/iso_env_sampledata.png)
    68  
    69  
    70  You can create a new branch (say, `test-env`) by going to the _Branches_ tab and clicking _Create Branch_. Once it is successful, you will see two branches under the repo: `main` and `test-env`.
    71  
    72  
    73  ![test-env branch]({{ site.baseurl }}/assets/img/iso_env_testenv_branch.png)
    74  
    75  
    76  Now you can add, modify or delete objects under the `test-env` branch without affecting the data in the main branch.
    77  
    78  ### Trying out lakeFS with Docker and Jupyter Notebooks
    79  
    80  This use case shows how to create dev/test data environments for ETL testing using lakeFS branches. The following tutorial provides a lakeFS environment, a Jupyter notebook, and Python lakefs_client API to demonstrate integration of lakeFS with [Spark]({% link integrations/spark.md %}). You can run this tutorial on your local machine.
    81  
    82  Follow the tutorial video below to get started with the playground and Jupyter notebook, or follow the instructions on this page.
    83  <iframe width="420" height="315" src="https://www.youtube.com/embed/fprpDZ96JQo"></iframe>
    84  
    85  #### Prerequisites
    86  
    87  Before getting started, you will need [Docker](https://docs.docker.com/engine/install/) installed on your machine.
    88  
    89  #### Running lakeFS and Jupyter Notebooks
    90  
    91  Follow along the steps below to create dev/test environment with lakeFS.
    92  
    93  * Start by cloning the lakeFS samples Git repository:
    94  
    95      ```bash
    96  git clone https://github.com/treeverse/lakeFS-samples.git
    97      ```
    98  
    99      ```bash
   100  cd lakeFS-samples
   101      ```
   102  
   103  * Run following commands to download and run Docker container which includes Python, Spark, Jupyter Notebook, JDK, Hadoop binaries, lakeFS Python client and Airflow (Docker image size is around 4.5GB):
   104  
   105      ```bash
   106  git submodule init
   107  git submodule update
   108  docker compose up
   109  ```
   110  
   111  Open the [local Jupyter Notebook](http://localhost:8888) and go to the `spark-demo.ipynb` notebook.
   112  
   113  #### Configuring lakeFS Python Client
   114  
   115  Setup lakeFS access credentials for the lakeFS instance running. The defaults for these that the samples repo Docker Compose uses are shown here:
   116  
   117  ```bash
   118  lakefsAccessKey = 'AKIAIOSFODNN7EXAMPLE'
   119  lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
   120  lakefsEndPoint = 'http://lakefs:8000'
   121  ```
   122  
   123  Next, setup the storage namespace to a location in the bucket you have configured. The storage namespace is a location in the underlying storage where data for this repository will be stored. 
   124  
   125  ```bash
   126  storageNamespace = 's3://example/' 
   127  ```
   128  
   129  You can use lakeFS through the UI, API or `lakectl` commandline. For this use-case, we use python `lakefs_client` to run lakeFS core operations.
   130  
   131  ```bash
   132  import lakefs_client
   133  from lakefs_client import models
   134  from lakefs_client.client import LakeFSClient
   135  
   136  # lakeFS credentials and endpoint
   137  configuration = lakefs_client.Configuration()
   138  configuration.username = lakefsAccessKey
   139  configuration.password = lakefsSecretKey
   140  configuration.host = lakefsEndPoint
   141  
   142  client = LakeFSClient(configuration)
   143  ```
   144  
   145  lakeFS can be configured to work with Spark in two ways:
   146  * Access lakeFS using the [S3-compatible API][spark-s3a]
   147  * Access lakeFS using the [lakeFS-specific Hadoop FileSystem][hadoopfs]
   148  
   149  #### Upload the Sample Data to Main Branch
   150  
   151  To upload an object to the `my-repo`, use the following command.
   152  
   153  ```bash
   154  import os
   155  contentToUpload = open('/data/lakefs_test.csv', 'rb')
   156  client.objects.upload_object(
   157      repository="my-repo",
   158      branch="main",
   159      path=fileName, content=contentToUpload)
   160  ```
   161  
   162  Once uploaded, commit the changes to the `main` branch and attach some metadata to the commit as well.
   163  ```bash
   164  client.commits.commit(
   165      repository="my-repo",
   166      branch="main",
   167      commit_creation=models.CommitCreation(
   168          message='Added my first object!',
   169          metadata={'using': 'python_api'}))
   170  ```
   171  
   172  In this example, we use lakeFS S3A gateway to read data from the storage bucket.
   173  ```bash
   174  dataPath = f"s3a://my-repo/main/lakefs_test.csv"
   175  df = spark.read.csv(dataPath)
   176  df.show()
   177  ```
   178  
   179  #### Create a Test Branch
   180  
   181  Let us start by creating a new branch `test-env` on the example repo `my-repo`.
   182  
   183  ```bash
   184  client.branches.create_branch(
   185      repository="my-repo",
   186      branch_creation=models.BranchCreation(
   187          name="test-env",
   188          source="main"))
   189  ```
   190  
   191  Now we can use Spark to write the csv file from `main` branch as a Parquet file to the `test-env` of our lakeFS repo. Suppose we accidentally write the dataframe back to "test-env" branch again, this time in append mode.
   192  
   193  ```bash
   194  df.write.mode('overwrite').parquet('s3a://my-repo/test-env/')
   195  df.write.mode('append').parquet('s3a://my-repo/test-env/')
   196  ```
   197  What happens if we re-read in the data on both branches and perform a count on the resulting DataFrames?
   198  There will be twice as many rows in `test-env` branch. That is, we accidentally duplicated our data! Oh no!
   199  
   200  Data duplication introduce errors into our data analytics, BI and machine learning efforts; hence we would like to avoid duplicating our data.
   201  
   202  On the `main` branch however, there is still just the original data - untouched by our Spark code. This shows the utility of branch-based isolated environments with lakeFS.
   203  
   204  You can safely continue working with the data from main which is unharmed due to lakeFS isolation capabilities.
   205  
   206  ## Further Reading
   207  
   208  * Case Study: [How Enigma use lakeFS for isolated development and staging environments](https://lakefs.io/blog/improving-our-research-velocity-with-lakefs/)
   209  * [ETL Testing: A Practical Guide](https://lakefs.io/blog/etl-testing/)
   210  * [Top 5 ETL Testing Challenges - Solved!](https://lakefs.io/wp-content/uploads/2023/03/Top-5-ETL-Testing-Challenges-Solved.pdf)
   211  
   212  
   213  [hadoopfs]:  {% link integrations/spark.md %}#lakefs-hadoop-filesystem
   214  [spark-s3a]:  {% link integrations/spark.md %}#use-the-s3-compatible-api