github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/data_lifecycle_management/data-devenv.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/data_lifecycle_management/data-devenv.md (about)

     1  ---
     2  title: In Test
     3  parent: Data Lifecycle Management
     4  grand_parent: Understanding lakeFS
     5  description: lakeFS enables a safe test environment on your data lake without the need to copy or mock data
     6  redirect_from:
     7    - /data_lifecycle_management/data-devenv.html
     8    - /usecases/data-devenv.html
     9  ---
    10  
    11  
    12  ## In Test
    13  
    14  As part of our routine work with data we develop new code, improve and upgrade old code, upgrade infrastructures, and test new technologies. lakeFS enables a safe test environment on your data lake without the need to copy or mock data, work on the pipelines or involve DevOps.
    15  
    16  Creating a branch provides you an isolated environment with a snapshot of your repository (any part of your data lake you chose to manage on lakeFS). While working on your own branch in isolation, all other data users will be looking at the repository’s main branch. They can't see your changes, and you don’t see changes to main done after you created the branch. 
    17  
    18  No worries, no data duplication is done, it’s all metadata management behind the scenes.
    19  Let’s look at 2 examples of a test environment and their branching models.
    20  
    21  ### Example 1: Upgrading Spark and using Reset action
    22  
    23  You installed the latest version of Apache Spark. As a first step you’ll test your Spark jobs to see that the upgrade doesn't have any undesired side effects.
    24  
    25  For this purpose, you may create a branch (testing-spark-3.0) which will only be used to test the Spark upgrade, and discarded later. Jobs may run smoothly (the theoretical possibility exists!), or they may fail halfway through, leaving you with some intermediate partitions, data and metadata. In this case, you can simply *reset* the branch to its original state, without worrying about the intermediate results of your last experiment, and perform another (hopefully successful) test in an isolated branch. Reset actions are atomic and immediate, so no manual cleanup is required.
    26  
    27  Once testing is completed, and you have achieved the desired result, you can delete this experimental branch, and all data not used on any other branch will be deleted with it.
    28  
    29  <img src="{{ site.baseurl }}/assets/img/branching_1.png" alt="branching_1" width="500px"/>
    30  
    31  _Creating a testing branch:_
    32  
    33     ```shell
    34     lakectl branch create \
    35        lakefs://example-repo/testing-spark-3 \
    36        --source lakefs://example-repo/main
    37     # output:
    38     # created branch 'testing-spark-3'
    39     ```
    40  
    41  _Resetting changes to a branch:_
    42  
    43     ```shell
    44     lakectl branch reset lakefs://example-repo/testing-spark-3
    45     # are you sure you want to reset all uncommitted changes?: y█
    46     ```
    47  
    48  **Note** lakeFS version <= v0.33.1 uses '@' (instead of '/') as separator between repository and branch.
    49  
    50  ### Example 2: Collaborate & Compare - Which option is better?
    51  
    52  Easily compare by testing which one performs better on your data set. 
    53  Examples may be:
    54  * Different computation tools, e.g Spark vs. Presto
    55  * Different compression algorithms
    56  * Different Spark configurations
    57  * Different code versions of an ETL
    58  
    59  Run each experiment on its own independent branch, while the main remains untouched. Once both experiments are done, create a comparison query (using Hive or Presto or any other tool of your choice) to compare data characteristics, performance or any other metric you see fit.
    60  
    61  With lakeFS you don't need to worry about creating data paths for the experiments, copying data, and remembering to delete it. It’s substantially easier to avoid errors and maintain a clean lake after.
    62  
    63  <img src="{{ site.baseurl }}/assets/img/branching_2.png" alt="branching_2" width="500px"/>
    64  
    65  _Reading from and comparing branches using Spark:_
    66  
    67     ```scala
    68     val dfExperiment1 = sc.read.parquet("s3a://example-repo/experiment-1/events/by-date")
    69     val dfExperiment2 = sc.read.parquet("s3a://example-repo/experiment-2/events/by-date")
    70  
    71     dfExperiment1.groupBy("...").count()
    72     dfExperiment2.groupBy("...").count() // now we can compare the properties of the data itself
    73     ```
    74