github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/quickstart/work-with-data-locally.md (about)

     1  ---
     2  title: 7️⃣ Work with lakeFS data locally
     3  description: lakeFS quickstart / Bring lakeFS data to a local environment to show how lakeFS can be used for ML experiments development. 
     4  parent: ⭐ Quickstart
     5  nav_order: 35
     6  next: ["Resources for learning more about lakeFS", "./learning-more-lakefs.html"]
     7  previous: ["Using Actions and Hooks in lakeFS", "./actions-and-hooks.html"]
     8  ---
     9  
    10  # Work with lakeFS Data Locally
    11  
    12  When working with lakeFS, there are scenarios where we need to access and manipulate data locally. An example use case for working 
    13  locally is machine learning model development. Machine learning model development is dynamic and iterative. To optimize this 
    14  process, experiments need to be conducted with speed, tracking ease, and reproducibility. Localizing model data during development 
    15  accelerates the process by enabling interactive and offline development and reducing data access latency.
    16  
    17  We're going to use [lakectl local](../howto/local-checkouts.md) to bring a subset of our lakeFS data to a local directory within the lakeFS
    18  container and edit an image dataset used for ML model development.
    19  
    20  ## Cloning a Subset of lakeFS Data into a Local Directory
    21  
    22  1. In lakeFS create a new branch called `my-experiment`. You can do this through the UI or with `lakectl`:
    23  
    24      ```bash
    25      docker exec lakefs \
    26          lakectl branch create \
    27              lakefs://quickstart/my-experiment \
    28              --source lakefs://quickstart/main
    29      ```
    30  
    31  2. Clone images from your quickstart repository into a local directory named `my_local_dir` within your container:   
    32  
    33      ```bash
    34      docker exec lakefs \
    35          lakectl local clone lakefs://quickstart/my-experiment/images my_local_dir
    36      ```
    37  
    38  3. Verify that `my_local_dir` is linked to the correct path in your lakeFS remote: 
    39    
    40      ```bash
    41      docker exec lakefs \
    42          lakectl local list
    43      ```
    44  
    45     You should see confirmation that my_local_dir is tracking the desired lakeFS path.:    
    46     
    47     ```bash
    48         my_local_dir	lakefs://quickstart/my-experiment/images/	8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53
    49     ```
    50  
    51  4. Verify that your local environment is up-to-date with its remote path:
    52      
    53     ```bash
    54      docker exec lakefs \
    55          lakectl local status my_local_dir
    56      ```
    57      You should get a confirmation message like this showing that there is no difference between your local environment and the lakeFS remote:
    58  
    59     ```bash
    60     diff 'local:///home/lakefs/my_local_dir' <--> 'lakefs://quickstart/8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53/images/'...
    61     diff 'lakefs://quickstart/8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53/images/' <--> 'lakefs://quickstart/my-experiment/images/'...
    62  
    63     No diff found.
    64     ```    
    65  
    66  ## Making Changes to Data Locally
    67  
    68  1. Download a new image of an Axolotl and add it to the dataset cloned into `my_local_dir`:  
    69  
    70      ```bash
    71      curl -L https://go.lakefs.io/43ENDyS > axolotl.png
    72     
    73      docker cp axolotl.png lakefs:/home/lakefs/my_local_dir
    74     ```
    75  
    76  2. Clean the dataset by removing images larger than 225 KB:
    77      ```bash  
    78      docker exec lakefs \
    79          find my_local_dir -type f -size +225k -delete
    80      ```
    81     
    82  3. Check the status of your local changes compared to the lakeFS remote path:
    83      ```bash
    84      docker exec lakefs \
    85          lakectl local status my_local_dir
    86      ```
    87     
    88      You should get a confirmation message like this, showing the modifications you made locally: 
    89      ```bash
    90      diff 'local:///home/lakefs/my_local_dir' <--> 'lakefs://quickstart/8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53/images/'...
    91      diff 'lakefs://quickstart/8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53/images/' <--> 'lakefs://quickstart/my-experiment/images/'...
    92  
    93      ╔════════╦══════════╦═════════════════════╗
    94      ║ SOURCE ║ CHANGE   ║ PATH                ║
    95      ╠════════╬══════════╬═════════════════════╣
    96      ║ local  ║ modified ║ axolotl.png         ║
    97      ║ local  ║ removed  ║ duckdb-main-02.png  ║
    98      ║ local  ║ removed  ║ empty-repo-list.png ║
    99      ║ local  ║ removed  ║ repo-contents.png   ║
   100      ╚════════╩══════════╩═════════════════════╝
   101      ```
   102  
   103  ## Pushing Local Changes to lakeFS
   104  
   105  Once we are done with editing the image dataset in our local environment, we will push our changes to the lakeFS remote so that 
   106  the improved dataset is shared and versioned.
   107  
   108  1. Commit your local changes to lakeFS: 
   109  
   110      ```bash
   111      docker exec lakefs \
   112          lakectl local commit \
   113              -m 'Deleted images larger than 225KB in size and changed the Axolotl image' my_local_dir
   114      ```
   115      
   116      In your branch, you should see the commit including your local changes:
   117     
   118      <img width="75%" src="{{ site.baseurl }}/assets/img/quickstart/lakectl-local-01.png" alt="A lakectl local commit to lakeFS" class="quickstart"/>
   119  
   120  2. Compare `my-experiment` branch to the `main` branch to visualize your changes:
   121          
   122      <img width="75%" src="{{ site.baseurl }}/assets/img/quickstart/lakectl-local-02.png" alt="A comparison between a branch that includes local changes to the main branch" class="quickstart"/>
   123  
   124  ## Bonus Challenge
   125  
   126  And so with that, this quickstart for lakeFS draws to a close. If you're simply having _too much fun_ to stop then here's an exercise for you.
   127  
   128  Implement the requirement from the beginning of this quickstart *correctly*, such that you write `denmark-lakes.parquet` in the respective branch and successfully merge it back into main. Look up how to list the contents of the `main` branch and verify that it looks like this:
   129  
   130  ```
   131  object          2023-03-21 17:33:51 +0000 UTC    20.9 kB         denmark-lakes.parquet
   132  object          2023-03-21 14:45:38 +0000 UTC    916.4 kB        lakes.parquet
   133  ```
   134  
   135  # Finishing Up
   136  
   137  Once you've finished the quickstart, shut down your local environment with the following command:
   138  
   139  ```bash
   140  docker stop lakefs
   141  ```
   142