github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/quickstart/work-with-data-locally.md (about) 1 --- 2 title: 7️⃣ Work with lakeFS data locally 3 description: lakeFS quickstart / Bring lakeFS data to a local environment to show how lakeFS can be used for ML experiments development. 4 parent: ⭐ Quickstart 5 nav_order: 35 6 next: ["Resources for learning more about lakeFS", "./learning-more-lakefs.html"] 7 previous: ["Using Actions and Hooks in lakeFS", "./actions-and-hooks.html"] 8 --- 9 10 # Work with lakeFS Data Locally 11 12 When working with lakeFS, there are scenarios where we need to access and manipulate data locally. An example use case for working 13 locally is machine learning model development. Machine learning model development is dynamic and iterative. To optimize this 14 process, experiments need to be conducted with speed, tracking ease, and reproducibility. Localizing model data during development 15 accelerates the process by enabling interactive and offline development and reducing data access latency. 16 17 We're going to use [lakectl local](../howto/local-checkouts.md) to bring a subset of our lakeFS data to a local directory within the lakeFS 18 container and edit an image dataset used for ML model development. 19 20 ## Cloning a Subset of lakeFS Data into a Local Directory 21 22 1. In lakeFS create a new branch called `my-experiment`. You can do this through the UI or with `lakectl`: 23 24 ```bash 25 docker exec lakefs \ 26 lakectl branch create \ 27 lakefs://quickstart/my-experiment \ 28 --source lakefs://quickstart/main 29 ``` 30 31 2. Clone images from your quickstart repository into a local directory named `my_local_dir` within your container: 32 33 ```bash 34 docker exec lakefs \ 35 lakectl local clone lakefs://quickstart/my-experiment/images my_local_dir 36 ``` 37 38 3. Verify that `my_local_dir` is linked to the correct path in your lakeFS remote: 39 40 ```bash 41 docker exec lakefs \ 42 lakectl local list 43 ``` 44 45 You should see confirmation that my_local_dir is tracking the desired lakeFS path.: 46 47 ```bash 48 my_local_dir lakefs://quickstart/my-experiment/images/ 8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53 49 ``` 50 51 4. Verify that your local environment is up-to-date with its remote path: 52 53 ```bash 54 docker exec lakefs \ 55 lakectl local status my_local_dir 56 ``` 57 You should get a confirmation message like this showing that there is no difference between your local environment and the lakeFS remote: 58 59 ```bash 60 diff 'local:///home/lakefs/my_local_dir' <--> 'lakefs://quickstart/8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53/images/'... 61 diff 'lakefs://quickstart/8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53/images/' <--> 'lakefs://quickstart/my-experiment/images/'... 62 63 No diff found. 64 ``` 65 66 ## Making Changes to Data Locally 67 68 1. Download a new image of an Axolotl and add it to the dataset cloned into `my_local_dir`: 69 70 ```bash 71 curl -L https://go.lakefs.io/43ENDyS > axolotl.png 72 73 docker cp axolotl.png lakefs:/home/lakefs/my_local_dir 74 ``` 75 76 2. Clean the dataset by removing images larger than 225 KB: 77 ```bash 78 docker exec lakefs \ 79 find my_local_dir -type f -size +225k -delete 80 ``` 81 82 3. Check the status of your local changes compared to the lakeFS remote path: 83 ```bash 84 docker exec lakefs \ 85 lakectl local status my_local_dir 86 ``` 87 88 You should get a confirmation message like this, showing the modifications you made locally: 89 ```bash 90 diff 'local:///home/lakefs/my_local_dir' <--> 'lakefs://quickstart/8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53/images/'... 91 diff 'lakefs://quickstart/8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53/images/' <--> 'lakefs://quickstart/my-experiment/images/'... 92 93 ╔════════╦══════════╦═════════════════════╗ 94 ║ SOURCE ║ CHANGE ║ PATH ║ 95 ╠════════╬══════════╬═════════════════════╣ 96 ║ local ║ modified ║ axolotl.png ║ 97 ║ local ║ removed ║ duckdb-main-02.png ║ 98 ║ local ║ removed ║ empty-repo-list.png ║ 99 ║ local ║ removed ║ repo-contents.png ║ 100 ╚════════╩══════════╩═════════════════════╝ 101 ``` 102 103 ## Pushing Local Changes to lakeFS 104 105 Once we are done with editing the image dataset in our local environment, we will push our changes to the lakeFS remote so that 106 the improved dataset is shared and versioned. 107 108 1. Commit your local changes to lakeFS: 109 110 ```bash 111 docker exec lakefs \ 112 lakectl local commit \ 113 -m 'Deleted images larger than 225KB in size and changed the Axolotl image' my_local_dir 114 ``` 115 116 In your branch, you should see the commit including your local changes: 117 118 <img width="75%" src="{{ site.baseurl }}/assets/img/quickstart/lakectl-local-01.png" alt="A lakectl local commit to lakeFS" class="quickstart"/> 119 120 2. Compare `my-experiment` branch to the `main` branch to visualize your changes: 121 122 <img width="75%" src="{{ site.baseurl }}/assets/img/quickstart/lakectl-local-02.png" alt="A comparison between a branch that includes local changes to the main branch" class="quickstart"/> 123 124 ## Bonus Challenge 125 126 And so with that, this quickstart for lakeFS draws to a close. If you're simply having _too much fun_ to stop then here's an exercise for you. 127 128 Implement the requirement from the beginning of this quickstart *correctly*, such that you write `denmark-lakes.parquet` in the respective branch and successfully merge it back into main. Look up how to list the contents of the `main` branch and verify that it looks like this: 129 130 ``` 131 object 2023-03-21 17:33:51 +0000 UTC 20.9 kB denmark-lakes.parquet 132 object 2023-03-21 14:45:38 +0000 UTC 916.4 kB lakes.parquet 133 ``` 134 135 # Finishing Up 136 137 Once you've finished the quickstart, shut down your local environment with the following command: 138 139 ```bash 140 docker stop lakefs 141 ``` 142