github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/quickstart/branch.md (about) 1 --- 2 title: 3️⃣ Create a branch 3 description: lakeFS quickstart / Create a branch in lakeFS without copying data on disk, make a change to the branch, see that the original version of the data is unchanged. 4 parent: ⭐ Quickstart 5 nav_order: 15 6 next: ["Merge the branch back into main", "./commit-and-merge.html"] 7 previous: ["Query the pre-populated data", "./query.html"] 8 --- 9 10 # Create a Branch 11 12 lakeFS uses branches in a similar way to Git. It's a great way to isolate changes until, or if, we are ready to re-integrate them. lakeFS uses a zero-copy branching technique which means that it's very efficient to create branches of your data. 13 14 Having seen the lakes data in the previous step we're now going to create a new dataset to hold data only for lakes in Denmark. Why? Well, because :) 15 16 The first thing we'll do is create a branch for us to do this development against. We'll use the `lakectl` tool to create the branch, which we first need to configure with our credentials. In a new terminal window run the following: 17 18 ```bash 19 docker exec -it lakefs lakectl config 20 ``` 21 22 Follow the prompts to enter the credentials that you got in the first step. Leave the **Server endpoint URL** as `http://127.0.0.1:8000`. 23 24 Now that lakectl is configured, we can use it to create the branch. Run the following: 25 26 ```bash 27 docker exec lakefs \ 28 lakectl branch create \ 29 lakefs://quickstart/denmark-lakes \ 30 --source lakefs://quickstart/main 31 ``` 32 33 You should get a confirmation message like this: 34 35 ``` 36 Source ref: lakefs://quickstart/main 37 created branch 'denmark-lakes' 3384cd7cdc4a2cd5eb6249b52f0a709b49081668bb1574ce8f1ef2d956646816 38 ``` 39 40 ## Transforming the Data 41 42 Now we'll make a change to the data. lakeFS has several native clients, as well as an [S3-compatible endpoint](https://docs.lakefs.io/understand/architecture.html#s3-gateway). This means that anything that can use S3 will work with lakeFS. Pretty neat. 43 44 We're going to use DuckDB which is embedded within the web interface of lakeFS. 45 46 From the lakeFS **Objects** page select the `lakes.parquet` file to open the DuckDB editor: 47 48 <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-main-01.png" alt="The lakeFS object viewer with embedded DuckDB to query parquet files. A query has run automagically to preview the contents of the selected parquet file." class="quickstart"/> 49 50 To start with, we'll load the lakes data into a DuckDB table so that we can manipulate it. Replace the previous text in the DuckDB editor with this: 51 52 ```sql 53 CREATE OR REPLACE TABLE lakes AS 54 SELECT * FROM READ_PARQUET('lakefs://quickstart/denmark-lakes/lakes.parquet'); 55 ``` 56 57 You'll see a row count of 100,000 to confirm that the DuckDB table has been created. 58 59 Just to check that it's the same data that we saw before we'll run the same query. Note that we are querying a DuckDB table (`lakes`), rather than using a function to query a parquet file directly. 60 61 ```sql 62 SELECT country, COUNT(*) 63 FROM lakes 64 GROUP BY country 65 ORDER BY COUNT(*) 66 DESC LIMIT 5; 67 ``` 68 69 <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-editor-02.png" alt="The DuckDB editor pane querying the lakes table" class="quickstart"/> 70 71 ### Making a Change to the Data 72 73 Now we can change our table, which was loaded from the original `lakes.parquet`, to remove all rows not for Denmark: 74 75 ```sql 76 DELETE FROM lakes WHERE Country != 'Denmark'; 77 ``` 78 79 <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-editor-03.png" alt="The DuckDB editor pane deleting rows from the lakes table" class="quickstart"/> 80 81 We can verify that it's worked by reissuing the same query as before: 82 83 ```sql 84 SELECT country, COUNT(*) 85 FROM lakes 86 GROUP BY country 87 ORDER BY COUNT(*) 88 DESC LIMIT 5; 89 ``` 90 91 92 <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-editor-04.png" alt="The DuckDB editor pane querying the lakes table showing only rows for Denmark remain" class="quickstart"/> 93 94 ## Write the Data back to lakeFS 95 96 The changes so far have only been to DuckDB's copy of the data. Let's now push it back to lakeFS. Note the path is different this time as we're writing it to the `denmark-lakes` branch, not `main`: 97 98 ```sql 99 COPY lakes TO 'lakefs://quickstart/denmark-lakes/lakes.parquet'; 100 ``` 101 102 <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-editor-05.png" alt="The DuckDB editor pane writing data back to the denmark-lakes branch" class="quickstart"/> 103 104 ## Verify that the Data's Changed on the Branch 105 106 Let's just confirm for ourselves that the parquet file itself has the new data. We'll drop the `lakes` table just to be sure, and then query the parquet file directly: 107 108 ```sql 109 DROP TABLE lakes; 110 111 SELECT country, COUNT(*) 112 FROM READ_PARQUET('lakefs://quickstart/denmark-lakes/lakes.parquet') 113 GROUP BY country 114 ORDER BY COUNT(*) 115 DESC LIMIT 5; 116 ``` 117 118 <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-editor-06.png" alt="The DuckDB editor pane show the parquet file on denmark-lakes branch has been changed" class="quickstart"/> 119 120 121 ## What about the data in `main`? 122 123 So we've changed the data in our `denmark-lakes` branch, deleting swathes of the dataset. What's this done to our original data in the `main` branch? Absolutely nothing! See for yourself by running the same query as above, but against the `main` branch: 124 125 ```sql 126 SELECT country, COUNT(*) 127 FROM READ_PARQUET('lakefs://quickstart/main/lakes.parquet') 128 GROUP BY country 129 ORDER BY COUNT(*) 130 DESC LIMIT 5; 131 ``` 132 <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-main-02.png" alt="The lakeFS object browser showing DuckDB querying lakes.parquet on the main branch. The results are the same as they were before we made the changes to the denmark-lakes branch, which is as expected." class="quickstart"/> 133 134 In the next step we'll see how to commit our changes and merge our branch back into main.