github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/quickstart/branch.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/quickstart/branch.md (about)

     1  ---
     2  title: 3️⃣ Create a branch
     3  description: lakeFS quickstart / Create a branch in lakeFS without copying data on disk, make a change to the branch, see that the original version of the data is unchanged. 
     4  parent: ⭐ Quickstart
     5  nav_order: 15
     6  next: ["Merge the branch back into main", "./commit-and-merge.html"]
     7  previous: ["Query the pre-populated data", "./query.html"]
     8  ---
     9  
    10  # Create a Branch
    11  
    12  lakeFS uses branches in a similar way to Git. It's a great way to isolate changes until, or if, we are ready to re-integrate them. lakeFS uses a zero-copy branching technique which means that it's very efficient to create branches of your data. 
    13  
    14  Having seen the lakes data in the previous step we're now going to create a new dataset to hold data only for lakes in Denmark. Why? Well, because :)
    15  
    16  The first thing we'll do is create a branch for us to do this development against. We'll use the `lakectl` tool to create the branch, which we first need to configure with our credentials.  In a new terminal window run the following:
    17  
    18  ```bash
    19  docker exec -it lakefs lakectl config
    20  ```
    21  
    22  Follow the prompts to enter the credentials that you got in the first step. Leave the **Server endpoint URL** as `http://127.0.0.1:8000`. 
    23  
    24  Now that lakectl is configured, we can use it to create the branch. Run the following:
    25  
    26  ```bash
    27  docker exec lakefs \
    28      lakectl branch create \
    29              lakefs://quickstart/denmark-lakes \
    30  		    --source lakefs://quickstart/main
    31  ```
    32  
    33  You should get a confirmation message like this:
    34  
    35  ```
    36  Source ref: lakefs://quickstart/main
    37  created branch 'denmark-lakes' 3384cd7cdc4a2cd5eb6249b52f0a709b49081668bb1574ce8f1ef2d956646816
    38  ```
    39  
    40  ## Transforming the Data
    41  
    42  Now we'll make a change to the data. lakeFS has several native clients, as well as an [S3-compatible endpoint](https://docs.lakefs.io/understand/architecture.html#s3-gateway). This means that anything that can use S3 will work with lakeFS. Pretty neat.
    43  
    44  We're going to use DuckDB which is embedded within the web interface of lakeFS. 
    45  
    46  From the lakeFS **Objects** page select the `lakes.parquet` file to open the DuckDB editor: 
    47  
    48  <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-main-01.png" alt="The lakeFS object viewer with embedded DuckDB to query parquet files. A query has run automagically to preview the contents of the selected parquet file." class="quickstart"/>
    49  
    50  To start with, we'll load the lakes data into a DuckDB table so that we can manipulate it. Replace the previous text in the DuckDB editor with this: 
    51  
    52  ```sql
    53  CREATE OR REPLACE TABLE lakes AS 
    54      SELECT * FROM READ_PARQUET('lakefs://quickstart/denmark-lakes/lakes.parquet');
    55  ```
    56  
    57  You'll see a row count of 100,000 to confirm that the DuckDB table has been created. 
    58  
    59  Just to check that it's the same data that we saw before we'll run the same query. Note that we are querying a DuckDB table (`lakes`), rather than using a function to query a parquet file directly. 
    60  
    61  ```sql
    62  SELECT   country, COUNT(*)
    63  FROM     lakes
    64  GROUP BY country
    65  ORDER BY COUNT(*) 
    66  DESC LIMIT 5;
    67  ```
    68  
    69  <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-editor-02.png" alt="The DuckDB editor pane querying the lakes table" class="quickstart"/>
    70  
    71  ### Making a Change to the Data
    72  
    73  Now we can change our table, which was loaded from the original `lakes.parquet`, to remove all rows not for Denmark:
    74  
    75  ```sql
    76  DELETE FROM lakes WHERE Country != 'Denmark';
    77  ```
    78  
    79  <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-editor-03.png" alt="The DuckDB editor pane deleting rows from the lakes table" class="quickstart"/>
    80  
    81  We can verify that it's worked by reissuing the same query as before:
    82  
    83  ```sql
    84  SELECT   country, COUNT(*)
    85  FROM     lakes
    86  GROUP BY country
    87  ORDER BY COUNT(*) 
    88  DESC LIMIT 5;
    89  ```
    90  
    91  
    92  <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-editor-04.png" alt="The DuckDB editor pane querying the lakes table showing only rows for Denmark remain" class="quickstart"/>
    93  
    94  ## Write the Data back to lakeFS
    95  
    96  The changes so far have only been to DuckDB's copy of the data. Let's now push it back to lakeFS. Note the path is different this time as we're writing it to the `denmark-lakes` branch, not `main`: 
    97  
    98  ```sql
    99  COPY lakes TO 'lakefs://quickstart/denmark-lakes/lakes.parquet';
   100  ```
   101  
   102  <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-editor-05.png" alt="The DuckDB editor pane writing data back to the denmark-lakes branch" class="quickstart"/>
   103  
   104  ## Verify that the Data's Changed on the Branch
   105  
   106  Let's just confirm for ourselves that the parquet file itself has the new data. We'll drop the `lakes` table just to be sure, and then query the parquet file directly:
   107  
   108  ```sql
   109  DROP TABLE lakes;
   110  
   111  SELECT   country, COUNT(*)
   112  FROM     READ_PARQUET('lakefs://quickstart/denmark-lakes/lakes.parquet')
   113  GROUP BY country
   114  ORDER BY COUNT(*) 
   115  DESC LIMIT 5;
   116  ```
   117  
   118  <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-editor-06.png" alt="The DuckDB editor pane show the parquet file on denmark-lakes branch has been changed" class="quickstart"/>
   119  
   120  
   121  ## What about the data in `main`?
   122  
   123  So we've changed the data in our `denmark-lakes` branch, deleting swathes of the dataset. What's this done to our original data in the `main` branch? Absolutely nothing! See for yourself by running the same query as above, but against the `main` branch:
   124  
   125  ```sql
   126  SELECT   country, COUNT(*)
   127  FROM     READ_PARQUET('lakefs://quickstart/main/lakes.parquet')
   128  GROUP BY country
   129  ORDER BY COUNT(*) 
   130  DESC LIMIT 5;
   131  ```
   132  <img src="{{ site.baseurl }}/assets/img/quickstart/duckdb-main-02.png" alt="The lakeFS object browser showing DuckDB querying lakes.parquet on the main branch. The results are the same as they were before we made the changes to the denmark-lakes branch, which is as expected." class="quickstart"/>
   133  
   134  In the next step we'll see how to commit our changes and merge our branch back into main.