github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/duckdb.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/duckdb.md (about)

     1  ---
     2  title: DuckDB
     3  description: How to use lakeFS with DuckDB, an open-source SQL OLAP database management system.
     4  parent: Integrations
     5  ---
     6  
     7  # Using lakeFS with DuckDB
     8  
     9  [DuckDB](https://duckdb.org/){:target="_blank"} is an in-process SQL OLAP database management system. You can access data in lakeFS from DuckDB, as well as use DuckDB from within the web interface of lakeFS
    10  
    11  {% include toc_2-3.html %}
    12  
    13  ## Accessing lakeFS from DuckDB
    14  ### Configuration
    15  
    16  Querying data in lakeFS from DuckDB is similar to querying data in S3 from DuckDB. It is done using the [httpfs extension](https://duckdb.org/docs/extensions/httpfs.html){:target="_blank"} connecting to the [S3 Gateway that lakeFS provides](https://docs.lakefs.io/understand/architecture.html#s3-gateway).
    17  
    18  If not loaded already, install and load the HTTPFS extension: 
    19  
    20  ```sql
    21  INSTALL httpfs;
    22  LOAD httpfs;
    23  ```
    24  
    25  Then run the following to configure the connection. 
    26  
    27  ```sql
    28  -- "s3_region" is the S3 region on which your bucket resides. If local storage, or not S3, then just set it to "us-east-1".
    29  SET s3_region='us-east-1';
    30  -- the host (and port, if necessary) of your lakeFS server
    31  SET s3_endpoint='lakefs.example.com';
    32  -- the access credentials for your lakeFS user
    33  SET s3_access_key_id='AKIAIOSFODNN7EXAMPLE'; 
    34  -- the access credentials for your lakeFS user
    35  SET s3_secret_access_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'; 
    36  SET s3_url_style='path';
    37  
    38  -- Uncomment in case the endpoint listen on non-secure, for example running lakeFS locally.
    39  -- SET s3_use_ssl=false;
    40  ```
    41  
    42  ### Querying Data
    43  
    44  Once configured, you can query data using the lakeFS S3 Gateway using the following URI pattern:
    45  
    46  ```text
    47  s3://<REPOSITORY NAME>/<REFERENCE ID>/<PATH TO DATA>
    48  ```
    49  
    50  Since the S3 Gateway implemenets all S3 functionality required by DuckDB, you can query using globs and patterns, including support for Hive-partitioned data.
    51  
    52  Example:
    53  
    54  ```sql
    55  SELECT * 
    56  FROM parquet_scan('s3://example-repo/main/data/population/by-region/*.parquet', HIVE_PARTITIONING=1) 
    57  ORDER BY name;
    58  ```
    59  
    60  ### Writing Data
    61  
    62  No special configuration required for writing to a branch. Assuming the configuration above and write permissions to a `dev` branch,
    63  a write operation would look like any DuckDB write:
    64  
    65  ```sql
    66  CREATE TABLE sampled_population AS SELECT * 
    67  FROM parquet_scan('s3://example-repo/main/data/population/by-region/*.parquet', HIVE_PARTITIONING=1) 
    68  USING SAMPLE reservoir(50000 ROWS) REPEATABLE (100);
    69  
    70  COPY sampled_population TO 's3://example-repo/main/data/population/sample.parquet'; -- actual write happens here
    71  ```
    72  
    73  ## Using DuckDB in the lakeFS web UI
    74  
    75  The lakeFS web UI includes DuckDB in the Object viewer page. 
    76  
    77  ![DuckDB query editor on the lakeFS objects page]({{ site.baseurl }}/assets/img/duckdb.png)
    78  
    79  Using this you can query objects in lakeFS directly using a `lakefs` path: 
    80  
    81      lakefs://<repository>/<branch>/object/path/foo.parquet
    82  
    83  The DuckDB query editor is provided by [DuckDB WASM](https://github.com/duckdb/duckdb-wasm). It renders and provides querying capabilities for any objects of the following types:
    84  
    85  * Parquet
    86  * CSV
    87  * TSV