github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/duckdb.md (about) 1 --- 2 title: DuckDB 3 description: How to use lakeFS with DuckDB, an open-source SQL OLAP database management system. 4 parent: Integrations 5 --- 6 7 # Using lakeFS with DuckDB 8 9 [DuckDB](https://duckdb.org/){:target="_blank"} is an in-process SQL OLAP database management system. You can access data in lakeFS from DuckDB, as well as use DuckDB from within the web interface of lakeFS 10 11 {% include toc_2-3.html %} 12 13 ## Accessing lakeFS from DuckDB 14 ### Configuration 15 16 Querying data in lakeFS from DuckDB is similar to querying data in S3 from DuckDB. It is done using the [httpfs extension](https://duckdb.org/docs/extensions/httpfs.html){:target="_blank"} connecting to the [S3 Gateway that lakeFS provides](https://docs.lakefs.io/understand/architecture.html#s3-gateway). 17 18 If not loaded already, install and load the HTTPFS extension: 19 20 ```sql 21 INSTALL httpfs; 22 LOAD httpfs; 23 ``` 24 25 Then run the following to configure the connection. 26 27 ```sql 28 -- "s3_region" is the S3 region on which your bucket resides. If local storage, or not S3, then just set it to "us-east-1". 29 SET s3_region='us-east-1'; 30 -- the host (and port, if necessary) of your lakeFS server 31 SET s3_endpoint='lakefs.example.com'; 32 -- the access credentials for your lakeFS user 33 SET s3_access_key_id='AKIAIOSFODNN7EXAMPLE'; 34 -- the access credentials for your lakeFS user 35 SET s3_secret_access_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'; 36 SET s3_url_style='path'; 37 38 -- Uncomment in case the endpoint listen on non-secure, for example running lakeFS locally. 39 -- SET s3_use_ssl=false; 40 ``` 41 42 ### Querying Data 43 44 Once configured, you can query data using the lakeFS S3 Gateway using the following URI pattern: 45 46 ```text 47 s3://<REPOSITORY NAME>/<REFERENCE ID>/<PATH TO DATA> 48 ``` 49 50 Since the S3 Gateway implemenets all S3 functionality required by DuckDB, you can query using globs and patterns, including support for Hive-partitioned data. 51 52 Example: 53 54 ```sql 55 SELECT * 56 FROM parquet_scan('s3://example-repo/main/data/population/by-region/*.parquet', HIVE_PARTITIONING=1) 57 ORDER BY name; 58 ``` 59 60 ### Writing Data 61 62 No special configuration required for writing to a branch. Assuming the configuration above and write permissions to a `dev` branch, 63 a write operation would look like any DuckDB write: 64 65 ```sql 66 CREATE TABLE sampled_population AS SELECT * 67 FROM parquet_scan('s3://example-repo/main/data/population/by-region/*.parquet', HIVE_PARTITIONING=1) 68 USING SAMPLE reservoir(50000 ROWS) REPEATABLE (100); 69 70 COPY sampled_population TO 's3://example-repo/main/data/population/sample.parquet'; -- actual write happens here 71 ``` 72 73 ## Using DuckDB in the lakeFS web UI 74 75 The lakeFS web UI includes DuckDB in the Object viewer page. 76 77  78 79 Using this you can query objects in lakeFS directly using a `lakefs` path: 80 81 lakefs://<repository>/<branch>/object/path/foo.parquet 82 83 The DuckDB query editor is provided by [DuckDB WASM](https://github.com/duckdb/duckdb-wasm). It renders and provides querying capabilities for any objects of the following types: 84 85 * Parquet 86 * CSV 87 * TSV