github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/iceberg.md (about) 1 --- 2 title: Apache Iceberg 3 description: How to integrate lakeFS with Apache Iceberg 4 parent: Integrations 5 --- 6 7 # Using lakeFS with Apache Iceberg 8 9 {% include toc_2-3.html %} 10 11 To enrich your Iceberg tables with lakeFS capabilities, you can use the lakeFS implementation of the Iceberg catalog. 12 You will then be able to query your Iceberg tables using lakeFS references, such as branches, tags, and commit hashes: 13 14 ```sql 15 SELECT * FROM catalog.ref.db.table 16 ``` 17 18 ## Setup 19 20 <div class="tabs"> 21 <ul> 22 <li><a href="#maven">Maven</a></li> 23 <li><a href="#pyspark">PySpark</a></li> 24 </ul> 25 <div markdown="1" id="maven"> 26 27 Use the following Maven dependency to install the lakeFS custom catalog: 28 29 ```xml 30 <dependency> 31 <groupId>io.lakefs</groupId> 32 <artifactId>lakefs-iceberg</artifactId> 33 <version>0.1.4</version> 34 </dependency> 35 ``` 36 37 </div> 38 <div markdown="1" id="pyspark"> 39 Include the `lakefs-iceberg` jar in your package list along with Iceberg. For example: 40 41 ```python 42 .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.0,io.lakefs:lakefs-iceberg:0.1.4") 43 ``` 44 </div> 45 </div> 46 47 ## Configure 48 49 <div class="tabs"> 50 <ul> 51 <li><a href="#conf-pyspark">PySpark</a></li> 52 <li><a href="#conf-sparkshell">Spark Shell</a></li> 53 </ul> 54 <div markdown="1" id="conf-pyspark"> 55 56 Set up the Spark SQL catalog: 57 ```python 58 .config("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog") \ 59 .config("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog") \ 60 .config("spark.sql.catalog.lakefs.warehouse", f"lakefs://{repo_name}") \ 61 .config("spark.sql.catalog.lakefs.cache-enabled", "false") 62 ``` 63 64 Configure the S3A Hadoop FileSystem with your lakeFS connection details. 65 Note that these are your lakeFS endpoint and credentials, not your S3 ones. 66 67 ```python 68 .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \ 69 .config("spark.hadoop.fs.s3a.endpoint", "https://example-org.us-east-1.lakefscloud.io") \ 70 .config("spark.hadoop.fs.s3a.access.key", "AKIAIO5FODNN7EXAMPLE") \ 71 .config("spark.hadoop.fs.s3a.secret.key", "wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY") \ 72 .config("spark.hadoop.fs.s3a.path.style.access", "true") 73 ``` 74 75 </div> 76 77 <div markdown="1" id="conf-sparkshell"> 78 ```shell 79 spark-shell --conf spark.sql.catalog.lakefs="org.apache.iceberg.spark.SparkCatalog" \ 80 --conf spark.sql.catalog.lakefs.catalog-impl="io.lakefs.iceberg.LakeFSCatalog" \ 81 --conf spark.sql.catalog.lakefs.warehouse="lakefs://example-repo" \ 82 --conf spark.sql.catalog.lakefs.cache-enabled="false" \ 83 --conf spark.hadoop.fs.s3.impl="org.apache.hadoop.fs.s3a.S3AFileSystem" \ 84 --conf spark.hadoop.fs.s3a.endpoint="https://example-org.us-east-1.lakefscloud.io" \ 85 --conf spark.hadoop.fs.s3a.access.key="AKIAIO5FODNN7EXAMPLE" \ 86 --conf spark.hadoop.fs.s3a.secret.key="wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY" \ 87 --conf spark.hadoop.fs.s3a.path.style.access="true" 88 ``` 89 </div> 90 </div> 91 92 93 ## Using Iceberg tables with lakeFS 94 95 ### Create a table 96 97 To create a table on your main branch, use the following syntax: 98 99 ```sql 100 CREATE TABLE lakefs.main.db1.table1 (id int, data string); 101 ``` 102 103 ### Insert data into the table 104 105 ```sql 106 INSERT INTO lakefs.main.db1.table1 VALUES (1, 'data1'); 107 INSERT INTO lakefs.main.db1.table1 VALUES (2, 'data2'); 108 ``` 109 110 ### Create a branch 111 112 We can now commit the creation of the table to the main branch: 113 114 ```shell 115 lakectl commit lakefs://example-repo/main -m "my first iceberg commit" 116 ``` 117 118 Then, create a branch: 119 120 ```shell 121 lakectl branch create lakefs://example-repo/dev -s lakefs://example-repo/main 122 ``` 123 124 ### Make changes on the branch 125 126 We can now make changes on the branch: 127 128 ```sql 129 INSERT INTO lakefs.dev.db1.table1 VALUES (3, 'data3'); 130 ``` 131 132 ### Query the table 133 134 If we query the table on the branch, we will see the data we inserted: 135 136 ```sql 137 SELECT * FROM lakefs.dev.db1.table1; 138 ``` 139 140 Results in: 141 ``` 142 +----+------+ 143 | id | data | 144 +----+------+ 145 | 1 | data1| 146 | 2 | data2| 147 | 3 | data3| 148 +----+------+ 149 ``` 150 151 However, if we query the table on the main branch, we will not see the new changes: 152 153 ```sql 154 SELECT * FROM lakefs.main.db1.table1; 155 ``` 156 157 Results in: 158 ``` 159 +----+------+ 160 | id | data | 161 +----+------+ 162 | 1 | data1| 163 | 2 | data2| 164 +----+------+ 165 ``` 166 167 ## Migrating an existing Iceberg Table to lakeFS Catalog 168 169 This is done through an incremental copy from the original table into lakeFS. 170 171 1. Create a new lakeFS repository `lakectl repo create lakefs://example-repo <base storage path>` 172 2. Initiate a spark session that can interact with the source iceberg table and the target lakeFS catalog. 173 174 Here's an example of Hadoop and S3 session and lakeFS catalog with [per-bucket config](https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/bk_cloud-data-access/content/s3-per-bucket-configs.html): 175 176 ```java 177 SparkConf conf = new SparkConf(); 178 conf.set("spark.hadoop.fs.s3a.path.style.access", "true"); 179 180 // set hadoop on S3 config (source tables we want to copy) for spark 181 conf.set("spark.sql.catalog.hadoop_prod", "org.apache.iceberg.spark.SparkCatalog"); 182 conf.set("spark.sql.catalog.hadoop_prod.type", "hadoop"); 183 conf.set("spark.sql.catalog.hadoop_prod.warehouse", "s3a://my-bucket/warehouse/hadoop/"); 184 conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"); 185 conf.set("spark.hadoop.fs.s3a.bucket.my-bucket.access.key", "<AWS_ACCESS_KEY>"); 186 conf.set("spark.hadoop.fs.s3a.bucket.my-bucket.secret.key", "<AWS_SECRET_KEY>"); 187 188 // set lakeFS config (target catalog and repository) 189 conf.set("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog"); 190 conf.set("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog"); 191 conf.set("spark.sql.catalog.lakefs.warehouse", "lakefs://example-repo"); 192 conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"); 193 conf.set("spark.hadoop.fs.s3a.bucket.example-repo.access.key", "<LAKEFS_ACCESS_KEY>"); 194 conf.set("spark.hadoop.fs.s3a.bucket.example-repo.secret.key", "<LAKEFS_SECRET_KEY>"); 195 conf.set("spark.hadoop.fs.s3a.bucket.example-repo.endpoint" , "<LAKEFS_ENDPOINT>"); 196 ``` 197 198 3. Create Schema in lakeFS and copy the data 199 200 Example of copy with spark-sql: 201 202 ```SQL 203 -- Create Iceberg Schema in lakeFS 204 CREATE SCHEMA IF NOT EXISTS <lakefs-catalog>.<branch>.<db> 205 -- Create new iceberg table in lakeFS from the source table (pre-lakeFS) 206 CREATE TABLE IF NOT EXISTS <lakefs-catalog>.<branch>.<db> USING iceberg AS SELECT * FROM <iceberg-original-table> 207 ``` 208 209 [ref-expr]: {% link understand/model.md %}#ref-expressions