github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/iceberg.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/iceberg.md (about)

     1  ---
     2  title: Apache Iceberg
     3  description: How to integrate lakeFS with Apache Iceberg
     4  parent: Integrations
     5  ---
     6  
     7  # Using lakeFS with Apache Iceberg
     8  
     9  {% include toc_2-3.html %}
    10  
    11  To enrich your Iceberg tables with lakeFS capabilities, you can use the lakeFS implementation of the Iceberg catalog.
    12  You will then be able to query your Iceberg tables using lakeFS references, such as branches, tags, and commit hashes: 
    13  
    14  ```sql
    15  SELECT * FROM catalog.ref.db.table
    16  ```
    17  
    18  ## Setup
    19  
    20  <div class="tabs">
    21    <ul>
    22      <li><a href="#maven">Maven</a></li>
    23      <li><a href="#pyspark">PySpark</a></li>
    24    </ul>
    25    <div markdown="1" id="maven">
    26  
    27  Use the following Maven dependency to install the lakeFS custom catalog:
    28  
    29  ```xml
    30  <dependency>
    31    <groupId>io.lakefs</groupId>
    32    <artifactId>lakefs-iceberg</artifactId>
    33    <version>0.1.4</version>
    34  </dependency>
    35  ```
    36  
    37  </div>
    38  <div markdown="1" id="pyspark">
    39    Include the `lakefs-iceberg` jar in your package list along with Iceberg. For example: 
    40  
    41  ```python
    42  .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.0,io.lakefs:lakefs-iceberg:0.1.4")
    43  ```  
    44  </div>
    45  </div>
    46  
    47  ## Configure
    48  
    49  <div class="tabs">
    50    <ul>
    51      <li><a href="#conf-pyspark">PySpark</a></li>
    52      <li><a href="#conf-sparkshell">Spark Shell</a></li>
    53    </ul>
    54    <div markdown="1" id="conf-pyspark">
    55  
    56  Set up the Spark SQL catalog: 
    57  ```python
    58  .config("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog") \
    59  .config("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog") \
    60  .config("spark.sql.catalog.lakefs.warehouse", f"lakefs://{repo_name}") \ 
    61  .config("spark.sql.catalog.lakefs.cache-enabled", "false")
    62  ```
    63  
    64  Configure the S3A Hadoop FileSystem with your lakeFS connection details.
    65  Note that these are your lakeFS endpoint and credentials, not your S3 ones.
    66      
    67  ```python
    68  .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    69  .config("spark.hadoop.fs.s3a.endpoint", "https://example-org.us-east-1.lakefscloud.io") \
    70  .config("spark.hadoop.fs.s3a.access.key", "AKIAIO5FODNN7EXAMPLE") \
    71  .config("spark.hadoop.fs.s3a.secret.key", "wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY") \
    72  .config("spark.hadoop.fs.s3a.path.style.access", "true")
    73  ```
    74  
    75    </div>
    76  
    77    <div markdown="1" id="conf-sparkshell">
    78  ```shell
    79  spark-shell --conf spark.sql.catalog.lakefs="org.apache.iceberg.spark.SparkCatalog" \
    80     --conf spark.sql.catalog.lakefs.catalog-impl="io.lakefs.iceberg.LakeFSCatalog" \
    81     --conf spark.sql.catalog.lakefs.warehouse="lakefs://example-repo" \
    82     --conf spark.sql.catalog.lakefs.cache-enabled="false" \
    83     --conf spark.hadoop.fs.s3.impl="org.apache.hadoop.fs.s3a.S3AFileSystem" \
    84     --conf spark.hadoop.fs.s3a.endpoint="https://example-org.us-east-1.lakefscloud.io" \
    85     --conf spark.hadoop.fs.s3a.access.key="AKIAIO5FODNN7EXAMPLE" \
    86     --conf spark.hadoop.fs.s3a.secret.key="wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY" \
    87     --conf spark.hadoop.fs.s3a.path.style.access="true"
    88  ```
    89    </div>
    90  </div>
    91  
    92  
    93  ## Using Iceberg tables with lakeFS
    94  
    95  ### Create a table
    96  
    97  To create a table on your main branch, use the following syntax:
    98  
    99  ```sql
   100  CREATE TABLE lakefs.main.db1.table1 (id int, data string);
   101  ```
   102  
   103  ### Insert data into the table
   104      
   105  ```sql
   106  INSERT INTO lakefs.main.db1.table1 VALUES (1, 'data1');
   107  INSERT INTO lakefs.main.db1.table1 VALUES (2, 'data2');
   108  ```
   109  
   110  ### Create a branch
   111  
   112  We can now commit the creation of the table to the main branch:
   113  
   114  ```shell
   115  lakectl commit lakefs://example-repo/main -m "my first iceberg commit"
   116  ```
   117  
   118  Then, create a branch:
   119  
   120  ```shell
   121  lakectl branch create lakefs://example-repo/dev -s lakefs://example-repo/main
   122  ```
   123  
   124  ### Make changes on the branch
   125  
   126  We can now make changes on the branch:
   127  
   128  ```sql
   129  INSERT INTO lakefs.dev.db1.table1 VALUES (3, 'data3');
   130  ```
   131  
   132  ### Query the table
   133  
   134  If we query the table on the branch, we will see the data we inserted:
   135  
   136  ```sql
   137  SELECT * FROM lakefs.dev.db1.table1;
   138  ```
   139  
   140  Results in:
   141  ```
   142  +----+------+
   143  | id | data |
   144  +----+------+
   145  | 1  | data1|
   146  | 2  | data2|
   147  | 3  | data3|
   148  +----+------+
   149  ```
   150  
   151  However, if we query the table on the main branch, we will not see the new changes:
   152  
   153  ```sql
   154  SELECT * FROM lakefs.main.db1.table1;
   155  ```
   156  
   157  Results in:
   158  ```
   159  +----+------+
   160  | id | data |
   161  +----+------+
   162  | 1  | data1|
   163  | 2  | data2|
   164  +----+------+
   165  ```
   166  
   167  ## Migrating an existing Iceberg Table to lakeFS Catalog
   168  
   169  This is done through an incremental copy from the original table into lakeFS. 
   170  
   171  1. Create a new lakeFS repository `lakectl repo create lakefs://example-repo <base storage path>`
   172  2. Initiate a spark session that can interact with the source iceberg table and the target lakeFS catalog. 
   173  
   174      Here's an example of Hadoop and S3 session and lakeFS catalog with [per-bucket config](https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/bk_cloud-data-access/content/s3-per-bucket-configs.html): 
   175  
   176      ```java
   177      SparkConf conf = new SparkConf();
   178      conf.set("spark.hadoop.fs.s3a.path.style.access", "true");
   179  
   180      // set hadoop on S3 config (source tables we want to copy) for spark
   181      conf.set("spark.sql.catalog.hadoop_prod", "org.apache.iceberg.spark.SparkCatalog");
   182      conf.set("spark.sql.catalog.hadoop_prod.type", "hadoop");
   183      conf.set("spark.sql.catalog.hadoop_prod.warehouse", "s3a://my-bucket/warehouse/hadoop/");
   184      conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions");
   185      conf.set("spark.hadoop.fs.s3a.bucket.my-bucket.access.key", "<AWS_ACCESS_KEY>");
   186      conf.set("spark.hadoop.fs.s3a.bucket.my-bucket.secret.key", "<AWS_SECRET_KEY>");
   187  
   188      // set lakeFS config (target catalog and repository)
   189      conf.set("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog");
   190      conf.set("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog");
   191      conf.set("spark.sql.catalog.lakefs.warehouse", "lakefs://example-repo");
   192      conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions");
   193      conf.set("spark.hadoop.fs.s3a.bucket.example-repo.access.key", "<LAKEFS_ACCESS_KEY>");
   194      conf.set("spark.hadoop.fs.s3a.bucket.example-repo.secret.key", "<LAKEFS_SECRET_KEY>");
   195      conf.set("spark.hadoop.fs.s3a.bucket.example-repo.endpoint"  , "<LAKEFS_ENDPOINT>");
   196      ```
   197  
   198  3. Create Schema in lakeFS and copy the data 
   199  
   200      Example of copy with spark-sql: 
   201  
   202      ```SQL
   203      -- Create Iceberg Schema in lakeFS
   204      CREATE SCHEMA IF NOT EXISTS <lakefs-catalog>.<branch>.<db>
   205      -- Create new iceberg table in lakeFS from the source table (pre-lakeFS)
   206      CREATE TABLE IF NOT EXISTS <lakefs-catalog>.<branch>.<db> USING iceberg AS SELECT * FROM <iceberg-original-table>
   207      ```
   208  
   209  [ref-expr]:  {% link understand/model.md %}#ref-expressions