github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/spark.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/spark.md (about)

     1  ---
     2  title: Apache Spark
     3  description: Accessing data in lakeFS from Apache Spark works the same as accessing S3 data from Apache Spark.
     4  parent: Integrations
     5  redirect_from: 
     6    - /integrations/databricks.html
     7    - /integrations/emr.html
     8    - /integrations/glue_etl.html
     9    - /using/databricks.html
    10    - /using/emr.html
    11    - /using/glue_etl.html
    12    - /using/spark.html
    13  ---
    14  
    15  # Using lakeFS with Apache Spark
    16  
    17  There are several ways to use lakeFS with Spark:
    18  
    19  * [The S3-compatible API](#s3-compatible-api): Scalable and best to get started. <span class="badge">All Storage Vendors</span>
    20  * [The lakeFS FileSystem](#lakefs-hadoop-filesystem): Direct data flow from client to storage, highly scalable. <span class="badge">AWS S3</span>
    21     * [lakeFS FileSystem in Presigned mode](#hadoop-filesystem-in-presigned-mode): Best of both worlds. <span class="badge mr-1">AWS S3</span><span class="badge">Azure Blob</span>
    22  
    23  See how SimilarWeb is using lakeFS with Spark to [manage algorithm changes in data pipelines](https://grdoron.medium.com/a-smarter-way-to-manage-algorithm-changes-in-data-pipelines-with-lakefs-a4e284f8c756).
    24  {: .note }
    25  
    26  {% include toc.html %}
    27  
    28  ## S3-compatible API
    29  
    30  lakeFS has an S3-compatible endpoint which you can point Spark at to get started quickly.
    31  
    32  You will access your data using S3-style URIs, e.g. `s3a://example-repo/example-branch/example-table`.
    33  
    34  You can use the S3-compatible API regardless of where your data is hosted.
    35  
    36  ### Configuration
    37  
    38  To configure Spark to work with lakeFS, we set S3A Hadoop configuration to the lakeFS endpoint and credentials:
    39  
    40  * `fs.s3a.access.key`: lakeFS access key
    41  * `fs.s3a.secret.key`: lakeFS secret key
    42  * `fs.s3a.endpoint`: lakeFS S3-compatible API endpoint (e.g. https://example-org.us-east-1.lakefscloud.io)
    43  * `fs.s3a.path.style.access`: `true`
    44  
    45  Here is how to do it:
    46  <div class="tabs">
    47    <ul>
    48      <li><a href="#s3-config-tabs-cli">CLI</a></li>
    49      <li><a href="#s3-config-tabs-code">Scala</a></li>
    50      <li><a href="#s3-config-tabs-xml">XML Configuration</a></li>
    51      <li><a href="#s3-config-tabs-emr">EMR</a></li>
    52    </ul>
    53    <div markdown="1" id="s3-config-tabs-cli">
    54  ```shell
    55  spark-shell --conf spark.hadoop.fs.s3a.access.key='AKIAlakefs12345EXAMPLE' \
    56                --conf spark.hadoop.fs.s3a.secret.key='abc/lakefs/1234567bPxRfiCYEXAMPLEKEY' \
    57                --conf spark.hadoop.fs.s3a.path.style.access=true \
    58                --conf spark.hadoop.fs.s3a.endpoint='https://example-org.us-east-1.lakefscloud.io' ...
    59  ```
    60    </div>
    61    <div markdown="1" id="s3-config-tabs-code">
    62  ```scala
    63  spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "AKIAlakefs12345EXAMPLE")
    64  spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY")
    65  spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "https://example-org.us-east-1.lakefscloud.io")
    66  spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
    67  ```
    68    </div>
    69    <div markdown="1" id="s3-config-tabs-xml">
    70  Add these into a configuration file, e.g. `$SPARK_HOME/conf/hdfs-site.xml`:
    71  ```xml
    72  <?xml version="1.0"?>
    73  <configuration>
    74      <property>
    75          <name>fs.s3a.access.key</name>
    76          <value>AKIAlakefs12345EXAMPLE</value>
    77      </property>
    78      <property>
    79              <name>fs.s3a.secret.key</name>
    80              <value>abc/lakefs/1234567bPxRfiCYEXAMPLEKEY</value>
    81      </property>
    82      <property>
    83          <name>fs.s3a.endpoint</name>
    84          <value>https://example-org.us-east-1.lakefscloud.io</value>
    85      </property>
    86      <property>
    87          <name>fs.s3a.path.style.access</name>
    88          <value>true</value>
    89      </property>
    90  </configuration>
    91  ```
    92    </div>
    93    <div markdown="1" id="s3-config-tabs-emr">
    94    Use the below configuration when creating the cluster. You may delete any app configuration that is not suitable for your use case:
    95  
    96  ```json
    97  [
    98    {
    99      "Classification": "spark-defaults",
   100      "Properties": {
   101        "spark.sql.catalogImplementation": "hive"
   102      }
   103    },
   104    {
   105      "Classification": "core-site",
   106      "Properties": {
   107          "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
   108          "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   109          "fs.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   110          "fs.s3.path.style.access": "true",
   111          "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
   112          "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   113          "fs.s3a.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   114          "fs.s3a.path.style.access": "true"
   115      }
   116    },
   117    {
   118      "Classification": "emrfs-site",
   119      "Properties": {
   120          "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
   121          "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   122          "fs.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   123          "fs.s3.path.style.access": "true",
   124          "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
   125          "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   126          "fs.s3a.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   127          "fs.s3a.path.style.access": "true"
   128      }
   129    },
   130    {
   131      "Classification": "presto-connector-hive",
   132      "Properties": {
   133          "hive.s3.aws-access-key": "AKIAIOSFODNN7EXAMPLE",
   134          "hive.s3.aws-secret-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   135          "hive.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   136          "hive.s3.path-style-access": "true",
   137          "hive.s3-file-system-type": "PRESTO"
   138      }
   139    },
   140    {
   141      "Classification": "hive-site",
   142      "Properties": {
   143          "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
   144          "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   145          "fs.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   146          "fs.s3.path.style.access": "true",
   147          "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
   148          "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   149          "fs.s3a.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   150          "fs.s3a.path.style.access": "true"
   151      }
   152    },
   153    {
   154      "Classification": "hdfs-site",
   155      "Properties": {
   156          "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
   157          "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   158          "fs.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   159          "fs.s3.path.style.access": "true",
   160          "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
   161          "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   162          "fs.s3a.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   163          "fs.s3a.path.style.access": "true"
   164      }
   165    },
   166    {
   167      "Classification": "mapred-site",
   168      "Properties": {
   169          "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
   170          "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   171          "fs.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   172          "fs.s3.path.style.access": "true",
   173          "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
   174          "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   175          "fs.s3a.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   176          "fs.s3a.path.style.access": "true"
   177      }
   178    }
   179  ]
   180  ```
   181  
   182  Alternatively, you can pass these configuration values when adding a step.
   183  
   184  For example:
   185  
   186  ```bash
   187  aws emr add-steps --cluster-id j-197B3AEGQ9XE4 \
   188    --steps="Type=Spark,Name=SparkApplication,ActionOnFailure=CONTINUE, \
   189    Args=[--conf,spark.hadoop.fs.s3a.access.key=AKIAIOSFODNN7EXAMPLE, \
   190    --conf,spark.hadoop.fs.s3a.secret.key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY, \
   191    --conf,spark.hadoop.fs.s3a.endpoint=https://example-org.us-east-1.lakefscloud.io, \
   192    --conf,spark.hadoop.fs.s3a.path.style.access=true, \
   193    s3a://<lakefs-repo>/<lakefs-branch>/path/to/jar]"
   194  ```
   195  
   196    </div>
   197  </div>
   198  
   199  #### Per-bucket configuration
   200  
   201  The above configuration will use lakeFS as the sole S3 endpoint. To use lakeFS in parallel with S3, you can configure Spark to use lakeFS only for specific bucket names.
   202  For example, to configure only `example-repo` to use lakeFS, set the following configurations:
   203  
   204  <div class="tabs">
   205    <ul>
   206      <li><a href="#s3-bucket-config-tabs-cli">CLI</a></li>
   207      <li><a href="#s3-bucket-config-tabs-code">Scala</a></li>
   208      <li><a href="#s3-bucket-config-tabs-xml">XML Configuration</a></li>
   209      <li><a href="#s3-bucket-config-tabs-emr">EMR</a></li>
   210    </ul>
   211    <div markdown="1" id="s3-bucket-config-tabs-cli">
   212  ```sh
   213  spark-shell --conf spark.hadoop.fs.s3a.bucket.example-repo.access.key='AKIAlakefs12345EXAMPLE' \
   214                --conf spark.hadoop.fs.s3a.bucket.example-repo.secret.key='abc/lakefs/1234567bPxRfiCYEXAMPLEKEY' \
   215                --conf spark.hadoop.fs.s3a.bucket.example-repo.endpoint='https://example-org.us-east-1.lakefscloud.io' \
   216                --conf spark.hadoop.fs.s3a.path.style.access=true
   217  ```
   218    </div>
   219    <div markdown="1" id="s3-bucket-config-tabs-code">
   220  ```scala
   221  spark.sparkContext.hadoopConfiguration.set("fs.s3a.bucket.example-repo.access.key", "AKIAlakefs12345EXAMPLE")
   222  spark.sparkContext.hadoopConfiguration.set("fs.s3a.bucket.example-repo.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY")
   223  spark.sparkContext.hadoopConfiguration.set("fs.s3a.bucket.example-repo.endpoint", "https://example-org.us-east-1.lakefscloud.io")
   224  spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
   225  ```
   226    </div>
   227    <div markdown="1" id="s3-bucket-config-tabs-xml">
   228  Add these into a configuration file, e.g. `$SPARK_HOME/conf/hdfs-site.xml`:
   229  ```xml
   230  <?xml version="1.0"?>
   231  <configuration>
   232      <property>
   233          <name>fs.s3a.bucket.example-repo.access.key</name>
   234          <value>AKIAlakefs12345EXAMPLE</value>
   235      </property>
   236      <property>
   237          <name>fs.s3a.bucket.example-repo.secret.key</name>
   238          <value>abc/lakefs/1234567bPxRfiCYEXAMPLEKEY</value>
   239      </property>
   240      <property>
   241          <name>fs.s3a.bucket.example-repo.endpoint</name>
   242          <value>https://example-org.us-east-1.lakefscloud.io</value>
   243      </property>
   244      <property>
   245          <name>fs.s3a.path.style.access</name>
   246          <value>true</value>
   247      </property>
   248  </configuration>
   249  ```
   250    </div>
   251    <div markdown="1" id="s3-bucket-config-tabs-emr">
   252    Use the below configuration when creating the cluster. You may delete any app configuration that is not suitable for your use case:
   253  
   254  ```json
   255  [
   256    {
   257      "Classification": "spark-defaults",
   258      "Properties": {
   259        "spark.sql.catalogImplementation": "hive"
   260      }
   261    },
   262    {
   263      "Classification": "core-site",
   264      "Properties": {
   265          "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE",
   266          "fs.s3.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   267          "fs.s3.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   268          "fs.s3.bucket.example-repo.path.style.access": "true",
   269          "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE",
   270          "fs.s3a.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   271          "fs.s3a.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   272          "fs.s3a.bucket.example-repo.path.style.access": "true"
   273      }
   274    },
   275    {
   276      "Classification": "emrfs-site",
   277      "Properties": {
   278          "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE",
   279          "fs.s3.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   280          "fs.s3.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   281          "fs.s3.bucket.example-repo.path.style.access": "true",
   282          "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE",
   283          "fs.s3a.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   284          "fs.s3a.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   285          "fs.s3a.bucket.example-repo.path.style.access": "true"
   286      }
   287    },
   288    {
   289      "Classification": "presto-connector-hive",
   290      "Properties": {
   291          "hive.s3.aws-access-key": "AKIAIOSFODNN7EXAMPLE",
   292          "hive.s3.aws-secret-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   293          "hive.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   294          "hive.s3.path-style-access": "true",
   295          "hive.s3-file-system-type": "PRESTO"
   296      }
   297    },
   298    {
   299      "Classification": "hive-site",
   300      "Properties": {
   301          "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE",
   302          "fs.s3.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   303          "fs.s3.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   304          "fs.s3.bucket.example-repo.path.style.access": "true",
   305          "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE",
   306          "fs.s3a.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   307          "fs.s3a.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   308          "fs.s3a.bucket.example-repo.path.style.access": "true"
   309      }
   310    },
   311    {
   312      "Classification": "hdfs-site",
   313      "Properties": {
   314          "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE",
   315          "fs.s3.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   316          "fs.s3.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   317          "fs.s3.bucket.example-repo.path.style.access": "true",
   318          "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE",
   319          "fs.s3a.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   320          "fs.s3a.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   321          "fs.s3a.bucket.example-repo.path.style.access": "true"
   322      }
   323    },
   324    {
   325      "Classification": "mapred-site",
   326      "Properties": {
   327          "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE",
   328          "fs.s3.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   329          "fs.s3.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   330          "fs.s3.bucket.example-repo.path.style.access": "true",
   331          "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE",
   332          "fs.s3a.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   333          "fs.s3a.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io",
   334          "fs.s3a.bucket.example-repo.path.style.access": "true"
   335      }
   336    }
   337  ]
   338  ```
   339  
   340  Alternatively, you can pass these configuration values when adding a step.
   341  
   342  For example:
   343  
   344  ```bash
   345  aws emr add-steps --cluster-id j-197B3AEGQ9XE4 \
   346    --steps="Type=Spark,Name=SparkApplication,ActionOnFailure=CONTINUE, \
   347    Args=[--conf,spark.hadoop.fs.s3a.bucket.example-repo.access.key=AKIAIOSFODNN7EXAMPLE, \
   348    --conf,spark.hadoop.fs.s3a.bucket.example-repo.secret.key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY, \
   349    --conf,spark.hadoop.fs.s3a.bucket.example-repo.endpoint=https://example-org.us-east-1.lakefscloud.io, \
   350    --conf,spark.hadoop.fs.s3a.path.style.access=true, \
   351    s3a://<lakefs-repo>/<lakefs-branch>/path/to/jar]"
   352  ```
   353  
   354    </div>
   355  </div>
   356  
   357  With this configuration set, you read S3A paths with `example-repo` as the bucket will use lakeFS, while all other buckets will use AWS S3.
   358  
   359  ### Usage
   360  
   361  Here's an example for reading a Parquet file from lakeFS to a Spark DataFrame:
   362  
   363  ```scala
   364  val repo = "example-repo"
   365  val branch = "main"
   366  val df = spark.read.parquet(s"s3a://${repo}/${branch}/example-path/example-file.parquet")
   367  ```
   368  
   369  Here's how to write some results back to a lakeFS path:
   370  ```scala
   371  df.write.partitionBy("example-column").parquet(s"s3a://${repo}/${branch}/output-path/")
   372  ```
   373  
   374  The data is now created in lakeFS as new changes in your branch. You can now commit these changes or revert them.
   375  
   376  ### Configuring Azure Databricks with the S3-compatible API
   377  
   378  If you use Azure Databricks, you can take advantage of the lakeFS S3-compatible API with your Azure account and the S3A FileSystem. 
   379  This will require installing the `hadoop-aws` package (with the same version as your `hadoop-azure` package) to your Databricks cluster.
   380  
   381  Define your FileSystem configurations in the following way:
   382  
   383  ```
   384  spark.hadoop.fs.lakefs.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
   385  spark.hadoop.fs.lakefs.access.key=‘AKIAlakefs12345EXAMPLE’                   // The access key to your lakeFS server
   386  spark.hadoop.fs.lakefs.secret.key=‘abc/lakefs/1234567bPxRfiCYEXAMPLEKEY’     // The secret key to your lakeFS server
   387  spark.hadoop.fs.lakefs.path.style.access=true
   388  spark.hadoop.fs.lakefs.endpoint=‘https://example-org.us-east-1.lakefscloud.io’                 // The endpoint of your lakeFS server
   389  ```
   390  
   391  For more details about [Mounting cloud object storage on Databricks](https://docs.databricks.com/dbfs/mounts.html).
   392  
   393  ### Configuring Databricks SQL Warehouse with the S3-compatible API
   394  
   395  A SQL warehouse is a compute resource that lets you run SQL commands on data 
   396  objects within Databricks SQL.
   397  
   398  If you use Databricks SQL warehouse, you can take advantage of the lakeFS 
   399  S3-compatible API with the S3A FileSystem. 
   400  
   401  Define your SQL Warehouse configurations in the following way:
   402  
   403  1. In the top right, select `Admin Settings` and then `SQL warehouse settings`.
   404  
   405  2. Under `Data Access Configuration` add the following key-value pairs for 
   406     each lakeFS repository you want to access:
   407  
   408  ```
   409  spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
   410  spark.hadoop.fs.s3a.bucket.example-repo.access.key AKIAIOSFODNN7EXAMPLE // The access key to your lakeFS server
   411  spark.hadoop.fs.s3a.bucket.example-repo.secret.key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY // The secret key to your lakeFS server
   412  spark.hadoop.fs.s3a.bucket.example-repo.endpoint https://example-org.us-east-1.lakefscloud.io // The endpoint of your lakeFS server
   413  spark.hadoop.fs.s3a.bucket.example-repo.path.style.access true               
   414  ```
   415  
   416  3. Changes are applied automatically after the SQL Warehouse restarts.
   417  4. You can now use the lakeFS S3-compatible API with your SQL Warehouse, e.g.:
   418  
   419  ```sql
   420  SELECT * FROM delta.`s3a://example-repo/main/datasets/delta-table/` LIMIT 100
   421  ```
   422  ### ⚠️ Experimental: Pre-signed mode for S3A
   423  
   424  In Hadoop 3.1.4 version and above (as tested using our lakeFS Hadoop FS), it is possible to use pre-signed URLs as return values from the lakeFS S3 Gateway.
   425  
   426  This has the immediate benefit of reducing the amount of traffic that has to go through the lakeFS server thus improving IO performance. 
   427  To read more about pre-signed URLs, see [this guide](../reference/security/presigned-url.html).
   428  
   429  Here's an example Spark configuration to enable this support:
   430  
   431  ```
   432  spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
   433  spark.hadoop.fs.s3a.bucket.example-repo.access.key AKIAIOSFODNN7EXAMPLE // The access key to your lakeFS server
   434  spark.hadoop.fs.s3a.bucket.example-repo.secret.key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY // The secret key to your lakeFS server
   435  spark.hadoop.fs.s3a.path.style.access true
   436  spark.hadoop.fs.s3a.bucket.example-repo.signing-algorithm QueryStringSignerType
   437  spark.hadoop.fs.s3a.bucket.example-repo.user.agent.prefix s3RedirectionSupport
   438  ```
   439  
   440  `user.agent.prefix` should **contain** the string `s3RedirectionSupport` but does not have to match the string exactly.
   441  {: .note }
   442  
   443  
   444  Once configured, requests will include the string `s3RedirectionSupport` in the `User-Agent` HTTP header sent with GetObject requests, resulting in lakeFS responding with a pre-signed URL.
   445  Setting the `signing-algorithm` to `QueryStringSignerType` is required to stop S3A from signing a pre-signed URL, since the existence of more than one signature method will return an error from S3.
   446  
   447  ℹ This feature requires a lakeFS server of version `>1.18.0`
   448  {: .note }
   449  
   450  ## lakeFS Hadoop FileSystem
   451  
   452  If you're using lakeFS on top of S3, this mode will enhance your application's performance.
   453  In this mode, Spark will read and write objects directly from S3, reducing the load on the lakeFS server.
   454  It will still access the lakeFS server for metadata operations.
   455  
   456  After configuring the lakeFS Hadoop FileSystem below, use URIs of the form `lakefs://example-repo/ref/path/to/data` to
   457  interact with your data on lakeFS.
   458  
   459  ### Installation
   460  
   461  <div class="tabs">
   462    <ul>
   463      <li><a href="#install-standalone">Spark Standalone</a></li>
   464      <li><a href="#install-databricks">Databricks</a></li>
   465      <li><a href="#install-cloudera-spark">Cloudera Spark</a></li>
   466    </ul> 
   467    <div markdown="1" id="install-standalone">
   468  
   469  Add the package to your `spark-submit` command:
   470  
   471    ```
   472    --packages io.lakefs:hadoop-lakefs-assembly:0.2.4
   473    ```
   474  
   475    </div>
   476    <div markdown="2" id="install-databricks">
   477  In  your cluster settings, under the _Libraries_ tab, add the following Maven package:
   478  
   479  ```
   480  io.lakefs:hadoop-lakefs-assembly:0.2.4
   481  ```
   482  
   483  Once installed, it should look something like this:
   484  
   485  ![Databricks - Adding the lakeFS client Jar]({{ site.baseurl }}/assets/img/databricks-install-package.png)
   486  
   487    </div>
   488    <div markdown="3" id="install-cloudera-spark">
   489  
   490  Add the package to your `pyspark` or `spark-submit` command:
   491  
   492    ```
   493    --packages io.lakefs:hadoop-lakefs-assembly:0.2.4
   494    ```
   495  
   496  Add the configuration to access the S3 bucket used by lakeFS to your `pyspark` or `spark-submit` command or add this configuration at the Cloudera cluster level (see below):
   497  
   498    ```
   499    --conf spark.yarn.access.hadoopFileSystems=s3a://bucket-name
   500    ```
   501  
   502  Add the configuration to access the S3 bucket used by lakeFS at the Cloudera cluster level:
   503  1. Log in to the CDP (Cloudera Data Platform) web interface.
   504  1. From the CDP home screen, click the `Management Console` icon.
   505  1. In the Management Console, select `Data Hub Clusters` from the navigation pane.
   506  1. Select the cluster you want to configure. Click on `CM-UI` link under Services:
   507    ![Cloudera - Management Console]({{ site.baseurl }}/assets/img/cloudera/ManagementConsole.png)
   508  1. In Cloudera Manager web interface, click on `Clusters` from the navigation pane and click on `spark_on_yarn` option:
   509    ![Cloudera - Cloudera Manager]({{ site.baseurl }}/assets/img/cloudera/ClouderaManager.png)
   510  1. Click on `Configuration` tab and search for `spark.yarn.access.hadoopFileSystems` in the search box:
   511    ![Cloudera - spark_on_yarn]({{ site.baseurl }}/assets/img/cloudera/spark_on_yarn.png)
   512  1. Add S3 bucket used by lakeFS `s3a://bucket-name` in the `spark.yarn.access.hadoopFileSystems` list:
   513    ![Cloudera - hadoopFileSystems]({{ site.baseurl }}/assets/img/cloudera/hadoopFileSystems.png)
   514    </div>
   515  </div>
   516  
   517  
   518  ### Configuration
   519  
   520  Set the `fs.lakefs.*` Hadoop configurations to point to your lakeFS installation:
   521  * `fs.lakefs.impl`: `io.lakefs.LakeFSFileSystem`
   522  * `fs.lakefs.access.key`: lakeFS access key
   523  * `fs.lakefs.secret.key`: lakeFS secret key
   524  * `fs.lakefs.endpoint`: lakeFS API URL (e.g. `https://example-org.us-east-1.lakefscloud.io/api/v1`)
   525  
   526  Configure the lakeFS client to use a temporary token instead of static credentials:
   527  
   528  * `fs.lakefs.auth.provider`: The default is `basic_auth` with `fs.lakefs.access.key` and `fs.lakefs.secret.key` for basic authentication.
   529  Can be set to `io.lakefs.auth.TemporaryAWSCredentialsLakeFSTokenProvider` for using temporary AWS credentials, you can read more about it [here]({% link reference/security/external-principals-aws.md %}).
   530  
   531  When using `io.lakefs.auth.TemporaryAWSCredentialsLakeFSTokenProvider` as the auth provider the following configuration are relevant:
   532  
   533  * `fs.lakefs.token.aws.access.key`: AWS assumed role access key
   534  * `fs.lakefs.token.aws.secret.key`: AWS assumed role secret key
   535  * `fs.lakefs.token.aws.session.token`: AWS assumed role temporary session token
   536  * `fs.lakefs.token.aws.sts.endpoint`: AWS STS regional endpoint for generated the presigned-url (i.e `https://sts.us-west-2.amazonaws.com`)
   537  * `fs.lakefs.token.aws.sts.duration_seconds`: Optional, the duration in seconds for the initial identity token (default is 60)
   538  * `fs.lakefs.token.duration_seconds`: Optional, the duration in seconds for the lakeFS token (default is set in the lakeFS configuration [auth.login_duration]({% link reference/configuration.md %}))
   539  * `fs.lakefs.token.sts.additional_headers`: Optional, comma separated list of `header:value` to attach when generating presigned sts request. Default is `X-Lakefs-Server-ID:fs.lakefs.endpoint`.
   540  
   541  Configure the S3A FileSystem to access your S3 storage, for example using the `fs.s3a.*` configurations (these are **not** your lakeFS credentials):
   542  
   543  * `fs.s3a.access.key`: AWS S3 access key
   544  * `fs.s3a.secret.key`: AWS S3 secret key
   545  
   546  Here are some configuration examples:
   547  <div class="tabs">
   548    <ul>
   549      <li><a href="#config-cli">CLI</a></li>
   550      <li><a href="#config-scala">Scala</a></li>
   551      <li><a href="#config-pyspark">PySpark</a></li>
   552      <li><a href="#config-xml">XML Configuration</a></li>
   553      <li><a href="#config-databricks">Databricks</a></li>
   554    </ul> 
   555    <div markdown="1" id="config-cli">
   556  ```shell
   557  spark-shell --conf spark.hadoop.fs.s3a.access.key='AKIAIOSFODNN7EXAMPLE' \
   558                --conf spark.hadoop.fs.s3a.secret.key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY' \
   559                --conf spark.hadoop.fs.s3a.endpoint='https://s3.eu-central-1.amazonaws.com' \
   560                --conf spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem \
   561                --conf spark.hadoop.fs.lakefs.access.key=AKIAlakefs12345EXAMPLE \
   562                --conf spark.hadoop.fs.lakefs.secret.key=abc/lakefs/1234567bPxRfiCYEXAMPLEKEY \
   563                --conf spark.hadoop.fs.lakefs.endpoint=https://example-org.us-east-1.lakefscloud.io/api/v1 \
   564                --packages io.lakefs:hadoop-lakefs-assembly:0.2.4 \
   565                io.example.ExampleClass
   566  ```
   567    </div>
   568    <div markdown="1" id="config-scala">
   569  
   570  ```scala
   571  spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "AKIAIOSFODNN7EXAMPLE")
   572  spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
   573  spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.eu-central-1.amazonaws.com")
   574  spark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
   575  spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", "AKIAlakefs12345EXAMPLE")
   576  spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY")
   577  spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "https://example-org.us-east-1.lakefscloud.io/api/v1")
   578  ```
   579    </div>
   580    <div markdown="1" id="config-pyspark">
   581  
   582  ```python
   583  sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "AKIAIOSFODNN7EXAMPLE")
   584  sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
   585  sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "https://s3.eu-central-1.amazonaws.com")
   586  sc._jsc.hadoopConfiguration().set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
   587  sc._jsc.hadoopConfiguration().set("fs.lakefs.access.key", "AKIAlakefs12345EXAMPLE")
   588  sc._jsc.hadoopConfiguration().set("fs.lakefs.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY")
   589  sc._jsc.hadoopConfiguration().set("fs.lakefs.endpoint", "https://example-org.us-east-1.lakefscloud.io/api/v1")
   590  ```
   591    </div>
   592    <div markdown="1" id="config-xml">
   593  
   594  Make sure that you load the lakeFS FileSystem into Spark by running it with `--packages` or `--jars`,
   595  and then add these into a configuration file, e.g., `$SPARK_HOME/conf/hdfs-site.xml`:
   596  
   597  ```xml
   598  <?xml version="1.0"?>
   599  <configuration>
   600      <property>
   601          <name>fs.s3a.access.key</name>
   602          <value>AKIAIOSFODNN7EXAMPLE</value>
   603      </property>
   604      <property>
   605              <name>fs.s3a.secret.key</name>
   606              <value>wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY</value>
   607      </property>
   608      <property>
   609          <name>fs.s3a.endpoint</name>
   610          <value>https://s3.eu-central-1.amazonaws.com</value>
   611      </property>
   612      <property>
   613          <name>fs.lakefs.impl</name>
   614          <value>io.lakefs.LakeFSFileSystem</value>
   615      </property>
   616      <property>
   617          <name>fs.lakefs.access.key</name>
   618          <value>AKIAlakefs12345EXAMPLE</value>
   619      </property>
   620      <property>
   621          <name>fs.lakefs.secret.key</name>
   622          <value>abc/lakefs/1234567bPxRfiCYEXAMPLEKEY</value>
   623      </property>
   624      <property>
   625          <name>fs.lakefs.endpoint</name>
   626          <value>https://example-org.us-east-1.lakefscloud.io/api/v1</value>
   627      </property>
   628  </configuration>
   629  ```
   630    </div>
   631    <div markdown="1" id="config-databricks">
   632  
   633  Add the following the cluster's configuration under `Configuration ➡️ Advanced options`:
   634  
   635  ```
   636  spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem
   637  spark.hadoop.fs.lakefs.access.key AKIAlakefs12345EXAMPLE
   638  spark.hadoop.fs.lakefs.secret.key abc/lakefs/1234567bPxRfiCYEXAMPLEKEY
   639  spark.hadoop.fs.s3a.access.key AKIAIOSFODNN7EXAMPLE
   640  spark.hadoop.fs.s3a.secret.key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
   641  spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
   642  spark.hadoop.fs.lakefs.endpoint https://example-org.us-east-1.lakefscloud.io/api/v1
   643  ```
   644  
   645  Alternatively, follow this [step by step Databricks integration tutorial, including lakeFS Hadoop File System, Python client and lakeFS SPARK client](https://lakefs.io/blog/databricks-lakefs-integration-tutorial/).
   646    </div>
   647  </div>
   648  
   649  ⚠️ If your bucket is on a region other than us-east-1, you may also need to configure `fs.s3a.endpoint` with the correct region.
   650  Amazon provides [S3 endpoints](https://docs.aws.amazon.com/general/latest/gr/s3.html) you can use.
   651  {: .note }
   652  
   653  ### Usage with TemporaryAWSCredentialsLakeFSTokenProvider
   654  
   655  An initial setup is required - you must have [AWS Auth configured]({% link reference/security/external-principals-aws.md %}) with lakeFS.
   656  The `TemporaryAWSCredentialsLakeFSTokenProvider` depends on the caller to provide AWS credentials (e.g Assumed Role Key,Secret and Token) as input to the lakeFS client.
   657  
   658  ⚠️ Configure `sts.endpoint` with a valid [sts regional service endpoint](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html) and it must be be equal to the region that is used for authentication first place. The only exception is `us-east-1` which is the default region for STS.
   659  {: .note }
   660  
   661  ⚠️ Using the current provider the lakeFS token will not renew upon expiry and the user will need to re-authenticate.
   662  {: .note }
   663  
   664  PySpark example using `TemporaryAWSCredentialsLakeFSTokenProvider` with boto3 and AWS session credentials:
   665  
   666  ```python
   667  import boto3 
   668  
   669  session = boto3.session.Session()
   670  
   671  # AWS credentials used s3a to access lakeFS bucket
   672  sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "AKIAIOSFODNN7EXAMPLE")
   673  sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
   674  sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "https://s3.us-west-2.amazonaws.com")
   675  sc._jsc.hadoopConfiguration().set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
   676  sc._jsc.hadoopConfiguration().set("fs.lakefs.endpoint", "https://example-org.us-west-2.lakefscloud.io/api/v1")
   677  sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.path.style.access", "true")
   678  sc._jsc.hadoopConfiguration().set("fs.lakefs.auth.provider", "io.lakefs.auth.TemporaryAWSCredentialsLakeFSTokenProvider")
   679  # AWS tempporary session credentials to use with lakeFS
   680  sc._jsc.hadoopConfiguration().set("fs.lakefs.token.aws.access.key", session.get_credentials().access_key)
   681  sc._jsc.hadoopConfiguration().set("fs.lakefs.token.aws.secret.key", session.get_credentials().secret_key)
   682  sc._jsc.hadoopConfiguration().set("fs.lakefs.token.aws.session.token", session.get_credentials().token)
   683  sc._jsc.hadoopConfiguration().set("fs.lakefs.token.aws.sts.endpoint", "https://sts.us-west-2.amazonaws.com")
   684  ```
   685  
   686  ### Usage
   687  
   688  Hadoop FileSystem paths use the `lakefs://` protocol, with paths taking the form `lakefs://<repository>/<ref>/path/to/object`.
   689  `<ref>` can be a branch, tag, or commit ID in lakeFS.
   690  Here's an example for reading a Parquet file from lakeFS to a Spark DataFrame:
   691  
   692  ```scala
   693  val repo = "example-repo"
   694  val branch = "main"
   695  val df = spark.read.parquet(s"lakefs://${repo}/${branch}/example-path/example-file.parquet")
   696  ```
   697  
   698  Here's how to write some results back to a lakeFS path:
   699  
   700  ```scala
   701  df.write.partitionBy("example-column").parquet(s"lakefs://${repo}/${branch}/output-path/")
   702  ```
   703  
   704  The data is now created in lakeFS as new changes in your branch. You can now commit these changes or revert them.
   705  
   706  ## Hadoop FileSystem in Presigned mode
   707  
   708  _Available starting version 0.1.13 of the FileSystem_
   709  
   710  In this mode, the lakeFS server is responsible for authenticating with your storage.
   711  The client will still perform data operations directly on the storage.
   712  To do so, it will use pre-signed storage URLs provided by the lakeFS server.
   713  
   714  When using this mode, you don't need to configure the client with access to your storage:
   715  
   716  <div class="tabs">
   717    <ul>
   718      <li><a href="#config-cli">CLI</a></li>
   719      <li><a href="#config-scala">Scala</a></li>
   720      <li><a href="#config-pyspark">PySpark</a></li>
   721      <li><a href="#config-xml">XML Configuration</a></li>
   722      <li><a href="#config-databricks">Databricks</a></li>
   723    </ul> 
   724    <div markdown="1" id="config-cli">
   725  ```shell
   726  spark-shell --conf spark.hadoop.fs.lakefs.access.mode=presigned \
   727                --conf spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem \
   728                --conf spark.hadoop.fs.lakefs.access.key=AKIAlakefs12345EXAMPLE \
   729                --conf spark.hadoop.fs.lakefs.secret.key=abc/lakefs/1234567bPxRfiCYEXAMPLEKEY \
   730                --conf spark.hadoop.fs.lakefs.endpoint=https://example-org.us-east-1.lakefscloud.io/api/v1 \
   731                --packages io.lakefs:hadoop-lakefs-assembly:0.2.4
   732  ```
   733    </div>
   734    <div markdown="1" id="config-scala">
   735  
   736  ```scala
   737  spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.mode", "presigned")
   738  spark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
   739  spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", "AKIAlakefs12345EXAMPLE")
   740  spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY")
   741  spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "https://example-org.us-east-1.lakefscloud.io/api/v1")
   742  ```
   743    </div>
   744    <div markdown="1" id="config-pyspark">
   745  
   746  ```python
   747  sc._jsc.hadoopConfiguration().set("fs.lakefs.access.mode", "presigned")
   748  sc._jsc.hadoopConfiguration().set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
   749  sc._jsc.hadoopConfiguration().set("fs.lakefs.access.key", "AKIAlakefs12345EXAMPLE")
   750  sc._jsc.hadoopConfiguration().set("fs.lakefs.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY")
   751  sc._jsc.hadoopConfiguration().set("fs.lakefs.endpoint", "https://example-org.us-east-1.lakefscloud.io/api/v1")
   752  ```
   753    </div>
   754    <div markdown="1" id="config-xml">
   755  
   756  Make sure that you load the lakeFS FileSystem into Spark by running it with `--packages` or `--jars`,
   757  and then add these into a configuration file, e.g., `$SPARK_HOME/conf/hdfs-site.xml`:
   758  
   759  ```xml
   760  <?xml version="1.0"?>
   761  <configuration>
   762      <property>
   763          <name>fs.lakefs.access.mode</name>
   764          <value>presigned</value>
   765      </property>
   766      <property>
   767          <name>fs.lakefs.impl</name>
   768          <value>io.lakefs.LakeFSFileSystem</value>
   769      </property>
   770      <property>
   771          <name>fs.lakefs.access.key</name>
   772          <value>AKIAlakefs12345EXAMPLE</value>
   773      </property>
   774      <property>
   775          <name>fs.lakefs.secret.key</name>
   776          <value>abc/lakefs/1234567bPxRfiCYEXAMPLEKEY</value>
   777      </property>
   778      <property>
   779          <name>fs.lakefs.endpoint</name>
   780          <value>https://example-org.us-east-1.lakefscloud.io/api/v1</value>
   781      </property>
   782  </configuration>
   783  ```
   784    </div>
   785    <div markdown="1" id="config-databricks">
   786  
   787  Add the following the cluster's configuration under `Configuration ➡️ Advanced options`:
   788  
   789  ```
   790  spark.hadoop.fs.lakefs.access.mode presigned
   791  spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem
   792  spark.hadoop.fs.lakefs.access.key AKIAlakefs12345EXAMPLE
   793  spark.hadoop.fs.lakefs.secret.key abc/lakefs/1234567bPxRfiCYEXAMPLEKEY
   794  spark.hadoop.fs.lakefs.endpoint https://example-org.us-east-1.lakefscloud.io/api/v1
   795  ```
   796  </div>
   797  </div>