github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/spark.md (about) 1 --- 2 title: Apache Spark 3 description: Accessing data in lakeFS from Apache Spark works the same as accessing S3 data from Apache Spark. 4 parent: Integrations 5 redirect_from: 6 - /integrations/databricks.html 7 - /integrations/emr.html 8 - /integrations/glue_etl.html 9 - /using/databricks.html 10 - /using/emr.html 11 - /using/glue_etl.html 12 - /using/spark.html 13 --- 14 15 # Using lakeFS with Apache Spark 16 17 There are several ways to use lakeFS with Spark: 18 19 * [The S3-compatible API](#s3-compatible-api): Scalable and best to get started. <span class="badge">All Storage Vendors</span> 20 * [The lakeFS FileSystem](#lakefs-hadoop-filesystem): Direct data flow from client to storage, highly scalable. <span class="badge">AWS S3</span> 21 * [lakeFS FileSystem in Presigned mode](#hadoop-filesystem-in-presigned-mode): Best of both worlds. <span class="badge mr-1">AWS S3</span><span class="badge">Azure Blob</span> 22 23 See how SimilarWeb is using lakeFS with Spark to [manage algorithm changes in data pipelines](https://grdoron.medium.com/a-smarter-way-to-manage-algorithm-changes-in-data-pipelines-with-lakefs-a4e284f8c756). 24 {: .note } 25 26 {% include toc.html %} 27 28 ## S3-compatible API 29 30 lakeFS has an S3-compatible endpoint which you can point Spark at to get started quickly. 31 32 You will access your data using S3-style URIs, e.g. `s3a://example-repo/example-branch/example-table`. 33 34 You can use the S3-compatible API regardless of where your data is hosted. 35 36 ### Configuration 37 38 To configure Spark to work with lakeFS, we set S3A Hadoop configuration to the lakeFS endpoint and credentials: 39 40 * `fs.s3a.access.key`: lakeFS access key 41 * `fs.s3a.secret.key`: lakeFS secret key 42 * `fs.s3a.endpoint`: lakeFS S3-compatible API endpoint (e.g. https://example-org.us-east-1.lakefscloud.io) 43 * `fs.s3a.path.style.access`: `true` 44 45 Here is how to do it: 46 <div class="tabs"> 47 <ul> 48 <li><a href="#s3-config-tabs-cli">CLI</a></li> 49 <li><a href="#s3-config-tabs-code">Scala</a></li> 50 <li><a href="#s3-config-tabs-xml">XML Configuration</a></li> 51 <li><a href="#s3-config-tabs-emr">EMR</a></li> 52 </ul> 53 <div markdown="1" id="s3-config-tabs-cli"> 54 ```shell 55 spark-shell --conf spark.hadoop.fs.s3a.access.key='AKIAlakefs12345EXAMPLE' \ 56 --conf spark.hadoop.fs.s3a.secret.key='abc/lakefs/1234567bPxRfiCYEXAMPLEKEY' \ 57 --conf spark.hadoop.fs.s3a.path.style.access=true \ 58 --conf spark.hadoop.fs.s3a.endpoint='https://example-org.us-east-1.lakefscloud.io' ... 59 ``` 60 </div> 61 <div markdown="1" id="s3-config-tabs-code"> 62 ```scala 63 spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "AKIAlakefs12345EXAMPLE") 64 spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY") 65 spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "https://example-org.us-east-1.lakefscloud.io") 66 spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true") 67 ``` 68 </div> 69 <div markdown="1" id="s3-config-tabs-xml"> 70 Add these into a configuration file, e.g. `$SPARK_HOME/conf/hdfs-site.xml`: 71 ```xml 72 <?xml version="1.0"?> 73 <configuration> 74 <property> 75 <name>fs.s3a.access.key</name> 76 <value>AKIAlakefs12345EXAMPLE</value> 77 </property> 78 <property> 79 <name>fs.s3a.secret.key</name> 80 <value>abc/lakefs/1234567bPxRfiCYEXAMPLEKEY</value> 81 </property> 82 <property> 83 <name>fs.s3a.endpoint</name> 84 <value>https://example-org.us-east-1.lakefscloud.io</value> 85 </property> 86 <property> 87 <name>fs.s3a.path.style.access</name> 88 <value>true</value> 89 </property> 90 </configuration> 91 ``` 92 </div> 93 <div markdown="1" id="s3-config-tabs-emr"> 94 Use the below configuration when creating the cluster. You may delete any app configuration that is not suitable for your use case: 95 96 ```json 97 [ 98 { 99 "Classification": "spark-defaults", 100 "Properties": { 101 "spark.sql.catalogImplementation": "hive" 102 } 103 }, 104 { 105 "Classification": "core-site", 106 "Properties": { 107 "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE", 108 "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 109 "fs.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io", 110 "fs.s3.path.style.access": "true", 111 "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE", 112 "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 113 "fs.s3a.endpoint": "https://example-org.us-east-1.lakefscloud.io", 114 "fs.s3a.path.style.access": "true" 115 } 116 }, 117 { 118 "Classification": "emrfs-site", 119 "Properties": { 120 "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE", 121 "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 122 "fs.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io", 123 "fs.s3.path.style.access": "true", 124 "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE", 125 "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 126 "fs.s3a.endpoint": "https://example-org.us-east-1.lakefscloud.io", 127 "fs.s3a.path.style.access": "true" 128 } 129 }, 130 { 131 "Classification": "presto-connector-hive", 132 "Properties": { 133 "hive.s3.aws-access-key": "AKIAIOSFODNN7EXAMPLE", 134 "hive.s3.aws-secret-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 135 "hive.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io", 136 "hive.s3.path-style-access": "true", 137 "hive.s3-file-system-type": "PRESTO" 138 } 139 }, 140 { 141 "Classification": "hive-site", 142 "Properties": { 143 "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE", 144 "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 145 "fs.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io", 146 "fs.s3.path.style.access": "true", 147 "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE", 148 "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 149 "fs.s3a.endpoint": "https://example-org.us-east-1.lakefscloud.io", 150 "fs.s3a.path.style.access": "true" 151 } 152 }, 153 { 154 "Classification": "hdfs-site", 155 "Properties": { 156 "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE", 157 "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 158 "fs.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io", 159 "fs.s3.path.style.access": "true", 160 "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE", 161 "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 162 "fs.s3a.endpoint": "https://example-org.us-east-1.lakefscloud.io", 163 "fs.s3a.path.style.access": "true" 164 } 165 }, 166 { 167 "Classification": "mapred-site", 168 "Properties": { 169 "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE", 170 "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 171 "fs.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io", 172 "fs.s3.path.style.access": "true", 173 "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE", 174 "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 175 "fs.s3a.endpoint": "https://example-org.us-east-1.lakefscloud.io", 176 "fs.s3a.path.style.access": "true" 177 } 178 } 179 ] 180 ``` 181 182 Alternatively, you can pass these configuration values when adding a step. 183 184 For example: 185 186 ```bash 187 aws emr add-steps --cluster-id j-197B3AEGQ9XE4 \ 188 --steps="Type=Spark,Name=SparkApplication,ActionOnFailure=CONTINUE, \ 189 Args=[--conf,spark.hadoop.fs.s3a.access.key=AKIAIOSFODNN7EXAMPLE, \ 190 --conf,spark.hadoop.fs.s3a.secret.key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY, \ 191 --conf,spark.hadoop.fs.s3a.endpoint=https://example-org.us-east-1.lakefscloud.io, \ 192 --conf,spark.hadoop.fs.s3a.path.style.access=true, \ 193 s3a://<lakefs-repo>/<lakefs-branch>/path/to/jar]" 194 ``` 195 196 </div> 197 </div> 198 199 #### Per-bucket configuration 200 201 The above configuration will use lakeFS as the sole S3 endpoint. To use lakeFS in parallel with S3, you can configure Spark to use lakeFS only for specific bucket names. 202 For example, to configure only `example-repo` to use lakeFS, set the following configurations: 203 204 <div class="tabs"> 205 <ul> 206 <li><a href="#s3-bucket-config-tabs-cli">CLI</a></li> 207 <li><a href="#s3-bucket-config-tabs-code">Scala</a></li> 208 <li><a href="#s3-bucket-config-tabs-xml">XML Configuration</a></li> 209 <li><a href="#s3-bucket-config-tabs-emr">EMR</a></li> 210 </ul> 211 <div markdown="1" id="s3-bucket-config-tabs-cli"> 212 ```sh 213 spark-shell --conf spark.hadoop.fs.s3a.bucket.example-repo.access.key='AKIAlakefs12345EXAMPLE' \ 214 --conf spark.hadoop.fs.s3a.bucket.example-repo.secret.key='abc/lakefs/1234567bPxRfiCYEXAMPLEKEY' \ 215 --conf spark.hadoop.fs.s3a.bucket.example-repo.endpoint='https://example-org.us-east-1.lakefscloud.io' \ 216 --conf spark.hadoop.fs.s3a.path.style.access=true 217 ``` 218 </div> 219 <div markdown="1" id="s3-bucket-config-tabs-code"> 220 ```scala 221 spark.sparkContext.hadoopConfiguration.set("fs.s3a.bucket.example-repo.access.key", "AKIAlakefs12345EXAMPLE") 222 spark.sparkContext.hadoopConfiguration.set("fs.s3a.bucket.example-repo.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY") 223 spark.sparkContext.hadoopConfiguration.set("fs.s3a.bucket.example-repo.endpoint", "https://example-org.us-east-1.lakefscloud.io") 224 spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true") 225 ``` 226 </div> 227 <div markdown="1" id="s3-bucket-config-tabs-xml"> 228 Add these into a configuration file, e.g. `$SPARK_HOME/conf/hdfs-site.xml`: 229 ```xml 230 <?xml version="1.0"?> 231 <configuration> 232 <property> 233 <name>fs.s3a.bucket.example-repo.access.key</name> 234 <value>AKIAlakefs12345EXAMPLE</value> 235 </property> 236 <property> 237 <name>fs.s3a.bucket.example-repo.secret.key</name> 238 <value>abc/lakefs/1234567bPxRfiCYEXAMPLEKEY</value> 239 </property> 240 <property> 241 <name>fs.s3a.bucket.example-repo.endpoint</name> 242 <value>https://example-org.us-east-1.lakefscloud.io</value> 243 </property> 244 <property> 245 <name>fs.s3a.path.style.access</name> 246 <value>true</value> 247 </property> 248 </configuration> 249 ``` 250 </div> 251 <div markdown="1" id="s3-bucket-config-tabs-emr"> 252 Use the below configuration when creating the cluster. You may delete any app configuration that is not suitable for your use case: 253 254 ```json 255 [ 256 { 257 "Classification": "spark-defaults", 258 "Properties": { 259 "spark.sql.catalogImplementation": "hive" 260 } 261 }, 262 { 263 "Classification": "core-site", 264 "Properties": { 265 "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE", 266 "fs.s3.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 267 "fs.s3.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io", 268 "fs.s3.bucket.example-repo.path.style.access": "true", 269 "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE", 270 "fs.s3a.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 271 "fs.s3a.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io", 272 "fs.s3a.bucket.example-repo.path.style.access": "true" 273 } 274 }, 275 { 276 "Classification": "emrfs-site", 277 "Properties": { 278 "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE", 279 "fs.s3.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 280 "fs.s3.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io", 281 "fs.s3.bucket.example-repo.path.style.access": "true", 282 "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE", 283 "fs.s3a.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 284 "fs.s3a.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io", 285 "fs.s3a.bucket.example-repo.path.style.access": "true" 286 } 287 }, 288 { 289 "Classification": "presto-connector-hive", 290 "Properties": { 291 "hive.s3.aws-access-key": "AKIAIOSFODNN7EXAMPLE", 292 "hive.s3.aws-secret-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 293 "hive.s3.endpoint": "https://example-org.us-east-1.lakefscloud.io", 294 "hive.s3.path-style-access": "true", 295 "hive.s3-file-system-type": "PRESTO" 296 } 297 }, 298 { 299 "Classification": "hive-site", 300 "Properties": { 301 "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE", 302 "fs.s3.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 303 "fs.s3.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io", 304 "fs.s3.bucket.example-repo.path.style.access": "true", 305 "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE", 306 "fs.s3a.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 307 "fs.s3a.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io", 308 "fs.s3a.bucket.example-repo.path.style.access": "true" 309 } 310 }, 311 { 312 "Classification": "hdfs-site", 313 "Properties": { 314 "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE", 315 "fs.s3.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 316 "fs.s3.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io", 317 "fs.s3.bucket.example-repo.path.style.access": "true", 318 "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE", 319 "fs.s3a.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 320 "fs.s3a.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io", 321 "fs.s3a.bucket.example-repo.path.style.access": "true" 322 } 323 }, 324 { 325 "Classification": "mapred-site", 326 "Properties": { 327 "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE", 328 "fs.s3.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 329 "fs.s3.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io", 330 "fs.s3.bucket.example-repo.path.style.access": "true", 331 "fs.s3a.bucket.example-repo.access.key": "AKIAIOSFODNN7EXAMPLE", 332 "fs.s3a.bucket.example-repo.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 333 "fs.s3a.bucket.example-repo.endpoint": "https://example-org.us-east-1.lakefscloud.io", 334 "fs.s3a.bucket.example-repo.path.style.access": "true" 335 } 336 } 337 ] 338 ``` 339 340 Alternatively, you can pass these configuration values when adding a step. 341 342 For example: 343 344 ```bash 345 aws emr add-steps --cluster-id j-197B3AEGQ9XE4 \ 346 --steps="Type=Spark,Name=SparkApplication,ActionOnFailure=CONTINUE, \ 347 Args=[--conf,spark.hadoop.fs.s3a.bucket.example-repo.access.key=AKIAIOSFODNN7EXAMPLE, \ 348 --conf,spark.hadoop.fs.s3a.bucket.example-repo.secret.key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY, \ 349 --conf,spark.hadoop.fs.s3a.bucket.example-repo.endpoint=https://example-org.us-east-1.lakefscloud.io, \ 350 --conf,spark.hadoop.fs.s3a.path.style.access=true, \ 351 s3a://<lakefs-repo>/<lakefs-branch>/path/to/jar]" 352 ``` 353 354 </div> 355 </div> 356 357 With this configuration set, you read S3A paths with `example-repo` as the bucket will use lakeFS, while all other buckets will use AWS S3. 358 359 ### Usage 360 361 Here's an example for reading a Parquet file from lakeFS to a Spark DataFrame: 362 363 ```scala 364 val repo = "example-repo" 365 val branch = "main" 366 val df = spark.read.parquet(s"s3a://${repo}/${branch}/example-path/example-file.parquet") 367 ``` 368 369 Here's how to write some results back to a lakeFS path: 370 ```scala 371 df.write.partitionBy("example-column").parquet(s"s3a://${repo}/${branch}/output-path/") 372 ``` 373 374 The data is now created in lakeFS as new changes in your branch. You can now commit these changes or revert them. 375 376 ### Configuring Azure Databricks with the S3-compatible API 377 378 If you use Azure Databricks, you can take advantage of the lakeFS S3-compatible API with your Azure account and the S3A FileSystem. 379 This will require installing the `hadoop-aws` package (with the same version as your `hadoop-azure` package) to your Databricks cluster. 380 381 Define your FileSystem configurations in the following way: 382 383 ``` 384 spark.hadoop.fs.lakefs.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 385 spark.hadoop.fs.lakefs.access.key=‘AKIAlakefs12345EXAMPLE’ // The access key to your lakeFS server 386 spark.hadoop.fs.lakefs.secret.key=‘abc/lakefs/1234567bPxRfiCYEXAMPLEKEY’ // The secret key to your lakeFS server 387 spark.hadoop.fs.lakefs.path.style.access=true 388 spark.hadoop.fs.lakefs.endpoint=‘https://example-org.us-east-1.lakefscloud.io’ // The endpoint of your lakeFS server 389 ``` 390 391 For more details about [Mounting cloud object storage on Databricks](https://docs.databricks.com/dbfs/mounts.html). 392 393 ### Configuring Databricks SQL Warehouse with the S3-compatible API 394 395 A SQL warehouse is a compute resource that lets you run SQL commands on data 396 objects within Databricks SQL. 397 398 If you use Databricks SQL warehouse, you can take advantage of the lakeFS 399 S3-compatible API with the S3A FileSystem. 400 401 Define your SQL Warehouse configurations in the following way: 402 403 1. In the top right, select `Admin Settings` and then `SQL warehouse settings`. 404 405 2. Under `Data Access Configuration` add the following key-value pairs for 406 each lakeFS repository you want to access: 407 408 ``` 409 spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem 410 spark.hadoop.fs.s3a.bucket.example-repo.access.key AKIAIOSFODNN7EXAMPLE // The access key to your lakeFS server 411 spark.hadoop.fs.s3a.bucket.example-repo.secret.key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY // The secret key to your lakeFS server 412 spark.hadoop.fs.s3a.bucket.example-repo.endpoint https://example-org.us-east-1.lakefscloud.io // The endpoint of your lakeFS server 413 spark.hadoop.fs.s3a.bucket.example-repo.path.style.access true 414 ``` 415 416 3. Changes are applied automatically after the SQL Warehouse restarts. 417 4. You can now use the lakeFS S3-compatible API with your SQL Warehouse, e.g.: 418 419 ```sql 420 SELECT * FROM delta.`s3a://example-repo/main/datasets/delta-table/` LIMIT 100 421 ``` 422 ### ⚠️ Experimental: Pre-signed mode for S3A 423 424 In Hadoop 3.1.4 version and above (as tested using our lakeFS Hadoop FS), it is possible to use pre-signed URLs as return values from the lakeFS S3 Gateway. 425 426 This has the immediate benefit of reducing the amount of traffic that has to go through the lakeFS server thus improving IO performance. 427 To read more about pre-signed URLs, see [this guide](../reference/security/presigned-url.html). 428 429 Here's an example Spark configuration to enable this support: 430 431 ``` 432 spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem 433 spark.hadoop.fs.s3a.bucket.example-repo.access.key AKIAIOSFODNN7EXAMPLE // The access key to your lakeFS server 434 spark.hadoop.fs.s3a.bucket.example-repo.secret.key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY // The secret key to your lakeFS server 435 spark.hadoop.fs.s3a.path.style.access true 436 spark.hadoop.fs.s3a.bucket.example-repo.signing-algorithm QueryStringSignerType 437 spark.hadoop.fs.s3a.bucket.example-repo.user.agent.prefix s3RedirectionSupport 438 ``` 439 440 `user.agent.prefix` should **contain** the string `s3RedirectionSupport` but does not have to match the string exactly. 441 {: .note } 442 443 444 Once configured, requests will include the string `s3RedirectionSupport` in the `User-Agent` HTTP header sent with GetObject requests, resulting in lakeFS responding with a pre-signed URL. 445 Setting the `signing-algorithm` to `QueryStringSignerType` is required to stop S3A from signing a pre-signed URL, since the existence of more than one signature method will return an error from S3. 446 447 ℹ This feature requires a lakeFS server of version `>1.18.0` 448 {: .note } 449 450 ## lakeFS Hadoop FileSystem 451 452 If you're using lakeFS on top of S3, this mode will enhance your application's performance. 453 In this mode, Spark will read and write objects directly from S3, reducing the load on the lakeFS server. 454 It will still access the lakeFS server for metadata operations. 455 456 After configuring the lakeFS Hadoop FileSystem below, use URIs of the form `lakefs://example-repo/ref/path/to/data` to 457 interact with your data on lakeFS. 458 459 ### Installation 460 461 <div class="tabs"> 462 <ul> 463 <li><a href="#install-standalone">Spark Standalone</a></li> 464 <li><a href="#install-databricks">Databricks</a></li> 465 <li><a href="#install-cloudera-spark">Cloudera Spark</a></li> 466 </ul> 467 <div markdown="1" id="install-standalone"> 468 469 Add the package to your `spark-submit` command: 470 471 ``` 472 --packages io.lakefs:hadoop-lakefs-assembly:0.2.4 473 ``` 474 475 </div> 476 <div markdown="2" id="install-databricks"> 477 In your cluster settings, under the _Libraries_ tab, add the following Maven package: 478 479 ``` 480 io.lakefs:hadoop-lakefs-assembly:0.2.4 481 ``` 482 483 Once installed, it should look something like this: 484 485  486 487 </div> 488 <div markdown="3" id="install-cloudera-spark"> 489 490 Add the package to your `pyspark` or `spark-submit` command: 491 492 ``` 493 --packages io.lakefs:hadoop-lakefs-assembly:0.2.4 494 ``` 495 496 Add the configuration to access the S3 bucket used by lakeFS to your `pyspark` or `spark-submit` command or add this configuration at the Cloudera cluster level (see below): 497 498 ``` 499 --conf spark.yarn.access.hadoopFileSystems=s3a://bucket-name 500 ``` 501 502 Add the configuration to access the S3 bucket used by lakeFS at the Cloudera cluster level: 503 1. Log in to the CDP (Cloudera Data Platform) web interface. 504 1. From the CDP home screen, click the `Management Console` icon. 505 1. In the Management Console, select `Data Hub Clusters` from the navigation pane. 506 1. Select the cluster you want to configure. Click on `CM-UI` link under Services: 507  508 1. In Cloudera Manager web interface, click on `Clusters` from the navigation pane and click on `spark_on_yarn` option: 509  510 1. Click on `Configuration` tab and search for `spark.yarn.access.hadoopFileSystems` in the search box: 511  512 1. Add S3 bucket used by lakeFS `s3a://bucket-name` in the `spark.yarn.access.hadoopFileSystems` list: 513  514 </div> 515 </div> 516 517 518 ### Configuration 519 520 Set the `fs.lakefs.*` Hadoop configurations to point to your lakeFS installation: 521 * `fs.lakefs.impl`: `io.lakefs.LakeFSFileSystem` 522 * `fs.lakefs.access.key`: lakeFS access key 523 * `fs.lakefs.secret.key`: lakeFS secret key 524 * `fs.lakefs.endpoint`: lakeFS API URL (e.g. `https://example-org.us-east-1.lakefscloud.io/api/v1`) 525 526 Configure the lakeFS client to use a temporary token instead of static credentials: 527 528 * `fs.lakefs.auth.provider`: The default is `basic_auth` with `fs.lakefs.access.key` and `fs.lakefs.secret.key` for basic authentication. 529 Can be set to `io.lakefs.auth.TemporaryAWSCredentialsLakeFSTokenProvider` for using temporary AWS credentials, you can read more about it [here]({% link reference/security/external-principals-aws.md %}). 530 531 When using `io.lakefs.auth.TemporaryAWSCredentialsLakeFSTokenProvider` as the auth provider the following configuration are relevant: 532 533 * `fs.lakefs.token.aws.access.key`: AWS assumed role access key 534 * `fs.lakefs.token.aws.secret.key`: AWS assumed role secret key 535 * `fs.lakefs.token.aws.session.token`: AWS assumed role temporary session token 536 * `fs.lakefs.token.aws.sts.endpoint`: AWS STS regional endpoint for generated the presigned-url (i.e `https://sts.us-west-2.amazonaws.com`) 537 * `fs.lakefs.token.aws.sts.duration_seconds`: Optional, the duration in seconds for the initial identity token (default is 60) 538 * `fs.lakefs.token.duration_seconds`: Optional, the duration in seconds for the lakeFS token (default is set in the lakeFS configuration [auth.login_duration]({% link reference/configuration.md %})) 539 * `fs.lakefs.token.sts.additional_headers`: Optional, comma separated list of `header:value` to attach when generating presigned sts request. Default is `X-Lakefs-Server-ID:fs.lakefs.endpoint`. 540 541 Configure the S3A FileSystem to access your S3 storage, for example using the `fs.s3a.*` configurations (these are **not** your lakeFS credentials): 542 543 * `fs.s3a.access.key`: AWS S3 access key 544 * `fs.s3a.secret.key`: AWS S3 secret key 545 546 Here are some configuration examples: 547 <div class="tabs"> 548 <ul> 549 <li><a href="#config-cli">CLI</a></li> 550 <li><a href="#config-scala">Scala</a></li> 551 <li><a href="#config-pyspark">PySpark</a></li> 552 <li><a href="#config-xml">XML Configuration</a></li> 553 <li><a href="#config-databricks">Databricks</a></li> 554 </ul> 555 <div markdown="1" id="config-cli"> 556 ```shell 557 spark-shell --conf spark.hadoop.fs.s3a.access.key='AKIAIOSFODNN7EXAMPLE' \ 558 --conf spark.hadoop.fs.s3a.secret.key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY' \ 559 --conf spark.hadoop.fs.s3a.endpoint='https://s3.eu-central-1.amazonaws.com' \ 560 --conf spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem \ 561 --conf spark.hadoop.fs.lakefs.access.key=AKIAlakefs12345EXAMPLE \ 562 --conf spark.hadoop.fs.lakefs.secret.key=abc/lakefs/1234567bPxRfiCYEXAMPLEKEY \ 563 --conf spark.hadoop.fs.lakefs.endpoint=https://example-org.us-east-1.lakefscloud.io/api/v1 \ 564 --packages io.lakefs:hadoop-lakefs-assembly:0.2.4 \ 565 io.example.ExampleClass 566 ``` 567 </div> 568 <div markdown="1" id="config-scala"> 569 570 ```scala 571 spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "AKIAIOSFODNN7EXAMPLE") 572 spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY") 573 spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.eu-central-1.amazonaws.com") 574 spark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem") 575 spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", "AKIAlakefs12345EXAMPLE") 576 spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY") 577 spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "https://example-org.us-east-1.lakefscloud.io/api/v1") 578 ``` 579 </div> 580 <div markdown="1" id="config-pyspark"> 581 582 ```python 583 sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "AKIAIOSFODNN7EXAMPLE") 584 sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY") 585 sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "https://s3.eu-central-1.amazonaws.com") 586 sc._jsc.hadoopConfiguration().set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem") 587 sc._jsc.hadoopConfiguration().set("fs.lakefs.access.key", "AKIAlakefs12345EXAMPLE") 588 sc._jsc.hadoopConfiguration().set("fs.lakefs.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY") 589 sc._jsc.hadoopConfiguration().set("fs.lakefs.endpoint", "https://example-org.us-east-1.lakefscloud.io/api/v1") 590 ``` 591 </div> 592 <div markdown="1" id="config-xml"> 593 594 Make sure that you load the lakeFS FileSystem into Spark by running it with `--packages` or `--jars`, 595 and then add these into a configuration file, e.g., `$SPARK_HOME/conf/hdfs-site.xml`: 596 597 ```xml 598 <?xml version="1.0"?> 599 <configuration> 600 <property> 601 <name>fs.s3a.access.key</name> 602 <value>AKIAIOSFODNN7EXAMPLE</value> 603 </property> 604 <property> 605 <name>fs.s3a.secret.key</name> 606 <value>wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY</value> 607 </property> 608 <property> 609 <name>fs.s3a.endpoint</name> 610 <value>https://s3.eu-central-1.amazonaws.com</value> 611 </property> 612 <property> 613 <name>fs.lakefs.impl</name> 614 <value>io.lakefs.LakeFSFileSystem</value> 615 </property> 616 <property> 617 <name>fs.lakefs.access.key</name> 618 <value>AKIAlakefs12345EXAMPLE</value> 619 </property> 620 <property> 621 <name>fs.lakefs.secret.key</name> 622 <value>abc/lakefs/1234567bPxRfiCYEXAMPLEKEY</value> 623 </property> 624 <property> 625 <name>fs.lakefs.endpoint</name> 626 <value>https://example-org.us-east-1.lakefscloud.io/api/v1</value> 627 </property> 628 </configuration> 629 ``` 630 </div> 631 <div markdown="1" id="config-databricks"> 632 633 Add the following the cluster's configuration under `Configuration ➡️ Advanced options`: 634 635 ``` 636 spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem 637 spark.hadoop.fs.lakefs.access.key AKIAlakefs12345EXAMPLE 638 spark.hadoop.fs.lakefs.secret.key abc/lakefs/1234567bPxRfiCYEXAMPLEKEY 639 spark.hadoop.fs.s3a.access.key AKIAIOSFODNN7EXAMPLE 640 spark.hadoop.fs.s3a.secret.key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY 641 spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem 642 spark.hadoop.fs.lakefs.endpoint https://example-org.us-east-1.lakefscloud.io/api/v1 643 ``` 644 645 Alternatively, follow this [step by step Databricks integration tutorial, including lakeFS Hadoop File System, Python client and lakeFS SPARK client](https://lakefs.io/blog/databricks-lakefs-integration-tutorial/). 646 </div> 647 </div> 648 649 ⚠️ If your bucket is on a region other than us-east-1, you may also need to configure `fs.s3a.endpoint` with the correct region. 650 Amazon provides [S3 endpoints](https://docs.aws.amazon.com/general/latest/gr/s3.html) you can use. 651 {: .note } 652 653 ### Usage with TemporaryAWSCredentialsLakeFSTokenProvider 654 655 An initial setup is required - you must have [AWS Auth configured]({% link reference/security/external-principals-aws.md %}) with lakeFS. 656 The `TemporaryAWSCredentialsLakeFSTokenProvider` depends on the caller to provide AWS credentials (e.g Assumed Role Key,Secret and Token) as input to the lakeFS client. 657 658 ⚠️ Configure `sts.endpoint` with a valid [sts regional service endpoint](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html) and it must be be equal to the region that is used for authentication first place. The only exception is `us-east-1` which is the default region for STS. 659 {: .note } 660 661 ⚠️ Using the current provider the lakeFS token will not renew upon expiry and the user will need to re-authenticate. 662 {: .note } 663 664 PySpark example using `TemporaryAWSCredentialsLakeFSTokenProvider` with boto3 and AWS session credentials: 665 666 ```python 667 import boto3 668 669 session = boto3.session.Session() 670 671 # AWS credentials used s3a to access lakeFS bucket 672 sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "AKIAIOSFODNN7EXAMPLE") 673 sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY") 674 sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "https://s3.us-west-2.amazonaws.com") 675 sc._jsc.hadoopConfiguration().set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem") 676 sc._jsc.hadoopConfiguration().set("fs.lakefs.endpoint", "https://example-org.us-west-2.lakefscloud.io/api/v1") 677 sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.path.style.access", "true") 678 sc._jsc.hadoopConfiguration().set("fs.lakefs.auth.provider", "io.lakefs.auth.TemporaryAWSCredentialsLakeFSTokenProvider") 679 # AWS tempporary session credentials to use with lakeFS 680 sc._jsc.hadoopConfiguration().set("fs.lakefs.token.aws.access.key", session.get_credentials().access_key) 681 sc._jsc.hadoopConfiguration().set("fs.lakefs.token.aws.secret.key", session.get_credentials().secret_key) 682 sc._jsc.hadoopConfiguration().set("fs.lakefs.token.aws.session.token", session.get_credentials().token) 683 sc._jsc.hadoopConfiguration().set("fs.lakefs.token.aws.sts.endpoint", "https://sts.us-west-2.amazonaws.com") 684 ``` 685 686 ### Usage 687 688 Hadoop FileSystem paths use the `lakefs://` protocol, with paths taking the form `lakefs://<repository>/<ref>/path/to/object`. 689 `<ref>` can be a branch, tag, or commit ID in lakeFS. 690 Here's an example for reading a Parquet file from lakeFS to a Spark DataFrame: 691 692 ```scala 693 val repo = "example-repo" 694 val branch = "main" 695 val df = spark.read.parquet(s"lakefs://${repo}/${branch}/example-path/example-file.parquet") 696 ``` 697 698 Here's how to write some results back to a lakeFS path: 699 700 ```scala 701 df.write.partitionBy("example-column").parquet(s"lakefs://${repo}/${branch}/output-path/") 702 ``` 703 704 The data is now created in lakeFS as new changes in your branch. You can now commit these changes or revert them. 705 706 ## Hadoop FileSystem in Presigned mode 707 708 _Available starting version 0.1.13 of the FileSystem_ 709 710 In this mode, the lakeFS server is responsible for authenticating with your storage. 711 The client will still perform data operations directly on the storage. 712 To do so, it will use pre-signed storage URLs provided by the lakeFS server. 713 714 When using this mode, you don't need to configure the client with access to your storage: 715 716 <div class="tabs"> 717 <ul> 718 <li><a href="#config-cli">CLI</a></li> 719 <li><a href="#config-scala">Scala</a></li> 720 <li><a href="#config-pyspark">PySpark</a></li> 721 <li><a href="#config-xml">XML Configuration</a></li> 722 <li><a href="#config-databricks">Databricks</a></li> 723 </ul> 724 <div markdown="1" id="config-cli"> 725 ```shell 726 spark-shell --conf spark.hadoop.fs.lakefs.access.mode=presigned \ 727 --conf spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem \ 728 --conf spark.hadoop.fs.lakefs.access.key=AKIAlakefs12345EXAMPLE \ 729 --conf spark.hadoop.fs.lakefs.secret.key=abc/lakefs/1234567bPxRfiCYEXAMPLEKEY \ 730 --conf spark.hadoop.fs.lakefs.endpoint=https://example-org.us-east-1.lakefscloud.io/api/v1 \ 731 --packages io.lakefs:hadoop-lakefs-assembly:0.2.4 732 ``` 733 </div> 734 <div markdown="1" id="config-scala"> 735 736 ```scala 737 spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.mode", "presigned") 738 spark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem") 739 spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", "AKIAlakefs12345EXAMPLE") 740 spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY") 741 spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "https://example-org.us-east-1.lakefscloud.io/api/v1") 742 ``` 743 </div> 744 <div markdown="1" id="config-pyspark"> 745 746 ```python 747 sc._jsc.hadoopConfiguration().set("fs.lakefs.access.mode", "presigned") 748 sc._jsc.hadoopConfiguration().set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem") 749 sc._jsc.hadoopConfiguration().set("fs.lakefs.access.key", "AKIAlakefs12345EXAMPLE") 750 sc._jsc.hadoopConfiguration().set("fs.lakefs.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY") 751 sc._jsc.hadoopConfiguration().set("fs.lakefs.endpoint", "https://example-org.us-east-1.lakefscloud.io/api/v1") 752 ``` 753 </div> 754 <div markdown="1" id="config-xml"> 755 756 Make sure that you load the lakeFS FileSystem into Spark by running it with `--packages` or `--jars`, 757 and then add these into a configuration file, e.g., `$SPARK_HOME/conf/hdfs-site.xml`: 758 759 ```xml 760 <?xml version="1.0"?> 761 <configuration> 762 <property> 763 <name>fs.lakefs.access.mode</name> 764 <value>presigned</value> 765 </property> 766 <property> 767 <name>fs.lakefs.impl</name> 768 <value>io.lakefs.LakeFSFileSystem</value> 769 </property> 770 <property> 771 <name>fs.lakefs.access.key</name> 772 <value>AKIAlakefs12345EXAMPLE</value> 773 </property> 774 <property> 775 <name>fs.lakefs.secret.key</name> 776 <value>abc/lakefs/1234567bPxRfiCYEXAMPLEKEY</value> 777 </property> 778 <property> 779 <name>fs.lakefs.endpoint</name> 780 <value>https://example-org.us-east-1.lakefscloud.io/api/v1</value> 781 </property> 782 </configuration> 783 ``` 784 </div> 785 <div markdown="1" id="config-databricks"> 786 787 Add the following the cluster's configuration under `Configuration ➡️ Advanced options`: 788 789 ``` 790 spark.hadoop.fs.lakefs.access.mode presigned 791 spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem 792 spark.hadoop.fs.lakefs.access.key AKIAlakefs12345EXAMPLE 793 spark.hadoop.fs.lakefs.secret.key abc/lakefs/1234567bPxRfiCYEXAMPLEKEY 794 spark.hadoop.fs.lakefs.endpoint https://example-org.us-east-1.lakefscloud.io/api/v1 795 ``` 796 </div> 797 </div>