github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/reference/spark-client.md (about) 1 --- 2 title: Spark Client 3 description: The lakeFS Spark client performs operations on lakeFS committed metadata stored in the object store. 4 parent: Reference 5 --- 6 7 8 # lakeFS Spark Metadata Client 9 10 Utilize the power of Spark to interact with the metadata on lakeFS. Possible use cases include: 11 12 * Creating a DataFrame for listing the objects in a specific commit or branch. 13 * Computing changes between two commits. 14 * Exporting your data for consumption outside lakeFS. 15 * Bulk operations on the underlying storage. 16 17 ## Getting Started 18 19 Please note that Spark 2 is no longer supported with the lakeFS metadata client. 20 {: .note } 21 22 The Spark metadata client is compiled for Spark 3.1.2 with Hadoop 3.2.1, but 23 can work for other Spark versions and higher Hadoop versions. 24 25 <div class="tabs"> 26 <ul> 27 <li><a href="#spark-shell">PySpark, spark-shell, spark-submit, spark-sql</a></li> 28 <li><a href="#databricks">DataBricks</a></li> 29 </ul> 30 <div markdown="1" id="spark-shell"> 31 Start Spark Shell / PySpark with the `--packages` flag, for instance: 32 33 ```bash 34 spark-shell --packages io.lakefs:lakefs-spark-client_2.12:0.13.0 35 ``` 36 37 Alternatively use the assembled jar (an "Überjar") on S3, from 38 `s3://treeverse-clients-us-east/lakefs-spark-client/0.13.0/lakefs-spark-client-assembly-0.13.0.jar` 39 by passing its path to `--jars`. 40 The assembled jar is larger but shades several common libraries. Use it if Spark 41 complains about bad classes or missing methods. 42 </div> 43 <div markdown="1" id="databricks"> 44 Include this assembled jar (an "Überjar") from S3, from 45 `s3://treeverse-clients-us-east/lakefs-spark-client/0.13.0/lakefs-spark-client-assembly-0.13.0.jar`. 46 </div> 47 </div> 48 49 ## Configuration 50 51 1. To read metadata from lakeFS, the client should be configured with your lakeFS endpoint and credentials, using the following Hadoop configurations: 52 53 | Configuration | Description | 54 |--------------------------------------|--------------------------------------------------------------| 55 | `spark.hadoop.lakefs.api.url` | lakeFS API endpoint, e.g: `http://lakefs.example.com/api/v1` | 56 | `spark.hadoop.lakefs.api.access_key` | The access key to use for fetching metadata from lakeFS | 57 | `spark.hadoop.lakefs.api.secret_key` | Corresponding lakeFS secret key | 58 59 1. The client will also directly interact with your storage using Hadoop FileSystem. 60 Therefore, your Spark session must be able to access the underlying storage of your lakeFS repository. 61 There are [various ways to do this](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3){:target="_blank"}, 62 but for a non-production environment you can use the following Hadoop configurations: 63 64 | Configuration | Description | 65 |----------------------------------|----------------------------------------------------------| 66 | `spark.hadoop.fs.s3a.access.key` | Access key to use for accessing underlying storage on S3 | 67 | `spark.hadoop.fs.s3a.secret.key` | Corresponding secret key to use with S3 access key | 68 69 ### Assuming role on S3 (Hadoop 3 only) 70 71 The client includes support for assuming a separate role on S3 when 72 running on Hadoop 3. It uses the same configuration used by 73 `S3AFileSystem` to assume the role on S3A. Apache Hadoop AWS 74 documentation has details under "[Working with IAM Assumed 75 Roles][s3a-assumed-role]". You will need to use the following Hadoop 76 configurations: 77 78 | Configuration | Description | 79 |-----------------------------------|----------------------------------------------------------------------| 80 | `fs.s3a.aws.credentials.provider` | Set to `org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider` | 81 | `fs.s3a.assumed.role.arn` | Set to the ARN of the role to assume | 82 83 ## Examples 84 85 1. Get a DataFrame for listing all objects in a commit: 86 87 ```scala 88 import io.treeverse.clients.LakeFSContext 89 90 val commitID = "a1b2c3d4" 91 val df = LakeFSContext.newDF(spark, "example-repo", commitID) 92 df.show 93 /* output example: 94 +------------+--------------------+--------------------+-------------------+----+ 95 | key | address| etag| last_modified|size| 96 +------------+--------------------+--------------------+-------------------+----+ 97 | file_1 |791457df80a0465a8...|7b90878a7c9be5a27...|2021-03-05 11:23:30| 36| 98 | file_2 |e15be8f6e2a74c329...|95bee987e9504e2c3...|2021-03-05 11:45:25| 36| 99 | file_3 |f6089c25029240578...|32e2f296cb3867d57...|2021-03-07 13:43:19| 36| 100 | file_4 |bef38ef97883445c8...|e920efe2bc220ffbb...|2021-03-07 13:43:11| 13| 101 +------------+--------------------+--------------------+-------------------+----+ 102 */ 103 ``` 104 105 1. Run SQL queries on your metadata: 106 107 ```scala 108 df.createOrReplaceTempView("files") 109 spark.sql("SELECT DATE(last_modified), COUNT(*) FROM files GROUP BY 1 ORDER BY 1") 110 /* output example: 111 +----------+--------+ 112 | dt|count(1)| 113 +----------+--------+ 114 |2021-03-05| 2| 115 |2021-03-07| 2| 116 +----------+--------+ 117 */ 118 ``` 119 120 121 [s3a-assumed-role]: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/assumed_roles.html#Configuring_Assumed_Roles