github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/reference/spark-client.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/reference/spark-client.md (about)

     1  ---
     2  title: Spark Client
     3  description: The lakeFS Spark client performs operations on lakeFS committed metadata stored in the object store. 
     4  parent: Reference
     5  ---
     6  
     7  
     8  # lakeFS Spark Metadata Client
     9  
    10  Utilize the power of Spark to interact with the metadata on lakeFS. Possible use cases include:
    11  
    12  * Creating a DataFrame for listing the objects in a specific commit or branch.
    13  * Computing changes between two commits.
    14  * Exporting your data for consumption outside lakeFS.
    15  * Bulk operations on the underlying storage.
    16  
    17  ## Getting Started
    18  
    19  Please note that Spark 2 is no longer supported with the lakeFS metadata client.
    20  {: .note }
    21  
    22  The Spark metadata client is compiled for Spark 3.1.2 with Hadoop 3.2.1, but
    23  can work for other Spark versions and higher Hadoop versions.
    24  
    25  <div class="tabs">
    26    <ul>
    27      <li><a href="#spark-shell">PySpark, spark-shell, spark-submit, spark-sql</a></li>
    28  	<li><a href="#databricks">DataBricks</a></li>
    29    </ul>
    30    <div markdown="1" id="spark-shell">
    31  Start Spark Shell / PySpark with the `--packages` flag, for instance:
    32  
    33  ```bash
    34  spark-shell --packages io.lakefs:lakefs-spark-client_2.12:0.13.0
    35  ```
    36  
    37  Alternatively use the assembled jar (an "Überjar") on S3, from
    38  `s3://treeverse-clients-us-east/lakefs-spark-client/0.13.0/lakefs-spark-client-assembly-0.13.0.jar`
    39  by passing its path to `--jars`.
    40  The assembled jar is larger but shades several common libraries.  Use it if Spark
    41  complains about bad classes or missing methods.
    42  </div>
    43  <div markdown="1" id="databricks">
    44  Include this assembled jar (an "Überjar") from S3, from
    45  `s3://treeverse-clients-us-east/lakefs-spark-client/0.13.0/lakefs-spark-client-assembly-0.13.0.jar`.
    46  </div>
    47  </div>
    48  
    49  ## Configuration
    50  
    51  1. To read metadata from lakeFS, the client should be configured with your lakeFS endpoint and credentials, using the following Hadoop configurations:
    52  
    53     | Configuration                        | Description                                                  |
    54     |--------------------------------------|--------------------------------------------------------------|
    55     | `spark.hadoop.lakefs.api.url`        | lakeFS API endpoint, e.g: `http://lakefs.example.com/api/v1` |
    56     | `spark.hadoop.lakefs.api.access_key` | The access key to use for fetching metadata from lakeFS      |
    57     | `spark.hadoop.lakefs.api.secret_key` | Corresponding lakeFS secret key                              |
    58  
    59  1. The client will also directly interact with your storage using Hadoop FileSystem.
    60     Therefore, your Spark session must be able to access the underlying storage of your lakeFS repository.
    61     There are [various ways to do this](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3){:target="_blank"},
    62     but for a non-production environment you can use the following Hadoop configurations:
    63  
    64     | Configuration                    | Description                                              |
    65     |----------------------------------|----------------------------------------------------------|
    66     | `spark.hadoop.fs.s3a.access.key` | Access key to use for accessing underlying storage on S3 |
    67     | `spark.hadoop.fs.s3a.secret.key` | Corresponding secret key to use with S3 access key       |
    68  
    69     ### Assuming role on S3 (Hadoop 3 only)
    70  
    71     The client includes support for assuming a separate role on S3 when
    72     running on Hadoop 3. It uses the same configuration used by
    73     `S3AFileSystem` to assume the role on S3A. Apache Hadoop AWS
    74     documentation has details under "[Working with IAM Assumed
    75     Roles][s3a-assumed-role]". You will need to use the following Hadoop
    76     configurations:
    77     
    78     | Configuration                     | Description                                                          |
    79     |-----------------------------------|----------------------------------------------------------------------|
    80     | `fs.s3a.aws.credentials.provider` | Set to `org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider` |
    81     | `fs.s3a.assumed.role.arn`         | Set to the ARN of the role to assume                                 |
    82  
    83  ## Examples
    84  
    85  1. Get a DataFrame for listing all objects in a commit:
    86  
    87     ```scala
    88     import io.treeverse.clients.LakeFSContext
    89      
    90     val commitID = "a1b2c3d4"
    91     val df = LakeFSContext.newDF(spark, "example-repo", commitID)
    92     df.show
    93     /* output example:
    94        +------------+--------------------+--------------------+-------------------+----+
    95        |        key |             address|                etag|      last_modified|size|
    96        +------------+--------------------+--------------------+-------------------+----+
    97        |     file_1 |791457df80a0465a8...|7b90878a7c9be5a27...|2021-03-05 11:23:30|  36|
    98        |     file_2 |e15be8f6e2a74c329...|95bee987e9504e2c3...|2021-03-05 11:45:25|  36|
    99        |     file_3 |f6089c25029240578...|32e2f296cb3867d57...|2021-03-07 13:43:19|  36|
   100        |     file_4 |bef38ef97883445c8...|e920efe2bc220ffbb...|2021-03-07 13:43:11|  13|
   101        +------------+--------------------+--------------------+-------------------+----+
   102      */
   103     ```
   104  
   105  1. Run SQL queries on your metadata:
   106  
   107     ```scala
   108     df.createOrReplaceTempView("files")
   109     spark.sql("SELECT DATE(last_modified), COUNT(*) FROM files GROUP BY 1 ORDER BY 1")
   110     /* output example:
   111        +----------+--------+
   112        |        dt|count(1)|
   113        +----------+--------+
   114        |2021-03-05|       2|
   115        |2021-03-07|       2|
   116        +----------+--------+
   117      */
   118     ```
   119  
   120  
   121  [s3a-assumed-role]:  https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/assumed_roles.html#Configuring_Assumed_Roles