github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/clients/spark/README.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/clients/spark/README.md (about)

     1  # lakeFS Spark Metadata Client
     2  
     3  Read metadata from lakeFS into Spark.
     4  
     5  ## Features
     6  
     7  1. Read Graveler meta-ranges, ranges and entries.
     8  1. Export data from lakeFS to any object storage (see [docs](https://docs.lakefs.io/reference/export.html)).
     9  
    10  _Please note that starting version 0.9.0, Spark 2 is not supported with the lakeFS metadata client._
    11  
    12  ## Installation
    13  
    14  ### Uber-jar
    15  The Uber-Jar can be found on a public S3 location:
    16  http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client/${CLIENT_VERSION}/lakefs-spark-client-assembly-${CLIENT_VERSION}.jar
    17  
    18  ### Maven
    19  ```
    20  io.lakefs:lakefs-spark-client_2.12:${CLIENT_VERSION}
    21  ```
    22  
    23  ## Usage Examples
    24  ### Export using spark-submit
    25  
    26  Replace `<version>` below with the latest version available. See [available versions](https://mvnrepository.com/artifact/io.lakefs/lakefs-spark-client_2.12).
    27  
    28  ```
    29  CLIENT_VERSION=0.11.0
    30  spark-submit --conf spark.hadoop.lakefs.api.url=https://lakefs.example.com/api/v1 \
    31      --conf spark.hadoop.fs.s3a.access.key=<S3_ACCESS_KEY> \
    32      --conf spark.hadoop.fs.s3a.secret.key=<S3_SECRET_KEY> \
    33      --conf spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
    34      --conf spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
    35      --packages org.apache.hadoop:hadoop-aws:2.7.7,io.lakefs:lakefs-spark-client_2.12:${CLIENT_VERSION} \
    36      --class io.treeverse.clients.Main export-app example-repo s3://example-bucket/exported-data/ \
    37      --branch=main
    38  ```
    39  
    40  ### Export using spark-submit (uber-jar)
    41  
    42  Replace `<version>` below with the latest version available. See [available versions](https://mvnrepository.com/artifact/io.lakefs/lakefs-spark-client_2.12).
    43  
    44  ```
    45  CLIENT_VERSION=0.11.0
    46  spark-submit --conf spark.hadoop.lakefs.api.url=https://lakefs.example.com/api/v1 \
    47      --conf spark.hadoop.fs.s3a.access.key=<S3_ACCESS_KEY> \
    48  	--conf spark.hadoop.fs.s3a.secret.key=<S3_SECRET_KEY> \
    49  	--conf spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
    50  	--conf spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
    51  	--packages org.apache.hadoop:hadoop-aws:2.7.7 \
    52  	--jars http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client/${CLIENT_VERSION}/lakefs-spark-client-assembly-${CLIENT_VERSION}.jar \
    53  	--class io.treeverse.clients.Main export-app example-repo s3://example-bucket/exported-data/ \
    54  	--branch=main
    55  ```
    56  
    57  ## Publishing a new version
    58  
    59  Follow the [Spark client release checklist](https://github.com/treeverse/dev/blob/main/pages/lakefs-clients-release.md#spark-metadata-client)
    60  
    61  ## Debugging
    62  
    63  To debug the Exporter or the Garbage Collector using your IDE you can use a remote JVM debugger. You can follow [these](https://sparkbyexamples.com/spark/how-to-debug-spark-application-locally-or-remote/) instructions to connect one. 
    64