github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/clients/spark/README.md (about) 1 # lakeFS Spark Metadata Client 2 3 Read metadata from lakeFS into Spark. 4 5 ## Features 6 7 1. Read Graveler meta-ranges, ranges and entries. 8 1. Export data from lakeFS to any object storage (see [docs](https://docs.lakefs.io/reference/export.html)). 9 10 _Please note that starting version 0.9.0, Spark 2 is not supported with the lakeFS metadata client._ 11 12 ## Installation 13 14 ### Uber-jar 15 The Uber-Jar can be found on a public S3 location: 16 http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client/${CLIENT_VERSION}/lakefs-spark-client-assembly-${CLIENT_VERSION}.jar 17 18 ### Maven 19 ``` 20 io.lakefs:lakefs-spark-client_2.12:${CLIENT_VERSION} 21 ``` 22 23 ## Usage Examples 24 ### Export using spark-submit 25 26 Replace `<version>` below with the latest version available. See [available versions](https://mvnrepository.com/artifact/io.lakefs/lakefs-spark-client_2.12). 27 28 ``` 29 CLIENT_VERSION=0.11.0 30 spark-submit --conf spark.hadoop.lakefs.api.url=https://lakefs.example.com/api/v1 \ 31 --conf spark.hadoop.fs.s3a.access.key=<S3_ACCESS_KEY> \ 32 --conf spark.hadoop.fs.s3a.secret.key=<S3_SECRET_KEY> \ 33 --conf spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \ 34 --conf spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \ 35 --packages org.apache.hadoop:hadoop-aws:2.7.7,io.lakefs:lakefs-spark-client_2.12:${CLIENT_VERSION} \ 36 --class io.treeverse.clients.Main export-app example-repo s3://example-bucket/exported-data/ \ 37 --branch=main 38 ``` 39 40 ### Export using spark-submit (uber-jar) 41 42 Replace `<version>` below with the latest version available. See [available versions](https://mvnrepository.com/artifact/io.lakefs/lakefs-spark-client_2.12). 43 44 ``` 45 CLIENT_VERSION=0.11.0 46 spark-submit --conf spark.hadoop.lakefs.api.url=https://lakefs.example.com/api/v1 \ 47 --conf spark.hadoop.fs.s3a.access.key=<S3_ACCESS_KEY> \ 48 --conf spark.hadoop.fs.s3a.secret.key=<S3_SECRET_KEY> \ 49 --conf spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \ 50 --conf spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \ 51 --packages org.apache.hadoop:hadoop-aws:2.7.7 \ 52 --jars http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client/${CLIENT_VERSION}/lakefs-spark-client-assembly-${CLIENT_VERSION}.jar \ 53 --class io.treeverse.clients.Main export-app example-repo s3://example-bucket/exported-data/ \ 54 --branch=main 55 ``` 56 57 ## Publishing a new version 58 59 Follow the [Spark client release checklist](https://github.com/treeverse/dev/blob/main/pages/lakefs-clients-release.md#spark-metadata-client) 60 61 ## Debugging 62 63 To debug the Exporter or the Garbage Collector using your IDE you can use a remote JVM debugger. You can follow [these](https://sparkbyexamples.com/spark/how-to-debug-spark-application-locally-or-remote/) instructions to connect one. 64