github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/test/spark/README.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/test/spark/README.md (about)

1 # Test Spark App
2
3 In order to test lakeFS with Spark we use docker compose to run REPO/TAG lakeFS image with Spark master ('spark') and two workers ('spark-worker-1' and 'spark-worker-2').
4 Spark access lakeFS using s3a endpoint, which we configure while we run `spark-submit`, the endpoint will try to resolve the names: `s3.docker.lakefs.io` and `example.s3.docker.lakefs.io`.
5 Running the services inside the default network and set a specific subnet enables us to set specific IP address to lakeFS and add additional host to resolve to the same address - 10.5.0.55 in our case.
6 Every new bucket we would like to access though our gateway will require additional entry that should be mapped to lakefs (10.0.1.50).
7
8 ### Running it
9
10 After docker compose env is up and running we need to verify that the Spark application - found under `app/` folder is composed and packaged:
11
12 ```shell
13 sbt package
14 ```
15
16 The docker compose yaml holds a service entry called `spark-submit` which uses the same settings as a Spark worker just with profile `command`, which will not start by default, unless "command" profile is specified.
17 Using this entry and the volume mount to our workspace we submit the spark app.
18
19 ### Exporter test
20
21 - `setup-exporter-test.sh` is responsible to setup the repository for the Exporter test.
22 - `run-exporter-test.sh` is responsible to prepare the test environment and run the respective test `spark-submit` job.
23
24 The app itself is the word count example that store the information into csv format. Loading the csv infomation and validate specific data in order to verify that the right data is loaded.