github.com/pachyderm/pachyderm@v1.13.4/examples/spark/pi/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 # Estimate Pi Using Spark 5 6 This example demonstrates integration of Spark with Pachyderm by launching 7 a Spark job on an existing cluster from within a Pachyderm Job. The job uses 8 configuration info that is versioned within Pachyderm, and stores it's reduced 9 result back into a Pachyderm output repo, maintaining full provenance and 10 version history within Pachyderm, while taking advantage of Spark for 11 computation. 12 13 The example assumes that you have: 14 15 - A Pachyderm cluster running - see [this guide](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/) to get up and running with a local Pachyderm cluster in just a few minutes. 16 - Kubernetes access to the cluster Pachyderm is installed in. 17 - The `pachctl` CLI tool installed and connected to your Pachyderm cluster - see [the relevant deploy docs](https://docs.pachyderm.com/1.13.x/deploy-manage/deploy/) for instructions. 18 - The `kubectl` CLI tool installed (you will likely have installed this while [setting up your local Pachyderm cluster](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/)) 19 20 Note: if deploying on Minikube, you'll need to increase the default memory 21 allocation to accomodate the deploy of a Spark cluster. When running `minikube 22 start`, append `--memory 4096`. 23 24 ## Set up Spark Cluster 25 26 The simpelst way to run this example is by deploying a Spark cluster into the 27 same Kubernetes cluster on which Pachyderm is running. We'll do so with Helm. 28 (Note: if you already have an external Spark cluster running, you can skip this 29 section. Be sure to read [the note about connecting to an existing Spark 30 cluster](#connecting-to-an-existing-spark-cluster)) 31 32 ### Install Helm 33 34 If you don't already have the Helm client installed, you can do so by following 35 [the instructions 36 here](https://docs.helm.sh/using_helm/#installing-the-helm-client) (or, for the 37 bold, by running `curl 38 https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get | bash`.) 39 40 ### Set up Helm/Tiller 41 42 In order to use Helm with your Kubernetes cluster, you'll need to install 43 Tiller: 44 45 ``` 46 kubectl create serviceaccount --namespace kube-system tiller 47 kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller 48 helm init --service-account tiller --upgrade 49 ``` 50 51 Tiller will take about a minute to initialize and enter `Running` status. You 52 can check it's status by running: `kubectl get pod -n kube-system -l 53 name=tiller` 54 55 56 ### Install Spark 57 58 Finally, once Tiller is `Running`, use Helm to install Spark: 59 60 ``` 61 helm install --name spark stable/spark 62 ``` 63 64 This will again take several minutes to pull the relevant Docker images and 65 start running. You can check the status with `kubectl get pod -l 66 release=spark` 67 68 ## Deploy Pachyderm Pipeline 69 70 Once your Spark cluster is running, you're ready to deploy the Pachyderm 71 pipeline: 72 73 74 ``` 75 # create a repo to hold configuration data that acts as input to the pipeline 76 pachctl create repo estimate_pi_config 77 78 # create the actual processing pipeline 79 pachctl create pipeline -f estimate_pi_pipeline.json 80 81 # kick off a job with 1000 samples 82 echo 1000 | pachctl put file estimate_pi_config@master:num_samples 83 84 # check job status 85 pachctl list job --pipeline estimate_pi 86 87 # once job has completed, retrieve the results 88 pachctl get file estimate_pi@master:pi_estimate 89 90 ``` 91 92 ## Connecting to an existing Spark cluster 93 94 By default, this example makes use of Kubernetes' service discovery to connect 95 your Pachyderm pipeline code to your Spark cluster. If you wish to connect to 96 a different Spark cluster, you can do so by adding the `--master` flag to the 97 list of arguments provided to `cmd` in the pipeline spec: append `"--master"` 98 and `"spark://$MYSPARK_MASTER_SERVICE_HOST:$MYSPARK_MASTER_SERVICE_PORT"` to 99 the `cmd` array. 100 101 To test a manually-specified connection, deploy a Spark cluster into 102 a different name in Kubernetes: 103 104 ``` 105 helm install --name my-custom-spark stable/spark 106 ``` 107