github.com/pachyderm/pachyderm@v1.13.4/examples/spark/pi/README.md

github.com/pachyderm/pachyderm@v1.13.4/examples/spark/pi/README.md (about)

     1  >![pach_logo](../../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  # Estimate Pi Using Spark
     5  
     6  This example demonstrates integration of Spark with Pachyderm by launching
     7  a Spark job on an existing cluster from within a Pachyderm Job. The job uses
     8  configuration info that is versioned within Pachyderm, and stores it's reduced
     9  result back into a Pachyderm output repo, maintaining full provenance and
    10  version history within Pachyderm, while taking advantage of Spark for
    11  computation.
    12  
    13  The example assumes that you have:
    14  
    15  - A Pachyderm cluster running - see [this guide](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/) to get up and running with a local Pachyderm cluster in just a few minutes.
    16  - Kubernetes access to the cluster Pachyderm is installed in.
    17  - The `pachctl` CLI tool installed and connected to your Pachyderm cluster - see [the relevant deploy docs](https://docs.pachyderm.com/1.13.x/deploy-manage/deploy/) for instructions.
    18  - The `kubectl` CLI tool installed (you will likely have installed this while [setting up your local Pachyderm cluster](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/))
    19  
    20  Note: if deploying on Minikube, you'll need to increase the default memory
    21  allocation to accomodate the deploy of a Spark cluster. When running `minikube
    22  start`, append `--memory 4096`.
    23  
    24  ## Set up Spark Cluster
    25  
    26  The simpelst way to run this example is by deploying a Spark cluster into the
    27  same Kubernetes cluster on which Pachyderm is running. We'll do so with Helm.
    28  (Note: if you already have an external Spark cluster running, you can skip this
    29  section. Be sure to read [the note about connecting to an existing Spark
    30  cluster](#connecting-to-an-existing-spark-cluster))
    31  
    32  ### Install Helm
    33  
    34  If you don't already have the Helm client installed, you can do so by following
    35  [the instructions
    36  here](https://docs.helm.sh/using_helm/#installing-the-helm-client) (or, for the
    37  bold, by running `curl
    38  https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get | bash`.)
    39  
    40  ### Set up Helm/Tiller
    41  
    42  In order to use Helm with your Kubernetes cluster, you'll need to install
    43  Tiller:
    44  
    45  ```
    46  kubectl create serviceaccount --namespace kube-system tiller
    47  kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
    48  helm init --service-account tiller --upgrade
    49  ```
    50  
    51  Tiller will take about a minute to initialize and enter `Running` status. You
    52  can check it's status by running: `kubectl get pod -n kube-system -l
    53  name=tiller`
    54  
    55  
    56  ### Install Spark
    57  
    58  Finally, once Tiller is `Running`, use Helm to install Spark:
    59  
    60  ```
    61  helm install --name spark stable/spark
    62  ```
    63  
    64  This will again take several minutes to pull the relevant Docker images and
    65  start running. You can check the status with `kubectl get  pod -l
    66  release=spark`
    67  
    68  ## Deploy Pachyderm Pipeline
    69  
    70  Once your Spark cluster is running, you're ready to deploy the Pachyderm
    71  pipeline:
    72  
    73  
    74  ```
    75  # create a repo to hold configuration data that acts as input to the pipeline
    76  pachctl create repo estimate_pi_config
    77  
    78  # create the actual processing pipeline
    79  pachctl create pipeline -f estimate_pi_pipeline.json
    80  
    81  # kick off a job with 1000 samples
    82  echo 1000 | pachctl put file estimate_pi_config@master:num_samples
    83  
    84  # check job status
    85  pachctl list job --pipeline estimate_pi
    86  
    87  # once job has completed, retrieve the results
    88  pachctl get file estimate_pi@master:pi_estimate
    89  
    90  ```
    91  
    92  ## Connecting to an existing Spark cluster
    93  
    94  By default, this example makes use of Kubernetes' service discovery to connect
    95  your Pachyderm pipeline code to your Spark cluster. If you wish to connect to
    96  a different Spark cluster, you can do so by adding the `--master` flag to the
    97  list of arguments provided to `cmd` in the pipeline spec: append `"--master"`
    98  and `"spark://$MYSPARK_MASTER_SERVICE_HOST:$MYSPARK_MASTER_SERVICE_PORT"` to
    99  the `cmd` array.
   100  
   101  To test a manually-specified connection, deploy a Spark cluster into
   102  a different name in Kubernetes:
   103  
   104  ```
   105  helm install --name my-custom-spark stable/spark
   106  ```
   107