github.com/bigcommerce/nomad@v0.9.3-bc/terraform/examples/spark/README.md

github.com/bigcommerce/nomad@v0.9.3-bc/terraform/examples/spark/README.md (about)

     1  # Nomad / Spark integration
     2  
     3  The Nomad ecosystem includes a fork of Apache Spark that natively supports using 
     4  a Nomad cluster to run Spark applications. When running on Nomad, the Spark 
     5  executors that run Spark tasks for your application, and optionally the 
     6  application driver itself, run as Nomad tasks in a Nomad job. See the 
     7  [usage guide](./RunningSparkOnNomad.pdf) for more details.
     8  
     9  Clusters provisioned with Nomad's Terraform templates are automatically 
    10  configured to run the Spark integration. The sample job files found here are 
    11  also provisioned onto every client and server.
    12  
    13  ## Setup
    14  
    15  To give the Spark integration a test drive, provision a cluster and SSH to any 
    16  one of the clients or servers (the public IPs are displayed when the Terraform 
    17  provisioning process completes):
    18  
    19  ```bash
    20  $ ssh -i /path/to/key ubuntu@PUBLIC_IP
    21  ```
    22  
    23  The Spark history server and several of the sample Spark jobs below require 
    24  HDFS. Using the included job file, deploy an HDFS cluster on Nomad: 
    25  
    26  ```bash
    27  $ cd $HOME/examples/spark
    28  $ nomad run hdfs.nomad
    29  $ nomad status hdfs
    30  ```
    31  
    32  When the allocations are all in the `running` state (as shown by `nomad status 
    33  hdfs`), query Consul to verify that the HDFS service has been registered:
    34  
    35  ```bash
    36  $ dig hdfs.service.consul
    37  ```
    38  
    39  Next, create directories and files in HDFS for use by the history server and the 
    40  sample Spark jobs:
    41  
    42  ```bash
    43  $ hdfs dfs -mkdir /foo
    44  $ hdfs dfs -put /var/log/apt/history.log /foo
    45  $ hdfs dfs -mkdir /spark-events
    46  $ hdfs dfs -ls /
    47  ```
    48  
    49  Finally, deploy the Spark history server:
    50  
    51  ```bash
    52  $ nomad run spark-history-server-hdfs.nomad
    53  ```
    54  
    55  You can get the private IP for the history server with a Consul DNS lookup:
    56  
    57  ```bash
    58  $ dig spark-history.service.consul
    59  ```
    60  
    61  Cross-reference the private IP with the `terraform apply` output to get the 
    62  corresponding public IP. You can access the history server at 
    63  `http://PUBLIC_IP:18080`.
    64  
    65  ## Sample Spark jobs
    66  
    67  The sample `spark-submit` commands listed below demonstrate several of the 
    68  official Spark examples. Features like `spark-sql`, `spark-shell` and `pyspark` 
    69  are included. The commands can be executed from any client or server.
    70  
    71  You can monitor the status of a Spark job in a second terminal session with:
    72  
    73  ```bash
    74  $ nomad status
    75  $ nomad status JOB_ID
    76  $ nomad alloc-status DRIVER_ALLOC_ID
    77  $ nomad logs DRIVER_ALLOC_ID
    78  ```
    79  
    80  To view the output of the job, run `nomad logs` for the driver's Allocation ID.
    81  
    82  ### SparkPi (Java)
    83  
    84  ```bash
    85  spark-submit \
    86    --class org.apache.spark.examples.JavaSparkPi \
    87    --master nomad \
    88    --deploy-mode cluster \
    89    --conf spark.executor.instances=4 \
    90    --conf spark.nomad.cluster.monitorUntil=complete \
    91    --conf spark.eventLog.enabled=true \
    92    --conf spark.eventLog.dir=hdfs://hdfs.service.consul/spark-events \
    93    --conf spark.nomad.sparkDistribution=https://nomad-spark.s3.amazonaws.com/spark-2.1.0-bin-nomad.tgz \
    94    https://nomad-spark.s3.amazonaws.com/spark-examples_2.11-2.1.0-SNAPSHOT.jar 100
    95  ```
    96  
    97  ### Word count (Java)
    98  
    99  ```bash
   100  spark-submit \
   101    --class org.apache.spark.examples.JavaWordCount \
   102    --master nomad \
   103    --deploy-mode cluster \
   104    --conf spark.executor.instances=4 \
   105    --conf spark.nomad.cluster.monitorUntil=complete \
   106    --conf spark.eventLog.enabled=true \
   107    --conf spark.eventLog.dir=hdfs://hdfs.service.consul/spark-events \
   108    --conf spark.nomad.sparkDistribution=https://nomad-spark.s3.amazonaws.com/spark-2.1.0-bin-nomad.tgz \
   109    https://nomad-spark.s3.amazonaws.com/spark-examples_2.11-2.1.0-SNAPSHOT.jar \
   110    hdfs://hdfs.service.consul/foo/history.log
   111  ```
   112  
   113  ### DFSReadWriteTest (Scala)
   114  
   115  ```bash
   116  spark-submit \
   117    --class org.apache.spark.examples.DFSReadWriteTest \
   118    --master nomad \
   119    --deploy-mode cluster \
   120    --conf spark.executor.instances=4 \
   121    --conf spark.nomad.cluster.monitorUntil=complete \
   122    --conf spark.eventLog.enabled=true \
   123    --conf spark.eventLog.dir=hdfs://hdfs.service.consul/spark-events \
   124    --conf spark.nomad.sparkDistribution=https://nomad-spark.s3.amazonaws.com/spark-2.1.0-bin-nomad.tgz \
   125    https://nomad-spark.s3.amazonaws.com/spark-examples_2.11-2.1.0-SNAPSHOT.jar \
   126    /etc/sudoers hdfs://hdfs.service.consul/foo
   127  ```
   128  
   129  ### spark-shell
   130  
   131  Start the shell:
   132  
   133  ```bash
   134  spark-shell \
   135    --master nomad \
   136    --conf spark.executor.instances=4 \
   137    --conf spark.nomad.sparkDistribution=https://nomad-spark.s3.amazonaws.com/spark-2.1.0-bin-nomad.tgz
   138  ```
   139  
   140  Run a few commands:
   141  
   142  ```bash
   143  $ spark.version
   144  
   145  $ val data = 1 to 10000
   146  $ val distData = sc.parallelize(data)
   147  $ distData.filter(_ < 10).collect()
   148  ```
   149  
   150  ### sql-shell
   151  
   152  Start the shell:
   153  
   154  ```bash
   155  spark-sql \
   156    --master nomad \
   157    --conf spark.executor.instances=4 \
   158    --conf spark.nomad.sparkDistribution=https://nomad-spark.s3.amazonaws.com/spark-2.1.0-bin-nomad.tgz jars/spark-sql_2.11-2.1.0-SNAPSHOT.jar
   159  ```
   160  
   161  Run a few commands:
   162  
   163  ```bash
   164  $ CREATE TEMPORARY VIEW usersTable
   165  USING org.apache.spark.sql.parquet
   166  OPTIONS (
   167    path "/usr/local/bin/spark/examples/src/main/resources/users.parquet"
   168  );
   169  
   170  $ SELECT * FROM usersTable;
   171  ```
   172  
   173  ### pyspark
   174  
   175  Start the shell:
   176  
   177  ```bash
   178  pyspark \
   179    --master nomad \
   180    --conf spark.executor.instances=4 \
   181    --conf spark.nomad.sparkDistribution=https://nomad-spark.s3.amazonaws.com/spark-2.1.0-bin-nomad.tgz
   182  ```
   183  
   184  Run a few commands:
   185  
   186  ```bash
   187  $ df = spark.read.json("/usr/local/bin/spark/examples/src/main/resources/people.json")
   188  $ df.show()
   189  $ df.printSchema()
   190  $ df.createOrReplaceTempView("people")
   191  $ sqlDF = spark.sql("SELECT * FROM people")
   192  $ sqlDF.show()
   193  ```