github.com/bigcommerce/nomad@v0.9.3-bc/terraform/examples/spark/README.md (about) 1 # Nomad / Spark integration 2 3 The Nomad ecosystem includes a fork of Apache Spark that natively supports using 4 a Nomad cluster to run Spark applications. When running on Nomad, the Spark 5 executors that run Spark tasks for your application, and optionally the 6 application driver itself, run as Nomad tasks in a Nomad job. See the 7 [usage guide](./RunningSparkOnNomad.pdf) for more details. 8 9 Clusters provisioned with Nomad's Terraform templates are automatically 10 configured to run the Spark integration. The sample job files found here are 11 also provisioned onto every client and server. 12 13 ## Setup 14 15 To give the Spark integration a test drive, provision a cluster and SSH to any 16 one of the clients or servers (the public IPs are displayed when the Terraform 17 provisioning process completes): 18 19 ```bash 20 $ ssh -i /path/to/key ubuntu@PUBLIC_IP 21 ``` 22 23 The Spark history server and several of the sample Spark jobs below require 24 HDFS. Using the included job file, deploy an HDFS cluster on Nomad: 25 26 ```bash 27 $ cd $HOME/examples/spark 28 $ nomad run hdfs.nomad 29 $ nomad status hdfs 30 ``` 31 32 When the allocations are all in the `running` state (as shown by `nomad status 33 hdfs`), query Consul to verify that the HDFS service has been registered: 34 35 ```bash 36 $ dig hdfs.service.consul 37 ``` 38 39 Next, create directories and files in HDFS for use by the history server and the 40 sample Spark jobs: 41 42 ```bash 43 $ hdfs dfs -mkdir /foo 44 $ hdfs dfs -put /var/log/apt/history.log /foo 45 $ hdfs dfs -mkdir /spark-events 46 $ hdfs dfs -ls / 47 ``` 48 49 Finally, deploy the Spark history server: 50 51 ```bash 52 $ nomad run spark-history-server-hdfs.nomad 53 ``` 54 55 You can get the private IP for the history server with a Consul DNS lookup: 56 57 ```bash 58 $ dig spark-history.service.consul 59 ``` 60 61 Cross-reference the private IP with the `terraform apply` output to get the 62 corresponding public IP. You can access the history server at 63 `http://PUBLIC_IP:18080`. 64 65 ## Sample Spark jobs 66 67 The sample `spark-submit` commands listed below demonstrate several of the 68 official Spark examples. Features like `spark-sql`, `spark-shell` and `pyspark` 69 are included. The commands can be executed from any client or server. 70 71 You can monitor the status of a Spark job in a second terminal session with: 72 73 ```bash 74 $ nomad status 75 $ nomad status JOB_ID 76 $ nomad alloc-status DRIVER_ALLOC_ID 77 $ nomad logs DRIVER_ALLOC_ID 78 ``` 79 80 To view the output of the job, run `nomad logs` for the driver's Allocation ID. 81 82 ### SparkPi (Java) 83 84 ```bash 85 spark-submit \ 86 --class org.apache.spark.examples.JavaSparkPi \ 87 --master nomad \ 88 --deploy-mode cluster \ 89 --conf spark.executor.instances=4 \ 90 --conf spark.nomad.cluster.monitorUntil=complete \ 91 --conf spark.eventLog.enabled=true \ 92 --conf spark.eventLog.dir=hdfs://hdfs.service.consul/spark-events \ 93 --conf spark.nomad.sparkDistribution=https://nomad-spark.s3.amazonaws.com/spark-2.1.0-bin-nomad.tgz \ 94 https://nomad-spark.s3.amazonaws.com/spark-examples_2.11-2.1.0-SNAPSHOT.jar 100 95 ``` 96 97 ### Word count (Java) 98 99 ```bash 100 spark-submit \ 101 --class org.apache.spark.examples.JavaWordCount \ 102 --master nomad \ 103 --deploy-mode cluster \ 104 --conf spark.executor.instances=4 \ 105 --conf spark.nomad.cluster.monitorUntil=complete \ 106 --conf spark.eventLog.enabled=true \ 107 --conf spark.eventLog.dir=hdfs://hdfs.service.consul/spark-events \ 108 --conf spark.nomad.sparkDistribution=https://nomad-spark.s3.amazonaws.com/spark-2.1.0-bin-nomad.tgz \ 109 https://nomad-spark.s3.amazonaws.com/spark-examples_2.11-2.1.0-SNAPSHOT.jar \ 110 hdfs://hdfs.service.consul/foo/history.log 111 ``` 112 113 ### DFSReadWriteTest (Scala) 114 115 ```bash 116 spark-submit \ 117 --class org.apache.spark.examples.DFSReadWriteTest \ 118 --master nomad \ 119 --deploy-mode cluster \ 120 --conf spark.executor.instances=4 \ 121 --conf spark.nomad.cluster.monitorUntil=complete \ 122 --conf spark.eventLog.enabled=true \ 123 --conf spark.eventLog.dir=hdfs://hdfs.service.consul/spark-events \ 124 --conf spark.nomad.sparkDistribution=https://nomad-spark.s3.amazonaws.com/spark-2.1.0-bin-nomad.tgz \ 125 https://nomad-spark.s3.amazonaws.com/spark-examples_2.11-2.1.0-SNAPSHOT.jar \ 126 /etc/sudoers hdfs://hdfs.service.consul/foo 127 ``` 128 129 ### spark-shell 130 131 Start the shell: 132 133 ```bash 134 spark-shell \ 135 --master nomad \ 136 --conf spark.executor.instances=4 \ 137 --conf spark.nomad.sparkDistribution=https://nomad-spark.s3.amazonaws.com/spark-2.1.0-bin-nomad.tgz 138 ``` 139 140 Run a few commands: 141 142 ```bash 143 $ spark.version 144 145 $ val data = 1 to 10000 146 $ val distData = sc.parallelize(data) 147 $ distData.filter(_ < 10).collect() 148 ``` 149 150 ### sql-shell 151 152 Start the shell: 153 154 ```bash 155 spark-sql \ 156 --master nomad \ 157 --conf spark.executor.instances=4 \ 158 --conf spark.nomad.sparkDistribution=https://nomad-spark.s3.amazonaws.com/spark-2.1.0-bin-nomad.tgz jars/spark-sql_2.11-2.1.0-SNAPSHOT.jar 159 ``` 160 161 Run a few commands: 162 163 ```bash 164 $ CREATE TEMPORARY VIEW usersTable 165 USING org.apache.spark.sql.parquet 166 OPTIONS ( 167 path "/usr/local/bin/spark/examples/src/main/resources/users.parquet" 168 ); 169 170 $ SELECT * FROM usersTable; 171 ``` 172 173 ### pyspark 174 175 Start the shell: 176 177 ```bash 178 pyspark \ 179 --master nomad \ 180 --conf spark.executor.instances=4 \ 181 --conf spark.nomad.sparkDistribution=https://nomad-spark.s3.amazonaws.com/spark-2.1.0-bin-nomad.tgz 182 ``` 183 184 Run a few commands: 185 186 ```bash 187 $ df = spark.read.json("/usr/local/bin/spark/examples/src/main/resources/people.json") 188 $ df.show() 189 $ df.printSchema() 190 $ df.createOrReplaceTempView("people") 191 $ sqlDF = spark.sql("SELECT * FROM people") 192 $ sqlDF.show() 193 ```