github.com/qsunny/k8s@v0.0.0-20220101153623-e6dca256d5bf/examples-master/staging/spark/README.md (about) 1 # Spark example 2 3 Following this example, you will create a functional [Apache 4 Spark](http://spark.apache.org/) cluster using Kubernetes and 5 [Docker](http://docker.io). 6 7 You will setup a Spark master service and a set of Spark workers using Spark's [standalone mode](http://spark.apache.org/docs/latest/spark-standalone.html). 8 9 For the impatient expert, jump straight to the [tl;dr](#tldr) 10 section. 11 12 ### Sources 13 14 The Docker images are heavily based on https://github.com/mattf/docker-spark. 15 And are curated in https://github.com/kubernetes/application-images/tree/master/spark 16 17 The Spark UI Proxy is taken from https://github.com/aseigneurin/spark-ui-proxy. 18 19 The PySpark examples are taken from http://stackoverflow.com/questions/4114167/checking-if-a-number-is-a-prime-number-in-python/27946768#27946768 20 21 ## Step Zero: Prerequisites 22 23 This example assumes 24 25 - You have a Kubernetes cluster installed and running. 26 - That you have the ```kubectl``` command line tool installed in your path and configured to talk to your Kubernetes cluster 27 - That your Kubernetes cluster is running [kube-dns](https://github.com/kubernetes/dns) or an equivalent integration. 28 29 Optionally, your Kubernetes cluster should be configured with a Loadbalancer integration (automatically configured via kube-up or GKE) 30 31 ## Step One: Create namespace 32 33 ```sh 34 $ kubectl create -f examples/staging/spark/namespace-spark-cluster.yaml 35 ``` 36 37 Now list all namespaces: 38 39 ```sh 40 $ kubectl get namespaces 41 NAME LABELS STATUS 42 default <none> Active 43 spark-cluster name=spark-cluster Active 44 ``` 45 46 To configure kubectl to work with our namespace, we will create a new context using our current context as a base: 47 48 ```sh 49 $ CURRENT_CONTEXT=$(kubectl config view -o jsonpath='{.current-context}') 50 $ USER_NAME=$(kubectl config view -o jsonpath='{.contexts[?(@.name == "'"${CURRENT_CONTEXT}"'")].context.user}') 51 $ CLUSTER_NAME=$(kubectl config view -o jsonpath='{.contexts[?(@.name == "'"${CURRENT_CONTEXT}"'")].context.cluster}') 52 $ kubectl config set-context spark --namespace=spark-cluster --cluster=${CLUSTER_NAME} --user=${USER_NAME} 53 $ kubectl config use-context spark 54 ``` 55 56 ## Step Two: Start your Master service 57 58 The Master [service](https://kubernetes.io/docs/concepts/services-networking/service/) is the master service 59 for a Spark cluster. 60 61 Use the 62 [`examples/staging/spark/spark-master-controller.yaml`](spark-master-controller.yaml) 63 file to create a 64 [replication controller](https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller/) 65 running the Spark Master service. 66 67 ```console 68 $ kubectl create -f examples/staging/spark/spark-master-controller.yaml 69 replicationcontroller "spark-master-controller" created 70 ``` 71 72 Then, use the 73 [`examples/staging/spark/spark-master-service.yaml`](spark-master-service.yaml) file to 74 create a logical service endpoint that Spark workers can use to access the 75 Master pod: 76 77 ```console 78 $ kubectl create -f examples/staging/spark/spark-master-service.yaml 79 service "spark-master" created 80 ``` 81 82 ### Check to see if Master is running and accessible 83 84 ```console 85 $ kubectl get pods 86 NAME READY STATUS RESTARTS AGE 87 spark-master-controller-5u0q5 1/1 Running 0 8m 88 ``` 89 90 Check logs to see the status of the master. (Use the pod retrieved from the previous output.) 91 92 ```sh 93 $ kubectl logs spark-master-controller-5u0q5 94 starting org.apache.spark.deploy.master.Master, logging to /opt/spark-1.5.1-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-spark-master-controller-g0oao.out 95 Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/opt/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip spark-master --port 7077 --webui-port 8080 96 ======================================== 97 15/10/27 21:25:05 INFO Master: Registered signal handlers for [TERM, HUP, INT] 98 15/10/27 21:25:05 INFO SecurityManager: Changing view acls to: root 99 15/10/27 21:25:05 INFO SecurityManager: Changing modify acls to: root 100 15/10/27 21:25:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 101 15/10/27 21:25:06 INFO Slf4jLogger: Slf4jLogger started 102 15/10/27 21:25:06 INFO Remoting: Starting remoting 103 15/10/27 21:25:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@spark-master:7077] 104 15/10/27 21:25:06 INFO Utils: Successfully started service 'sparkMaster' on port 7077. 105 15/10/27 21:25:07 INFO Master: Starting Spark master at spark://spark-master:7077 106 15/10/27 21:25:07 INFO Master: Running Spark version 1.5.1 107 15/10/27 21:25:07 INFO Utils: Successfully started service 'MasterUI' on port 8080. 108 15/10/27 21:25:07 INFO MasterWebUI: Started MasterWebUI at http://spark-master:8080 109 15/10/27 21:25:07 INFO Utils: Successfully started service on port 6066. 110 15/10/27 21:25:07 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066 111 15/10/27 21:25:07 INFO Master: I have been elected leader! New state: ALIVE 112 ``` 113 114 Once the master is started, we'll want to check the Spark WebUI. In order to access the Spark WebUI, we will deploy a [specialized proxy](https://github.com/aseigneurin/spark-ui-proxy). This proxy is necessary to access worker logs from the Spark UI. 115 116 Deploy the proxy controller with [`examples/staging/spark/spark-ui-proxy-controller.yaml`](spark-ui-proxy-controller.yaml): 117 118 ```console 119 $ kubectl create -f examples/staging/spark/spark-ui-proxy-controller.yaml 120 replicationcontroller "spark-ui-proxy-controller" created 121 ``` 122 123 We'll also need a corresponding Loadbalanced service for our Spark Proxy [`examples/staging/spark/spark-ui-proxy-service.yaml`](spark-ui-proxy-service.yaml): 124 125 ```console 126 $ kubectl create -f examples/staging/spark/spark-ui-proxy-service.yaml 127 service "spark-ui-proxy" created 128 ``` 129 130 After creating the service, you should eventually get a loadbalanced endpoint: 131 132 ```console 133 $ kubectl get svc spark-ui-proxy -o wide 134 NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR 135 spark-ui-proxy 10.0.51.107 aad59283284d611e6839606c214502b5-833417581.us-east-1.elb.amazonaws.com 80/TCP 9m component=spark-ui-proxy 136 ``` 137 138 The Spark UI in the above example output will be available at http://aad59283284d611e6839606c214502b5-833417581.us-east-1.elb.amazonaws.com 139 140 If your Kubernetes cluster is not equipped with a Loadbalancer integration, you will need to use the [kubectl proxy](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#using-kubectl-proxy) to 141 connect to the Spark WebUI: 142 143 ```console 144 kubectl proxy --port=8001 145 ``` 146 147 At which point the UI will be available at 148 [http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-master:8080/](http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-master:8080/). 149 150 ## Step Three: Start your Spark workers 151 152 The Spark workers do the heavy lifting in a Spark cluster. They 153 provide execution resources and data cache capabilities for your 154 program. 155 156 The Spark workers need the Master service to be running. 157 158 Use the [`examples/staging/spark/spark-worker-controller.yaml`](spark-worker-controller.yaml) file to create a 159 [replication controller](https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller/) that manages the worker pods. 160 161 ```console 162 $ kubectl create -f examples/staging/spark/spark-worker-controller.yaml 163 replicationcontroller "spark-worker-controller" created 164 ``` 165 166 ### Check to see if the workers are running 167 168 If you launched the Spark WebUI, your workers should just appear in the UI when 169 they're ready. (It may take a little bit to pull the images and launch the 170 pods.) You can also interrogate the status in the following way: 171 172 ```console 173 $ kubectl get pods 174 NAME READY STATUS RESTARTS AGE 175 spark-master-controller-5u0q5 1/1 Running 0 25m 176 spark-worker-controller-e8otp 1/1 Running 0 6m 177 spark-worker-controller-fiivl 1/1 Running 0 6m 178 spark-worker-controller-ytc7o 1/1 Running 0 6m 179 180 $ kubectl logs spark-master-controller-5u0q5 181 [...] 182 15/10/26 18:20:14 INFO Master: Registering worker 10.244.1.13:53567 with 2 cores, 6.3 GB RAM 183 15/10/26 18:20:14 INFO Master: Registering worker 10.244.2.7:46195 with 2 cores, 6.3 GB RAM 184 15/10/26 18:20:14 INFO Master: Registering worker 10.244.3.8:39926 with 2 cores, 6.3 GB RAM 185 ``` 186 187 ## Step Four: Start the Zeppelin UI to launch jobs on your Spark cluster 188 189 The Zeppelin UI pod can be used to launch jobs into the Spark cluster either via 190 a web notebook frontend or the traditional Spark command line. See 191 [Zeppelin](https://zeppelin.incubator.apache.org/) and 192 [Spark architecture](https://spark.apache.org/docs/latest/cluster-overview.html) 193 for more details. 194 195 Deploy Zeppelin: 196 197 ```console 198 $ kubectl create -f examples/staging/spark/zeppelin-controller.yaml 199 replicationcontroller "zeppelin-controller" created 200 ``` 201 202 And the corresponding service: 203 204 ```console 205 $ kubectl create -f examples/staging/spark/zeppelin-service.yaml 206 service "zeppelin" created 207 ``` 208 209 Zeppelin needs the spark-master service to be running. 210 211 ### Check to see if Zeppelin is running 212 213 ```console 214 $ kubectl get pods -l component=zeppelin 215 NAME READY STATUS RESTARTS AGE 216 zeppelin-controller-ja09s 1/1 Running 0 53s 217 ``` 218 219 ## Step Five: Do something with the cluster 220 221 Now you have two choices, depending on your predilections. You can do something 222 graphical with the Spark cluster, or you can stay in the CLI. 223 224 For both choices, we will be working with this Python snippet: 225 226 ```python 227 from math import sqrt; from itertools import count, islice 228 229 def isprime(n): 230 return n > 1 and all(n%i for i in islice(count(2), int(sqrt(n)-1))) 231 232 nums = sc.parallelize(xrange(10000000)) 233 print nums.filter(isprime).count() 234 ``` 235 236 ### Do something fast with pyspark! 237 238 Simply copy and paste the python snippet into pyspark from within the zeppelin pod: 239 240 ```console 241 $ kubectl exec zeppelin-controller-ja09s -it pyspark 242 Python 2.7.9 (default, Mar 1 2015, 12:57:24) 243 [GCC 4.9.2] on linux2 244 Type "help", "copyright", "credits" or "license" for more information. 245 Welcome to 246 ____ __ 247 / __/__ ___ _____/ /__ 248 _\ \/ _ \/ _ `/ __/ '_/ 249 /__ / .__/\_,_/_/ /_/\_\ version 1.5.1 250 /_/ 251 252 Using Python version 2.7.9 (default, Mar 1 2015 12:57:24) 253 SparkContext available as sc, HiveContext available as sqlContext. 254 >>> from math import sqrt; from itertools import count, islice 255 >>> 256 >>> def isprime(n): 257 ... return n > 1 and all(n%i for i in islice(count(2), int(sqrt(n)-1))) 258 ... 259 >>> nums = sc.parallelize(xrange(10000000)) 260 261 >>> print nums.filter(isprime).count() 262 664579 263 ``` 264 265 Congratulations, you now know how many prime numbers there are within the first 10 million numbers! 266 267 ### Do something graphical and shiny! 268 269 Creating the Zeppelin service should have yielded you a Loadbalancer endpoint: 270 271 ```console 272 $ kubectl get svc zeppelin -o wide 273 NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR 274 zeppelin 10.0.154.1 a596f143884da11e6839506c114532b5-121893930.us-east-1.elb.amazonaws.com 80/TCP 3m component=zeppelin 275 ``` 276 277 If your Kubernetes cluster does not have a Loadbalancer integration, then we will have to use port forwarding. 278 279 Take the Zeppelin pod from before and port-forward the WebUI port: 280 281 ```console 282 $ kubectl port-forward zeppelin-controller-ja09s 8080:8080 283 ``` 284 285 This forwards `localhost` 8080 to container port 8080. You can then find 286 Zeppelin at [http://localhost:8080/](http://localhost:8080/). 287 288 Once you've loaded up the Zeppelin UI, create a "New Notebook". In there we will paste our python snippet, but we need to add a `%pyspark` hint for Zeppelin to understand it: 289 290 ``` 291 %pyspark 292 from math import sqrt; from itertools import count, islice 293 294 def isprime(n): 295 return n > 1 and all(n%i for i in islice(count(2), int(sqrt(n)-1))) 296 297 nums = sc.parallelize(xrange(10000000)) 298 print nums.filter(isprime).count() 299 ``` 300 301 After pasting in our code, press shift+enter or click the play icon to the right of our snippet. The Spark job will run and once again we'll have our result! 302 303 ## Result 304 305 You now have services and replication controllers for the Spark master, Spark 306 workers and Spark driver. You can take this example to the next step and start 307 using the Apache Spark cluster you just created, see 308 [Spark documentation](https://spark.apache.org/documentation.html) for more 309 information. 310 311 ## tl;dr 312 313 ```console 314 kubectl create ns spark-cluster 315 kubectl create -f examples/staging/spark -n spark-cluster 316 ``` 317 318 After it's setup: 319 320 ```console 321 kubectl get pods -n spark-cluster # Make sure everything is running 322 kubectl get svc -o wide -n spark-cluster # Get the Loadbalancer endpoints for spark-ui-proxy and zeppelin 323 ``` 324 325 At which point the Master UI and Zeppelin will be available at the URLs under the `EXTERNAL-IP` field. 326 327 You can also interact with the Spark cluster using the traditional `spark-shell` / 328 `spark-submit` / `pyspark` commands by using `kubectl exec` against the 329 `zeppelin-controller` pod. 330 331 If your Kubernetes cluster does not have a Loadbalancer integration, use `kubectl proxy` and `kubectl port-forward` to access the Spark UI and Zeppelin. 332 333 For Spark UI: 334 335 ```console 336 kubectl proxy --port=8001 337 ``` 338 339 Then visit [http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-ui-proxy/](http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-ui-proxy/). 340 341 For Zeppelin: 342 343 ```console 344 kubectl port-forward zeppelin-controller-abc123 8080:8080 & 345 ``` 346 347 Then visit [http://localhost:8080/](http://localhost:8080/). 348 349 ## Known Issues With Spark 350 351 * This provides a Spark configuration that is restricted to the cluster network, 352 meaning the Spark master is only available as a cluster service. If you need 353 to submit jobs using external client other than Zeppelin or `spark-submit` on 354 the `zeppelin` pod, you will need to provide a way for your clients to get to 355 the 356 [`examples/staging/spark/spark-master-service.yaml`](spark-master-service.yaml). See 357 [Services](https://kubernetes.io/docs/concepts/services-networking/service/) for more information. 358 359 ## Known Issues With Zeppelin 360 361 * The Zeppelin pod is large, so it may take a while to pull depending on your 362 network. The size of the Zeppelin pod is something we're working on, see issue #17231. 363 364 * Zeppelin may take some time (about a minute) on this pipeline the first time 365 you run it. It seems to take considerable time to load. 366 367 * On GKE, `kubectl port-forward` may not be stable over long periods of time. If 368 you see Zeppelin go into `Disconnected` state (there will be a red dot on the 369 top right as well), the `port-forward` probably failed and needs to be 370 restarted. See #12179. 371 372 <!-- BEGIN MUNGE: GENERATED_ANALYTICS --> 373 []() 374 <!-- END MUNGE: GENERATED_ANALYTICS -->