github.com/qsunny/k8s@v0.0.0-20220101153623-e6dca256d5bf/examples-master/staging/spark/README.md (about)

     1  # Spark example
     2  
     3  Following this example, you will create a functional [Apache
     4  Spark](http://spark.apache.org/) cluster using Kubernetes and
     5  [Docker](http://docker.io).
     6  
     7  You will setup a Spark master service and a set of Spark workers using Spark's [standalone mode](http://spark.apache.org/docs/latest/spark-standalone.html).
     8  
     9  For the impatient expert, jump straight to the [tl;dr](#tldr)
    10  section.
    11  
    12  ### Sources
    13  
    14  The Docker images are heavily based on https://github.com/mattf/docker-spark.
    15  And are curated in https://github.com/kubernetes/application-images/tree/master/spark
    16  
    17  The Spark UI Proxy is taken from https://github.com/aseigneurin/spark-ui-proxy.
    18  
    19  The PySpark examples are taken from http://stackoverflow.com/questions/4114167/checking-if-a-number-is-a-prime-number-in-python/27946768#27946768
    20  
    21  ## Step Zero: Prerequisites
    22  
    23  This example assumes
    24  
    25  - You have a Kubernetes cluster installed and running.
    26  - That you have the ```kubectl``` command line tool installed in your path and configured to talk to your Kubernetes cluster
    27  - That your Kubernetes cluster is running [kube-dns](https://github.com/kubernetes/dns) or an equivalent integration.
    28  
    29  Optionally, your Kubernetes cluster should be configured with a Loadbalancer integration (automatically configured via kube-up or GKE)
    30  
    31  ## Step One: Create namespace
    32  
    33  ```sh
    34  $ kubectl create -f examples/staging/spark/namespace-spark-cluster.yaml
    35  ```
    36  
    37  Now list all namespaces:
    38  
    39  ```sh
    40  $ kubectl get namespaces
    41  NAME          LABELS             STATUS
    42  default       <none>             Active
    43  spark-cluster name=spark-cluster Active
    44  ```
    45  
    46  To configure kubectl to work with our namespace, we will create a new context using our current context as a base:
    47  
    48  ```sh
    49  $ CURRENT_CONTEXT=$(kubectl config view -o jsonpath='{.current-context}')
    50  $ USER_NAME=$(kubectl config view -o jsonpath='{.contexts[?(@.name == "'"${CURRENT_CONTEXT}"'")].context.user}')
    51  $ CLUSTER_NAME=$(kubectl config view -o jsonpath='{.contexts[?(@.name == "'"${CURRENT_CONTEXT}"'")].context.cluster}')
    52  $ kubectl config set-context spark --namespace=spark-cluster --cluster=${CLUSTER_NAME} --user=${USER_NAME}
    53  $ kubectl config use-context spark
    54  ```
    55  
    56  ## Step Two: Start your Master service
    57  
    58  The Master [service](https://kubernetes.io/docs/concepts/services-networking/service/) is the master service
    59  for a Spark cluster.
    60  
    61  Use the
    62  [`examples/staging/spark/spark-master-controller.yaml`](spark-master-controller.yaml)
    63  file to create a
    64  [replication controller](https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller/)
    65  running the Spark Master service.
    66  
    67  ```console
    68  $ kubectl create -f examples/staging/spark/spark-master-controller.yaml
    69  replicationcontroller "spark-master-controller" created
    70  ```
    71  
    72  Then, use the
    73  [`examples/staging/spark/spark-master-service.yaml`](spark-master-service.yaml) file to
    74  create a logical service endpoint that Spark workers can use to access the
    75  Master pod:
    76  
    77  ```console
    78  $ kubectl create -f examples/staging/spark/spark-master-service.yaml
    79  service "spark-master" created
    80  ```
    81  
    82  ### Check to see if Master is running and accessible
    83  
    84  ```console
    85  $ kubectl get pods
    86  NAME                            READY     STATUS    RESTARTS   AGE
    87  spark-master-controller-5u0q5   1/1       Running   0          8m
    88  ```
    89  
    90  Check logs to see the status of the master. (Use the pod retrieved from the previous output.)
    91  
    92  ```sh
    93  $ kubectl logs spark-master-controller-5u0q5
    94  starting org.apache.spark.deploy.master.Master, logging to /opt/spark-1.5.1-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-spark-master-controller-g0oao.out
    95  Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/opt/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip spark-master --port 7077 --webui-port 8080
    96  ========================================
    97  15/10/27 21:25:05 INFO Master: Registered signal handlers for [TERM, HUP, INT]
    98  15/10/27 21:25:05 INFO SecurityManager: Changing view acls to: root
    99  15/10/27 21:25:05 INFO SecurityManager: Changing modify acls to: root
   100  15/10/27 21:25:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
   101  15/10/27 21:25:06 INFO Slf4jLogger: Slf4jLogger started
   102  15/10/27 21:25:06 INFO Remoting: Starting remoting
   103  15/10/27 21:25:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@spark-master:7077]
   104  15/10/27 21:25:06 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
   105  15/10/27 21:25:07 INFO Master: Starting Spark master at spark://spark-master:7077
   106  15/10/27 21:25:07 INFO Master: Running Spark version 1.5.1
   107  15/10/27 21:25:07 INFO Utils: Successfully started service 'MasterUI' on port 8080.
   108  15/10/27 21:25:07 INFO MasterWebUI: Started MasterWebUI at http://spark-master:8080
   109  15/10/27 21:25:07 INFO Utils: Successfully started service on port 6066.
   110  15/10/27 21:25:07 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
   111  15/10/27 21:25:07 INFO Master: I have been elected leader! New state: ALIVE
   112  ```
   113  
   114  Once the master is started, we'll want to check the Spark WebUI. In order to access the Spark WebUI, we will deploy a [specialized proxy](https://github.com/aseigneurin/spark-ui-proxy). This proxy is necessary to access worker logs from the Spark UI.
   115  
   116  Deploy the proxy controller with [`examples/staging/spark/spark-ui-proxy-controller.yaml`](spark-ui-proxy-controller.yaml):
   117  
   118  ```console
   119  $ kubectl create -f examples/staging/spark/spark-ui-proxy-controller.yaml
   120  replicationcontroller "spark-ui-proxy-controller" created
   121  ```
   122  
   123  We'll also need a corresponding Loadbalanced service for our Spark Proxy [`examples/staging/spark/spark-ui-proxy-service.yaml`](spark-ui-proxy-service.yaml):
   124  
   125  ```console
   126  $ kubectl create -f examples/staging/spark/spark-ui-proxy-service.yaml
   127  service "spark-ui-proxy" created
   128  ```
   129  
   130  After creating the service, you should eventually get a loadbalanced endpoint:
   131  
   132  ```console
   133  $ kubectl get svc spark-ui-proxy -o wide
   134   NAME             CLUSTER-IP    EXTERNAL-IP                                                              PORT(S)   AGE       SELECTOR
   135  spark-ui-proxy   10.0.51.107   aad59283284d611e6839606c214502b5-833417581.us-east-1.elb.amazonaws.com   80/TCP    9m        component=spark-ui-proxy
   136  ```
   137  
   138  The Spark UI in the above example output will be available at http://aad59283284d611e6839606c214502b5-833417581.us-east-1.elb.amazonaws.com
   139  
   140  If your Kubernetes cluster is not equipped with a Loadbalancer integration, you will need to use the [kubectl proxy](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#using-kubectl-proxy) to
   141  connect to the Spark WebUI:
   142  
   143  ```console
   144  kubectl proxy --port=8001
   145  ```
   146  
   147  At which point the UI will be available at
   148  [http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-master:8080/](http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-master:8080/).
   149  
   150  ## Step Three: Start your Spark workers
   151  
   152  The Spark workers do the heavy lifting in a Spark cluster. They
   153  provide execution resources and data cache capabilities for your
   154  program.
   155  
   156  The Spark workers need the Master service to be running.
   157  
   158  Use the [`examples/staging/spark/spark-worker-controller.yaml`](spark-worker-controller.yaml) file to create a
   159  [replication controller](https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller/) that manages the worker pods.
   160  
   161  ```console
   162  $ kubectl create -f examples/staging/spark/spark-worker-controller.yaml
   163  replicationcontroller "spark-worker-controller" created
   164  ```
   165  
   166  ### Check to see if the workers are running
   167  
   168  If you launched the Spark WebUI, your workers should just appear in the UI when
   169  they're ready. (It may take a little bit to pull the images and launch the
   170  pods.) You can also interrogate the status in the following way:
   171  
   172  ```console
   173  $ kubectl get pods
   174  NAME                            READY     STATUS    RESTARTS   AGE
   175  spark-master-controller-5u0q5   1/1       Running   0          25m
   176  spark-worker-controller-e8otp   1/1       Running   0          6m
   177  spark-worker-controller-fiivl   1/1       Running   0          6m
   178  spark-worker-controller-ytc7o   1/1       Running   0          6m
   179  
   180  $ kubectl logs spark-master-controller-5u0q5
   181  [...]
   182  15/10/26 18:20:14 INFO Master: Registering worker 10.244.1.13:53567 with 2 cores, 6.3 GB RAM
   183  15/10/26 18:20:14 INFO Master: Registering worker 10.244.2.7:46195 with 2 cores, 6.3 GB RAM
   184  15/10/26 18:20:14 INFO Master: Registering worker 10.244.3.8:39926 with 2 cores, 6.3 GB RAM
   185  ```
   186  
   187  ## Step Four: Start the Zeppelin UI to launch jobs on your Spark cluster
   188  
   189  The Zeppelin UI pod can be used to launch jobs into the Spark cluster either via
   190  a web notebook frontend or the traditional Spark command line. See
   191  [Zeppelin](https://zeppelin.incubator.apache.org/) and
   192  [Spark architecture](https://spark.apache.org/docs/latest/cluster-overview.html)
   193  for more details.
   194  
   195  Deploy Zeppelin:
   196  
   197  ```console
   198  $ kubectl create -f examples/staging/spark/zeppelin-controller.yaml
   199  replicationcontroller "zeppelin-controller" created
   200  ```
   201  
   202  And the corresponding service:
   203  
   204  ```console
   205  $ kubectl create -f examples/staging/spark/zeppelin-service.yaml
   206  service "zeppelin" created
   207  ```
   208  
   209  Zeppelin needs the spark-master service to be running.
   210  
   211  ### Check to see if Zeppelin is running
   212  
   213  ```console
   214  $ kubectl get pods -l component=zeppelin
   215  NAME                        READY     STATUS    RESTARTS   AGE
   216  zeppelin-controller-ja09s   1/1       Running   0          53s
   217  ```
   218  
   219  ## Step Five: Do something with the cluster
   220  
   221  Now you have two choices, depending on your predilections. You can do something
   222  graphical with the Spark cluster, or you can stay in the CLI.
   223  
   224  For both choices, we will be working with this Python snippet:
   225  
   226  ```python
   227  from math import sqrt; from itertools import count, islice
   228  
   229  def isprime(n):
   230      return n > 1 and all(n%i for i in islice(count(2), int(sqrt(n)-1)))
   231  
   232  nums = sc.parallelize(xrange(10000000))
   233  print nums.filter(isprime).count()
   234  ```
   235  
   236  ### Do something fast with pyspark!
   237  
   238  Simply copy and paste the python snippet into pyspark from within the zeppelin pod:
   239  
   240  ```console
   241  $ kubectl exec zeppelin-controller-ja09s -it pyspark
   242  Python 2.7.9 (default, Mar  1 2015, 12:57:24)
   243  [GCC 4.9.2] on linux2
   244  Type "help", "copyright", "credits" or "license" for more information.
   245  Welcome to
   246        ____              __
   247       / __/__  ___ _____/ /__
   248      _\ \/ _ \/ _ `/ __/  '_/
   249     /__ / .__/\_,_/_/ /_/\_\   version 1.5.1
   250        /_/
   251  
   252  Using Python version 2.7.9 (default, Mar  1 2015 12:57:24)
   253  SparkContext available as sc, HiveContext available as sqlContext.
   254  >>> from math import sqrt; from itertools import count, islice
   255  >>>
   256  >>> def isprime(n):
   257  ...     return n > 1 and all(n%i for i in islice(count(2), int(sqrt(n)-1)))
   258  ...
   259  >>> nums = sc.parallelize(xrange(10000000))
   260  
   261  >>> print nums.filter(isprime).count()
   262  664579
   263  ```
   264  
   265  Congratulations, you now know how many prime numbers there are within the first 10 million numbers!
   266  
   267  ### Do something graphical and shiny!
   268  
   269  Creating the Zeppelin service should have yielded you a Loadbalancer endpoint:
   270  
   271  ```console
   272  $ kubectl get svc zeppelin -o wide
   273   NAME       CLUSTER-IP   EXTERNAL-IP                                                              PORT(S)   AGE       SELECTOR
   274  zeppelin   10.0.154.1   a596f143884da11e6839506c114532b5-121893930.us-east-1.elb.amazonaws.com   80/TCP    3m        component=zeppelin
   275  ```
   276  
   277  If your Kubernetes cluster does not have a Loadbalancer integration, then we will have to use port forwarding.
   278  
   279  Take the Zeppelin pod from before and port-forward the WebUI port:
   280  
   281  ```console
   282  $ kubectl port-forward zeppelin-controller-ja09s 8080:8080
   283  ```
   284  
   285  This forwards `localhost` 8080 to container port 8080. You can then find
   286  Zeppelin at [http://localhost:8080/](http://localhost:8080/).
   287  
   288  Once you've loaded up the Zeppelin UI, create a "New Notebook". In there we will paste our python snippet, but we need to add a `%pyspark` hint for Zeppelin to understand it:
   289  
   290  ```
   291  %pyspark
   292  from math import sqrt; from itertools import count, islice
   293  
   294  def isprime(n):
   295      return n > 1 and all(n%i for i in islice(count(2), int(sqrt(n)-1)))
   296  
   297  nums = sc.parallelize(xrange(10000000))
   298  print nums.filter(isprime).count()
   299  ```
   300  
   301  After pasting in our code, press shift+enter or click the play icon to the right of our snippet. The Spark job will run and once again we'll have our result!
   302  
   303  ## Result
   304  
   305  You now have services and replication controllers for the Spark master, Spark
   306  workers and Spark driver.  You can take this example to the next step and start
   307  using the Apache Spark cluster you just created, see
   308  [Spark documentation](https://spark.apache.org/documentation.html) for more
   309  information.
   310  
   311  ## tl;dr
   312  
   313  ```console
   314  kubectl create ns spark-cluster
   315  kubectl create -f examples/staging/spark -n spark-cluster
   316  ```
   317  
   318  After it's setup:
   319  
   320  ```console
   321  kubectl get pods -n spark-cluster # Make sure everything is running
   322  kubectl get svc -o wide -n spark-cluster # Get the Loadbalancer endpoints for spark-ui-proxy and zeppelin
   323  ```
   324  
   325  At which point the Master UI and Zeppelin will be available at the URLs under the `EXTERNAL-IP` field.
   326  
   327  You can also interact with the Spark cluster using the traditional `spark-shell` /
   328  `spark-submit` / `pyspark` commands by using `kubectl exec` against the
   329  `zeppelin-controller` pod.
   330  
   331  If your Kubernetes cluster does not have a Loadbalancer integration, use `kubectl proxy` and `kubectl port-forward` to access the Spark UI and Zeppelin.
   332  
   333  For Spark UI:
   334  
   335  ```console
   336  kubectl proxy --port=8001
   337  ```
   338  
   339  Then visit [http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-ui-proxy/](http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-ui-proxy/).
   340  
   341  For Zeppelin:
   342  
   343  ```console
   344  kubectl port-forward zeppelin-controller-abc123 8080:8080 &
   345  ```
   346  
   347  Then visit [http://localhost:8080/](http://localhost:8080/).
   348  
   349  ## Known Issues With Spark
   350  
   351  * This provides a Spark configuration that is restricted to the cluster network,
   352    meaning the Spark master is only available as a cluster service. If you need
   353    to submit jobs using external client other than Zeppelin or `spark-submit` on
   354    the `zeppelin` pod, you will need to provide a way for your clients to get to
   355    the
   356    [`examples/staging/spark/spark-master-service.yaml`](spark-master-service.yaml). See
   357    [Services](https://kubernetes.io/docs/concepts/services-networking/service/) for more information.
   358  
   359  ## Known Issues With Zeppelin
   360  
   361  * The Zeppelin pod is large, so it may take a while to pull depending on your
   362    network. The size of the Zeppelin pod is something we're working on, see issue #17231.
   363  
   364  * Zeppelin may take some time (about a minute) on this pipeline the first time
   365    you run it. It seems to take considerable time to load.
   366  
   367  * On GKE, `kubectl port-forward` may not be stable over long periods of time. If
   368    you see Zeppelin go into `Disconnected` state (there will be a red dot on the
   369    top right as well), the `port-forward` probably failed and needs to be
   370    restarted. See #12179.
   371  
   372  <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
   373  [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/examples/staging/spark/README.md?pixel)]()
   374  <!-- END MUNGE: GENERATED_ANALYTICS -->