github.com/qsunny/k8s@v0.0.0-20220101153623-e6dca256d5bf/examples-master/staging/spark/spark-gluster/README.md

github.com/qsunny/k8s@v0.0.0-20220101153623-e6dca256d5bf/examples-master/staging/spark/spark-gluster/README.md (about)

     1  # Spark on GlusterFS example
     2  
     3  This guide is an extension of the standard [Spark on Kubernetes Guide](../../../examples/spark/) and describes how to run Spark on GlusterFS using the [Kubernetes Volume Plugin for GlusterFS](../../../examples/volumes/glusterfs/)
     4  
     5  The setup is the same in that you will setup a Spark Master Service in the same way you do with the standard Spark guide but you will deploy a modified Spark Master and a Modified Spark Worker ReplicationController, as they will be modified to use the GlusterFS volume plugin to mount a GlusterFS volume into the Spark Master and Spark Workers containers. Note that this example can be used as a guide for implementing any of the Kubernetes Volume Plugins with the Spark Example.
     6  
     7  [There is also a video available that provides a walkthrough for how to set this solution up](https://youtu.be/xyIaoM0-gM0)
     8  
     9  ## Step Zero: Prerequisites
    10  
    11  This example assumes that you have been able to successfully get the standard Spark Example working in Kubernetes and that you have a GlusterFS cluster that is accessible from your Kubernetes cluster. It is also recommended that you are familiar with the GlusterFS Volume Plugin and how to configure it.
    12  
    13  ## Step One: Define the endpoints for your GlusterFS Cluster
    14  
    15  Modify the `examples/spark/spark-gluster/glusterfs-endpoints.yaml` file to list the IP addresses of some of the servers in your GlusterFS cluster. The GlusterFS Volume Plugin uses these IP addresses to perform a Fuse Mount of the GlusterFS Volume into the Spark Worker Containers that are launched by the ReplicationController in the next section.
    16  
    17  Register your endpoints by running the following command:
    18  
    19  ```console
    20  $ kubectl create -f examples/spark/spark-gluster/glusterfs-endpoints.yaml
    21  ```
    22  
    23  ## Step Two: Modify and Submit your Spark Master ReplicationController
    24  
    25  Modify the `examples/spark/spark-gluster/spark-master-controller.yaml` file to reflect the GlusterFS Volume that you wish to use in the PATH parameter of the volumes subsection.
    26  
    27  Submit the Spark Master Pod
    28  
    29  ```console
    30  $ kubectl create -f examples/spark/spark-gluster/spark-master-controller.yaml
    31  ```
    32  
    33  Verify that the Spark Master Pod deployed successfully.
    34  
    35  ```console
    36  $ kubectl get pods
    37  ```
    38  
    39  Submit the Spark Master Service
    40  
    41  ```console
    42  $ kubectl create -f examples/spark/spark-gluster/spark-master-service.yaml
    43  ```
    44  
    45  Verify that the Spark Master Service deployed successfully.
    46  
    47  ```console
    48  $ kubectl get services
    49  ```
    50  
    51  ## Step Three: Start your Spark workers
    52  
    53  Modify the `examples/spark/spark-gluster/spark-worker-controller.yaml` file to reflect the GlusterFS Volume that you wish to use in the PATH parameter of the Volumes subsection.
    54  
    55  Make sure that the replication factor for the pods is not greater than the amount of Kubernetes nodes available in your Kubernetes cluster.
    56  
    57  Submit your Spark Worker ReplicationController by running the following command:
    58  
    59  ```console
    60  $ kubectl create -f examples/spark/spark-gluster/spark-worker-controller.yaml
    61  ```
    62  
    63  Verify that the Spark Worker ReplicationController deployed its pods successfully.
    64  
    65  ```console
    66  $ kubectl get pods
    67  ```
    68  
    69  Follow the steps from the standard example to verify the Spark Worker pods have registered successfully with the Spark Master.
    70  
    71  ## Step Four: Submit a Spark Job
    72  
    73  All the Spark Workers and the Spark Master in your cluster have a mount to GlusterFS. This means that any of them can be used as the Spark Client to submit a job. For simplicity, lets use the Spark Master as an example.
    74  
    75  
    76  The Spark Worker and Spark Master containers include a setup_client utility script that takes two parameters, the Service IP of the Spark Master and the port that it is running on. This must be to setup the container as a Spark client prior to submitting any Spark Jobs.
    77  
    78  Obtain the Service IP (listed as IP:) and Full Pod Name by running
    79  
    80  ```console
    81  $ kubectl describe pod spark-master-controller
    82  ```
    83  
    84  Now we will shell into the Spark Master Container and run a Spark Job. In the example below, we are running the Spark Wordcount example and specifying the input and output directory at the location where GlusterFS is mounted in the Spark Master Container. This will submit the job to the Spark Master who will distribute the work to all the Spark Worker Containers.
    85  
    86  All the Spark Worker containers  will be able to access the data as they all have the same GlusterFS volume mounted at /mnt/glusterfs. The reason we are submitting the job from a Spark Worker and not an additional Spark Base container (as in the standard Spark Example) is due to the fact that the Spark instance submitting the job must be able to access the data. Only the Spark Master and Spark Worker containers have GlusterFS mounted.
    87  
    88  The Spark Worker and Spark Master containers include a setup_client utility script that takes two parameters, the Service IP of the Spark Master and the port that it is running on. This must be done to setup the container as a Spark client prior to submitting any Spark Jobs.
    89  
    90  Shell into the Master Spark Node (spark-master-controller) by running
    91  
    92  ```console
    93  kubectl exec spark-master-controller-<ID> -i -t -- bash -i
    94  
    95  root@spark-master-controller-c1sqd:/# . /setup_client.sh <Service IP> 7077
    96  root@spark-master-controller-c1sqd:/# pyspark
    97  
    98  Python 2.7.9 (default, Mar  1 2015, 12:57:24)
    99  [GCC 4.9.2] on linux2
   100  Type "help", "copyright", "credits" or "license" for more information.
   101  15/06/26 14:25:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   102  Welcome to
   103        ____              __
   104       / __/__  ___ _____/ /__
   105      _\ \/ _ \/ _ `/ __/  '_/
   106     /__ / .__/\_,_/_/ /_/\_\   version 1.4.0
   107        /_/
   108  Using Python version 2.7.9 (default, Mar  1 2015 12:57:24)
   109  SparkContext available as sc, HiveContext available as sqlContext.
   110  >>> file = sc.textFile("/mnt/glusterfs/somefile.txt")
   111  >>> counts = file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
   112  >>> counts.saveAsTextFile("/mnt/glusterfs/output")
   113  ```
   114  
   115  While still in the container, you can see the output of your Spark Job in the Distributed File System by running the following:
   116  
   117  ```console
   118  root@spark-master-controller-c1sqd:/# ls -l /mnt/glusterfs/output
   119  ```
   120  
   121  <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
   122  [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/examples/spark/spark-gluster/README.md?pixel)]()
   123  <!-- END MUNGE: GENERATED_ANALYTICS -->