github.com/smintz/nomad@v0.8.3/website/source/guides/spark/hdfs.html.md

github.com/smintz/nomad@v0.8.3/website/source/guides/spark/hdfs.html.md (about)

     1  ---
     2  layout: "guides"
     3  page_title: "Apache Spark Integration - Using HDFS"
     4  sidebar_current: "guides-spark-hdfs"
     5  description: |-
     6    Learn how to deploy HDFS on Nomad and integrate it with Spark.
     7  ---
     8  
     9  # Using HDFS
    10  
    11  [HDFS](https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system) 
    12  is a distributed, replicated and scalable file system written for the Hadoop 
    13  framework. Spark was designed to read from and write to HDFS, since it is 
    14  common for Spark applications to perform data-intensive processing over large 
    15  datasets. HDFS can be deployed as its own Nomad job.
    16  
    17  ## Running HDFS on Nomad
    18  
    19  A sample HDFS job file can be found [here](https://github.com/hashicorp/nomad/blob/master/terraform/examples/spark/hdfs.nomad).
    20  It has two task groups, one for the HDFS NameNode and one for the 
    21  DataNodes. Both task groups use a [Docker image](https://github.com/hashicorp/nomad/tree/master/terraform/examples/spark/docker/hdfs) that includes Hadoop:
    22  
    23  ```hcl
    24    group "NameNode" {
    25  
    26      constraint {
    27        operator  = "distinct_hosts"
    28        value     = "true"
    29      }
    30  
    31      task "NameNode" {
    32  
    33        driver = "docker"
    34  
    35        config {
    36          image = "rcgenova/hadoop-2.7.3"
    37          command = "bash"
    38          args = [ "-c", "hdfs namenode -format && exec hdfs namenode 
    39            -D fs.defaultFS=hdfs://${NOMAD_ADDR_ipc}/ -D dfs.permissions.enabled=false" ]
    40          network_mode = "host"
    41          port_map {
    42            ipc = 8020
    43            ui = 50070
    44          }
    45        }
    46  
    47        resources {
    48          cpu    = 1000
    49          memory = 1024
    50          network {
    51            port "ipc" {
    52              static = "8020"
    53            }
    54            port "ui" {
    55              static = "50070"
    56            }
    57          }
    58        }
    59  
    60        service {
    61          name = "hdfs"
    62          port = "ipc"
    63        }
    64      }
    65    }
    66  ```
    67  
    68  The NameNode task registers itself in Consul as `hdfs`. This enables the 
    69  DataNodes to generically reference the NameNode:
    70  
    71  ```hcl
    72    group "DataNode" {
    73  
    74      count = 3
    75  
    76      constraint {
    77        operator  = "distinct_hosts"
    78        value     = "true"
    79      }
    80      
    81      task "DataNode" {
    82  
    83        driver = "docker"
    84  
    85        config {
    86          network_mode = "host"
    87          image = "rcgenova/hadoop-2.7.3"
    88          args = [ "hdfs", "datanode"
    89            , "-D", "fs.defaultFS=hdfs://hdfs.service.consul/"
    90            , "-D", "dfs.permissions.enabled=false"
    91          ]
    92          port_map {
    93            data = 50010
    94            ipc = 50020
    95            ui = 50075
    96          }
    97        }
    98  
    99        resources {
   100          cpu    = 1000
   101          memory = 1024
   102          network {
   103            port "data" {
   104              static = "50010"
   105            }
   106            port "ipc" {
   107              static = "50020"
   108            }
   109            port "ui" {
   110              static = "50075"
   111            }
   112          }
   113        }
   114  
   115      }
   116    }
   117  ```
   118  
   119  Another viable option for DataNode task group is to use a dedicated 
   120  [system](https://www.nomadproject.io/docs/runtime/schedulers.html#system) job. 
   121  This will deploy a DataNode to every client node in the system, which may or may 
   122  not be desirable depending on your use case. 
   123  
   124  The HDFS job can be deployed using the `nomad job run` command:
   125  
   126  ```shell
   127  $ nomad job run hdfs.nomad
   128  ```
   129  
   130  ## Production Deployment Considerations
   131  
   132  A production deployment will typically have redundant NameNodes in an 
   133  active/passive configuration (which requires ZooKeeper). See [HDFS High 
   134  Availability](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html).
   135  
   136  ## Next Steps
   137  
   138  Learn how to [monitor the output](/guides/spark/monitoring.html) of your 
   139  Spark applications.