github.com/smintz/nomad@v0.8.3/website/source/guides/spark/hdfs.html.md (about) 1 --- 2 layout: "guides" 3 page_title: "Apache Spark Integration - Using HDFS" 4 sidebar_current: "guides-spark-hdfs" 5 description: |- 6 Learn how to deploy HDFS on Nomad and integrate it with Spark. 7 --- 8 9 # Using HDFS 10 11 [HDFS](https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system) 12 is a distributed, replicated and scalable file system written for the Hadoop 13 framework. Spark was designed to read from and write to HDFS, since it is 14 common for Spark applications to perform data-intensive processing over large 15 datasets. HDFS can be deployed as its own Nomad job. 16 17 ## Running HDFS on Nomad 18 19 A sample HDFS job file can be found [here](https://github.com/hashicorp/nomad/blob/master/terraform/examples/spark/hdfs.nomad). 20 It has two task groups, one for the HDFS NameNode and one for the 21 DataNodes. Both task groups use a [Docker image](https://github.com/hashicorp/nomad/tree/master/terraform/examples/spark/docker/hdfs) that includes Hadoop: 22 23 ```hcl 24 group "NameNode" { 25 26 constraint { 27 operator = "distinct_hosts" 28 value = "true" 29 } 30 31 task "NameNode" { 32 33 driver = "docker" 34 35 config { 36 image = "rcgenova/hadoop-2.7.3" 37 command = "bash" 38 args = [ "-c", "hdfs namenode -format && exec hdfs namenode 39 -D fs.defaultFS=hdfs://${NOMAD_ADDR_ipc}/ -D dfs.permissions.enabled=false" ] 40 network_mode = "host" 41 port_map { 42 ipc = 8020 43 ui = 50070 44 } 45 } 46 47 resources { 48 cpu = 1000 49 memory = 1024 50 network { 51 port "ipc" { 52 static = "8020" 53 } 54 port "ui" { 55 static = "50070" 56 } 57 } 58 } 59 60 service { 61 name = "hdfs" 62 port = "ipc" 63 } 64 } 65 } 66 ``` 67 68 The NameNode task registers itself in Consul as `hdfs`. This enables the 69 DataNodes to generically reference the NameNode: 70 71 ```hcl 72 group "DataNode" { 73 74 count = 3 75 76 constraint { 77 operator = "distinct_hosts" 78 value = "true" 79 } 80 81 task "DataNode" { 82 83 driver = "docker" 84 85 config { 86 network_mode = "host" 87 image = "rcgenova/hadoop-2.7.3" 88 args = [ "hdfs", "datanode" 89 , "-D", "fs.defaultFS=hdfs://hdfs.service.consul/" 90 , "-D", "dfs.permissions.enabled=false" 91 ] 92 port_map { 93 data = 50010 94 ipc = 50020 95 ui = 50075 96 } 97 } 98 99 resources { 100 cpu = 1000 101 memory = 1024 102 network { 103 port "data" { 104 static = "50010" 105 } 106 port "ipc" { 107 static = "50020" 108 } 109 port "ui" { 110 static = "50075" 111 } 112 } 113 } 114 115 } 116 } 117 ``` 118 119 Another viable option for DataNode task group is to use a dedicated 120 [system](https://www.nomadproject.io/docs/runtime/schedulers.html#system) job. 121 This will deploy a DataNode to every client node in the system, which may or may 122 not be desirable depending on your use case. 123 124 The HDFS job can be deployed using the `nomad job run` command: 125 126 ```shell 127 $ nomad job run hdfs.nomad 128 ``` 129 130 ## Production Deployment Considerations 131 132 A production deployment will typically have redundant NameNodes in an 133 active/passive configuration (which requires ZooKeeper). See [HDFS High 134 Availability](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html). 135 136 ## Next Steps 137 138 Learn how to [monitor the output](/guides/spark/monitoring.html) of your 139 Spark applications.