github.com/minio/minio@v0.0.0-20240328213742-3f72439b8a27/docs/bigdata/README.md (about) 1 # **Disaggregated HDP Spark and Hive with MinIO** 2 3 ## **1. Cloud-native Architecture** 4 5 ![cloud-native](https://github.com/minio/minio/blob/master/docs/bigdata/images/image1.png?raw=true "cloud native architecture") 6 7 Kubernetes manages stateless Spark and Hive containers elastically on the compute nodes. Spark has native scheduler integration with Kubernetes. Hive, for legacy reasons, uses YARN scheduler on top of Kubernetes. 8 9 All access to MinIO object storage is via S3/SQL SELECT API. In addition to the compute nodes, MinIO containers are also managed by Kubernetes as stateful containers with local storage (JBOD/JBOF) mapped as persistent local volumes. This architecture enables multi-tenant MinIO, allowing isolation of data between customers. 10 11 MinIO also supports multi-cluster, multi-site federation similar to AWS regions and tiers. Using MinIO Information Lifecycle Management (ILM), you can configure data to be tiered between NVMe based hot storage, and HDD based warm storage. All data is encrypted with per-object key. Access Control and Identity Management between the tenants are managed by MinIO using OpenID Connect or Kerberos/LDAP/AD. 12 13 ## **2. Prerequisites** 14 15 - Install Hortonworks Distribution using this [guide.](https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.1.0/bk_ambari-installation/content/ch_Installing_Ambari.html) 16 - [Setup Ambari](https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.1.0/bk_ambari-installation/content/set_up_the_ambari_server.html) which automatically sets up YARN 17 - [Installing Spark](https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/installing-spark/content/installing_spark.html) 18 - Install MinIO Distributed Server using one of the guides below. 19 - [Deployment based on Kubernetes](https://min.io/docs/minio/kubernetes/upstream/index.html#quickstart-for-kubernetes) 20 - [Deployment based on MinIO Helm Chart](https://github.com/helm/charts/tree/master/stable/minio) 21 22 ## **3. Configure Hadoop, Spark, Hive to use MinIO** 23 24 After successful installation navigate to the Ambari UI `http://<ambari-server>:8080/` and login using the default credentials: [**_username: admin, password: admin_**] 25 26 ![ambari-login](https://github.com/minio/minio/blob/master/docs/bigdata/images/image3.png?raw=true "ambari login") 27 28 ### **3.1 Configure Hadoop** 29 30 Navigate to **Services** -> **HDFS** -> **CONFIGS** -> **ADVANCED** as shown below 31 32 ![hdfs-configs](https://github.com/minio/minio/blob/master/docs/bigdata/images/image2.png?raw=true "hdfs advanced configs") 33 34 Navigate to **Custom core-site** to configure MinIO parameters for `_s3a_` connector 35 36 ![s3a-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image5.png?raw=true "custom core-site") 37 38 ``` 39 sudo pip install yq 40 alias kv-pairify='yq ".configuration[]" | jq ".[]" | jq -r ".name + \"=\" + .value"' 41 ``` 42 43 Let's take for example a set of 12 compute nodes with an aggregate memory of _1.2TiB_, we need to do following settings for optimal results. Add the following optimal entries for _core-site.xml_ to configure _s3a_ with **MinIO**. Most important options here are 44 45 ``` 46 cat ${HADOOP_CONF_DIR}/core-site.xml | kv-pairify | grep "mapred" 47 48 mapred.maxthreads.generate.mapoutput=2 # Num threads to write map outputs 49 mapred.maxthreads.partition.closer=0 # Asynchronous map flushers 50 mapreduce.fileoutputcommitter.algorithm.version=2 # Use the latest committer version 51 mapreduce.job.reduce.slowstart.completedmaps=0.99 # 99% map, then reduce 52 mapreduce.reduce.shuffle.input.buffer.percent=0.9 # Min % buffer in RAM 53 mapreduce.reduce.shuffle.merge.percent=0.9 # Minimum % merges in RAM 54 mapreduce.reduce.speculative=false # Disable speculation for reducing 55 mapreduce.task.io.sort.factor=999 # Threshold before writing to disk 56 mapreduce.task.sort.spill.percent=0.9 # Minimum % before spilling to disk 57 ``` 58 59 S3A is the connector to use S3 and other S3-compatible object stores such as MinIO. MapReduce workloads typically interact with object stores in the same way they do with HDFS. These workloads rely on HDFS atomic rename functionality to complete writing data to the datastore. Object storage operations are atomic by nature and they do not require/implement rename API. The default S3A committer emulates renames through copy and delete APIs. This interaction pattern causes significant loss of performance because of the write amplification. _Netflix_, for example, developed two new staging committers - the Directory staging committer and the Partitioned staging committer - to take full advantage of native object storage operations. These committers do not require rename operation. The two staging committers were evaluated, along with another new addition called the Magic committer for benchmarking. 60 61 It was found that the directory staging committer was the fastest among the three, S3A connector should be configured with the following parameters for optimal results: 62 63 ``` 64 cat ${HADOOP_CONF_DIR}/core-site.xml | kv-pairify | grep "s3a" 65 66 fs.s3a.access.key=minio 67 fs.s3a.secret.key=minio123 68 fs.s3a.path.style.access=true 69 fs.s3a.block.size=512M 70 fs.s3a.buffer.dir=${hadoop.tmp.dir}/s3a 71 fs.s3a.committer.magic.enabled=false 72 fs.s3a.committer.name=directory 73 fs.s3a.committer.staging.abort.pending.uploads=true 74 fs.s3a.committer.staging.conflict-mode=append 75 fs.s3a.committer.staging.tmp.path=/tmp/staging 76 fs.s3a.committer.staging.unique-filenames=true 77 fs.s3a.connection.establish.timeout=5000 78 fs.s3a.connection.ssl.enabled=false 79 fs.s3a.connection.timeout=200000 80 fs.s3a.endpoint=http://minio:9000 81 fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 82 83 fs.s3a.committer.threads=2048 # Number of threads writing to MinIO 84 fs.s3a.connection.maximum=8192 # Maximum number of concurrent conns 85 fs.s3a.fast.upload.active.blocks=2048 # Number of parallel uploads 86 fs.s3a.fast.upload.buffer=disk # Use disk as the buffer for uploads 87 fs.s3a.fast.upload=true # Turn on fast upload mode 88 fs.s3a.max.total.tasks=2048 # Maximum number of parallel tasks 89 fs.s3a.multipart.size=512M # Size of each multipart chunk 90 fs.s3a.multipart.threshold=512M # Size before using multipart uploads 91 fs.s3a.socket.recv.buffer=65536 # Read socket buffer hint 92 fs.s3a.socket.send.buffer=65536 # Write socket buffer hint 93 fs.s3a.threads.max=2048 # Maximum number of threads for S3A 94 ``` 95 96 The rest of the other optimization options are discussed in the links below 97 98 - [https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) 99 - [https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html](https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html) 100 101 Once the config changes are applied, proceed to restart **Hadoop** services. 102 103 ![hdfs-services](https://github.com/minio/minio/blob/master/docs/bigdata/images/image7.png?raw=true "hdfs restart services") 104 105 ### **3.2 Configure Spark2** 106 107 Navigate to **Services** -> **Spark2** -> **CONFIGS** as shown below 108 109 ![spark-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image6.png?raw=true "spark config") 110 111 Navigate to “**Custom spark-defaults**” to configure MinIO parameters for `_s3a_` connector 112 113 ![spark-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image9.png?raw=true "spark defaults") 114 115 Add the following optimal entries for _spark-defaults.conf_ to configure Spark with **MinIO**. 116 117 ``` 118 spark.hadoop.fs.s3a.access.key minio 119 spark.hadoop.fs.s3a.secret.key minio123 120 spark.hadoop.fs.s3a.path.style.access true 121 spark.hadoop.fs.s3a.block.size 512M 122 spark.hadoop.fs.s3a.buffer.dir ${hadoop.tmp.dir}/s3a 123 spark.hadoop.fs.s3a.committer.magic.enabled false 124 spark.hadoop.fs.s3a.committer.name directory 125 spark.hadoop.fs.s3a.committer.staging.abort.pending.uploads true 126 spark.hadoop.fs.s3a.committer.staging.conflict-mode append 127 spark.hadoop.fs.s3a.committer.staging.tmp.path /tmp/staging 128 spark.hadoop.fs.s3a.committer.staging.unique-filenames true 129 spark.hadoop.fs.s3a.committer.threads 2048 # number of threads writing to MinIO 130 spark.hadoop.fs.s3a.connection.establish.timeout 5000 131 spark.hadoop.fs.s3a.connection.maximum 8192 # maximum number of concurrent conns 132 spark.hadoop.fs.s3a.connection.ssl.enabled false 133 spark.hadoop.fs.s3a.connection.timeout 200000 134 spark.hadoop.fs.s3a.endpoint http://minio:9000 135 spark.hadoop.fs.s3a.fast.upload.active.blocks 2048 # number of parallel uploads 136 spark.hadoop.fs.s3a.fast.upload.buffer disk # use disk as the buffer for uploads 137 spark.hadoop.fs.s3a.fast.upload true # turn on fast upload mode 138 spark.hadoop.fs.s3a.impl org.apache.hadoop.spark.hadoop.fs.s3a.S3AFileSystem 139 spark.hadoop.fs.s3a.max.total.tasks 2048 # maximum number of parallel tasks 140 spark.hadoop.fs.s3a.multipart.size 512M # size of each multipart chunk 141 spark.hadoop.fs.s3a.multipart.threshold 512M # size before using multipart uploads 142 spark.hadoop.fs.s3a.socket.recv.buffer 65536 # read socket buffer hint 143 spark.hadoop.fs.s3a.socket.send.buffer 65536 # write socket buffer hint 144 spark.hadoop.fs.s3a.threads.max 2048 # maximum number of threads for S3A 145 ``` 146 147 Once the config changes are applied, proceed to restart **Spark** services. 148 149 ![spark-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image12.png?raw=true "spark restart services") 150 151 ### **3.3 Configure Hive** 152 153 Navigate to **Services** -> **Hive** -> **CONFIGS**-> **ADVANCED** as shown below 154 155 ![hive-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image10.png?raw=true "hive advanced config") 156 157 Navigate to “**Custom hive-site**” to configure MinIO parameters for `_s3a_` connector 158 159 ![hive-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image11.png?raw=true "hive advanced config") 160 161 Add the following optimal entries for `hive-site.xml` to configure Hive with **MinIO**. 162 163 ``` 164 hive.blobstore.use.blobstore.as.scratchdir=true 165 hive.exec.input.listing.max.threads=50 166 hive.load.dynamic.partitions.thread=25 167 hive.metastore.fshandler.threads=50 168 hive.mv.files.threads=40 169 mapreduce.input.fileinputformat.list-status.num-threads=50 170 ``` 171 172 For more information about these options please visit [https://www.cloudera.com/documentation/enterprise/5-11-x/topics/admin_hive_on_s3_tuning.html](https://www.cloudera.com/documentation/enterprise/5-11-x/topics/admin_hive_on_s3_tuning.html) 173 174 ![hive-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image13.png?raw=true "hive advanced custom config") 175 176 Once the config changes are applied, proceed to restart all Hive services. 177 178 ![hive-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image14.png?raw=true "restart hive services") 179 180 ## **4. Run Sample Applications** 181 182 After installing Hive, Hadoop and Spark successfully, we can now proceed to run some sample applications to see if they are configured appropriately. We can use Spark Pi and Spark WordCount programs to validate our Spark installation. We can also explore how to run Spark jobs from the command line and Spark shell. 183 184 ### **4.1 Spark Pi** 185 186 Test the Spark installation by running the following compute intensive example, which calculates pi by “throwing darts” at a circle. The program generates points in the unit square ((0,0) to (1,1)) and counts how many points fall within the unit circle within the square. The result approximates pi. 187 188 Follow these steps to run the Spark Pi example: 189 190 - Login as user **‘spark’**. 191 - When the job runs, the library can now use **MinIO** during intermediate processing. 192 - Navigate to a node with the Spark client and access the spark2-client directory: 193 194 ``` 195 cd /usr/hdp/current/spark2-client 196 su spark 197 ``` 198 199 - Run the Apache Spark Pi job in yarn-client mode, using code from **org.apache.spark**: 200 201 ``` 202 ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ 203 --master yarn-client \ 204 --num-executors 1 \ 205 --driver-memory 512m \ 206 --executor-memory 512m \ 207 --executor-cores 1 \ 208 examples/jars/spark-examples*.jar 10 209 ``` 210 211 The job should produce an output as shown below. Note the value of pi in the output. 212 213 ``` 214 17/03/22 23:21:10 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.302805 s 215 Pi is roughly 3.1445191445191445 216 ``` 217 218 Job status can also be viewed in a browser by navigating to the YARN ResourceManager Web UI and clicking on job history server information. 219 220 ### **4.2 WordCount** 221 222 WordCount is a simple program that counts how often a word occurs in a text file. The code builds a dataset of (String, Int) pairs called counts, and saves the dataset to a file. 223 224 The following example submits WordCount code to the Scala shell. Select an input file for the Spark WordCount example. We can use any text file as input. 225 226 - Login as user **‘spark’**. 227 - When the job runs, the library can now use **MinIO** during intermediate processing. 228 - Navigate to a node with Spark client and access the spark2-client directory: 229 230 ``` 231 cd /usr/hdp/current/spark2-client 232 su spark 233 ``` 234 235 The following example uses _log4j.properties_ as the input file: 236 237 #### **4.2.1 Upload the input file to HDFS:** 238 239 ``` 240 hadoop fs -copyFromLocal /etc/hadoop/conf/log4j.properties 241 s3a://testbucket/testdata 242 ``` 243 244 #### **4.2.2 Run the Spark shell:** 245 246 ``` 247 ./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m 248 ``` 249 250 The command should produce an output as shown below. (with additional status messages): 251 252 ``` 253 Spark context Web UI available at http://172.26.236.247:4041 254 Spark context available as 'sc' (master = yarn, app id = application_1490217230866_0002). 255 Spark session available as 'spark'. 256 Welcome to 257 258 259 ____ __ 260 / __/__ ___ _____/ /__ 261 _\ \/ _ \/ _ `/ __/ '_/ 262 /___/ .__/\_,_/_/ /_/\_\ version 2.1.0.2.6.0.0-598 263 /_/ 264 265 Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112) 266 Type in expressions to have them evaluated. 267 Type :help for more information. 268 269 scala> 270 ``` 271 272 - At the _scala>_ prompt, submit the job by typing the following commands, Replace node names, file name, and file location with your values: 273 274 ``` 275 scala> val file = sc.textFile("s3a://testbucket/testdata") 276 file: org.apache.spark.rdd.RDD[String] = s3a://testbucket/testdata MapPartitionsRDD[1] at textFile at <console>:24 277 278 scala> val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) 279 counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:25 280 281 scala> counts.saveAsTextFile("s3a://testbucket/wordcount") 282 ``` 283 284 Use one of the following approaches to view job output: 285 286 View output in the Scala shell: 287 288 ``` 289 scala> counts.count() 290 364 291 ``` 292 293 To view the output from MinIO exit the Scala shell. View WordCount job status: 294 295 ``` 296 hadoop fs -ls s3a://testbucket/wordcount 297 ``` 298 299 The output should be similar to the following: 300 301 ``` 302 Found 3 items 303 -rw-rw-rw- 1 spark spark 0 2019-05-04 01:36 s3a://testbucket/wordcount/_SUCCESS 304 -rw-rw-rw- 1 spark spark 4956 2019-05-04 01:36 s3a://testbucket/wordcount/part-00000 305 -rw-rw-rw- 1 spark spark 5616 2019-05-04 01:36 s3a://testbucket/wordcount/part-00001 306 ```