github.com/minio/minio@v0.0.0-20240328213742-3f72439b8a27/docs/bigdata/README.md

github.com/minio/minio@v0.0.0-20240328213742-3f72439b8a27/docs/bigdata/README.md (about)

     1  # **Disaggregated HDP Spark and Hive with MinIO**
     2  
     3  ## **1. Cloud-native Architecture**
     4  
     5  ![cloud-native](https://github.com/minio/minio/blob/master/docs/bigdata/images/image1.png?raw=true "cloud native architecture")
     6  
     7  Kubernetes manages stateless Spark and Hive containers elastically on the compute nodes. Spark has native scheduler integration with Kubernetes. Hive, for legacy reasons, uses YARN scheduler on top of Kubernetes.
     8  
     9  All access to MinIO object storage is via S3/SQL SELECT API. In addition to the compute nodes, MinIO containers are also managed by Kubernetes as stateful containers with local storage (JBOD/JBOF) mapped as persistent local volumes. This architecture enables multi-tenant MinIO, allowing isolation of data between customers.
    10  
    11  MinIO also supports multi-cluster, multi-site federation similar to AWS regions and tiers. Using MinIO Information Lifecycle Management (ILM), you can configure data to be tiered between NVMe based hot storage, and HDD based warm storage. All data is encrypted with per-object key. Access Control and Identity Management between the tenants are managed by MinIO using OpenID Connect or Kerberos/LDAP/AD.
    12  
    13  ## **2. Prerequisites**
    14  
    15  - Install Hortonworks Distribution using this [guide.](https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.1.0/bk_ambari-installation/content/ch_Installing_Ambari.html)
    16    - [Setup Ambari](https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.1.0/bk_ambari-installation/content/set_up_the_ambari_server.html) which automatically sets up YARN
    17    - [Installing Spark](https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/installing-spark/content/installing_spark.html)
    18  - Install MinIO Distributed Server using one of the guides below.
    19    - [Deployment based on Kubernetes](https://min.io/docs/minio/kubernetes/upstream/index.html#quickstart-for-kubernetes)
    20    - [Deployment based on MinIO Helm Chart](https://github.com/helm/charts/tree/master/stable/minio)
    21  
    22  ## **3. Configure Hadoop, Spark, Hive to use MinIO**
    23  
    24  After successful installation navigate to the Ambari UI `http://<ambari-server>:8080/` and login using the default credentials: [**_username: admin, password: admin_**]
    25  
    26  ![ambari-login](https://github.com/minio/minio/blob/master/docs/bigdata/images/image3.png?raw=true "ambari login")
    27  
    28  ### **3.1 Configure Hadoop**
    29  
    30  Navigate to **Services** -> **HDFS** -> **CONFIGS** -> **ADVANCED** as shown below
    31  
    32  ![hdfs-configs](https://github.com/minio/minio/blob/master/docs/bigdata/images/image2.png?raw=true "hdfs advanced configs")
    33  
    34  Navigate to **Custom core-site** to configure MinIO parameters for `_s3a_` connector
    35  
    36  ![s3a-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image5.png?raw=true "custom core-site")
    37  
    38  ```
    39  sudo pip install yq
    40  alias kv-pairify='yq ".configuration[]" | jq ".[]" | jq -r ".name + \"=\" + .value"'
    41  ```
    42  
    43  Let's take for example a set of 12 compute nodes with an aggregate memory of _1.2TiB_, we need to do following settings for optimal results. Add the following optimal entries for _core-site.xml_ to configure _s3a_ with **MinIO**. Most important options here are
    44  
    45  ```
    46  cat ${HADOOP_CONF_DIR}/core-site.xml | kv-pairify | grep "mapred"
    47  
    48  mapred.maxthreads.generate.mapoutput=2 # Num threads to write map outputs
    49  mapred.maxthreads.partition.closer=0 # Asynchronous map flushers
    50  mapreduce.fileoutputcommitter.algorithm.version=2 # Use the latest committer version
    51  mapreduce.job.reduce.slowstart.completedmaps=0.99 # 99% map, then reduce
    52  mapreduce.reduce.shuffle.input.buffer.percent=0.9 # Min % buffer in RAM
    53  mapreduce.reduce.shuffle.merge.percent=0.9 # Minimum % merges in RAM
    54  mapreduce.reduce.speculative=false # Disable speculation for reducing
    55  mapreduce.task.io.sort.factor=999 # Threshold before writing to disk
    56  mapreduce.task.sort.spill.percent=0.9 # Minimum % before spilling to disk
    57  ```
    58  
    59  S3A is the connector to use S3 and other S3-compatible object stores such as MinIO. MapReduce workloads typically interact with object stores in the same way they do with HDFS. These workloads rely on HDFS atomic rename functionality to complete writing data to the datastore. Object storage operations are atomic by nature and they do not require/implement rename API. The default S3A committer emulates renames through copy and delete APIs. This interaction pattern causes significant loss of performance because of the write amplification. _Netflix_, for example, developed two new staging committers - the Directory staging committer and the Partitioned staging committer - to take full advantage of native object storage operations. These committers do not require rename operation. The two staging committers were evaluated, along with another new addition called the Magic committer for benchmarking.
    60  
    61  It was found that the directory staging committer was the fastest among the three, S3A connector should be configured with the following parameters for optimal results:
    62  
    63  ```
    64  cat ${HADOOP_CONF_DIR}/core-site.xml | kv-pairify | grep "s3a"
    65  
    66  fs.s3a.access.key=minio
    67  fs.s3a.secret.key=minio123
    68  fs.s3a.path.style.access=true
    69  fs.s3a.block.size=512M
    70  fs.s3a.buffer.dir=${hadoop.tmp.dir}/s3a
    71  fs.s3a.committer.magic.enabled=false
    72  fs.s3a.committer.name=directory
    73  fs.s3a.committer.staging.abort.pending.uploads=true
    74  fs.s3a.committer.staging.conflict-mode=append
    75  fs.s3a.committer.staging.tmp.path=/tmp/staging
    76  fs.s3a.committer.staging.unique-filenames=true
    77  fs.s3a.connection.establish.timeout=5000
    78  fs.s3a.connection.ssl.enabled=false
    79  fs.s3a.connection.timeout=200000
    80  fs.s3a.endpoint=http://minio:9000
    81  fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
    82  
    83  fs.s3a.committer.threads=2048 # Number of threads writing to MinIO
    84  fs.s3a.connection.maximum=8192 # Maximum number of concurrent conns
    85  fs.s3a.fast.upload.active.blocks=2048 # Number of parallel uploads
    86  fs.s3a.fast.upload.buffer=disk # Use disk as the buffer for uploads
    87  fs.s3a.fast.upload=true # Turn on fast upload mode
    88  fs.s3a.max.total.tasks=2048 # Maximum number of parallel tasks
    89  fs.s3a.multipart.size=512M # Size of each multipart chunk
    90  fs.s3a.multipart.threshold=512M # Size before using multipart uploads
    91  fs.s3a.socket.recv.buffer=65536 # Read socket buffer hint
    92  fs.s3a.socket.send.buffer=65536 # Write socket buffer hint
    93  fs.s3a.threads.max=2048 # Maximum number of threads for S3A
    94  ```
    95  
    96  The rest of the other optimization options are discussed in the links below
    97  
    98  - [https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)
    99  - [https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html](https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html)
   100  
   101  Once the config changes are applied, proceed to restart **Hadoop** services.
   102  
   103  ![hdfs-services](https://github.com/minio/minio/blob/master/docs/bigdata/images/image7.png?raw=true "hdfs restart services")
   104  
   105  ### **3.2 Configure Spark2**
   106  
   107  Navigate to **Services** -> **Spark2** -> **CONFIGS** as shown below
   108  
   109  ![spark-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image6.png?raw=true "spark config")
   110  
   111  Navigate to “**Custom spark-defaults**” to configure MinIO parameters for `_s3a_` connector
   112  
   113  ![spark-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image9.png?raw=true "spark defaults")
   114  
   115  Add the following optimal entries for _spark-defaults.conf_ to configure Spark with **MinIO**.
   116  
   117  ```
   118  spark.hadoop.fs.s3a.access.key minio
   119  spark.hadoop.fs.s3a.secret.key minio123
   120  spark.hadoop.fs.s3a.path.style.access true
   121  spark.hadoop.fs.s3a.block.size 512M
   122  spark.hadoop.fs.s3a.buffer.dir ${hadoop.tmp.dir}/s3a
   123  spark.hadoop.fs.s3a.committer.magic.enabled false
   124  spark.hadoop.fs.s3a.committer.name directory
   125  spark.hadoop.fs.s3a.committer.staging.abort.pending.uploads true
   126  spark.hadoop.fs.s3a.committer.staging.conflict-mode append
   127  spark.hadoop.fs.s3a.committer.staging.tmp.path /tmp/staging
   128  spark.hadoop.fs.s3a.committer.staging.unique-filenames true
   129  spark.hadoop.fs.s3a.committer.threads 2048 # number of threads writing to MinIO
   130  spark.hadoop.fs.s3a.connection.establish.timeout 5000
   131  spark.hadoop.fs.s3a.connection.maximum 8192 # maximum number of concurrent conns
   132  spark.hadoop.fs.s3a.connection.ssl.enabled false
   133  spark.hadoop.fs.s3a.connection.timeout 200000
   134  spark.hadoop.fs.s3a.endpoint http://minio:9000
   135  spark.hadoop.fs.s3a.fast.upload.active.blocks 2048 # number of parallel uploads
   136  spark.hadoop.fs.s3a.fast.upload.buffer disk # use disk as the buffer for uploads
   137  spark.hadoop.fs.s3a.fast.upload true # turn on fast upload mode
   138  spark.hadoop.fs.s3a.impl org.apache.hadoop.spark.hadoop.fs.s3a.S3AFileSystem
   139  spark.hadoop.fs.s3a.max.total.tasks 2048 # maximum number of parallel tasks
   140  spark.hadoop.fs.s3a.multipart.size 512M # size of each multipart chunk
   141  spark.hadoop.fs.s3a.multipart.threshold 512M # size before using multipart uploads
   142  spark.hadoop.fs.s3a.socket.recv.buffer 65536 # read socket buffer hint
   143  spark.hadoop.fs.s3a.socket.send.buffer 65536 # write socket buffer hint
   144  spark.hadoop.fs.s3a.threads.max 2048 # maximum number of threads for S3A
   145  ```
   146  
   147  Once the config changes are applied, proceed to restart **Spark** services.
   148  
   149  ![spark-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image12.png?raw=true "spark restart services")
   150  
   151  ### **3.3 Configure Hive**
   152  
   153  Navigate to **Services** -> **Hive** -> **CONFIGS**-> **ADVANCED** as shown below
   154  
   155  ![hive-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image10.png?raw=true "hive advanced config")
   156  
   157  Navigate to “**Custom hive-site**” to configure MinIO parameters for `_s3a_` connector
   158  
   159  ![hive-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image11.png?raw=true "hive advanced config")
   160  
   161  Add the following optimal entries for `hive-site.xml` to configure Hive with **MinIO**.
   162  
   163  ```
   164  hive.blobstore.use.blobstore.as.scratchdir=true
   165  hive.exec.input.listing.max.threads=50
   166  hive.load.dynamic.partitions.thread=25
   167  hive.metastore.fshandler.threads=50
   168  hive.mv.files.threads=40
   169  mapreduce.input.fileinputformat.list-status.num-threads=50
   170  ```
   171  
   172  For more information about these options please visit [https://www.cloudera.com/documentation/enterprise/5-11-x/topics/admin_hive_on_s3_tuning.html](https://www.cloudera.com/documentation/enterprise/5-11-x/topics/admin_hive_on_s3_tuning.html)
   173  
   174  ![hive-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image13.png?raw=true "hive advanced custom config")
   175  
   176  Once the config changes are applied, proceed to restart all Hive services.
   177  
   178  ![hive-config](https://github.com/minio/minio/blob/master/docs/bigdata/images/image14.png?raw=true "restart hive services")
   179  
   180  ## **4. Run Sample Applications**
   181  
   182  After installing Hive, Hadoop and Spark successfully, we can now proceed to run some sample applications to see if they are configured appropriately.  We can use Spark Pi and Spark WordCount programs to validate our Spark installation. We can also explore how to run Spark jobs from the command line and Spark shell.
   183  
   184  ### **4.1 Spark Pi**
   185  
   186  Test the Spark installation by running the following compute intensive example, which calculates pi by “throwing darts” at a circle. The program generates points in the unit square ((0,0) to (1,1)) and counts how many points fall within the unit circle within the square. The result approximates pi.
   187  
   188  Follow these steps to run the Spark Pi example:
   189  
   190  - Login as user **‘spark’**.
   191  - When the job runs, the library can now use **MinIO** during intermediate processing.
   192  - Navigate to a node with the Spark client and access the spark2-client directory:
   193  
   194  ```
   195  cd /usr/hdp/current/spark2-client
   196  su spark
   197  ```
   198  
   199  - Run the Apache Spark Pi job in yarn-client mode, using code from **org.apache.spark**:
   200  
   201  ```
   202  ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
   203      --master yarn-client \
   204      --num-executors 1 \
   205      --driver-memory 512m \
   206      --executor-memory 512m \
   207      --executor-cores 1 \
   208      examples/jars/spark-examples*.jar 10
   209  ```
   210  
   211  The job should produce an output as shown below. Note the value of pi in the output.
   212  
   213  ```
   214  17/03/22 23:21:10 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.302805 s
   215  Pi is roughly 3.1445191445191445
   216  ```
   217  
   218  Job status can also be viewed in a browser by navigating to the YARN ResourceManager Web UI and clicking on job history server information.
   219  
   220  ### **4.2 WordCount**
   221  
   222  WordCount is a simple program that counts how often a word occurs in a text file. The code builds a dataset of (String, Int) pairs called counts, and saves the dataset to a file.
   223  
   224  The following example submits WordCount code to the Scala shell. Select an input file for the Spark WordCount example. We can use any text file as input.
   225  
   226  - Login as user **‘spark’**.
   227  - When the job runs, the library can now use **MinIO** during intermediate processing.
   228  - Navigate to a node with Spark client and access the spark2-client directory:
   229  
   230  ```
   231  cd /usr/hdp/current/spark2-client
   232  su spark
   233  ```
   234  
   235  The following example uses _log4j.properties_ as the input file:
   236  
   237  #### **4.2.1 Upload the input file to HDFS:**
   238  
   239  ```
   240  hadoop fs -copyFromLocal /etc/hadoop/conf/log4j.properties
   241            s3a://testbucket/testdata
   242  ```
   243  
   244  #### **4.2.2  Run the Spark shell:**
   245  
   246  ```
   247  ./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m
   248  ```
   249  
   250  The command should produce an output as shown below. (with additional status messages):
   251  
   252  ```
   253  Spark context Web UI available at http://172.26.236.247:4041
   254  Spark context available as 'sc' (master = yarn, app id = application_1490217230866_0002).
   255  Spark session available as 'spark'.
   256  Welcome to
   257  
   258  
   259        ____              __
   260       / __/__  ___ _____/ /__
   261      _\ \/ _ \/ _ `/ __/  '_/
   262     /___/ .__/\_,_/_/ /_/\_\   version 2.1.0.2.6.0.0-598
   263        /_/
   264  
   265  Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
   266  Type in expressions to have them evaluated.
   267  Type :help for more information.
   268  
   269  scala>
   270  ```
   271  
   272  - At the _scala>_ prompt, submit the job by typing the following commands, Replace node names, file name, and file location with your values:
   273  
   274  ```
   275  scala> val file = sc.textFile("s3a://testbucket/testdata")
   276  file: org.apache.spark.rdd.RDD[String] = s3a://testbucket/testdata MapPartitionsRDD[1] at textFile at <console>:24
   277  
   278  scala> val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
   279  counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:25
   280  
   281  scala> counts.saveAsTextFile("s3a://testbucket/wordcount")
   282  ```
   283  
   284  Use one of the following approaches to view job output:
   285  
   286  View output in the Scala shell:
   287  
   288  ```
   289  scala> counts.count()
   290  364
   291  ```
   292  
   293  To view the output from MinIO exit the Scala shell. View WordCount job status:
   294  
   295  ```
   296  hadoop fs -ls s3a://testbucket/wordcount
   297  ```
   298  
   299  The output should be similar to the following:
   300  
   301  ```
   302  Found 3 items
   303  -rw-rw-rw-   1 spark spark          0 2019-05-04 01:36 s3a://testbucket/wordcount/_SUCCESS
   304  -rw-rw-rw-   1 spark spark       4956 2019-05-04 01:36 s3a://testbucket/wordcount/part-00000
   305  -rw-rw-rw-   1 spark spark       5616 2019-05-04 01:36 s3a://testbucket/wordcount/part-00001
   306  ```