Hadoop 3.3.6 distributed cluster environment construction

1 System Configuration

A total of 3 centOS virtual machines have been prepared, master, slave1, slave2

Configure hosts resolution

vim /etc/hosts

192.168.10.11 master
192.168.10.12 slave1
192.168.10.13 slave2

Modify hostname

#Execute the following commands on the corresponding server
hostnamectl set-hostname master
hostnamectl set-hostname slave1
hostnamectl set-hostname slave2

Turn off firewall

#Close firewall
systemctl stop firewalld.service
#Disable boot startup
systemctl disable firewalld.service
#View firewall status
systemctl status firewalld.service
#Note: All three servers need to turn off the firewall. The production environment does not allow you to turn off the firewall directly. You can only configure policies and open specific ports.

disable selinux

Modify /etc/selinux/config to set SELINUX=disabled
vim /etc/selinux/config

Restart to make the host name and other configurations take effect.

shutdown -r now

Create a hadoop user and set a password. Create it on all hosts. Password: whatever

useradd hadoop
passwd hadoop

Configure the root permissions of the hadoop user

vim /etc/sudoers

Add content on the next line of %wheel

hadoop ALL=(ALL) NOPASSWD:ALL

Set up ssh password-free login

# Enter the key directory (may not exist)
cd ~/.ssh
# Delete old key
rm -rf ~/.ssh
# Generate key
ssh-keygen-trsa
# Copy the public key to the key file specified by ssh authorized_keys
cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys
#Modify file permissions (not setting may cause failure)
chmod 600 ~/.ssh/authorized_keys
# Test password-free login
ssh -vvv master
# Exit ssh login
exit

ssh -vvv -vvv means outputting debugging information. If the password exemption fails, you can check the reason from the debugging information. In my test, non-root users must modify the permissions of the authorized_keys file, otherwise the password exemption setting will not succeed.

Note: All servers must have executed the above command before proceeding with the following operation

Add the master’s public key content to ~/.ssh/authorized_keys of other hosts, so that the master can log in to other hosts without a password

# Execute on master
ssh-copy-id master
ssh-copy-id slave1
ssh-copy-id slave2

Execute similarly on other hosts

2 JDK installation

Unzip the installation package to /opt/module

# Create directory
sudo mkdir -p /opt/module
# Modify permissions
sudo chown -R hadoop:hadoop /opt/module
# Unzip the installation package to /opt/module
tar -xvf jdk-8u341-linux-x64.tar.gz -C /opt/module/
# Modify jdk directory name
 mv /opt/module/jdk1.8.0_341/ /opt/module/jdk

Configure environment variables

vim /etc/profile
# Add content
export JAVA_HOME=/opt/module/jdk
export PATH=$PATH:$JAVA_HOME/bin

# Make the variable effective
source /etc/profile

# Test jdk
java-version

3 Install hadoop

Download hadoop download
Unzip to /opt/module/hadoop

tar -xvf hadoop-3.3.6.tar.gz -C /opt/module
mv /opt/module/hadoop-3.3.6 /opt/module/hadoop

Configure environment variables to /etc/profile

# Add content
export JAVA_HOME=/opt/module/jdk
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/opt/module/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

Make environment variables effective

source /etc/profile

Configure the jdk path of hadoop-env.sh

# Open files separately
vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh

#Add JAVA_HOME configuration

export JAVA_HOME=/usr/local/jdk

#Verify configuration
hadoop version

Modify permissions

chown -R hadoop:hadoop /opt/module

Repeat steps 1, 2, 3, 4, 5, and 6 above for other servers.

4 hadoop configuration file configuration

Create a new bin directory in the home/hadoop directory and add the directory to PATH

Create a directory

mkdir -p /home/hadoop/bin

Modify environment variables
vim /etc/profile

export JAVA_HOME=/opt/module/jdk
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/opt/module/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:/home/hadoop/bin

Write cluster file distribution script xsync

Create a new xsync file

vim ~/bin/xsync

Add the following content

#!/bin/bash

#1. Determine the number of parameters
if [ $# -lt 1 ]
then
      echo Not Enough Arguement!
    exit;
fi

#2. Traverse all machines in the cluster
for host in hadoop2 hadoop3
do
    echo ==================== $host ====================
    #3. Traverse all directories and send them one by one

    for file in $@
    do
        #4. Determine whether the file exists
        if [ -e $file ]
            then
                #5. Get the parent directory
                pdir=$(cd -P $(dirname $file); pwd)

                #6. Get the name of the current file
                fname=$(basename $file)
                ssh $host "mkdir -p $pdir"
                rsync -av $pdir/$fname $host:$pdir
            else
                echo $file does not exists!
        fi
    done
done

Add execution permissions

chmod + x ~/bin/xsync

Test copying files to other hosts

xsync /home/hadoop/bin

5. Cluster planning and deployment

	master	slave1	slave2
HDFS	NameNode DataNode	DataNode	SecondaryNameNode DataNode
YARN	NodeManager	ResourceManager NodeManager	NodeManager

Planning principles: NameNode, SecondaryNameNode, and ResourceManager all occupy relatively large amounts of memory, so they should be deployed on different hosts.

Configuration file description

The configuration files that need to be configured are located in the hadoop installation directory $HADOOP_HOME/etc/hadoop

Configuration file name	Configuration description
core-site.xml	1 Configure NameNode address 2 Configure hadoop data storage directory 3 Configure HDFS web page login static user name
core-site.xml	1 Configure NameNode web access address 2 Configure SecondaryNameNode web access address
yarn-site.xml	1 Configure MR for shuffle 2 Configure the ResourceManger address
mapred-site.xm	1 Configure the MapReduce program to run on yarn 2 Configure JobHistoryServer

Configure core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- Specify the address of NameNode -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:8020</value>
    </property>

    <!-- Specify the storage directory for hadoop data, you need to create the directory manually -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop/data</value>
    </property>

    <!-- Configure the static user used for HDFS web page login as hadoop -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>hadoop</value>
    </property>
</configuration>

Configure hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<!-- nn web access address -->
<property>
        <name>dfs.namenode.http-address</name>
        <value>master:9870</value>
    </property>
<!-- 2nn web access address -->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>slave1:9868</value>
    </property>
</configuration>

Configure yarn-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- Specify MR to perform shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <!-- Specify the address of ResourceManager -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>slave2</value>
    </property>

</configuration>

Configure mapred-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
        <!-- Specify the MapReduce program to run on Yarn -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Synchronize configuration files to other hosts

xsync $HADOOP_HOME/etc/hadoop/

Log in to other hosts to check whether the configuration file is synchronized successfully.

Start the cluster

Configure workers

vim $HADOOP_HOME/etc/hadoop/workers

Add content

master
slave56
slave58

There can be no blank lines and no spaces at the end of the line.

Synchronize files to other hosts

xsync $HADOOP_HOME/etc

Start the cluster

If it is the first time to start, you need to perform formatting on the NameNode node now.

#Execute on the NameNode node, that is, hadoop1
hdfs namenode -format

Note: Formatting the NameNode will generate a new cluster ID, causing the cluster IDs of the NameNode and DataNode to be inconsistent, and the cluster cannot find past data. If the cluster reports an error during operation and needs to reformat the NameNode, be sure to stop the namenode and datanode processes first, and delete the $HADOOP_HOME/data and $HADOOP_HOME/logs directories of all machines before formatting.

Start HDFS on the master host

#Execute on the NameNode node, that is, hadoop1
start-dfs.sh

Start YARN on slave1 host

#Execute on the ResourceManager node, that is, on hadoop2
start-yarn.sh

View NameNode and ResourceManager through web services

# View the NameNode of HDFS
http://matster:9870
# View YARN’s ResourceManager
http://slave1:8088

If the page can be opened normally, the cluster is started successfully.

View the processes of each node

# master host
[root@master hadoop]$ jps
21745 NodeManager
21860Jps
20682 NameNode
20797 DataNode


# slave host
[root@slave1 hadoop]$ jps
9986 DataNode
10535 NodeManager
10410 ResourceManager
10891Jps

# slave2host
[root@slave2 hadoop]$ jps
17488 NodeManager
17584Jps
17397 SecondaryNameNode
17289 DataNode

Configure history server

? In order to view the historical running status of the program, you can configure a history server

Configure mapred-site.xml

# Add the following content
<!-- Historical server address -->
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>master:10020</value>
</property>

<!-- History server web address -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>master:19888</value>
</property>

Synchronize configuration to other hosts

xsync $HADOOP_HOME/etc

Start the history server on master

mapred --daemon start historyserver

View progress

[root@master hadoop]$ jps
2176 NodeManager
1714 NameNode
2292Jps
1865 DataNode
1533 JobHistoryServer

visit web

http://hadoop1:19888/jobhistory

Configure log aggregation
?

Log aggregation concept: After the application is completed, the program running log information is uploaded to the HDFS system.

Benefits of the log aggregation function: You can easily view program running details, which facilitates development and debugging.

Note: To enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryServer.

Configuration yarn-site.xml

Add content

<!-- Enable log aggregation function -->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>
<!-- Set the log aggregation server address -->
<property>
    <name>yarn.log.server.url</name>
    <value>http://hadoop1:19888/jobhistory/logs</value>
</property>
<!--Set the log retention time to 7 days -->
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
</property>

Sync configuration

xsync $HADOOP_HOME/etc

Close NodeManager, ResourceManager, HistroyServer

# Execute on hadoop2
stop-yarn.sh
# Execute on hadoop1
mapred --daemon stop historyserver

Open NodeManager, ResourceManager, HistroyServe

# Execute on hadoop2
start-yarn.sh
# Execute on hadoop1
mapred --daemon start historyserver

Write common scripts for Hadoop clusters

hadoop cluster start and stop script

Create myhadoop.sh in the ~/bin directory

vim ~/bin/myhadoop.sh

Add content

#!/bin/bash

if [ $# -lt 1 ]
then
    echo "No Args Input..."
    exit ;
fi

case $1 in
"start")
        echo " =================== Start hadoop cluster ==================="

        echo " --------------- Start hdfs ---------------"
        ssh master "/opt/module/hadoop/sbin/start-dfs.sh"
        echo " --------------- start yarn ---------------"
        ssh slave1 "/opt/module/hadoop/sbin/start-yarn.sh"
        echo " --------------- Start historyserver ---------------"
        ssh slave2 "/opt/module/hadoop/bin/mapred --daemon start historyserver"
;;
"stop")
        echo " =================== Shut down the hadoop cluster ==================="

        echo " --------------- Close historyserver ---------------"
        ssh master "/opt/module/hadoop/bin/mapred --daemon stop historyserver"
        echo " --------------- close yarn ---------------"
        ssh slave1 "/opt/module/hadoop/sbin/stop-yarn.sh"
        echo " --------------- close hdfs ---------------"
        ssh slave2 "/opt/module/hadoop/sbin/stop-dfs.sh"
;;
*)
    echo "Input Args Error..."
;;
esac

Add execution permissions

chmod + x ~/bin/myhadoop.sh

test script

# Stop
myhadoop.sh stop
# start up
myhadoop.sh start

ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.

Solution: start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh add the following parameters

Add the following parameters at the top of the start-dfs.sh and stop-dfs.sh files

#!/usr/bin/env bash
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root

Start-yarn.sh, stop-yarn.sh also need to add the following parameters at the top

#!/usr/bin/env bash
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

#Then execute the command
# stop
myhadoop.sh stop
# start up
myhadoop.sh start

Sync files

xsync ~/bin

View the java process scripts of three hosts
Create jpsall in the ~/bin directory

vim ~/bin/jpsall

Add content

#!/bin/bash

for host in master slave1 slave2
do
        echo =============== $host ===============
        ssh $host 'jps'
done

Add execution permissions

chmod + x ~/bin/jpsall

test script
If the following appears, it is normal

[hadoop@hadoop1 ~]$ jpsall
=============== master================
4448 DataNode
4723 NodeManager
4889 JobHistoryServer
4985Jps
4299 NameNode
=============== slave1================
4372 DataNode
4550 ResourceManager
5053Jps
4686 NodeManager
=============== slave2 ===============
3420 SecondaryNameNode
3677Jps
3310 DataNode
3503 NodeManager

If you are a root user, bash: jps: command not found may appear. This is because the node /etc/profile file is used to set system JAVA environment parameters. The environment variables here are effective for all users in the system. The execution command is entered through user login, so when ssh master jps is run on the node, the environment variables in the file ~/.bashrc are actually read.
So you need to configure JAVA environment variables in the user ~/.bashrc file
Is it possible to do the above issue of starting and shutting down the script in this way? You can try it.

vim .bashrc
#Add in it
export JAVA_HOME=/opt/module/jdk
export PATH=$PATH:$JAVA_HOME/bin

Sync to other hosts

xsync ~/bin/

Description of default commonly used port numbers
Port Hadoop2.X hadoop3.x
NameNode internal communication port 8020/9000 8020/9000/9820
NameNode HTTP UI 50070 9870
MapReduce view execution task port 8088 8088
History server communication port 19888 19888