Hadoop 3.3.4 setup (YARN deployment HDFS deployment)

Planning

Node CPU Memory Hard Disk
node1 1 4G 20G
node2 1 2G 20G
node3 1 2G 20G

Create a new virtual machine

I use 1810

After the setup is complete, start the installation

After successfully logging in, enter the init 0 command, shut down and clone

Clone virtual machine

One is named node2 and the other is named node3

The memory of node2 and node3 is set to 2G

1.Preparation

1.1 Modify host name

#On node1 node
[root@localhost ~]# hostnamectl set-hostname node1
?
#On node2 node
[root@localhost ~]# hostnamectl set-hostname node2
?
#On node3 node
[root@localhost ~]# hostnamectl set-hostname node3

1.2 Modify IP

#node1 IP changed to 192.168.59.101
#node2 IP changed to 192.168.59.102
#node3 IP changed to 192.168.59.103

1.3 Modify the hosts file on your computer

The file is in the C:\Windows\System32\drivers\etc directory

1.4 Write the /etc/hosts file on the node

#node1, node2, and node3 nodes must be written. Only node1 is shown below.
[root@node1 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.59.101node1
192.168.59.102node2
192.168.59.103 node3

1.5 Configure SSH password-free login

#node1, node2, and node3 nodes must execute the following commands. Only node1 is shown below.
[root@node1 ~]# ssh-keygen -t rsa -b 4096 #After executing this command, just press Enter
[root@node1 ~]# ssh-copy-id node1
[root@node1 ~]# ssh-copy-id node2
[root@node1 ~]# ssh-copy-id node3
?
#Create a hadoop user and configure SSH password-free login. Node1, node2, and node3 nodes must execute the following commands. Only node1 is shown below.
[root@node1 ~]# useradd hadoop
[root@node1 ~]# passwd hadoop
Changing password for user hadoop.
New password:
BAD PASSWORD: The password is shorter than 8 characters
Retype new password:
passwd: all authentication tokens updated successfully.
[root@node1 ~]# su - hadoop #Switch to hadoop user
?
#You need node1, node2, and node3 nodes to create hadoop to execute the following command, and switch to the hadoop user. Only node1 is shown below.
#Premise: node1, node2, node3 have all created hadoop users
[hadoop@node1 ~]$ ssh-keygen -t rsa -b 4096 #After executing this command, just press Enter
[hadoop@node1 ~]$ ssh-copy-id node1
[hadoop@node1 ~]$ ssh-copy-id node2
[hadoop@node1 ~]$ ssh-copy-id node3

1.6 Configuring JDK environment

Upload JDK files

Configuration environment

Remember to switch back to root user

[root@node1 ~]$ su - root
# The following is only executed on the node1 node
[root@node1 ~]# ls
anaconda-ks.cfg jdk-8u381-linux-x64.tar.gz
?
[root@node1 ~]$ mkdir -p /export/server
[root@node1 ~]# tar -zxvf jdk-8u381-linux-x64.tar.gz -C /export/server/
?
#Create soft link
[root@node1 ~]# ln -s /export/server/jdk1.8.0_381 /export/server/jdk
?
[root@node1 ~]# ll /export/server/
total 4
lrwxrwxrwx. 1 root root 27 Oct 22 10:49 jdk -> /export/server/jdk1.8.0_381
drwxr-xr-x. 8 root root 4096 Oct 22 10:49 jdk1.8.0_381
?
[root@node1 ~]$ vi /etc/profile
#Add at the bottom of the file:
export JAVA_HOME=/export/server/jdk
export PATH=$PATH:$JAVA_HOME/bin
?
#Refresh the /etc/profile file
[root@node1 ~]$ source /etc/profile
?
#Test java environment
[root@node1 ~]# java -version
java version "1.8.0_381"
Java(TM) SE Runtime Environment (build 1.8.0_381-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.381-b09, mixed mode)
?
[root@node1 ~]# javac -version
javac 1.8.0_381

Synchronize files to node2, node3

# Before synchronization, node2 and node3 must have /export/server, if not executed mkdir -p /export/server
?
[root@node1 ~]# scp -r /export/server/jdk1.8.0_381 node2:/export/server/
[root@node1 ~]# scp -r /export/server/jdk1.8.0_381 node3:/export/server/
?
# After synchronization is completed, check whether there is jdk1.8.0_381 on node2 and node3. Only node2 is shown below.
[root@node2 ~]# ls /export/server/
jdk1.8.0_381
?
# Synchronize the /etc/profile file to node2, node3. Only node2 is shown below.
[root@node2 ~]# scp /etc/profile node2:/etc/profile
profile 100% 1819 1.0MB/s 00:00
[root@node2 ~]# scp /etc/profile node3:/etc/profile
profile 100% 1819 943.9KB/s 00:00
?
# After synchronization is completed, create soft links on node2 and node3. Only node2 is shown below.
[root@node2 ~]# ln -s /export/server/jdk1.8.0_381 /export/server/jdk
[root@node2 ~]# ll /export/server/
total 4
lrwxrwxrwx. 1 root root 27 Oct 22 10:58 jdk -> /export/server/jdk1.8.0_381
drwxr-xr-x. 8 root root 4096 Oct 22 10:53 jdk1.8.0_381
?
# After the above is completed, refresh /etc/profile and test the java environment. Only node2 is shown below.
[root@node2 ~]# source /etc/profile
[root@node2 ~]# java -version
java version "1.8.0_381"
Java(TM) SE Runtime Environment (build 1.8.0_381-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.381-b09, mixed mode)
[root@node2 ~]# javac -version
javac 1.8.0_381

1.7 Turn off the firewall and SELinux

# node1, node2, and node3 must execute the following statements. Only node1 is shown below.
[root@node1 ~]# systemctl stop firewalld
[root@node1 ~]# systemctl disable firewalld
Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
?
[root@node1 ~]# vi /etc/selinux/config
# Change SELINUX=enforcing to SELINUX=disabled
[root@node1 ~]# setenforce 0

Create a snapshot for 1.8

Node1, node2, and node3 all need to create snapshots

2.Hadoop planning

NameNode: master node manager

DataNode: slave node worker

SecondaryNameNode: primary node secondary

node1 NameNode, DataNode, SecondaryNameNode
node2 DataNode
node3 DataNode

3. Deploy HDFS cluster

3.1 Upload & decompress Hadoop to node1

# The following is only executed on node1
[root@node1 ~]# ls
anaconda-ks.cfg hadoop-3.3.4.tar.gz jdk-8u381-linux-x64.tar.gz
?
[root@node1 ~]# tar -zxvf hadoop-3.3.4.tar.gz -C /export/server/
?
# Create soft link
[root@node1 ~]# ln -s /export/server/hadoop-3.3.4 /export/server/hadoop
[root@node1 ~]# ll /export/server/
total 4
lrwxrwxrwx 1 root root 27 Oct 23 06:22 hadoop -> /export/server/hadoop-3.3.4
drwxr-xr-x 10 1024 1024 215 Jul 29 2022 hadoop-3.3.4
lrwxrwxrwx. 1 root root 27 Oct 22 10:49 jdk -> /export/server/jdk1.8.0_381
drwxr-xr-x. 8 root root 4096 Oct 22 10:49 jdk1.8.0_381

3.2 Configuration File

workers What are the configuration slave nodes (DataNode)
hadoop -env.sh Configure Hadoop-related environment variables
core-site.xml Hadoop core configuration file
hafs-site.xml HDFS core configuration file

These files are in /export/server/hadoop/etc/hadoop

3.3 Configuring workers

# The following is only executed on node1
[root@node1 ~]# cd /export/server/hadoop/etc/hadoop/
[root@node1 hadoop]# vi workers
# Delete localhost in the file
# Add the following: (hostnames of three hosts)
node1
node2
node3

3.4 Configuring hadoop-env.sh

# The following is only executed on node1
[root@node1 hadoop]# vi hadoop-env.sh
# Add the following:
export JAVA_HOME=/export/server/jdk
export HADOOP_HOME=/export/server/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_LOG_DIR=$HADOOP_HOME/logs

3.5 configuration core-site.xml

# The following is only executed on node1
[root@node1 hadoop]# vi core-site.xml
# Add the following between <configuration> and </configuration>:
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://node1:8020</value>
        </property>
?
        <property>
                <name>io.file.buffer.size</name>
                <value>131072</value>
        </property>

3.6 Configure hdfs-site.xml

# The following is only executed on node1
[root@node1 hadoop]# vi hdfs-site.xml
# Add the following between <configuration> and </configuration>:
    <property>
        <name>dfs.datanode.data.dir.perm</name>
        <value>700</value>
    </property>
?
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/data/nn</value>
    </property>
?
    <property>
        <name>dfs.namenode.hosts</name>
        <value>node1,node2,node3</value>
    </property>
?
    <property>
        <name>dfs.blocksize</name>
        <value>268435456</value>
    </property>
?
    <property>
        <name>dfs.namenode.handler.count</name>
        <value>100</value>
    </property>
?
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/data/dn</value>
    </property>

4. Post-configuration operations

# On node1:
[root@node1 hadoop]# mkdir -p /data/nn
[root@node1 hadoop]# mkdir /data/dn
?
# On node2 and node3:
[root@node2 ~]# mkdir -p /data/dn
?
[root@node3 ~]# mkdir -p /data/dn
?
# On node1:
[root@node1 hadoop]# cd /export/server/
[root@node1 server]# scp -r hadoop-3.3.4 node2:`pwd`/
[root@node1 server]# scp -r hadoop-3.3.4 node3:`pwd`/
?
# Check whether node2 and node3 have hadoop-3.3.4. Only node2 is shown below.
[root@node2 ~]# ll /export/server/
total 4
drwxr-xr-x 10 root root 215 Oct 23 06:50 hadoop-3.3.4
lrwxrwxrwx. 1 root root 27 Oct 22 10:58 jdk -> /export/server/jdk1.8.0_381
drwxr-xr-x. 8 root root 4096 Oct 22 10:53 jdk1.8.0_381
?
# Establish hadoop soft links on node2 and node3. Only node2 is shown below.
[root@node2 ~]# ln -s /export/server/hadoop-3.3.4 /export/server/hadoop
[root@node2 ~]# ll /export/server/
total 4
lrwxrwxrwx 1 root root 27 Oct 23 07:21 hadoop -> /export/server/hadoop-3.3.4
drwxr-xr-x 10 root root 215 Oct 23 06:50 hadoop-3.3.4
lrwxrwxrwx. 1 root root 27 Oct 22 10:58 jdk -> /export/server/jdk1.8.0_381
drwxr-xr-x. 8 root root 4096 Oct 22 10:53 jdk1.8.0_381
?
# The following node1, node2, and node3 must be executed. Only node1 is shown below.
[root@node1 ~]# vi /etc/profile
#Add at the bottom of the file:
export HADOOP_HOME=/export/server/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
[root@node1 ~]# source /etc/profile
# Check if hadoop is available
[root@node1 ~]# hadoop version
Hadoop 3.3.4
Source code repository https://github.com/apache/hadoop.git -r a585a73c3e02ac62350c136643a5e7f6095a3dbb
Compiled by stevel on 2022-07-29T12:32Z
Compiled with protoc 3.7.1
From source with checksum fb9dd8918a7b8a5b430d61af858f6ec
This command was run using /export/server/hadoop-3.3.4/share/hadoop/common/hadoop-common-3.3.4.jar
?
# Execute on node1, node2, node3 as root
[root@node1 ~]# chown -R hadoop:hadoop /data
[root@node1 ~]# chown -R hadoop:hadoop /export
[root@node1 ~]# ll /
drwxr-xr-x 4 hadoop hadoop 26 Oct 23 06:47 data
drwxr-xr-x. 3 hadoop hadoop 20 Oct 22 10:48 export

5. Format namenode

# Switch to hadoop user
[root@node1 ~]# su - hadoop
Last login: Sun Oct 22 10:38:44 EDT 2023 on pts/0
[hadoop@node1 ~]$ hadoop namenode -format
?
# Start HDFS cluster
[hadoop@node1 ~]$ start-dfs.sh
?
# node1 node
[hadoop@node1 ~]$ jps
9089 NameNode
9622Jps
9193 DataNode
9356 SecondaryNameNode
?
# node2 and node3 nodes
[root@node2 ~]# jps
9092Jps
9031 DataNode
?
[root@host3 ~]# jps
8933Jps
8871 DataNode
?
# If you want to shut down the HDFS cluster, execute the following command
[hadoop@node1 ~]$ stop-dfs.sh

After HDFS starts, open the browser and enter http://node1:9870 or http://192.168.59.101:9870 in the url box.

Note: The node1 at http://node1:9870 needs to modify the hosts file on your computer, otherwise the website cannot be opened. Please see 1.3 for details.

After successfully opening the website, it will look like this

There is basically no problem here

6. Create HDFS cluster snapshot

7. Deploy YARN cluster

7.1 Review & Understanding

Hadoop HDFS distributed file system, we will start:

  • NameNode process as management node

  • DataNode process as worker

  • SecondaryNamenoe as secondary

In the same way, Hadoop YARN distributed resource scheduling will start:

  • ResourceManager process as management node

  • NodeManager process as worker node

  • ProxyServer and JobHistoryServer two auxiliary nodes

MapReduce runs in a YARN container without starting a separate process

ResourceManager Cluster Resource Manager
NodeManager Stand-alone Resource Manager
ProxyServer The proxy server provides security
JobHistoryServer Record historical information and logs
node1 ResourceManager, NodeManager, ProxyServer, JobHistoryServer
node2 NodeManager
node3 NodeManager

7.2 Configuring mapred-env.sh

# The following is only executed on node1
# Remember to switch back to the root user before doing this
[hadoop@node1 hadoop]$ su - root
[root@node1 ~]# cd /export/server/hadoop/etc/hadoop/
[root@node1 hadoop]# vi mapred-env.sh
# Add the following:
export JAVA_HOME=/export/server/jdk
export HADOOP_JOB_HISTORYSERVER_HEAPSIZE=1000
export HADOOP_MAPRED_R00T_LOGGER=INFO,RFA 

7.3 Configure mapred-site.xml

# The following is only executed on node1
[root@node1 hadoop]# vi mapred-site.xml
# Add the following between <configuration> and </configuration>:
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
?
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>node1:10020</value>
    </property>
?
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>node1:19888</value>
    </property>
?
    <property>
        <name>mapreduce.jobhistory.intermediate-done-dir</name>
        <value>/data/mr-history/tmp</value>
    </property>
?
    <property>
        <name>mapreduce.jobhistory.done-dir</name>
        <value>/data/mr-history/done</value>
    </property>
?
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HAD0OP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>
?
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>
?
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>

7.4 configuration yarn.env.sh

# The following is only executed on node1
[root@node1 hadoop]# vi yarn-env.sh
# Add the following:
export JAVA_HOME=/export/server/jdk
export HADOOP_HOME=/export/server/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_LOG_DIR=$HADOOP_HOME/logs

7.5 configuration yarn.site.xml

# The following is only executed on node1
[root@node1 hadoop]# vi yarn-site.xml
# Add the following between <configuration> and </configuration>:
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>node1</value>
    </property>
?
    <property>
        <name>yarn.nodemanager.local-dirs</name>
        <value>/data/nm-local</value>
    </property>
?
    <property>
        <name>yarn.nodemanager.log-dirs</name>
        <value>/data/nm-log</value>
    </property>
?
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
?
    <property>
        <name>yarn.log.server.url</name>
        <value>http://node1:19888/jobhistory/logs</value>
    </property>
?
    <property>
        <name>yarn.web-proxy.address</name>
        <value>node1:8089</value>
    </property>
?
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
?
    <property>
        <name>yarn.nodemanager.remote-app-log-dir</name>
        <value>/tmp/logs</value>
    </property>
?
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
    </property>

8. Synchronize files to node2 and node3 nodes

# Enter the /export/server/hadoop/etc/hadoop/ directory
[root@node1 hadoop]# cd /export/server/hadoop/etc/hadoop/
# Synchronize all files in this directory to node2 and node3 nodes
[root@node1 hadoop]# scp * node2:`pwd`/
[root@node1 hadoop]# scp * node3:`pwd`/

9. Start yarn cluster

# Remember to switch to hadoop user before starting
[root@node1 hadoop]# su - hadoop
[hadoop@node1 ~]$ start-yarn.sh
Starting resource manager
Starting node managers
?
# Shut down the yarn cluster command: stop-yarn.sh
?
#node1
[hadoop@node1 ~]$ jps
9089 NameNode
10389 WebAppProxyServer
9193 DataNode
10233 NodeManager
9356 SecondaryNameNode
10126 ResourceManager
10671Jps
?
#node2
[root@node2 ~]# jps
9361Jps
9268 NodeManager
9031 DataNode
?
#node3
[root@host3 ~]# jps
9108 NodeManager
8871 DataNode
9199Jps
?
# Because I did not close the HDFS process. All node1, node2, and node3 have HDFS processes.
?
# History server needs to be started independently
[hadoop@node1 ~]$ mapred --daemon start historyserver
# Shut down the history server Command: mapred --daemon stop historyserver
[hadoop@node1 ~]$ jps
9089 NameNode
10755Jps
10389 WebAppProxyServer
10728 JobHistoryServer
9193 DataNode
10233 NodeManager
9356 SecondaryNameNode
10126 ResourceManager

After yarn is started correctly, enter http://node1:8088 or http://192.168.59.101:8088 in the url box of the browser

At this point, the YARN cluster deployment has been basically completed.