Hadoop cluster construction

Use three virtual machines to build a Hadoop cluster.

Directory

1. Centos7 Minimal version installation and configuration

2. Hadoop single node installation

① Modify the host name and map the hosts file

② Install JDK

③ Install Hadoop

④ Local mode test (official WordCount)

3. Hadoop cluster configuration

(1) Clone hadoop102, hadoop103

(2) Group distribution script xsync

(3) Root user password-free login configuration

(4) Cluster configuration

1. Cluster deployment planning

2. Configuration file description

3. Configure the cluster

4. Distribute the configured Hadoop configuration files on the cluster

5. Go to 102 and 103 to check the distribution of files

(5) Group together

1. Configure workers

2. Start the cluster

3. Cluster basic test

(6) Summary of cluster start/stop methods

1. Each module starts/stops separately

2. Each service component starts/stops one by one

(7) Description of common port numbers


1. centos7 Minimal version installation configuration

Centos7 Minimal Version Basic Configuration Record_centos7minimal_YuBooy’s Blog-CSDN Blog

2. Hadoop single node installation

Source: (Shang Silicon Valley) Shang Silicon Valley Big Data Hadoop Tutorial (Hadoop 3.x Installation to Cluster Tuning)_哔哩哔哩_bilibili

Machine configuration: (see basic configuration details: Centos7 Minimal version basic installation configuration)

CPU name

hadoop101

hadoop102

hadoop103

IP

192.168.139.101

192.168.139.102

192.168.139.103

username

root

root

root

password

123

123

123

HDFS

NameNode

DataNodes

DataNodes

SecondaryNameNode

DataNodes

YARN

Node Manager

ResourceManager

Node Manager

Node Manager

First configure on hadoop101:

① Modify the host name and map the hosts file;

② Install JDK; install Hadoop and then clone hadoop102, hadoop103;

③ After cloning, modify the hostname and IP of hadoop102 and hadoop103.

① modify hostname and mapping hosts file

// Change the host name
hostnamectl set-hostname hadoop101

// Modify the mapping hosts file
vim /etc/hosts
// Add to:
192.168.139.101 hadoop101
192.168.139.102 hadoop102
192.168.139.103 hadoop103

reboot

Modify the host mapping file (hosts file) of windows

// C:\Windows\System32\drivers\etc path
// Add to:
192.168.139.101 hadoop101
192.168.139.102 hadoop102
192.168.139.103 hadoop103

② Install JDK

The non-minimal version needs to uninstall the built-in JDK first; the minimal version does not.

1. Put the jdk and hadoop packages in the /etc/sofware directory

2. Installation

// Unzip the JDK to the /opt/module directory
[root@hadoop101 sofware]# tar -zxvf jdk-8u144-linux-x64.tar.gz -C /opt/module
/

// Configure JDK environment variables
// (1) Create a new /etc/profile.d/my_env.sh file
vim /etc/profile.d/my_env.sh
// (2) Add the following content
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_144
export PATH=$PATH:$JAVA_HOME/bin
// (3) Source the /etc/profile file to make the new environment variable PATH take effect
source /etc/profile
// (4) Test whether the JDK is installed successfully
java -version

③ Install Hadoop

Hadoop download address: Index of /dist/hadoop/common/hadoop-3.1.3

/ unzip
cd /opt/sofware
tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/

// Add Hadoop to environment variables
//(1) Open the /etc/profile.d/my_env.sh file
vim /etc/profile.d/my_env.sh
// (2) Add the following content
#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
// (3) Source the /etc/profile file to make the new environment variable PATH take effect
source /etc/profile
// (4) Test if hadoop is installed successfully
hadoop version

Hadoop directory structure:

(1) bin directory: store scripts that operate on Hadoop-related services (hdfs, yarn, mapred)

(2) etc directory: Hadoop configuration file directory, storing Hadoop configuration files

(3) lib directory: stores Hadoop’s local library (compresses and decompresses data)

(4) sbin directory: stores scripts to start or stop Hadoop-related services

(5) share directory: store Hadoop dependent jar packages, documents, and official cases

④ Local Mode Test (Official WordCount)

// 1. Create a wcinput folder
mkdir /opt/module/hadoop-3.1.3/wcinput
// 2. Create a word.txt file

cd /opt/module/hadoop-3.1.3/wcinput
vim word.txt
// 3. Add the following content:
hadoop yarn
hadoop mapreduce
yuyu
yuyu
// 4. Execute the program
cd /opt/module/hadoop-3.1.3
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount wcinput wcoutput
// 5. View the result
cat wcoutput/part-r-00000

3. Hadoop cluster configuration

CPU name

hadoop101

hadoop102

hadoop103

IP

192.168.139.101

192.168.139.102

192.168.139.103

username

root

root

root

password

123

123

123

(1) clone hadoop102, hadoop103

Modify IP and hostname after cloning

// Modify ip 102/103
vim /etc/sysconfig/network-scripts/ifcfg-ens33
// Modify the host name hadoop102/hadoop103
hostnamectl set-hostname hadoop102

// restart
reboot

(2) Group distribution script xsync

① Install rsync (on all three machines)

yum -y install rsync

# Start the service and start it automatically
systemctl start rsyncd.service
systemctl enable rsyncd.service

② Create xsync file in /root/bin directory

cd /root/bin
vim xsync

Paste the following:

#!/bin/bash
#1. Get the number of input parameters, if there are no parameters, exit directly
pcount=$#
if [ $pcount -lt 1 ]
then
    echo Not Enough Arguement!
    exit;
the fi
?
#2. Traverse all machines in the cluster
for host in hadoop101 hadoop102 hadoop103 ##Change your own server domain name
do
    echo ===================== $host =====================
?
    #3. Traverse all directories and send them one by one
    for file in $@
    do
?
        #4 Determine if the file exists
        if [ -e $file ]
        then
?
            #5. Get the parent directory
            pdir=$(cd -P $(dirname $file); pwd)
            echo pdir=$pdir
?
            #6. Get the name of the current file
            fname=$(basename $file)
            echo fname=$fname
?
            #7. Execute commands via ssh: recursively create folders on the $host host (if the folder exists)
            ssh $host "mkdir -p $pdir"
?
            #8. Remotely sync files to the $pdir folder of $USER on $host
            rsync -av $pdir/$fname $USER@$host:$pdir
        else
            echo $file does not exist!
    the fi
    done
done

③ Add permissions to xsync:

chmod 777 xsync

④ Add global variables

vim /etc/profile
?
Add at the end:
PATH=$PATH:/root/bin
export PATH
?
Exit and execute the following command to make it take effect
source /etc/profile

(3) root user password-free login configuration

① Generate a pair of keys in the ~./ssh directory (all 3 nodes must be generated)

cd ~/.ssh
ssh-keygen -t rsa
# After entering this command, there will be a prompt, just press Enter

# If prompted [-bash: cd: .ssh: No such file or directory], just run ssh-keygen -t rsa to generate the key directly

② Save the public key in the authorized_keys file in the hadoop101 node

[root@hadoop101 .ssh]# cat id_rsa.pub >> authorized_keys

③ Log in to the hadoop102 and hadoop103 nodes, and copy the content of their public key files to the authorized_keys file of the hadoop101 node

# public key copy of hadoop102 node
[root@hadoop102 .ssh]# ssh-copy-id -i hadoop101
?
# copy of public key of hadoop103 node
[root@hadoop103 .ssh]# ssh-copy-id -i hadoop101

④ Modify permissions in hadoop101 node (~/.ssh directory and authorized_keys file)

[root@hadoop101 .ssh]# chmod 700 ~/.ssh
[root@hadoop101 .ssh]# chmod 644 ~/.ssh/authorized_keys

⑤ Distribute the authorization file to other nodes

# copy to hadoop102 node
[root@hadoop101 .ssh]# scp /root/.ssh/authorized_keys hadoop102:/root/.ssh/
?
# Copy to hadoop103 node
[root@hadoop101 .ssh]# scp /root/.ssh/authorized_keys hadoop103:/root/.ssh/

So far, the password-free login has been set up. Note that you need to enter a password when you log in for the first time with ssh, and you can log in without a password when you access it again

[root@hadoop101 .ssh]# ssh hadoop102
[root@hadoop101 .ssh]# ssh hadoop103

(4) Cluster configuration

1. Cluster deployment plan

> NameNode and SecondaryNameNode should not be installed on the same server

> ResourceManager also consumes a lot of memory, so it should not be configured on the same machine as NameNode and SecondaryNameNode.

CPU name

hadoop101

hadoop102

hadoop103

IP

192.168.139.101

192.168.139.102

192.168.139.103

username

root

root

root

password

123

123

123

HDFS

NameNode

DataNodes

DataNodes

SecondaryNameNode

DataNodes

YARN

Node Manager

ResourceManager

Node Manager

Node Manager

2. Configuration file description

There are two types of Hadoop configuration files: Default configuration file and Custom configuration file. Only when users want to modify a default configuration value, they need to modify the custom configuration file. Change corresponding attribute value.

① Default configuration file:

default file

The location where the file is stored in the Hadoop jar package

[core-default.xml]

hadoop-common-3.1.3.jar/core-default.xml

[hdfs-default.xml]

hadoop-hdfs-3.1.3.jar/hdfs-default.xml

[yarn-default.xml]

hadoop-yarn-common-3.1.3.jar/yarn-default.xml

[mapred-default.xml]

hadoop-mapreduce-client-core-3.1.3.jar/mapred-default.xml

② Custom configuration file:

core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml The four configuration files are stored in the path $HADOOP_HOME/etc/hadoop. The configuration needs to be reconfigured.

3, configure cluster

① Configure core-site.xml

[root@hadoop101 hadoop]# cd $HADOOP_HOME/etc/hadoop
[root@hadoop101 hadoop]# vim core-site.xml

The content is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 

<configuration>
    <!-- Specify the address of the NameNode -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop101:8020</value>
    </property>

    <!-- Specify the storage directory of hadoop data -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop-3.1.3/data</value>
    </property>

    <!-- Configure the static user used for HDFS web page login as root -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>root</value>
    </property>
</configuration>

② Configure hdfs-site.xml

[root@hadoop101 hadoop]# vim hdfs-site.xml

The content is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- nn web access address -->
    <property>
        <name>dfs.namenode.http-address</name>
        <value>hadoop101:9870</value>
    </property>
    <!-- 2nn web access address -->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop103:9868</value>
    </property>
</configuration>

③ Configure yarn-site.xml

[root@hadoop101 hadoop]# vim yarn-site.xml

The content is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <!-- Specify MR to take shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <!-- Specify the address of ResourceManager -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop102</value>
    </property>

    <!-- Inheritance of environment variables -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_CONF_DIR, CLASSPATH_PREPEND_DISTCACHE, HADOOP_YARN_HOME, HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

④ Configure mapred-site.xml

[root@hadoop101 hadoop]# vim mapred-site.xml

The content is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- Specify the MapReduce program to run on Yarn -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

4. Distribute the configured Hadoop configuration files on the cluster

[root@hadoop101 hadoop]# xsync /opt/module/hadoop-3.1.3/etc/hadoop/

5. Check file distribution on 102 and 103

[root@hadoop102 ~]# cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml
[root@hadoop103 ~]# cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml

(5) Group up the cluster

1. Configure workers

[root@hadoop101 hadoop]# vim /opt/module/hadoop-3.1.3/etc/hadoop/workers

【Note: No spaces are allowed at the end of the content added in this file, and blank lines are not allowed in the file】

Add the following:

hadoop101
hadoop102
hadoop103

Synchronize all node configuration files

[root@hadoop101 hadoop]# xsync /opt/module/hadoop-3.1.3/etc

2. Start cluster

① Needed for the first startup: format the NameNode on the hadoop101 node

[root@hadoop101 hadoop-3.1.3]# cd /opt/module/hadoop-3.1.3/
[root@hadoop101 hadoop-3.1.3]# hdfs namenode -format

[Note: Formatting the NameNode will generate a new cluster id, which will cause the cluster ids of the NameNode and DataNode to be inconsistent, and the cluster cannot find past data. If the cluster is started for the first time, format the NameNode directly on the hadoop101 node; if the cluster reports an error during operation and needs to reformat the NameNode, be sure to stop the namenode and datanode processes first, and delete the data and logs directories of all machines , and then format]

② Start HDFS (on NameNode node: hadoop101)

[root@hadoop101 hadoop-3.1.3]# cd /opt/module/hadoop-3.1.3/
[root@hadoop101 hadoop-3.1.3]# sbin/start-dfs.sh

If the startup encounters the following problems:

【ERROR: Attempting to operate on hdfs namenode as root】

Solution: Two ways to solve ERROR: Attempting to operate on hdfs namenode as root_世界水博客-CSDN博客

vim /etc/profile

#Add to:

export HDFS_NAMENODE_USER=root

export HDFS_DATANODE_USER=root

export HDFS_SECONDARYNAMENODE_USER=root

export YARN_RESOURCEMANAGER_USER=root

export YARN_NODEMANAGER_USER=root

#to validate:

source /etc/profile

③ Start YARN on the node (hadoop102) configured with ResourceManager

[root@hadoop102 hadoop-3.1.3]# cd /opt/module/hadoop-3.1.3/
[root@hadoop102 hadoop-3.1.3]# sbin/start-yarn.sh

④ View the NameNode of HDFS on the web

1. Enter in the browser: http://hadoop101:9870

2. Check the data information stored on HDFS

⑤ View YARN’s ResourceManager on the web

1. Enter in the browser: http://hadoop102:8088

2. View job information running on YARN

3. Basic cluster test

① Upload files to the cluster

Upload small files

[root@hadoop101 ~]# hadoop fs -mkdir /input
[root@hadoop101 ~]# hadoop fs -put $HADOOP_HOME/wcinput/word.txt /input

upload large files

[root@hadoop101 ~]# hadoop fs -put /opt/sofware/jdk-8u144-linux-x64.tar.gz /

View uploaded files from the web page:

② Download file

[root@hadoop103 ~]# hadoop fs -get /jdk-8u144-linux-x64.tar.gz ./

③ Execute the wordcount program

[root@hadoop101 hadoop-3.1.3]# cd /opt/module/hadoop-3.1.3
[root@hadoop101 hadoop-3.1.3]# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output

(6) Summary of cluster start/stop methods

1. Start/stop each module separately

① Overall start/stop HDFS (need to be executed on the NameNode node)

start-dfs.sh / stop-dfs.sh

[root@hadoop101 hadoop-3.1.3]# /opt/module/hadoop-3.1.3/sbin/start-dfs.sh

② Overall start/stop YARN (need to be executed on the ResourceManager node)

start-yarn.sh / stop-yarn.sh

[root@hadoop102 hadoop-3.1.3]# /opt/module/hadoop-3.1.3/sbin/start-yarn.sh

2. Start/stop each service component one by one

① Start/stop HDFS components separately

hdfs –daemon start / stop namenode / datanode / secondarynamenode

② Start/stop YARN

yarn –daemon start/stop resourcemanager/nodemanager

(7) Description of common port numbers

port name

Hadoop2.x

Hadoop3.x

NameNode internal communication port

8020 / 9000

8020/9000/9820

NameNode HTTP UI

50070

9870

MapReduce view execution task port

8088

8088

History server communication port

19888

19888

Reference: Source: (Silicon Valley)Silicon Valley Big Data Hadoop Tutorial (Hadoop 3.x Installation to Cluster Tuning)_哔哩哔哩_bilibili