Manually build a cluster based on the original HADOOP components

  1. HadoopClusterPlanning

Before building a cluster, you need to do some preparatory planning in advance, including host planning, software planning, user planning, and directory planning. Before building the cluster, create an account and configure password-free login. Due to the enablement of the firewall, the firewall ports corresponding to each node need to be added and opened.

Add hadoop account and hadopp user group to each node:

#Add hadoop user group: groupadd hadoop

#Add hadoop user and belong to hadoop group: useradd -g hadoop hadoop

#Set password for hadoop account: passwd hadoop

itesthadoop123

Set up password-free login:

#su hadoop

#ssh-keygen

#cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

#scp id_rsa.pub hadoop@hadoopnode1:/root/.ssh/

#scp id_rsa.pub hadoop@hadoopnode2:/root/.ssh/

#scp id_rsa.pub hadoop@hadoopnode3:/root/.ssh/

#scp id_rsa.pub hadoop@hadoopnode4:/root/.ssh/

#hadoopnode1..hadoopnode4: cat id_rsa.pub >> authorized_keys

Since the firewall is open, you need to execute the following command to open the port:

firewall-cmd --zone=public --permanent --add-port=2181/tcp

firewall-cmd --zone=public --permanent --add-port=2888/tcp

firewall-cmd --zone=public --permanent --add-port=3888/tcp

firewall-cmd --zone=public --permanent --add-port=9000/tcp

firewall-cmd --zone=public --permanent --add-port=50070/tcp

firewall-cmd --zone=public --permanent --add-port=8485/tcp

firewall-cmd --zone=public --permanent --add-port=50010/tcp

firewall-cmd --zone=public --permanent --add-port=16010/tcp

firewall-cmd --zone=public --permanent --add-port=50020/tcp

firewall-cmd --zone=public --permanent --add-port=50075/tcp

firewall-cmd --zone=public --permanent --add-port=16000/tcp

firewall-cmd --zone=public --permanent --add-port=16020/tcp

firewall-cmd --zone=public --permanent --add-port=16030/tcp

firewall-cmd --zone=public --permanent --add-port=8019/tcp

firewall-cmd --reload

firewall-cmd --zone=public --list-ports

Hadoop is a cluster based on zookeeper

In addition to zookeeper’s 2888, 3888 (default) ports,

Hadoop also needs to open several ports, including

Zooker storage address port: 2181

rpc communication port: 9000 (the rpc communication port only needs to be opened by the namenode and the alternate namenode)

http communication port: 50070 (same as above, only need to open the namenode port)

jornalnode port: 8485 (the data synchronization port needs to be opened to other jornalnode servers)

datanode port: 50010

The above ports are the default ports and can be changed as needed

1.1 Host planning

Due to limited resources, the Hadoop cluster is still installed with three hosts in the Zookeeper cluster. The specific planning of which daemons each host runs is shown in Table 1:

Table 1 host planning

daemon process

hadoopnode1/192.16.109.57

hadoopnode2/192.16.109.58

hadoopnode3/192.16.109.59

hadoopnode4/192.16.109.60

NameNode

yes

yes

DataNodes

yes

yes

yes

yes

Journalnode

yes

yes

yes

zookeeper

yes

yes

yes

Hbase-master

yes

yes

HRegionServer

yes

yes

1.2 Software Planning

Considering the compatibility between various software versions, the software planning is shown in Table 2.

Table 2 Software planning

software

Version

installation path

centos

centos 7

virtual machine operating system

JDK

JDK1.8

/usr/java/jdk1.8.0_331

zookeeper

Apache zookeeper 3.4.10

/home/hadoop/app/zookeeper/zookeeper-3.4.10

hadoop

Apache hadoop 2.6.5

/home/hadoop/app/hadoop/hadoop-2.6.5

Hbase

Apache hadoop 1.2.6

/home/hadoop/app/hbase/hbase-1.2.6

1.3 User Planning

In order to maintain the independence of the Hadoop cluster environment, we separately create hadoop users and user groups on each node. The user plan of each node is shown in Table 3.

Table 3 User Planning

node name

user group

user

hadoopnode1

hadoop

hadoop

hadoopnode2

hadoop

hadoop

hadoopnode3

hadoop

hadoop

hadoopnode4

hadoop

hadoop

1.4 Directory planning

In order to facilitate the management of each node of the Hadoop cluster, it is necessary to create relevant directories in advance under the hadoop user. The specific directory planning is shown in Table 4.

Table 4 directory plan

name

path

All software catalog

/home/hadoop/app/

All data and log directories

/home/hadoop/data/

2. HDFS distributed cluster construction

The Hadoop cluster is composed of two parts: HDFS and (YARN will not be installed for the time being). Here we first build the HDFS distributed cluster.

2.1 HDFS cluster configuration

2.1.1. Download and decompress Hadoop

First go to the official website (address: https://archive.apache.org/dist/hadoop/common/) to download the installation package of the stable version of Hadoop, then upload it to the /home/hadoop/app/hadoop directory under the hadoopnode1 node and decompress it , the specific operation is as follows.

[hadoop@hadoopnode1 app]$ tar -zxvf hadoop-2.6.5.tar.gz //Decompression

[hadoop@hadoopnode1 app]$ ln -s hadoop-2.6.5 hadoop //Create soft connection

[hadoop@hadoopnode1 app]$ cd /home/hadoop/app/hadoop/hadoop-2.6.5 //Switch to the configuration directory

2.1.2. Modify HDFS configuration file

(1) Modify the hadoop-env.sh configuration file

The hadoop-env.sh file mainly configures variables related to the hadoop environment. Here, the installation directory of JAVA_HOME is mainly modified. The specific operation is as follows.

[hadoop@hadoopnode1 hadoop]$ vi hadoop-env.sh

export JAVA_HOME=/usr/java/jdk1.8.0_331

(2) Modify the core-site.xml configuration file

The core-site.xml file mainly configures the public properties of Hadoop, and each property that needs to be configured is as follows.

[hadoop@hadoopnode1 hadoop]$ vi core-site.xml
<configuration>

    <property>

        <name>fs.defaultFS</name>

        <value>hdfs://itest-hadoop-cluster</value>

    </property>

    <!--The value here refers to the default HDFS path, named itest-hadoop-cluster-->

    <property>

        <name>hadoop.tmp.dir</name>

        <value>/home/hadoop/data/tmp</value>

    </property>

(3) Modify the hdfs-site.xml configuration file

The hdfs-site.xml file mainly configures attributes related to HDFS. Each attribute that needs to be configured is as follows.

[hadoop@hadoopnode1 hadoop]$ vi hdfs-site.xml
<configuration>

  <property>

      <name>dfs.replication</name>

      <value>3</value>

    </property>

  <!--The number of data block copies is 3-->

  <property>

      <name>dfs.permissions</name>

      <value>false</value>

  </property>

  <property>

      <name>dfs.permissions.enabled</name>

      <value>false</value>

  </property>

  < !--The default configuration of permissions is false-->

  <property>

      <name>dfs.nameservices</name>

      <value>itest-hadoop-cluster</value>

  </property>

  < !-- Namespace, corresponding to the value of fs.defaultFS, itest-hadoop-cluster is the unified entrance provided by HDFS -->

  <property>

      <name>dfs.ha.namenodes.itest-hadoop-cluster</name>

      <value>nn1,nn2</value>

  </property>

  < !--Specify the name of the NameNode under itest-hadoop-cluster, which is a logical name. The name can be started at will and not repeated -->

  <nn1 http address>

<property>

      <name>dfs.namenode.rpc-address.itest-hadoop-cluster.nn1</name>

      <value>hadoopnode1:9000</value>

  </property>

  <!--hadoopnode1 rpc address-->

  <property>

      <name>dfs.namenode.http-address.itest-hadoop-cluster.nn1</name>

      <value>hadoopnode1:50070</value>

  </property>

  <!--nn2 http address-->

  <property>

      <name>dfs.namenode.rpc-address.itest-hadoop-cluster.nn2</name>

      <value>hadoopnode2:9000</value>

  </property>

  <!--nn2 rpc address-->

  <property>

      <name>dfs.namenode.http-address.itest-hadoop-cluster.nn2</name>

      <value>hadoopnode2:50070</value>

  </property>

  <!--hadoopnode2 http address-->

  <property>

      <name>dfs.ha.automatic-failover.enabled</name>

      <value>true</value>

    </property>

  < !--Automatic recovery from startup failures-->

  <property>

      <name>dfs.namenode.shared.edits.dir</name>

      <value>qjournal://hadoopnode1:8485;hadoopnode2:8485;hadohadoopnode3op3:8485/itest-hadoop-cluster</value>

  </property>

  < !--specify journal-->

  <property>

      <name>dfs.client.failover.proxy.provider.itest-hadoop-cluster</name>

<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

    </property>

  < !--When specifying itest-hadoop-cluster failure, which implementation class is responsible for performing failover-->

    <property>

      <name>dfs.journalnode.edits.dir</name>

      <value>/home/hadoop/data/journaldata/jn</value>

    </property>

  < !-- When specifying the JournalNode cluster to store edits.log, the storage path of the local disk -- >

  <property>

      <name>dfs.ha.fencing.methods</name>

      <value>shell(/bin/true)</value>

    </property>

    <property>

        <name>dfs.ha.fencing.ssh.private-key-files</name>

        <value>/home/hadoop/.ssh/id_rsa</value>

    </property>

  <property>

        <name>dfs.ha.fencing.ssh.connect-timeout</name>

        <value>10000</value>

    </property>

    <property>

      <name>dfs.namenode.handler.count</name>

      <value>100</value>

    </property>

</configuration>

(4) Configure slaves file

The slaves file is to configure the host name of the DataNode node according to the cluster planning. The specific operation is as follows.

[hadoop@hadoopnode1 hadoop]$ vi slaves

hadoopnode1

hadoopnode2

hadoopnode3

(5) Remotely copy the Hadoop installation directory to all nodes

On the hadoopnode1 node, switch to the /home/hadoop/app/hadoop directory, and remotely copy the Hadoop installation directory to the hadoopnode2 and hadoopnode3 nodes. The specific operations are as follows.

[hadoop@hadoopnode1 app]$ scp -r hadoop-2.6.5 hadoop@hadoopnode2:/home/hadoop/app/hadoop

[hadoop@hadoopnode1 app]$ scp -r hadoop-2.6.5 hadoop@hadoopnode3:/home/hadoop/app/hadoop

[hadoop@hadoopnode1 app]$ scp -r hadoop-2.6.5 hadoop@hadoopnode4:/home/hadoop/app/hadoop

Then create soft connections on hadoopnode2 and hadoopnode3 nodes, the specific operations are as follows.

[hadoop@hadoopnode2 hadoop]$ ln -s hadoop-2.6.5 hadoop

[hadoop@hadoopnode3 hadoop]$ ln -s hadoop-2.6.5 hadoop

[hadoop@hadoopnode4 hadoop]$ ln -s hadoop-2.6.5 hadoop

2.1.3 Start HDFS cluster service

#Start the Zookeeper cluster

Start the Zookeeper service on all nodes in the cluster respectively. The specific operation is as follows.

[hadoop@hadoopnode1 zookeeper-3.4.10]$ bin/zkServer.sh start

[hadoop@hadoopnode2 zookeeper-3.4.10]$ bin/zkServer.sh start

[hadoop@hadoopnode3 zookeeper-3.4.10]$ bin/zkServer.sh start

#Start the Journalnode cluster

Start the Journalnode service on all nodes in the cluster, and the specific operations are as follows.

[hadoop@hadoopnode1 hadoop]$sbin/hadoop-daemon.sh start journalnode

[hadoop@hadoopnode2 hadoop]$ sbin/hadoop-daemon.sh start journalnode

[hadoop@hadoopnode3 hadoop]$sbin/hadoop-daemon.sh start journalnode

#Format the primary node NameNode

On the hadoopnode1 node (NameNode primary node), use the following command to format the NameNode.

[hadoop@hadoopnode1 hadoop]$ bin/hdfs namenode -format / /namenode format

[hadoop@hadoopnode1 hadoop]$ bin/hdfs zkfc -formatZK //Format high availability

[hadoop@hadoopnode1 hadoop]$bin/hdfs namenode //start namenode

#Standby NameNode synchronizes primary node metadata

While starting the NameNode service on the hadoopnode1 node, you need to execute the following command on the hadoopnode2 node (NameNode backup node) to synchronize the metadata of the master node.

[hadoop@hadoopnode2 hadoop]$ bin/hdfs namenode -bootstrapStandby

#Close the Journalnode cluster

After the hadoopnode2 node has synchronized the metadata of the master node, immediately on the hadoopnode1 node, press the key combination to end the NameNode process, and then close the Journalnode process on all nodes. The specific operation is as follows.

[hadoop@hadoopnode1 hadoop]$ sbin/hadoop-daemon.sh stop journalnode

[hadoop@hadoopnode2 hadoop]$ sbin/hadoop-daemon.sh stop journalnode

[hadoop@hadoopnode3 hadoop]$ sbin/hadoop-daemon.sh stop journalnode

#One-click start HDFS cluster

If there is no problem with the above operations, on the hadoopnode1 node, you can use the script to start all related processes of the HDFS cluster with one click. The specific operations are as follows:

[hadoop@hadoopnode1 hadoop]$ sbin/start-dfs.sh

Note: The first time you install HDFS, you need to format the NameNode. After the HDFS cluster is successfully installed, you can use the start-dfs.sh script to start all the processes of the HDFS cluster with one click.

2.3 HDFS cluster test

Enter the URL http://hadoopnode1:50070 in the browser, and check the status of the NameNode of the hadoopnode1 node through the Web interface. The result is shown in Figure 2. The status of this node is active, which means that HDFS can provide external services through the NameNode of hadoopnode1 node.

Figure 2 NameNode interface in active state

Enter the URL http://hadoopnode2:50070 in the browser, and check the status of the NameNode of the hadoopnode2 node through the web interface. The result is shown in Figure 3. The status of this node is standby, which means that the NameNode of the hadoopnode2 node cannot provide external services and can only be used as a standby node.

Figure 3 NameNode interface in standby state

Note: Only one NameNode node can be active at a time.

Create the words.log file in the /home/hadoop/app/hadoop directory of the hadoopnode1 node, and then upload it to the /test directory of the HDFS file system to check whether HDFS can be used normally. The specific operations are as follows:

#Local new words.log file

[hadoop@hadoopnode1 hadoop]$ vi words.log

hadoop hadoop hadoop

spark spark spark spark

flink flink flink

[hadoop@hadoopnode1 hadoop]$ hdfs dfs -mkdir /test 

#Upload the local file words.log

[hadoop@hadoopnode1 hadoop]$ hdfs dfs -put words.log /test

#Check if words.log is uploaded successfully

[hadoop@hadoopnode1 hadoop]$ hdfs dfs -ls /test

/test/words.log

If the above operations are normal, it means that the HDFS distributed cluster is built successfully.

3.Hadoop cluster operation and maintenance management

In the production environment, once the Hadoop cluster is running, it will not be shut down easily, and the daily work is more about managing and maintaining the Hadoop cluster.

3.1 Hadoop cluster process management

The management of the Hadoop cluster process is mainly to go online and offline for the processes such as NameNode, DataNode, (ResourceManager is not installed for the time being) and NodeManager. Next, we will explain the operation of each process separately.

NameNode daemon management

(1) Offline operation

Execute the sbin/hadoop-daemon.sh stop namenode command to shut down the NameNode process. If the NameNode in the Active state is shut down at this time, the standby NameNode will automatically switch to the Active state to provide external services. After the NameNode process is shut down, related maintenance operations can be performed on the node where it resides.

(2) Online operation

After completing the maintenance of the node where the NameNode is located, you can execute the sbin/hadoop-daemon.sh start namenode command to restart the NameNode process. If there is already a NameNode in the Active state in the HDFS cluster, the NameNode just started will run in the Standby state.

DataNode daemon process management

(1) Offline operation

Execute the sbin/hadoop-daemon.sh stop datanode command to shut down the DataNode process. At this time, the data blocks in the current DataNode will be migrated to other DataNodes to achieve data fault tolerance. After waiting for the DataNode process to shut down, you can perform related maintenance operations on the DataNode.

(2) Online operation

After completing the maintenance of the node where the DataNode is located, you can execute the sbin/hadoop-daemon.sh start datanode command to restart the DataNode process. Then you can execute the load balancing command to migrate some data blocks in the cluster to the current DataNode node, thereby improving the data storage capacity of the cluster.

ResourceManager daemon process management

(1) Offline operation

Execute the sbin/yarn-daemon.sh stop resourcemanager command to shut down the ResourceManager process. If the Active state ResourceManager is closed at this time, the standby ResourceManager will automatically switch to the Active state to provide external services. After the ResourceManager process is shut down, related maintenance operations can be performed on the node where it resides.

(2) Online operation

After completing the maintenance on the node where the ResourceManager resides, you can run the sbin/yarn-daemon.sh start resourcemanager command to restart the ResourceManager process. If there is already a ResourceManager in the Active state in the YARN cluster, the just started ResourceManager will run in the Standby state.

NodeManager daemon process management

(1) Offline operation

Execute the sbin/yarn-daemon.sh stop nodemanager command to shut down the NodeManager process. At this time, if the node where the current NodeManager is located has task tasks running, the YARN cluster will automatically schedule the task tasks to run on other NodeManager nodes. After waiting for the NodeManager process to shut down, you can perform related maintenance operations on the node.

(2) Online operation

After completing the maintenance on the node where the NodeManager resides, you can run the sbin/yarn-daemon.sh start nodemanager command to restart the NodeManager process. When a new task needs to be run in the YARN cluster, it will be submitted to the node where the current NodeManager is located first.

3.2 Hadoop cluster operation and maintenance skills

In actual work, the operation and maintenance of Hadoop clusters involves all aspects. Next, two common operation and maintenance techniques will be introduced.

View log

During the running of the Hadoop cluster, no matter what errors or exceptions are encountered, the log is the most important basis for Hadoop operation and maintenance. The first step is to check the Hadoop operation log. The log path of each process in the Hadoop cluster is as follows.

$ HADOOP_HOME/logs/hadoop-hadoop-namenode-hadoopnode1.log

$ HADOOP_HOME/logs/hadoop-hadoop-datanode-hadoopnode1.log

Generally, you can view logs through Linux commands, such as vi, cat and other commands. You can also view the updated log in real time through the tail -f command.

Clean up temporary files

In most cases, due to frequent cluster operations or unreasonable log output, log files and temporary files will occupy a large amount of disk, directly affecting normal HDFS storage. At this time, these temporary files can be cleaned up regularly. The path of temporary files is as follows Show.

(1) Temporary file path of HDFS: ${hadoop.tmp.dir}/mapred/staging.

(2) Local temporary file path: ${mapred.local.dir}/mapred/local.

Execute load balancing script regularly

There are many reasons for HDFS data imbalance, such as adding a DataNode, quickly deleting a large number of files on HDFS, and uneven distribution of computing tasks. Data imbalance will reduce the probability of MapReduce computing localization, thereby reducing the efficiency of job execution. When the Hadoop cluster data is found to be unbalanced, you can execute the Hadoop script sbin/start-balancer.sh to perform load balancing operations.