Hadoop3.3.4 distributed installation

Installation prerequisites: The Java environment has been configured and password-free SSH login between all machines.
Note: flinkv1, flinkv2, and flinkv3 below are aliases of the three servers.

1. Cluster deployment planning
Note: NameNode and SecondaryNameNode should not be installed on the same server
Note: ResourceManager also consumes a lot of memory. Do not configure it on the same machine as NameNode and SecondaryNameNode.
on the machine.

2. Upload the installation package to the linux system

3. Enter the Hadoop installation package path

[zhangflink@9wmwtivvjuibcd2e ~]$ cd /opt/package/

4. Unzip the installation file to /opt/module

[zhangflink@9wmwtivvjuibcd2e package]$ tar -zxvf hadoop-3.3.4.tar.gz -C ../software/

5. Check whether decompression is successful

[zhangflink@9wmwtivvjuibcd2e package]$ cd ../software/
[zhangflink@9wmwtivvjuibcd2e software]$ ls

6. Rename

[zhangflink@9wmwtivvjuibcd2e software]$ mv hadoop-3.3.4/ hadoop
[zhangflink@9wmwtivvjuibcd2e software]$ ls

7. Add Hadoop to environment variables
(1) Obtain the Hadoop installation path

[zhangflink@9wmwtivvjuibcd2e software]$ cd hadoop/
[zhangflink@9wmwtivvjuibcd2e hadoop]$ pwd

(2) Open the /etc/profile file

[zhangflink@9wmwtivvjuibcd2e hadoop]$ sudo vim /etc/profile

Add the JDK path at the end of the profile file: (shitf + g)

> #HADOOP_HOME export
> HADOOP_HOME=/opt/software/hadoop
> export PATH=$PATH:$HADOOP_HOME/bin
> export PATH=$PATH:$HADOOP_HOME/sbin

(3) Exit after saving

:wq

(4) Distribute environment variable files

[zhangflink@9wmwtivvjuibcd2e hadoop]$ /home/zhangflink/bin/xsync /etc/profile

(5) source is effective (3 nodes)

[zhangflink@9wmwtivvjuibcd2e hadoop]$ source /etc/profile

8. Configure the cluster
(1) Core configuration file
Configurecore-site.xml

[zhangflink@9wmwtivvjuibcd2e hadoop]$ cd etc/
[zhangflink@9wmwtivvjuibcd2e etc]$ cd hadoop/
[zhangflink@9wmwtivvjuibcd2e hadoop]$ vim core-site.xml

Add the following configuration items in the middle of the configuration at the bottom of the configuration file

<configuration>
<!-- Specify the address of NameNode -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://flinkv1:8020</value>
</property>
<!-- Specify the storage directory for hadoop data -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/software/hadoop/data</value>
</property>

<!-- Configure the static user used for HDFS web page login as atguigu -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>zhangflink</value>
</property>

<!-- Configure the host node that atguigu (superUser) allows to access through the proxy -->
    <property>
        <name>hadoop.proxyuser.zhangflink.hosts</name>
        <value>*</value>
</property>
<!-- Configure the atguigu (superUser) to allow the group to which the user belongs through proxy -->
    <property>
        <name>hadoop.proxyuser.zhangflink.groups</name>
        <value>*</value>
</property>
<!-- Configure the atguigu (superUser) to allow users through the proxy -->
    <property>
        <name>hadoop.proxyuser.zhangflink.users</name>
        <value>*</value>
</property>
</configuration>

(2) HDFS configuration file

[zhangflink@9wmwtivvjuibcd2e hadoop]$ vim hdfs-site.xml

<configuration>
<!-- nn web access address -->
        <property>
        <name>dfs.namenode.http-address</name>
        <value>flinkv1:9870</value>
    </property>

        <!-- 2nn web access address -->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>flinkv3:9868</value>
    </property>

    <!-- The test environment specifies the number of HDFS copies 1 -->
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

(3) YARN configuration file

[zhangflink@9wmwtivvjuibcd2e hadoop]$ vim yarn-site.xml

<configuration>
<!-- Specify MR to perform shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <!-- Specify the address of ResourceManager -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>flinkv2</value>
    </property>

    <!-- Inheritance of environment variables -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>

        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>512</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>4096</value>
    </property>

        <value>512</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>4096</value>
    </property>

    <!-- The physical memory size allowed to be managed by the yarn container -->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>4096</value>
    </property>

    <!-- Turn off Yarn's limit check on physical memory and virtual memory -->
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
</configuration>

(4)MapReduce configuration file

[zhangflink@9wmwtivvjuibcd2e hadoop]$ vim mapred-site.xml

<configuration>
<!-- Specify the MapReduce program to run on Yarn -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

(5) Configure workers

[zhangflink@9wmwtivvjuibcd2e hadoop]$ vim workers

flinkv1
flinkv2
flinkv3

9. Configure the history server
In order to view the historical running status of the program, you need to configure the history server
(1) Configure mapred-site.xml

[zhangflink@9wmwtivvjuibcd2e hadoop]$ vim mapred-site.xml

<!-- Historical server address -->
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>flinkv1:10020</value>
</property>

<!-- History server web address -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>flinkv1:19888</value>
</property>

10. Configure log aggregation
Log aggregation concept: After the application is completed, the program running log information is uploaded to the HDFS system.
Benefits of the log aggregation function: You can easily view program running details, which facilitates development and debugging.
Note: To enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryManager.
(1) Configure yarn-site.xml

<!-- Enable log aggregation function -->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>

<!-- Set the log aggregation server address -->
<property>
    <name>yarn.log.server.url</name>
    <value>http://flinkv1:19888/jobhistory/logs</value>
</property>

<!--Set the log retention time to 7 days -->
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
</property>

11. Distribute Hadoop

[zhangflink@9wmwtivvjuibcd2e software]$ /home/zhangflink/bin/xsync hadoop/

12. Form a cluster
(1) Start the cluster
If the cluster is started for the first time, you need to format the NameNode on the flinkv1 node (note that before formatting, you must first stop all namenode and datanode processes that were started last time, and then delete the data and log data)

[zhangflink@9wmwtivvjuibcd2e hadoop]$ bin/hdfs namenode -format

(2) Start HDFS

If the above error occurs when starting HDFS, it may be that the java environment variable is not configured. First, check whether the system java environment is configured successfully.

If the system environment is normal as shown in the figure above, it is caused by the hadoop configuration file not configuring the java environment variable path.
Just configure the hadoop-env.sh file as follows

Edit hadoop-env.sh file

[zhangflink@9wmwtivvjuibcd2e hadoop]$ vim etc/hadoop/hadoop-env.sh

Find the java environment and modify the configuration

JAVA_HOME=/opt/software/jdk1.8.0_212

Distribute hadoop configuration files

[zhangflink@9wmwtivvjuibcd2e hadoop]$ /home/zhangflink/bin/xsync etc/

Start HDFS again

[zhangflink@9wmwtivvjuibcd2e hadoop]$ sbin/start-dfs.sh

View progress

[zhangflink@9wmwtivvjuibcd2e hadoop]$ jps

(3) Start YARN on the node (flinkv2) configured with ResourceManager

[zhangflink@9wmwtivvjuibcd2e-0001 hadoop]$ sbin/start-yarn.sh

View progress

[zhangflink@9wmwtivvjuibcd2e hadoop]$ jps

(4) View the HDFS Web page on the Web: http://flinkv1:9870/ (Please use the public IP address to access the cloud server and ensure that the security group entrance of the port has been developed)

(5) View SecondaryNameNode on the Web side