1 System Configuration
A total of 3 centOS virtual machines have been prepared, master, slave1, slave2
- Configure hosts resolution
vim /etc/hosts 192.168.10.11 master 192.168.10.12 slave1 192.168.10.13 slave2
- Modify hostname
#Execute the following commands on the corresponding server hostnamectl set-hostname master hostnamectl set-hostname slave1 hostnamectl set-hostname slave2
- Turn off firewall
#Close firewall systemctl stop firewalld.service #Disable boot startup systemctl disable firewalld.service #View firewall status systemctl status firewalld.service #Note: All three servers need to turn off the firewall. The production environment does not allow you to turn off the firewall directly. You can only configure policies and open specific ports.
- disable selinux
Modify /etc/selinux/config to set SELINUX=disabled vim /etc/selinux/config
- Restart to make the host name and other configurations take effect.
shutdown -r now
- Create a hadoop user and set a password. Create it on all hosts. Password: whatever
useradd hadoop passwd hadoop
- Configure the root permissions of the hadoop user
vim /etc/sudoers
Add content on the next line of %wheel
hadoop ALL=(ALL) NOPASSWD:ALL
- Set up ssh password-free login
# Enter the key directory (may not exist) cd ~/.ssh # Delete old key rm -rf ~/.ssh # Generate key ssh-keygen-trsa # Copy the public key to the key file specified by ssh authorized_keys cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys #Modify file permissions (not setting may cause failure) chmod 600 ~/.ssh/authorized_keys # Test password-free login ssh -vvv master # Exit ssh login exit
ssh -vvv -vvv means outputting debugging information. If the password exemption fails, you can check the reason from the debugging information. In my test, non-root users must modify the permissions of the authorized_keys file, otherwise the password exemption setting will not succeed.
Note: All servers must have executed the above command before proceeding with the following operation
- Add the master’s public key content to ~/.ssh/authorized_keys of other hosts, so that the master can log in to other hosts without a password
# Execute on master ssh-copy-id master ssh-copy-id slave1 ssh-copy-id slave2
Execute similarly on other hosts
2 JDK installation
- Unzip the installation package to /opt/module
# Create directory sudo mkdir -p /opt/module # Modify permissions sudo chown -R hadoop:hadoop /opt/module # Unzip the installation package to /opt/module tar -xvf jdk-8u341-linux-x64.tar.gz -C /opt/module/ # Modify jdk directory name mv /opt/module/jdk1.8.0_341/ /opt/module/jdk
- Configure environment variables
vim /etc/profile # Add content export JAVA_HOME=/opt/module/jdk export PATH=$PATH:$JAVA_HOME/bin # Make the variable effective source /etc/profile # Test jdk java-version
3 Install hadoop
- Download hadoop download
- Unzip to /opt/module/hadoop
tar -xvf hadoop-3.3.6.tar.gz -C /opt/module mv /opt/module/hadoop-3.3.6 /opt/module/hadoop
- Configure environment variables to /etc/profile
# Add content export JAVA_HOME=/opt/module/jdk export PATH=$PATH:$JAVA_HOME/bin export HADOOP_HOME=/opt/module/hadoop export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export HDFS_NAMENODE_USER=root export HDFS_DATANODE_USER=root export HDFS_SECONDARYNAMENODE_USER=root export YARN_RESOURCEMANAGER_USER=root export YARN_NODEMANAGER_USER=root
- Make environment variables effective
source /etc/profile
- Configure the jdk path of hadoop-env.sh
# Open files separately vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh #Add JAVA_HOME configuration export JAVA_HOME=/usr/local/jdk #Verify configuration hadoop version
- Modify permissions
chown -R hadoop:hadoop /opt/module
Repeat steps 1, 2, 3, 4, 5, and 6 above for other servers.
4 hadoop configuration file configuration
Create a new bin directory in the home/hadoop directory and add the directory to PATH
- Create a directory
mkdir -p /home/hadoop/bin
- Modify environment variables
vim /etc/profile
export JAVA_HOME=/opt/module/jdk export PATH=$PATH:$JAVA_HOME/bin export HADOOP_HOME=/opt/module/hadoop export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:/home/hadoop/bin
- Write cluster file distribution script xsync
Create a new xsync file
vim ~/bin/xsync
Add the following content
#!/bin/bash #1. Determine the number of parameters if [ $# -lt 1 ] then echo Not Enough Arguement! exit; fi #2. Traverse all machines in the cluster for host in hadoop2 hadoop3 do echo ==================== $host ==================== #3. Traverse all directories and send them one by one for file in $@ do #4. Determine whether the file exists if [ -e $file ] then #5. Get the parent directory pdir=$(cd -P $(dirname $file); pwd) #6. Get the name of the current file fname=$(basename $file) ssh $host "mkdir -p $pdir" rsync -av $pdir/$fname $host:$pdir else echo $file does not exists! fi done done
Add execution permissions
chmod + x ~/bin/xsync
Test copying files to other hosts
xsync /home/hadoop/bin
5. Cluster planning and deployment
master | slave1 | slave2 | |
---|---|---|---|
HDFS | NameNode DataNode |
DataNode | SecondaryNameNode DataNode |
YARN | NodeManager | ResourceManager NodeManager |
NodeManager |
Planning principles: NameNode, SecondaryNameNode, and ResourceManager all occupy relatively large amounts of memory, so they should be deployed on different hosts.
- Configuration file description
The configuration files that need to be configured are located in the hadoop installation directory $HADOOP_HOME/etc/hadoop
Configuration file name | Configuration description |
---|---|
core-site.xml | 1 Configure NameNode address 2 Configure hadoop data storage directory 3 Configure HDFS web page login static user name |
core-site.xml | 1 Configure NameNode web access address 2 Configure SecondaryNameNode web access address |
yarn-site.xml | 1 Configure MR for shuffle 2 Configure the ResourceManger address |
mapred-site.xm | 1 Configure the MapReduce program to run on yarn 2 Configure JobHistoryServer |
- Configure core-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- Specify the address of NameNode --> <property> <name>fs.defaultFS</name> <value>hdfs://master:8020</value> </property> <!-- Specify the storage directory for hadoop data, you need to create the directory manually --> <property> <name>hadoop.tmp.dir</name> <value>/opt/module/hadoop/data</value> </property> <!-- Configure the static user used for HDFS web page login as hadoop --> <property> <name>hadoop.http.staticuser.user</name> <value>hadoop</value> </property> </configuration>
- Configure hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- nn web access address --> <property> <name>dfs.namenode.http-address</name> <value>master:9870</value> </property> <!-- 2nn web access address --> <property> <name>dfs.namenode.secondary.http-address</name> <value>slave1:9868</value> </property> </configuration>
- Configure yarn-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- Specify MR to perform shuffle --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!-- Specify the address of ResourceManager --> <property> <name>yarn.resourcemanager.hostname</name> <value>slave2</value> </property> </configuration>
- Configure mapred-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- Specify the MapReduce program to run on Yarn --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
- Synchronize configuration files to other hosts
xsync $HADOOP_HOME/etc/hadoop/
Log in to other hosts to check whether the configuration file is synchronized successfully.
- Start the cluster
Configure workers
vim $HADOOP_HOME/etc/hadoop/workers
Add content
master slave56 slave58
There can be no blank lines and no spaces at the end of the line.
Synchronize files to other hosts
xsync $HADOOP_HOME/etc
Start the cluster
If it is the first time to start, you need to perform formatting on the NameNode node now.
#Execute on the NameNode node, that is, hadoop1 hdfs namenode -format
Note: Formatting the NameNode will generate a new cluster ID, causing the cluster IDs of the NameNode and DataNode to be inconsistent, and the cluster cannot find past data. If the cluster reports an error during operation and needs to reformat the NameNode, be sure to stop the namenode and datanode processes first, and delete the $HADOOP_HOME/data and $HADOOP_HOME/logs directories of all machines before formatting.
Start HDFS on the master host
#Execute on the NameNode node, that is, hadoop1 start-dfs.sh
Start YARN on slave1 host
#Execute on the ResourceManager node, that is, on hadoop2 start-yarn.sh
View NameNode and ResourceManager through web services
# View the NameNode of HDFS http://matster:9870 # View YARN’s ResourceManager http://slave1:8088
If the page can be opened normally, the cluster is started successfully.
- View the processes of each node
# master host [root@master hadoop]$ jps 21745 NodeManager 21860Jps 20682 NameNode 20797 DataNode # slave host [root@slave1 hadoop]$ jps 9986 DataNode 10535 NodeManager 10410 ResourceManager 10891Jps # slave2host [root@slave2 hadoop]$ jps 17488 NodeManager 17584Jps 17397 SecondaryNameNode 17289 DataNode
- Configure history server
? In order to view the historical running status of the program, you can configure a history server
Configure mapred-site.xml
# Add the following content <!-- Historical server address --> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> </property> <!-- History server web address --> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>master:19888</value> </property>
Synchronize configuration to other hosts
xsync $HADOOP_HOME/etc
Start the history server on master
mapred --daemon start historyserver
View progress
[root@master hadoop]$ jps 2176 NodeManager 1714 NameNode 2292Jps 1865 DataNode 1533 JobHistoryServer
visit web
http://hadoop1:19888/jobhistory
- Configure log aggregation
?
Log aggregation concept: After the application is completed, the program running log information is uploaded to the HDFS system.
Benefits of the log aggregation function: You can easily view program running details, which facilitates development and debugging.
Note: To enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryServer.
Configuration yarn-site.xml
Add content
<!-- Enable log aggregation function --> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <!-- Set the log aggregation server address --> <property> <name>yarn.log.server.url</name> <value>http://hadoop1:19888/jobhistory/logs</value> </property> <!--Set the log retention time to 7 days --> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property>
Sync configuration
xsync $HADOOP_HOME/etc
Close NodeManager, ResourceManager, HistroyServer
# Execute on hadoop2 stop-yarn.sh # Execute on hadoop1 mapred --daemon stop historyserver
Open NodeManager, ResourceManager, HistroyServe
# Execute on hadoop2 start-yarn.sh # Execute on hadoop1 mapred --daemon start historyserver
- Write common scripts for Hadoop clusters
- hadoop cluster start and stop script
Create myhadoop.sh in the ~/bin directory
vim ~/bin/myhadoop.sh
Add content
#!/bin/bash if [ $# -lt 1 ] then echo "No Args Input..." exit ; fi case $1 in "start") echo " =================== Start hadoop cluster ===================" echo " --------------- Start hdfs ---------------" ssh master "/opt/module/hadoop/sbin/start-dfs.sh" echo " --------------- start yarn ---------------" ssh slave1 "/opt/module/hadoop/sbin/start-yarn.sh" echo " --------------- Start historyserver ---------------" ssh slave2 "/opt/module/hadoop/bin/mapred --daemon start historyserver" ;; "stop") echo " =================== Shut down the hadoop cluster ===================" echo " --------------- Close historyserver ---------------" ssh master "/opt/module/hadoop/bin/mapred --daemon stop historyserver" echo " --------------- close yarn ---------------" ssh slave1 "/opt/module/hadoop/sbin/stop-yarn.sh" echo " --------------- close hdfs ---------------" ssh slave2 "/opt/module/hadoop/sbin/stop-dfs.sh" ;; *) echo "Input Args Error..." ;; esac
Add execution permissions
chmod + x ~/bin/myhadoop.sh
test script
# Stop myhadoop.sh stop # start up myhadoop.sh start
ERROR: Attempting to operate on hdfs namenode as root ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation. ERROR: Attempting to operate on hdfs namenode as root ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
Solution: start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh add the following parameters
Add the following parameters at the top of the start-dfs.sh and stop-dfs.sh files
#!/usr/bin/env bash HDFS_DATANODE_USER=root HADOOP_SECURE_DN_USER=hdfs HDFS_NAMENODE_USER=root HDFS_SECONDARYNAMENODE_USER=root
Start-yarn.sh, stop-yarn.sh also need to add the following parameters at the top
#!/usr/bin/env bash YARN_RESOURCEMANAGER_USER=root HADOOP_SECURE_DN_USER=yarn YARN_NODEMANAGER_USER=root
#Then execute the command # stop myhadoop.sh stop # start up myhadoop.sh start
Sync files
xsync ~/bin
- View the java process scripts of three hosts
Create jpsall in the ~/bin directory
vim ~/bin/jpsall
Add content
#!/bin/bash for host in master slave1 slave2 do echo =============== $host =============== ssh $host 'jps' done
Add execution permissions
chmod + x ~/bin/jpsall
test script
If the following appears, it is normal
[hadoop@hadoop1 ~]$ jpsall =============== master================ 4448 DataNode 4723 NodeManager 4889 JobHistoryServer 4985Jps 4299 NameNode =============== slave1================ 4372 DataNode 4550 ResourceManager 5053Jps 4686 NodeManager =============== slave2 =============== 3420 SecondaryNameNode 3677Jps 3310 DataNode 3503 NodeManager
If you are a root user, bash: jps: command not found may appear. This is because the node /etc/profile file is used to set system JAVA environment parameters. The environment variables here are effective for all users in the system. The execution command is entered through user login, so when ssh master jps is run on the node, the environment variables in the file ~/.bashrc are actually read.
So you need to configure JAVA environment variables in the user ~/.bashrc file
Is it possible to do the above issue of starting and shutting down the script in this way? You can try it.
vim .bashrc #Add in it export JAVA_HOME=/opt/module/jdk export PATH=$PATH:$JAVA_HOME/bin
Sync to other hosts
xsync ~/bin/
Description of default commonly used port numbers
Port Hadoop2.X hadoop3.x
NameNode internal communication port 8020/9000 8020/9000/9820
NameNode HTTP UI 50070 9870
MapReduce view execution task port 8088 8088
History server communication port 19888 19888