Hadoop environment setup
- write in front
-
- Statement and pitfall summary
- Software preparation
- Introduction to Hadoop
- Officially built
-
- Prerequisite environment preparation
-
- Install Ubuntu20.04 Server
- XShell remote connection
- Install jdk
- Install hadoop
- SSH
- Static IP configuration
- Cluster construction
-
- Cluster creation
- Modify hostname and IP address
- SSH configuration
- Hadoop configuration
-
- workers
- core-site.xml
- hdfs-site.xml
- yarn-site.xml
- mapred-site.xml
- Scripting
-
- xsync
- myhadoop.sh
- jpsall
Write in front
Statement and pitfall summary
The software mentioned in the article is only for learning. If there are references to other authors’ blogs, it can be regarded as helping to attract traffic. If you disagree, please contact us immediately. Some knowledge points and pictures come from dark horses, etc., and any infringement will be deleted; this article In addition to the most basic environment installation, the most important thing about the tutorial is that I have made mistakes myself. I hope to share it with readers and communicate together, which can also help readers avoid detours.
1. Make a clear distinction between root and ordinary users. Unless you must use the root user to modify or authorize some configuration files, do not use the root user at other times. 2. SSH password-free login must be configured, and the public key must be configured on each machine. 3. The permissions of the files must be clearly defined, otherwise inexplicable errors will occur. 4. Any extra space in the configuration file may be wrong, so be careful. 5. If the jdk version and hadoop version are too high or too low, the startup may fail. 6. After deleting a folder in the hadoop directory, I found that there is no NameNode node after startup. It may be that the data file has been deleted and needs to be re-initialized.
Software preparation
- VMware
Installed in local windows environment - XShell and XFTP
XShell official website installation
It is worth mentioning that xshell is now available for free – of course you must read the official website description clearly
- hadoop3.3.3
- jdk8
Here I give the installation board package of the software I use
Link: Baidu Netdisk
Extraction code: 2023
I am using ubuntu20.04, jdk8, hadoop3.3.3 – 3 3 smooth words
Introduction to Hadoop
Here is an overview of the differences between hadoop versions
1. Common port numbers
hadoop3.x
HDFS NameNode internal normal port: 8020/9000/9820 HDFS NameNode query port for users: 9870 Yarn checks task running status: 8088 Historical server: 19888
hadoop2.x
HDFS NameNode internal normal port: 8020/9000 HDFS NameNode query port for users: 50070 Yarn checks task running status: 8088 Historical server: 19888
2. Commonly used configuration files
3.x core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml workers 2.x core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml slaves
Formal construction
Prerequisite environment preparation
Install Ubuntu20.04 Server
The username and password are all test – write it according to your own ideas
Note that this should also be changed
Just keep going to the next step. If there are any problems with the configuration, you can change it in the virtual machine settings.
Wait for automatic loading, and then the options are as shown in my picture.
To change the image, Alibaba Cloud is recommended
http://mirrors.aliyun.com/ubuntu/
It is worth mentioning that if you need to make changes, use the up and down keys and press Enter to confirm.
Wait for it to download automatically and then restart
XShell remote connection
After restarting, obtain the IP address of the test virtual machine through the ifconfig command.
Note that the host location is the IP address of your own virtual machine
Connection completed
Install jdk
sudo apt install openjdk-8-jdk
Install via command
After installation, the default installation directory is
cd /usr/lib/jvm
Configure jdk environment
sudo vim ~/.bashrc
Add at the end of the file
export JAVA_HOME=/usr/lib/jvm/jdk8 export CLASSPATH=.:${<!-- -->JAVA_HOME}/lib:${<!-- -->JRE_HOME}/lib export PATH=${<!-- -->JAVA_HOME}/bin:$PATH
Reload the configuration file
source ~/.bashrc
The installation is complete
Install hadoop
After installing Xftp, you can click here
Unzip and rename – to facilitate subsequent operations
sudo tar -zxvf hadoop-3.3.3.tar.gz
Also add the configuration at the end of the ~/.bashrc configuration file
export HADOOP_HOME=/usr/local/hadoop export PATH=$HADOOP_HOME/bin:$PATH
Remember to use it later·
source ~/.bashrc
hadoop version
Refresh the configuration and then check the hadoop version number. As shown in the figure, the configuration is successful.
SSH
When the Server virtual machine is initialized, the ssh service is downloaded.
sudo systemctl status ssh
Check the ssh service status. The normal situation is as shown in the figure.
If there is no ssh service – I deleted my ssh for demonstration
sudo systemctl stop ssh sudo apt-get remove openssh-client sudo apt-get --purge remove openssh-server
downloadssh
sudo apt-get install openssh-server
Static IP configuration
First check your own ip configuration, pay attention to the default assigned ip and subnet mask – network planning knowledge, simply speaking, it is a four-byte array to form an IP address. The first three are not used, and the last number is 2- 127 is fine
Also remember the network card – ensXX in front of the red box
Edit the configuration file in the /etc/netplan/
file
Here is a template for static IP configuration
network: ethernets: ens33: dhcp4: no dhcp6: no addresses: [192.168.52.111/24] optional: true gateway4: 192.168.52.2 nameservers: addresses: [114.114.114.114,8.8.8.8] version: 2
Restart the network after saving the changes
sudo netplan apply
After restarting, XShell will disconnect due to the change of IP address. Check it in VMware.
You can see that the IP configuration is successful. If you change the XShell connection configuration IP, you can also connect successfully.
At this point, the most basic configuration of a virtual machine is completed.
Cluster setup
Cluster creation
I would like to make a special statement here. The test machine used above is only used to demonstrate how to create and configure a stand-alone environment. My cluster has three machines. The environments used are the same as above. The names are tingxue, moyao, and mingxi– tingxue will also be the MasterNode of my cluster. And my Tingxue and Moyao have graphical interface Ubuntu, which already exists, so I used them – but there is no difference, the commands to be executed are still the same
Three virtual machines can be created by assigning the single machine that has been assigned – of course the assignment requires changing some configurations, such as host name, IP address, etc.; you can also create them one by one – you can be familiar with Linux commands, Try it yourself.
Modify host name and IP address
sudo vim /etc/hostname
Modify your host name here – for example, this is my tingxue host
sudo vim /etc/hosts
Add the mapping between host name and ip address here, the format is ip hostname
SSH configuration
SSH configuration needs to be configured on each machine, including the following SSH password-free login, which also requires the same operation on each machine - pitfall experience
< /strong>
The following three commands can be completely copied and executed, and their meanings are:
Open the ssh folder
Create ssh secret key and public key
Add the public key to the authorized_keys file – this is for password-free login, and this file is the file specified by default in the configuration file
cd ~/.ssh ssh-keygen-trsa cat ./id_rsa.pub >> ./authorized_keys
Next change the ssh configuration:
cd /etc/ssh sudo vim sshd_config
Configuration changes are shown in the figure
Then change the configuration of ssh on the connecting host
sudo vim ssh_config
This must be changed! ! !
I’ve read a lot of blogs before, but the video tutorials didn’t teach them, so I have been unsuccessful in logging in after configuring password-free login – the default ssh public key is user@hostname
, so I When connecting to moyao, I log in as tinfgxue@moyao, but my other machine is moyao@moyao, and the user tingxue does not exist at all, so it keeps failing. After being stuck for a long time, I suddenly realized it when I looked at the ssh_config configuration file – of course Host and User followed by your own hostname
Restart the ssh service after the configuration is complete
sudo service ssh restart
Configure ssh password-free login
Without configuring password-free operation, you need to enter the password of the corresponding host every time you connect to SSH, which is very troublesome.
Here is an example. Similar operations need to be done on mingxi, and the ssh secret key and public key must be generated on moyao and mingxi, and then the password-free operations must be configured on the other two machines respectively.
cd ~/.ssh scp ~/.ssh/id_rsa.pub moyao:/home/moyao
This is the configuration on tingxue
Then go to moyao:
cd /home/moyao cat id_rsa.pub >> ~/.ssh/authorized_keys
You can then choose to delete the id_rsa.pub file
Return to tingxue
ssh moyao
After the configuration is completed, the authorized_keys file of each host should contain the public keys of the three hosts
Hadoop configuration
The configuration file of hadoop is in ${HADOOP_HOME}/etc/hadoop. Before configuring Hadoop, you need to have hadoop permissions; secondly, the configuration must be changed to your own host name, do not copy it completely
This is the effect after we configured it on three hosts
As shown in the figure, if the user of the folder is root, use
sudo chown -R tingxue hadoop
Change the owner of the folder to tingxue – that is, an ordinary user
Among them, we need to modify the five files core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, and workers
workers
This file needs to be configured with the host name of the hadoop cluster – provided it has been configured in /etc/hosts. In fact, this represents the host IP address.
vim workers
core-site.xml
vimcore-site.xml
Here is an explanation of the location of the configuration. Only the configuration code will be placed later.
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://tingxue:8020</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/tmp</value> </property> <!--Configure the static user used for HDFS web page login to be atguigu--> <property> <name>hadoop.http.staticuser.user</name> <value>tingxue</value> </property> </configuration>
hdfs-site.xml
vim hdfs-site.xml
<configuration> <property> <name>dfs.namenode.http-address</name> <value>tingxue:9870</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>mingxi:9868</value> </property> </configuration>
yarn-site.xml
vim yarn-site.xml
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>moyao</value> </property> <!--Inheritance of environment variables--> <property> <name>yarn.nodemanager.env whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> <!-- Turn on the log aggregation function --> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <!-- Enable log aggregation server address --> <property> <name>yarn.log.server.url</name> <value>http://tingxue:19888/jobhistory/logs</value> </property> <!-- Log directory --> <property> <name>yarn.log.server.url</name> <value>/tmp/logs</value> </property> <!--Set the log retention time to 7 days --> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property> </configuration>
mapred-site.xml
vim mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>tingxue:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>tingxue:19888</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value> </property> </configuration>
The configuration is done on one machine. After the configuration is completed, it needs to be distributed to the other two hosts – but don’t worry, write three scripts first, which will make things twice the result with half the effort!
Scripting
This script is what I learned while studying – from Dark Horse
After creation, be sure to make the script an executable file
sudo chmod +x filename
Just use this command, if not, just 777
xsync
#!/bin/bash #1. Determine the number of parameters if [ $# -lt 1 ] then echo Not Enough Arguement! exit; fi #2. Traverse all machines in the cluster for host in tingxue moyao mingxi do echo ==================== $host ==================== #3. Traverse all directories and send them one by one for file in $@ do #4. Determine whether the file exists if [ -e $file ] then #5. Get the parent directory pdir=$(cd -P $(dirname $file); pwd) #6. Get the name of the current file fname=$(basename $file) ssh $host "mkdir-p $pdir" rsync -av $pdir/$fname $host:$pdir else echo $file does not exists! fi done done
This script synchronizes the selected files to the preset host through the rsync command
for host in tingxue moyao mingxi
This sentence is the host to be traversed–Remember to change it to your own
Place the script file in the /usr/bin directory. The usage method is command + the file name to be synchronized.
Note
Be sure to initialize the master node before starting Hadoop! ! !
cd /usr/local/hadoop ./bin/hdfs namenode -format
Start hadoop after initialization is complete – otherwise the NameNode will not start.
The problem may lie in creating files. The solution is to first run the file on the other two hosts. Create the hadoop folder in the /usr/local directory, and then authorize it to ordinary users – as mentioned above, then use the xsync command to synchronize
myhadoop.sh
#!/bin/bash if [ $# -lt 1 ] then echo "No Args Input..." exit ; fi case $1 in "start") echo " =================== Start hadoop cluster ===================" echo " --------------- Start hdfs ---------------" ssh tingxue "/usr/local/hadoop/sbin/start-dfs.sh" echo " --------------- start yarn ---------------" ssh moyao "/usr/local/hadoop/sbin/start-yarn.sh" echo " --------------- Start historyserver ---------------" ssh tingxue "/usr/local/hadoop/bin/mapred --daemon start historyserver" ;; "stop") echo " =================== Shut down the hadoop cluster ===================" echo " --------------- Close historyserver ---------------" ssh tingxue "/usr/local/hadoop/bin/mapred --daemon stop historyserver" echo " --------------- close yarn ---------------" ssh moyao "/usr/local/hadoop/sbin/stop-yarn.sh" echo " --------------- close hdfs ---------------" ssh tingxue "/usr/local/hadoop/sbin/stop-dfs.sh" ;; *) echo "Input Args Error..." ;; esac
This script is the hadoop cluster startup script. If you do not use the script, according to the above configuration, you need to start hdfs now in tingxue, then start yarn in moyao, and then start the historical service in tingxue. It will be very troublesome. Use the script to solve it with one click.
# Start myhadoop.sh start # closure myhadoop.sh stop
jpsall
#!/bin/bash for host in tingxue moyao mingxi do echo =============== $host =============== ssh $host /usr/lib/jvm/jdk8/bin/jps done
After starting the service, you need to check the service startup status on each host. It is too troublesome to check one by one. Save it if you can.
At this point, the Hadoop environment has been set up. This blog may not have much innovation, and the level of detail may not be as detailed as some bloggers, but it is a personal summary of the construction process and a summary of the pitfalls I have encountered.