Hadoop cluster environment setup

Hadoop environment setup

  • write in front
    • Statement and pitfall summary
    • Software preparation
  • Introduction to Hadoop
  • Officially built
    • Prerequisite environment preparation
      • Install Ubuntu20.04 Server
      • XShell remote connection
      • Install jdk
      • Install hadoop
      • SSH
      • Static IP configuration
    • Cluster construction
      • Cluster creation
      • Modify hostname and IP address
      • SSH configuration
    • Hadoop configuration
      • workers
      • core-site.xml
      • hdfs-site.xml
      • yarn-site.xml
      • mapred-site.xml
    • Scripting
      • xsync
      • myhadoop.sh
      • jpsall

Write in front

Statement and pitfall summary

The software mentioned in the article is only for learning. If there are references to other authors’ blogs, it can be regarded as helping to attract traffic. If you disagree, please contact us immediately. Some knowledge points and pictures come from dark horses, etc., and any infringement will be deleted; this article In addition to the most basic environment installation, the most important thing about the tutorial is that I have made mistakes myself. I hope to share it with readers and communicate together, which can also help readers avoid detours.

1. Make a clear distinction between root and ordinary users. Unless you must use the root user to modify or authorize some configuration files, do not use the root user at other times.
2. SSH password-free login must be configured, and the public key must be configured on each machine.
3. The permissions of the files must be clearly defined, otherwise inexplicable errors will occur.
4. Any extra space in the configuration file may be wrong, so be careful.
5. If the jdk version and hadoop version are too high or too low, the startup may fail.
6. After deleting a folder in the hadoop directory, I found that there is no NameNode node after startup. It may be that the data file has been deleted and needs to be re-initialized.

Software preparation

  • VMware
    Installed in local windows environment
  • XShell and XFTP
    XShell official website installation
    It is worth mentioning that xshell is now available for free – of course you must read the official website description clearly
  • hadoop3.3.3
  • jdk8
    Here I give the installation board package of the software I use
    Link: Baidu Netdisk
    Extraction code: 2023
    I am using ubuntu20.04, jdk8, hadoop3.3.3 – 3 3 smooth words

Introduction to Hadoop

Here is an overview of the differences between hadoop versions

1. Common port numbers
hadoop3.x

HDFS NameNode internal normal port: 8020/9000/9820
HDFS NameNode query port for users: 9870
Yarn checks task running status: 8088
Historical server: 19888

hadoop2.x

HDFS NameNode internal normal port: 8020/9000
HDFS NameNode query port for users: 50070
Yarn checks task running status: 8088
Historical server: 19888

2. Commonly used configuration files

3.x core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml workers
2.x core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml slaves

Formal construction

Prerequisite environment preparation

Install Ubuntu20.04 Server


The username and password are all test – write it according to your own ideas

Note that this should also be changed

Just keep going to the next step. If there are any problems with the configuration, you can change it in the virtual machine settings.
Wait for automatic loading, and then the options are as shown in my picture.





To change the image, Alibaba Cloud is recommended

http://mirrors.aliyun.com/ubuntu/
It is worth mentioning that if you need to make changes, use the up and down keys and press Enter to confirm.






Wait for it to download automatically and then restart

XShell remote connection

After restarting, obtain the IP address of the test virtual machine through the ifconfig command.


Note that the host location is the IP address of your own virtual machine




Connection completed

Install jdk

sudo apt install openjdk-8-jdk

Install via command

After installation, the default installation directory is

cd /usr/lib/jvm

Configure jdk environment

sudo vim ~/.bashrc

Add at the end of the file

export JAVA_HOME=/usr/lib/jvm/jdk8
export CLASSPATH=.:${<!-- -->JAVA_HOME}/lib:${<!-- -->JRE_HOME}/lib
export PATH=${<!-- -->JAVA_HOME}/bin:$PATH


Reload the configuration file

source ~/.bashrc


The installation is complete

Install hadoop

After installing Xftp, you can click here



Unzip and rename – to facilitate subsequent operations

sudo tar -zxvf hadoop-3.3.3.tar.gz


Also add the configuration at the end of the ~/.bashrc configuration file

export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$PATH

Remember to use it later·

source ~/.bashrc
hadoop version

Refresh the configuration and then check the hadoop version number. As shown in the figure, the configuration is successful.

SSH

When the Server virtual machine is initialized, the ssh service is downloaded.

sudo systemctl status ssh

Check the ssh service status. The normal situation is as shown in the figure.

If there is no ssh service – I deleted my ssh for demonstration

sudo systemctl stop ssh
sudo apt-get remove openssh-client
sudo apt-get --purge remove openssh-server


downloadssh

sudo apt-get install openssh-server

Static IP configuration

First check your own ip configuration, pay attention to the default assigned ip and subnet mask – network planning knowledge, simply speaking, it is a four-byte array to form an IP address. The first three are not used, and the last number is 2- 127 is fine
Also remember the network card – ensXX in front of the red box

Edit the configuration file in the /etc/netplan/ file
Here is a template for static IP configuration

network:
  ethernets:
    ens33:
      dhcp4: no
      dhcp6: no
      addresses: [192.168.52.111/24]
      optional: true
      gateway4: 192.168.52.2
      nameservers:
        addresses: [114.114.114.114,8.8.8.8]
  version: 2


Restart the network after saving the changes
sudo netplan apply
After restarting, XShell will disconnect due to the change of IP address. Check it in VMware.

You can see that the IP configuration is successful. If you change the XShell connection configuration IP, you can also connect successfully.

At this point, the most basic configuration of a virtual machine is completed.

Cluster setup

Cluster creation

I would like to make a special statement here. The test machine used above is only used to demonstrate how to create and configure a stand-alone environment. My cluster has three machines. The environments used are the same as above. The names are tingxue, moyao, and mingxi– tingxue will also be the MasterNode of my cluster. And my Tingxue and Moyao have graphical interface Ubuntu, which already exists, so I used them – but there is no difference, the commands to be executed are still the same
Three virtual machines can be created by assigning the single machine that has been assigned – of course the assignment requires changing some configurations, such as host name, IP address, etc.; you can also create them one by one – you can be familiar with Linux commands, Try it yourself.

Modify host name and IP address

sudo vim /etc/hostname

Modify your host name here – for example, this is my tingxue host

sudo vim /etc/hosts

Add the mapping between host name and ip address here, the format is ip hostname

SSH configuration

SSH configuration needs to be configured on each machine, including the following SSH password-free login, which also requires the same operation on each machine - pitfall experience< /strong>
The following three commands can be completely copied and executed, and their meanings are:
Open the ssh folder
Create ssh secret key and public key
Add the public key to the authorized_keys file – this is for password-free login, and this file is the file specified by default in the configuration file

cd ~/.ssh
ssh-keygen-trsa
cat ./id_rsa.pub >> ./authorized_keys

Next change the ssh configuration:

cd /etc/ssh
sudo vim sshd_config

Configuration changes are shown in the figure

Then change the configuration of ssh on the connecting host

sudo vim ssh_config


This must be changed! ! !
I’ve read a lot of blogs before, but the video tutorials didn’t teach them, so I have been unsuccessful in logging in after configuring password-free login – the default ssh public key is user@hostname, so I When connecting to moyao, I log in as tinfgxue@moyao, but my other machine is moyao@moyao, and the user tingxue does not exist at all, so it keeps failing. After being stuck for a long time, I suddenly realized it when I looked at the ssh_config configuration file – of course Host and User followed by your own hostname
Restart the ssh service after the configuration is complete

sudo service ssh restart

Configure ssh password-free login
Without configuring password-free operation, you need to enter the password of the corresponding host every time you connect to SSH, which is very troublesome.
Here is an example. Similar operations need to be done on mingxi, and the ssh secret key and public key must be generated on moyao and mingxi, and then the password-free operations must be configured on the other two machines respectively.

cd ~/.ssh
scp ~/.ssh/id_rsa.pub moyao:/home/moyao

This is the configuration on tingxue
Then go to moyao:

cd /home/moyao
cat id_rsa.pub >> ~/.ssh/authorized_keys

You can then choose to delete the id_rsa.pub file
Return to tingxue

ssh moyao


After the configuration is completed, the authorized_keys file of each host should contain the public keys of the three hosts

Hadoop configuration

The configuration file of hadoop is in ${HADOOP_HOME}/etc/hadoop. Before configuring Hadoop, you need to have hadoop permissions; secondly, the configuration must be changed to your own host name, do not copy it completely
This is the effect after we configured it on three hosts

As shown in the figure, if the user of the folder is root, use

sudo chown -R tingxue hadoop

Change the owner of the folder to tingxue – that is, an ordinary user


Among them, we need to modify the five files core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, and workers

workers

This file needs to be configured with the host name of the hadoop cluster – provided it has been configured in /etc/hosts. In fact, this represents the host IP address.

vim workers

core-site.xml

vimcore-site.xml

Here is an explanation of the location of the configuration. Only the configuration code will be placed later.

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://tingxue:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
<!--Configure the static user used for HDFS web page login to be atguigu-->
<property>
<name>hadoop.http.staticuser.user</name>
<value>tingxue</value>
</property>
</configuration>

hdfs-site.xml

vim hdfs-site.xml

<configuration>
<property>
<name>dfs.namenode.http-address</name>
<value>tingxue:9870</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>mingxi:9868</value>
</property>
</configuration>

yarn-site.xml

vim yarn-site.xml

<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>moyao</value>
</property>
<!--Inheritance of environment variables-->
<property>
<name>yarn.nodemanager.env whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<!-- Turn on the log aggregation function -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- Enable log aggregation server address -->
<property>
<name>yarn.log.server.url</name>
<value>http://tingxue:19888/jobhistory/logs</value>
</property>
<!-- Log directory -->
<property>
<name>yarn.log.server.url</name>
<value>/tmp/logs</value>
</property>
<!--Set the log retention time to 7 days -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>

</configuration>

mapred-site.xml

vim mapred-site.xml

<configuration>
      <property>
               <name>mapreduce.framework.name</name>
               <value>yarn</value>
       </property>
       <property>
               <name>mapreduce.jobhistory.address</name>
               <value>tingxue:10020</value>
       </property>
       <property>
               <name>mapreduce.jobhistory.webapp.address</name>
               <value>tingxue:19888</value>
       </property>
       <property>
               <name>yarn.app.mapreduce.am.env</name>
               <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
       </property>
       <property>
               <name>mapreduce.map.env</name>
               <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
       </property>
       <property>
               <name>mapreduce.reduce.env</name>
               <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
       </property>
</configuration>

The configuration is done on one machine. After the configuration is completed, it needs to be distributed to the other two hosts – but don’t worry, write three scripts first, which will make things twice the result with half the effort!

Scripting

This script is what I learned while studying – from Dark Horse
After creation, be sure to make the script an executable file

sudo chmod +x filename

Just use this command, if not, just 777

xsync

#!/bin/bash
#1. Determine the number of parameters
if [ $# -lt 1 ]
then
echo Not Enough Arguement!
exit;
fi
#2. Traverse all machines in the cluster
for host in tingxue moyao mingxi
do
echo ==================== $host ====================
#3. Traverse all directories and send them one by one
for file in $@
do
#4. Determine whether the file exists
if [ -e $file ]
then
#5. Get the parent directory
pdir=$(cd -P $(dirname $file); pwd)
#6. Get the name of the current file
fname=$(basename $file)
ssh $host "mkdir-p $pdir"
rsync -av $pdir/$fname $host:$pdir
else
echo $file does not exists!
fi
done
done

This script synchronizes the selected files to the preset host through the rsync command

for host in tingxue moyao mingxi

This sentence is the host to be traversed–Remember to change it to your own
Place the script file in the /usr/bin directory. The usage method is command + the file name to be synchronized.
Note
Be sure to initialize the master node before starting Hadoop! ! !

cd /usr/local/hadoop
./bin/hdfs namenode -format

Start hadoop after initialization is complete – otherwise the NameNode will not start.
The problem may lie in creating files. The solution is to first run the file on the other two hosts. Create the hadoop folder in the /usr/local directory, and then authorize it to ordinary users – as mentioned above, then use the xsync command to synchronize

myhadoop.sh

#!/bin/bash
if [ $# -lt 1 ]
then
echo "No Args Input..."
exit ;
fi

case $1 in
"start")
echo " =================== Start hadoop cluster ==================="
echo " --------------- Start hdfs ---------------"
ssh tingxue "/usr/local/hadoop/sbin/start-dfs.sh"
echo " --------------- start yarn ---------------"
ssh moyao "/usr/local/hadoop/sbin/start-yarn.sh"
echo " --------------- Start historyserver ---------------"
ssh tingxue "/usr/local/hadoop/bin/mapred --daemon start historyserver"
;;
"stop")
echo " =================== Shut down the hadoop cluster ==================="
echo " --------------- Close historyserver ---------------"
ssh tingxue "/usr/local/hadoop/bin/mapred --daemon stop historyserver"
echo " --------------- close yarn ---------------"
ssh moyao "/usr/local/hadoop/sbin/stop-yarn.sh"
echo " --------------- close hdfs ---------------"
ssh tingxue "/usr/local/hadoop/sbin/stop-dfs.sh"
;;
*)
echo "Input Args Error..."
;;
esac

This script is the hadoop cluster startup script. If you do not use the script, according to the above configuration, you need to start hdfs now in tingxue, then start yarn in moyao, and then start the historical service in tingxue. It will be very troublesome. Use the script to solve it with one click.

# Start
myhadoop.sh start
# closure
myhadoop.sh stop

jpsall

#!/bin/bash
for host in tingxue moyao mingxi
do
echo =============== $host ===============
ssh $host /usr/lib/jvm/jdk8/bin/jps
done

After starting the service, you need to check the service startup status on each host. It is too troublesome to check one by one. Save it if you can.

At this point, the Hadoop environment has been set up. This blog may not have much innovation, and the level of detail may not be as detailed as some bloggers, but it is a personal summary of the construction process and a summary of the pitfalls I have encountered.

syntaxbug.com © 2021 All Rights Reserved.