LZO configuration of HDFS (1)

Table of Contents

Table of Contents

1. Introduction to lzo algorithm

2. Using lzo algorithm in hadoop

3. lzo algorithm on HDFS

4. HDFS configuration lzo compression

(1) Compile

a) Environmental preparation

1. Download maven Linux version

2. Upload and decompress the maven package

3. Configure the maven environment

4. Download the following plug-ins through yum

b) Download, install and compile lzo

c) Configure lzo


1. Introduction to lzo algorithm

The Lempel-Ziv-Oberhumer algorithm is one of the fastest lossless data compression and decompression algorithms, referred to as the LZO algorithm. A software tool that implements the LZO algorithm is lzop. Compared with the common gzip, lzop can provide faster compression and decompression speeds. The LZO library was originally written in ANSI C. Now LZO has various versions such as Perl, Python and Java.

LZO is designed based on processing speed. LZO’s decompression speed is faster than its compression speed, but its compression ratio can be adjusted freely as needed without affecting the decompression speed. Algorithmic decompression is simple and requires no memory support; and LZO can provide lossless compression.

advantage:

  • Get decompressed and compressed data quickly
  • Does not consume too much CPU resources
  • Reasonable compression ratio

2. Use lzo algorithm in hadoop

Using lzo’s compression algorithm in hadoop can reduce data size and data disk read and write time. Not only that, lzo is based on block partitioning, which allows data to be broken down into chunks and processed by Hadoop in parallel. Based on such characteristics, lzo can become a very easy-to-use compression format on hadoop.

Lzo itself is not splitable, so when the data is in text format, the data compressed by lzo is used as the input of the job as a file as a map. However, the sequencefile itself is divided into blocks, so files in the sequencefile format, coupled with the lzo compression format, can be splittable in the lzo file format.

3. lzo algorithm on HDFS

Compressed data is usually only 1/4 of the original data. Storing compressed data in HDFS allows the cluster to save more data and extend the service life of the cluster. Not only that, because the bottleneck of mapreduce jobs is usually IO, storing compressed data means fewer IO operations and the job runs more efficiently.

There are two troublesome aspects of using compression on Hadoop: First, some compression formats cannot be divided into blocks and processed in parallel, such as gzip. Second, although some other compression formats support block processing, the decompression process is very slow, which shifts the job bottleneck to the CPU, such as bzip2. For example, if we have a 1.1GB gzip file, which is divided into 128MB/chunk and stored on HDFS, then it will be divided into 9 chunks. In order to process each chunk in parallel in mapreduce, there are dependencies between each mapper. The second mapper will process a random byte in the file. Then the context dictionary used when gzip decompresses will be empty, which means that the gzip compressed file cannot be processed correctly in parallel on Hadoop. Therefore, large gzip compressed files on Hadoop can only be processed individually by one mapper, which is very inefficient and is no different from not using mapreduce. As for another bzip2 compression format, although bzip2 compression is very fast and can even be divided into chunks, its decompression process is very, very slow and cannot be read by streaming, so it cannot be used efficiently in hadoop. This compression. Even if it is used, due to its inefficiency in decompression, the bottleneck of the job will be transferred to the CPU.

It would be ideal to have a compression algorithm that can be divided into chunks, processed in parallel, and very fast. This method is lzo.

4. HDFS configuration lzo compression

(1) Compile

Hadoop itself does not support lzo compression, so you need to use the hadoop-lzo open source component provided by twitter. hadoop-lzo needs to rely on hadoop and lzo for compilation.

a) Environment preparation

  • Development environment: centos 7, jdk1.8, hadoop3.2.4, non-root user
  • Maven downloads, installs and configures the environment and modifies the setting configuration file
  • Cluster allocation

    Serverhadoop102

    Serverhadoop103

    Server hadoop104

    HDFS

    NameNode

    DataNode

    DataNode

    DataNode

    SecondaryNameNode

    Yarn

    NodeManager

    Resource manager

    NodeManager

    NodeManager

1. Download maven Linux version

Download address for the latest version of maven: Maven – Download Apache Mavenicon-default.png?t=N7T8https://maven.apache.org/download.cgi

Download address of previous versions of maven: Index of /dist/maven/maven-3icon-default.png?t=N7T8https://archive.apache.org/dist/ maven/maven-3/

2. Upload and decompress the maven package

#Decompression command:
tar -xf apache-maven-3.9.5-bin.tar.gz -C /usr/local/maven/
#I unzip it into the maven folder I created myself

3. Configure maven environment
cd apache-maven-3.9.5/ #Enter
mkdir repositories #Create folder local warehouse address

1 vim conf/settings.xml #Edit setting.xml
2 #Add parameters at line 55
  <localRepository>/usr/local/maven/apache-maven-3.9.5/repositories/</localRepository>
3 #Add Alibaba image warehouse at line 175
  <mirror>
      <id>alimaven</id>
      <name>aliyun maven</name>
      <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
      <mirrorOf>central</mirrorOf>
  </mirror>

#Enter the environment to edit files
vim /etc/profile.d/my_env.sh

#Add configuration environment
export MAVEN_HOME=/opt/inst/maven354
export PATH=$PATH:$MAVEN_HOME/bin

#Refresh environment variables
source vim/etc/profile.d/my_en.sh

#Check the maven environment
mvn -v

The xsync script I use here is distributed with one click, so the configuration file is in vim /etc/profile.d/my_env.sh. How to configure the xsync script, for details, please refer to: Configuring xsync (detailed explanation)_Jiujiu@星的blog-CSDN Blog uses the xsync command Synchronizing a file will only synchronize this file to the same path of other servers (unknown directories and files will be automatically created). Using the xsync command to synchronize a directory will synchronize all files and subdirectories under this directory to other servers. Synchronize a file or directory multiple times under the same path on the server (unknown directories and files will be automatically created). The first time all will be synchronized, the second and subsequent times will only synchronize the changed parts, and the unchanged parts will not be synchronized. Repeat synchttps ://blog.csdn.net/qq_58534786/article/details/133322121

4. Download the following plug-ins through yum

gcc-c++
zlib-devel
autoconf
automake
libtool

yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool

b) Download, install and compile lzo

wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz

# Directly enter the command to download (I put it in a special place to store the compressed package)

#You can also download it directly on windows and upload it.

Download address: http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz

#Unzip
tar -xf lzo-2.10.tar.gz -C /opt/module/

#Enter lzo-2.10
cd lzo-2.10

exportCFLAGS=-m64

#Configure the specified lzo compilation address
./configure -prefix=/usr/local/hadoop/lzo/

#Start compilation and installation (root permissions)
make
make install

#View compilation and installation status
cd /usr/local/hadoop/lzo

# Directly enter the following command to download
wget https://github.com/twitter/hadoop-lzo/archive/master.zip

#Or upload, download address
https://github.com/twitter/hadoop-lzo/archive/master.zip

#If ssh refuses to connect, it may be that ssh has not been downloaded.
sudo apt-get install openssh-server

#Unzip master.zip
tar -xf master.zip -C /opt/module

# If you are prompted that there is no unzip, remember to use yum to install it, which requires root permissions.
sudo yum -y install unzip

#Modify pom.xml file (hadoop version)
<hadoop.current.version>3.2.4</hadoop.current.version>

#Declare variables
exportCFLAGS=-m64

export CXXFLAGS=-m64

export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include

export LIBRARY_PATH=/usr/local/hadoop/lzo/lib

In non-editing mode, / + If you want to find content, you can quickly search

cd hadoop-lzo-master/
exportCFLAGS=-m64
export CXXFLAGS=-m64
export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include # Here you need to provide the compiled lzo include file
export LIBRARY_PATH=/usr/local/hadoop/lzo/lib # Here you need to provide the compiled lzo lib file

#Start compiling. If there is an error due to permissions, please add sudo at the front.
mvn clean package -Dmaven.test.skip=true

#View compilation results
cd target

c) Configure lzo

Move the compiled hadoop-lzo-0.4.21-SNAPSHOT.jar to hadoop-3.2.4/share/hadoop/common/

sudo mv hadoop-lzo-0.4.21-SNAPSHOT.jar /opt/module/hadoop-3.2.4/share/hadoop/common/

#View mobile results
cd /opt/module/hadoop-3.2.4/share/hadoop/common/

#Distribute to hadoop103, hadoop104
xsync hadoop-lzo-0.4.21-SNAPSHOT.jar

# Configure core-site.xml
<property>
   <name>io.compression.codecs</name>
<value>
   org.apache.hadoop.io.compress.GzipCodec,
   org.apache.hadoop.io.compress.DefaultCodec,
   org.apache.hadoop.io.compress.BZip2Codec,
   org.apache.hadoop.io.compress.SnappyCodec,
   com.hadoop.compression.lzo.LzoCodec,
   com.hadoop.compression.lzo.LzopCodec
</value>
</property>

<property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

#Synchronize to hadoop103/hadoop104
xsync core-site.xml

Start the cluster and check the status

Testing lzo and creating indexes, and tuning hadoop parameters in the next issue