Table of Contents
1. Flume
1.Features of Flume:
2. What can Flume do?
3. Flume collection and storage
4. Flume’s three major components
5. Flume official website connection Chinese version
2. Install Flume
(1) Upload and decompress the software package
(2) Configure environment variables
3. Test Flume
(1) Edit the Flume configuration file and start it
1. Create a new configuration file
2. Edit configuration file
3. Start the configuration file
4. Install Telnet service
5. Problem solving:
(1) Question:
(2) Cause analysis:
(3)Solution:
4. Write data migration
1. Create a new folder
2. Create configuration file
3. Start the cluster and view the process
4. Create a data file directory and put the data files in it
5.Write python files
1) Requirements:
2) Run the python file
3) View generated files
6. Run the configuration file and upload the web page
1) Run the configuration file:
2) Check whether the operation is successful
?edit
3) Check whether there are files on the HDFS webpage
?Edit 4) Try downloading the file
5) Solve the problem
a) Question:
b) Cause analysis:
c) Solve the problem:
6) Download successful
5. Document processing
一、Flume
1.Features of Flume:
It is a system that is distributed, reliable, and highly available for massive log collection, aggregation and transmission. Flume also has good custom expansion capabilities for special scenarios. Therefore, Flume can be applied to most daily data collection scenarios.
2. What can Flume do?
Flume is a tool/service that can collect data resources such as logs and events, and centralize and store these huge amounts of data from various data sources.
3. Flume collection and storage
Flume can collect various forms of source data such as files, folders, kafka, mysql database, etc., and can output the collected data (sink) to many external sources such as HDFS, hbase, hive, kafka, etc. in the storage system.
4. The three major components of Flume
- Source: read data
- Channel: Receive stored data; writing method: content, file
- Sink: Read stored data; Writing method: print to console, HDFS
5.flume official website connection Chinese version
https://flume.liyifeng.org/#exec-source
2. Install Flume
(1) Upload and decompress the software package
- Upload package:
cd /opt/softwares
- Unzip the package:
tar command package name -C decompression path
tar -xf apache-flume-1.9.0-bin.tar.gz -C /opt/modules/
- Set soft connection: (environment variables can be used directly)
? ln -s apache-flume-1.9.0-bin/ flume
(2) Configure environment variables
- Path:
vi /etc/profile
- Configure environment variables:
export FLUME_HOME=/opt/modules/flume export PATH=$FLUME_HOME/bin:$PATH
- Update environment variables:
source /etc/profile
- Change profile:
cp flume-env.sh.template flume-env.sh
Copy the file to make changes so that we can find the original file if we make a mistake.
- Add java environment variables:
export JAVA_HOME=/opt/modules/jdk1.8.0_241
- View version:
flume-ng version
3. Test Flume
(1) Edit Flume configuration file and start
1. Create a new configuration file
vi nc-flume.conf
2. Edit configuration file
Select the appropriate components source, channel, and sink respectively.
The test here selects netcat, memory, and logger as components.
#sink alias a1.sinks = k1 # Configure source-related information. Netcat of data source. Data of one port of a host. Specify host port. a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # Configure channel-memory a1.channels.c1.type =memory # Configure sink-console printing a1.sinks.k1.type=logger # Bind the corresponding relationship of source channel sink a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
3. Startup configuration file
flume-ng agent -n a1 --conf-file nc-flume.conf -Dflume.root.logger=INFO,console
4. Install Telnet service
Start a new node!
yum install telnet -y
Successful installation:
try to connect
telnet localhost 44444
Connection refused!
5. Problem Solving:
(1) Question:
- Trying ::1…
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1…
telnet: connect to address 127.0.0.1: Connection refused
(2) Cause analysis:
This is a connection refused and the port cannot be found.
(3) Solution:
Check the config file: !
It is found here that the configuration is wrong
correct:
# Alias of source a1.sources = r1 # Alias of channel a1.channels = c1 Alias for #sink a1.sinks = k1 # Configure source-related information. Netcat of data source. Data of one port of a host. Specify host port. a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # Configure channel-memory a1.channels.c1.type =memory # Configure sink-console printing a1.sinks.k1.type=logger # Bind the corresponding relationship of source channel sink a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
After modification:
Start the profile again:
Successfully connected! ! ! This can easily go wrong! !
4. Write data migration
Try to write data and upload it to the HDFS web page!
1.New folder
2. Create configuration file
- Use exec, file, and hdfs components to configure and upload files to the web page.
- Specify the name and path of the generated file
#Alias a1.sources = r1 a1.channels = c1 a1.sinks = k1 #editsources a1.sources.r1.type = exec a1.sources.r1.command = tail -F /opt/data/students_info.txt # edit channels a1.channels.c1.type = file a1.channels.c1.checkpointDir = /opt/modules/flume/channels_checkpoints_file/checkpoint a1.channels.c1.dataDirs = /opt/modules/flume/channels_check_file/data #edit sinks a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d a1.sinks.k1.hdfs.filePrefix = user- a1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundValue = 1 a1.sinks.k1.hdfs.roundUnit = minute a1.sinks.k1.hdfs.rollInterval = 0 a1.sinks.k1.hdfs.rollSize = 0 a1.sinks.k1.hdfs.rollCount = 0 a1.sinks.k1.hdfs.useLocalTimeStamp = true a1.sinks.k1.hdfs.fileType = DataStream #Edit channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
3. Start the cluster and view the process
4. Create a data file directory and put the data file in it
mkdir data
5. Write python file
1)Requirement:
- The name is 3 characters and random. The name is composed of 1,2,3 (19 characters in each field)
- Gender is male or female
- The student number is 3 digits (minimum 001, maximum 100)
- Age 15-22 years old
- The score consists of four subjects, with a minimum score of 60 points and a full score of 100 points (the total score of the four subjects is added up)
- Write to log file
#!/usr/bin/python #coding=UTF-8 import random import time # Name nameArray1 = ["Zhao", "Qian", "Sun", "Li", "Zhou", "Wu", "Zheng", "Wang", "Deng", "Ma", "Yang", "Han" , "Su", "Jiang", "Chiang", "Zhong", "Liu", "Chen", "Fang", "Zeng"] nameArray2 = ["Ming", "He", "Jian", "Chao", "Hong", "Empty", "Zheng", "He", "Nine", "Ke", "Xiang", "Kai" , "Hui", "Shu", "Jia", "Pu", "Peng", "Home", "Europe", "Fei"] nameArray3 = ["Hui", "Ke", "Military", "Learning", "District", "Fei", "心", "新", "美", "cloth", "丽", "风" , "flat", "high", "floor", "machine", "table", "Ni", "Ruo", "Ru"] # gender sexArr = ["male", "female"] # Schedule (* * * * */1 once per second) to send files to the students_info.txt file def log(): name1 = nameArray1[random.randint(0, 19)] name2 = nameArray2[random.randint(0, 19)] name3 = nameArray3[random.randint(0, 19)] name = name1 + name2 + name3 # # student ID # student_id = str(random.randint(0, 100)) # Generate a 3-digit student number, starting from 001 and up to 100 student_id = str(random.randint(1, 100)).zfill(3) #Age (15-22) age = random.randint(15, 22) sex = sexArr[random.randint(0, 1)] # score chinese_score = random.randint(60, 100) math_score = random.randint(60, 100) english_score = random.randint(60, 100) computer_score = random.randint(60, 100) total_score = chinese_score + math_score + english_score + computer_score info = "{},{},{},{},{},{},{},{}".format(student_id, name, sex, chinese_score, math_score, english_score, computer_score, total_score) # info = f"{student_id},{name},{sex},{chinese_score},{math_score},{english_score},{computer_score},{total_score}" print(info) # Time written to log file with open('students_info.txt', 'a + ') as student_log: student_log.writelines(info) student_log.writelines("\ ") while True: log() time.sleep(1) # Take a look every 1 second
2) Run python file
python student.py
3) View generated files
6. Run configuration file, upload web page
1) Run configuration file:
flume-ng agent -n a1 –conf-file file name + print to console (-Dflume.root.loggger=INFO,console)
flume-ng agent -n a1 --conf-file student-score_exec-file-hdfs.conf -Dflume.root.logger=INFO,console
2) Check whether the operation is successful
3) Check whether the HDFS web page has files
ip address:9870
4) Try to download the file
- Check whether it is consistent with the file in the virtual machine
- Click to download
- Cannot download
5) Solve the problem
a)Question:
Can’t download file
b)Cause analysis:
It may be a firewall or mapping problem
c) Solve the problem:
- Check the firewall status:
systemctl status firewalld.service
closed
- Find the Windows mapping file and set the mapping
C:\Windows\System32\drivers\etc
Because the files inside cannot be changed, you have to drag them outside to change them.
Open with notepad
Add mapping
192.168.58.3 hadoop01 192.168.58.4 hadoop02 192.168.58.5 hadoop03
move it back again
6) Download successful
5. File processing
Chart analysis and subsequent hive operations can be performed on this file.