Hadoop Distributed File System (HDFS) is one of the core components of Hadoop. If Hadoop is already installed, it already contains the HDFS component and does not need to be installed separately.
To study this guide, you need to install Hadoop on the Linux system. If Linux and Hadoop are not installed on the machine, please return to Hadoop (02) Hadoop-3.3.6 Cluster Configuration Tutorial_Eufeo’s Blog-CSDN Blog and install Linux and Hadoop according to the tutorial. .
This section involves a lot of theoretical knowledge points. The main theoretical knowledge points include: distributed file system, introduction to HDFS, related concepts of HDFS, HDFS architecture, HDFS storage principle, and HDFS data reading and writing process.
Next, we will introduce the common Shell commands for HDFS file operations in the Linux operating system, use the Web interface to view and manage the Hadoop file system, and use the Java API provided by Hadoop to perform basic file operations.
Before learning HDFS programming practices, we need to start Hadoop on the master node (version is Hadoop3.3.6). Execute the following command
cd /usr/local/hadoop-3.3.6 ./sbin/start-dfs.sh #Start hadoop
1. Use shell commands to interact with HDFS
Hadoop supports many Shell commands, among which fs is the most commonly used command in HDFS. You can use fs to view the directory structure of the HDFS file system, upload and download data, create files, etc.
Note: The commands in the textbook “Big Data Technology Principles and Applications” are Shell command methods starting with “./bin/hadoop dfs”. There are actually three shell command methods.
- hadoop fs
- hadoopdfs
- hdfs dfs
hadoop fs: suitable for any different file system, such as local file system and HDFS file system
hadoop dfs: only works with HDFS file system
hdfs dfs: The same command as hadoop dfs, and can only be applied to the HDFS file system
We can enter the following command in the terminal to see which commands are supported by fs:
./bin/hadoop fs
Enter the following command in the terminal to see the effect of a specific command
For example: to see how to use the put command, we can enter the following command:
./bin/hadoop fs -help put
1.1 Directory operations
HDFS (Hadoop Distributed File System) provides a set of commands to perform directory operations. First switch to “/usr/local/hadoop-3.3.6” in advance
1.1.1 Create directory
./bin/hdfs dfs -mkdir -p /user/hadoop
Function: Create a directory named /user/hadoop
in HDFS, and automatically create its parent directory (if it does not exist).
1.1.2 List directory contents
./bin/hdfs dfs -ls .
Function: List the contents of the current directory in HDFS.
In this command, “-ls” means to list all the contents in a certain directory of HDFS, and “.” means the current user directory in HDFS, which is the “/user/hadoop” directory. Therefore, the above command and the following command are equivalent:
./bin/hdfs dfs -ls /user/hadoop
If you want to list all directories on HDFS, you can use the following command:
./bin/hdfs dfs -ls
Next, you can use the following command to create an input directory:
./bin/hdfs dfs -mkdir input22222
When creating an input directory, a relative path is used. In fact, after the input directory is successfully created, its full path in HDFS is “/user/hadoop/input”.
If you want to create a directory named input in the root directory of HDFS, you need to use the following command:
./bin/hdfs dfs -mkdir /input33333
Command to list files in the hdfs root directory:
./bin/hdfs dfs -ls /
1.1.3 Delete directory
You can use the rm command to delete a directory. For example, you can use the following command to delete the “/input33333” directory just created in HDFS (not the “/user/hadoop/input22222” directory):
./bin/hdfs dfs -rm -r /input33333
In the above command, the “-r” parameter means that if you delete all the contents under the “/input” directory and its subdirectories, if a directory to be deleted contains subdirectories, you must use the “-r” parameter, otherwise the execution will fail. .
1.2 File Operation
In practical applications, it is often necessary to upload files from the local file system to HDFS, or download files from HDFS to the local file system.
First, use the vim editor to create a file myLocalFile.txt in the “/home/hadoop/” directory of the local Linux file system. You can enter some words in it. For example, enter the following three lines:
Hadoop
Spark
JavaWeb
1.2.1 Upload files locally to HDFS
Then, you can use the following command to upload “/home/hadoop/myLocalFile.txt” of the local file system to the input directory of the current user directory in HDFS, that is, upload to the “/user/hadoop/input/” directory of HDFS Down:
hdfs dfs -put /home/hadoop/myLocalFile.txt input22222 # If the PATH variable is set, you can use the hdfs command directly hdfs dfs -ls input22222 # Check whether the file is uploaded successfully
View the contents of the input22222 folder:
hdfs dfs -ls input22222
View the contents of the myLocalFile.txt file in HDFS:
hdfs dfs -cat input22222/myLocalFile.txt If the PATH path is not set, use the following command cd /usr/local/hadoop-3.3.6 ./bin/ hdfs dfs -cat input22222/myLocalFile.txt
1.2.2 Download files from HDFS to local host
./bin/hdfs dfs -ls input22222 ./bin/hdfs dfs -get input22222/myLocalFile.txt /home/hadoop/Downloads/ cd /home/hadoop/Downloads/ ls -l cat myLocalFile.txt
1.2.3 File copy in HDFS
Learn how to copy files from one directory in HDFS to another directory in HDFS. For example, if you want to copy the “/user/hadoop/input22222/myLocalFile.txt” file of HDFS to another directory “/input33333” of HDFS (note that this input33333 directory is located in the root directory of HDFS), you can use the following command :
First create the input33333 folder in the root directory:
./bin/hdfs dfs -mkdir -p /input33333 ./bin/hdfs dfs -ls /
Execute the copy command:
./bin/hdfs dfs -cp input22222/myLocalFile.txt /input33333
Check whether the file was copied successfully:
./bin/hdfs dfs -ls /input33333 ./bin/hdfs dfs -cat /input33333/myLocalFile.txt
2. Use the web interface to manage HDFS
Under the premise of starting Hadoop (version is Hadoop3.3.6) on the master node. You can use: “IP address: 9870” to enter the HDFS web interface to see the HDFS web management interface. The access address of the WEB interface is http://localhost:9870.
3. Use Java API to interact with HDFS
Different file systems in Hadoop interact by calling Java API. The Shell command introduced above is essentially the application of Java API. The official Hadoop API documentation of Hadoop is provided below. If you want to learn Hadoop in depth, you can visit the following website to view the functions of each API.
Hadoop API documentation
To interact with Java API, you need to use the software IDEA to write Java programs.
3.1 Install IDEA software on Debian system
Detailed tutorials can be found here.
3.2 Use IDEA to develop and debug HDFS Java programs
Enter the IDEA installation directory and start the IDEA tool.
cd /opt/idea-IC-23.2/bin ls -l ./bin/idea.sh
Start the IDEA tool. When IDEA is started, the interface shown below will pop up, prompting you to set the workspace.
You can directly use the default setting “/home/hadoop/workspace” and click the “OK” button. It can be seen that since the hadoop user is currently used to log in to the Linux system, the default workspace directory is located under the hadoop user directory “/home/hadoop”.
After IDEA is started, the interface shown in the figure below will appear.
3.3 Add the JAR package needed to the project
In order to write a Java application that can interact with HDFS, you generally need to add the following JAR package to the Java project:
(1) All JAR packages in the “/usr/local/hadoop/share/hadoop-3.3.6/common” directory, including hadoop-common-3.3.6.jar, hadoop-common- 3.3.6-tests.jar, haoop-nfs-3.3.6.jar, haoop-kms-3.3.6.jar and hadoop-registry-3.3.6.jar. Note that the directories jdiff, lib, sources and webapps are not included. ;
(2) All JAR packages in the “/usr/local/hadoop-3.3.6/share/hadoop/common/lib” directory;
(3) All JAR packages in the “/usr/local/hadoop-3.3.6/share/hadoop/hdfs” directory. Note that the directories jdiff, lib, sources and webapps are not included;
(4) All JAR packages in the “/usr/local/hadoop-3.3.6/share/hadoop/hdfs/lib” directory.
Step 1: Create the libs folder. If the libs folder already exists, you can proceed directly to the next step.
Step 2: Copy and paste the jar package that needs to be imported into the libs folder.
Copy the jar to libs. Remember the status of the jar package when you first put it in. Do not click to open it.
For example: You can enter the following command and then enter the /usr/local/hadoop-3.3.6/share/hadoop/common folder to find the jar package that needs to be imported into the HDFS_example project:
nautilus /
Including lib and other jar packages:
Similarly, copy the jar package and lib under hdfs:
Then import the above package into the Java project in IDEA:
Step 3: Establish dependency on the libs folder. Right-click the project in IDEA, select ‘Open Module Settings’ (or ‘Project Structure’), select ‘Modules’ in the pop-up window, and then select the ‘Dependencies’ tab.
Click the “+” button, select “JARs or directories”, then select the libs folder where the jar package was placed in the pop-up window, and then click OK (I directly selected the folder here).
Select “compile” here: Then click “apply” and “ok”:
If this status indicates that the import is successful:
In this way, the operation of importing the jar package is completed. Now you can write or run the code directly. If you need to further optimize the import package settings, you can make corresponding adjustments according to IDEA’s environment configuration and plug-ins (such as Maven).
4. Writing Java applications
4.1 Create a new file MergeFile.java
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.*; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.SequenceFile.Writer; import org.apache.hadoop.io.compress.DefaultCodec; /** * Filter out files whose file names meet specific conditions */ class MyPathFilter implements PathFilter { String reg = null; MyPathFilter(String reg) { this.reg = reg; } public boolean accept(Path path) { if (!(path.toString().matches(reg))) return true; return false; } } /*** * Use XMLRecordReader and SequenceFile.Writer to merge XML files in HDFS */ public class MergeFile { Path inputPath = null; //The path of the directory where the files to be merged are located Path outputPath = null; //Path of output file public MergeFile(String input, String output) { this.inputPath = new Path(input); this.outputPath = new Path(output); } public void doMerge() throws IOException { Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://localhost:9000"); conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem"); FileSystem fsSource = FileSystem.get(URI.create(inputPath.toString()), conf); FileSystem fsDst = FileSystem.get(URI.create(outputPath.toString()), conf); //The following filters out files with the suffix .abc in the input directory. FileStatus[] sourceStatus = fsSource.listStatus(inputPath, new MyPathFilter(".*\.abc")); //Create SequenceFile.Writer to write the merged file SequenceFile.Writer writer = SequenceFile.createWriter(fsDst, conf, outputPath, Text.class, Text.class, SequenceFile.CompressionType.BLOCK, new DefaultCodec()); //The following reads the contents of each file after filtering and writes the contents of each file into SequenceFile. for (FileStatus status : sourceStatus) { FSDataInputStream fsdis = fsSource.open(status.getPath()); byte[] data = new byte[(int) status.getLen()]; fsdis.readFully(data); fsdis.close(); // Write the file path as the key and the file content as the value into SequenceFile writer.append(new Text(status.getPath().toString()), new Text(data)); } writer.close(); } public static void main(String[] args) throws IOException { MergeFile merge = new MergeFile("hdfs://localhost:9000/user/hadoop/", "hdfs://localhost:9000/user/hadoop/merge.txt"); merge.doMerge(); } }
Function: Use FSDataOutputStream
and FSDataInputStream
to merge files in HDFS. It reads the contents of each file in a loop and writes it to the output file.
4.2 Compile and run the program
Before starting to compile and run the program, please make sure that Hadoop has been started. If it has not been started yet, you need to open a Linux terminal and enter the following command to start Hadoop:
cd /usr/local/hadoop-3.3.6/bin/ ./bin/hdfs start-dfs.sh
You can also check whether it has been started through the jps command:
Then, make sure that file1.txt, file2.txt, file3.txt, file4.abc and file5.abc already exist in the “/user/hadoop” directory of HDFS, and each file has content. Here, assume that the file content is as follows:
The content of file1.txt is: this is file1.txt
The content of file2.txt is: this is file2.txt
The content of file3.txt is: this is file3.txt
The content of file4.abc is: this is file4.abc
The content of file5.abc is: this is file5.abc
If it is not created, you can use a script command to quickly create it:
cd ~ pwd # Display the current path vim create_files.sh # Edit script chmod + x create_files.sh # Modify script permissions ./create_files.sh #Execute script hdfs dfs -ls /user/hadoop # Check whether the creation is successful
#!/bin/bash
hdfs dfs -mkdir /user/hadoop
echo “this is file1.txt” | hdfs dfs -put – /user/hadoop/file1.txt
echo “this is file2.txt” | hdfs dfs -put – /user/hadoop/file2.txt
echo “this is file3.txt” | hdfs dfs -put – /user/hadoop/file3.txt
echo “this is file4.abc” | hdfs dfs -put – /user/hadoop/file4.abc
echo “this is file5.abc” | hdfs dfs -put – /user/hadoop/file5.abcecho “Files created and content written successfully.”
Execute this code:
Special note: Before executing the code, please delete all folders under /user/hadoop (such as input, input22222, output, input33333, etc., files generated during previous operations) .
hdfs dfs -ls # Check what folders are in the /user/hadoop directory hdfs dfs -rm -r file name # Delete folder
Problems that may arise:
Exception in thread “main” java.lang.NoClassDefFoundError:com/ctc/wstx/io/InputBootstrap, refer to Hadoop(4-1) for solution.
Exception in thread “main” java.net.ConnectException: Call From hadoop01/192.168.30.134 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; Please refer to Hadoop(4-2) for solution.
log4j: WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). , refer to Hadoop(4-3) for solution.
If the program runs successfully, you can view the generated merge.txt file in HDFS. For example, you can execute the following command in the Linux terminal:
hdfs dfs -ls hdfs dfs -cat merge.txt
You can see the following results in merge.txt :
this is file1.txt
this is file2.txt
this is file3.txt
5. Deployment of Java applications
The following describes how to generate a JAR package from a Java application and deploy it to run on the Hadoop platform. First, create a new directory named myapp in the Hadoop installation directory to store the Hadoop application we wrote ourselves. You can execute the following command in the Linux terminal:
cd /usr/local/hadoop-3.3.6/ mkdir myapp
5.1 Use IDEA’s own packaging tool to package JAR
In the “Package Explorer” panel on the left side of the IDEA work interface, right-click the project name “HDFS_example” and select “Open Module Setting” in the pop-up menu, as shown in the figure below.
Then select the class file that needs to be run, here is MergeFile.java
Then click apply and ok. After confirmation, re-build the Artifas–jar package, as shown in the figure, and then there will be a corresponding jar package in the project out output.
View directory /usr/local/hadoop-3.3 .6/myapp, the jar directory already exists at this time.
Since the program has been run once before, it has been generated merge.txt, therefore, you need to first execute the following command in Linux to delete the file:
hdfs dfs -ls # View files hdfs dfs -rm -r merge.txt # Delete files hdfs dfs -ls
5.2 hadoop jar command running program
Now, you can use the hadoop jar command to run the program in the Linux system. The command is as follows:
cd /usr/local/hadoop-3.3.6/ ./bin/hadoop jar ./myapp/HDFS_example.jar hdfs dfs -ls hdfs dfs -cat merge.txt
Command explanation: hadoop jar <JAR file path> <application main class> [application parameters]
6. Practice
Several code files are given below for readers to practice on their own.
6.1 Writing files
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.Path; public class Chapter3_1 { public static void main(String[] args) { try { Configuration conf = new Configuration(); conf.set("fs.defaultFS","hdfs://hadoop01:9000"); conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem"); FileSystem fs = FileSystem.get(conf); byte[] buff = "Hello world".getBytes(); // Content to be written String filename = "test"; //The file name to be written FSDataOutputStream os = fs.create(new Path(filename)); os.write(buff,0,buff.length); System.out.println("Create:" + filename); os.close(); fs.close(); } catch (Exception e) { e.printStackTrace(); } } }
6.2 Determine whether the file exists
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; public class Chapter3_2 { public static void main(String[] args) { try { String filename = "test"; Configuration conf = new Configuration(); conf.set("fs.defaultFS","hdfs://hadoop01:9000"); conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem"); FileSystem fs = FileSystem.get(conf); if(fs.exists(new Path(filename))){ System.out.println("File exists"); }else{ System.out.println("File does not exist"); } fs.close(); } catch (Exception e) { e.printStackTrace(); } } }
6.3 Reading files
import java.io.BufferedReader; import java.io.InputStreamReader; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.FSDataInputStream; public class Chapter3_3 { public static void main(String[] args) { try { Configuration conf = new Configuration(); conf.set("fs.defaultFS","hdfs://hadoop01:9000"); conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem"); FileSystem fs = FileSystem.get(conf); Path file = new Path("test"); FSDataInputStream getIt = fs.open(file); BufferedReader d = new BufferedReader(new InputStreamReader(getIt)); String content = d.readLine(); //Read one line from the file System.out.println(content); d.close(); //Close the file fs.close(); //Close hdfs } catch (Exception e) { e.printStackTrace(); } } }
Reference materials
Two simple ways to package IDEA MAVEN projects into jar packages_idea maven project packaging into jar packages-CSDN Blog
HDFS Programming Practice (Hadoop3.3.5)_Xiamen University Database Laboratory Blog (xmu.edu.cn)