Hadoop(04) HDFS programming practice

Hadoop Distributed File System (HDFS) is one of the core components of Hadoop. If Hadoop is already installed, it already contains the HDFS component and does not need to be installed separately.

To study this guide, you need to install Hadoop on the Linux system. If Linux and Hadoop are not installed on the machine, please return to Hadoop (02) Hadoop-3.3.6 Cluster Configuration Tutorial_Eufeo’s Blog-CSDN Blog and install Linux and Hadoop according to the tutorial. .

This section involves a lot of theoretical knowledge points. The main theoretical knowledge points include: distributed file system, introduction to HDFS, related concepts of HDFS, HDFS architecture, HDFS storage principle, and HDFS data reading and writing process.

Next, we will introduce the common Shell commands for HDFS file operations in the Linux operating system, use the Web interface to view and manage the Hadoop file system, and use the Java API provided by Hadoop to perform basic file operations.

Before learning HDFS programming practices, we need to start Hadoop on the master node (version is Hadoop3.3.6). Execute the following command

cd /usr/local/hadoop-3.3.6
./sbin/start-dfs.sh #Start hadoop

1. Use shell commands to interact with HDFS

Hadoop supports many Shell commands, among which fs is the most commonly used command in HDFS. You can use fs to view the directory structure of the HDFS file system, upload and download data, create files, etc.

Note: The commands in the textbook “Big Data Technology Principles and Applications” are Shell command methods starting with “./bin/hadoop dfs”. There are actually three shell command methods.

  1. hadoop fs
  2. hadoopdfs
  3. hdfs dfs

hadoop fs: suitable for any different file system, such as local file system and HDFS file system
hadoop dfs: only works with HDFS file system
hdfs dfs: The same command as hadoop dfs, and can only be applied to the HDFS file system

We can enter the following command in the terminal to see which commands are supported by fs:

./bin/hadoop fs

Enter the following command in the terminal to see the effect of a specific command

For example: to see how to use the put command, we can enter the following command:

./bin/hadoop fs -help put

1.1 Directory operations

HDFS (Hadoop Distributed File System) provides a set of commands to perform directory operations. First switch to “/usr/local/hadoop-3.3.6” in advance

1.1.1 Create directory

./bin/hdfs dfs -mkdir -p /user/hadoop

Function: Create a directory named /user/hadoop in HDFS, and automatically create its parent directory (if it does not exist).

1.1.2 List directory contents

./bin/hdfs dfs -ls .

Function: List the contents of the current directory in HDFS.

In this command, “-ls” means to list all the contents in a certain directory of HDFS, and “.” means the current user directory in HDFS, which is the “/user/hadoop” directory. Therefore, the above command and the following command are equivalent:

./bin/hdfs dfs -ls /user/hadoop

If you want to list all directories on HDFS, you can use the following command:

./bin/hdfs dfs -ls

Next, you can use the following command to create an input directory:

./bin/hdfs dfs -mkdir input22222

When creating an input directory, a relative path is used. In fact, after the input directory is successfully created, its full path in HDFS is “/user/hadoop/input”.

If you want to create a directory named input in the root directory of HDFS, you need to use the following command:

./bin/hdfs dfs -mkdir /input33333

Command to list files in the hdfs root directory:

./bin/hdfs dfs -ls /

1.1.3 Delete directory

You can use the rm command to delete a directory. For example, you can use the following command to delete the “/input33333” directory just created in HDFS (not the “/user/hadoop/input22222” directory):

./bin/hdfs dfs -rm -r /input33333

In the above command, the “-r” parameter means that if you delete all the contents under the “/input” directory and its subdirectories, if a directory to be deleted contains subdirectories, you must use the “-r” parameter, otherwise the execution will fail. .

1.2 File Operation

In practical applications, it is often necessary to upload files from the local file system to HDFS, or download files from HDFS to the local file system.
First, use the vim editor to create a file myLocalFile.txt in the “/home/hadoop/” directory of the local Linux file system. You can enter some words in it. For example, enter the following three lines:

Hadoop

Spark

JavaWeb

1.2.1 Upload files locally to HDFS

Then, you can use the following command to upload “/home/hadoop/myLocalFile.txt” of the local file system to the input directory of the current user directory in HDFS, that is, upload to the “/user/hadoop/input/” directory of HDFS Down:

hdfs dfs -put /home/hadoop/myLocalFile.txt input22222 # If the PATH variable is set, you can use the hdfs command directly
hdfs dfs -ls input22222 # Check whether the file is uploaded successfully

View the contents of the input22222 folder:

hdfs dfs -ls input22222

View the contents of the myLocalFile.txt file in HDFS:

hdfs dfs -cat input22222/myLocalFile.txt

If the PATH path is not set, use the following command
cd /usr/local/hadoop-3.3.6
./bin/ hdfs dfs -cat input22222/myLocalFile.txt

1.2.2 Download files from HDFS to local host

./bin/hdfs dfs -ls input22222
./bin/hdfs dfs -get input22222/myLocalFile.txt /home/hadoop/Downloads/
cd /home/hadoop/Downloads/
ls -l
cat myLocalFile.txt

1.2.3 File copy in HDFS

Learn how to copy files from one directory in HDFS to another directory in HDFS. For example, if you want to copy the “/user/hadoop/input22222/myLocalFile.txt” file of HDFS to another directory “/input33333” of HDFS (note that this input33333 directory is located in the root directory of HDFS), you can use the following command :

First create the input33333 folder in the root directory:

./bin/hdfs dfs -mkdir -p /input33333
./bin/hdfs dfs -ls /

Execute the copy command:

./bin/hdfs dfs -cp input22222/myLocalFile.txt /input33333

Check whether the file was copied successfully:

./bin/hdfs dfs -ls /input33333
./bin/hdfs dfs -cat /input33333/myLocalFile.txt

2. Use the web interface to manage HDFS

Under the premise of starting Hadoop (version is Hadoop3.3.6) on the master node. You can use: “IP address: 9870” to enter the HDFS web interface to see the HDFS web management interface. The access address of the WEB interface is http://localhost:9870.

3. Use Java API to interact with HDFS

Different file systems in Hadoop interact by calling Java API. The Shell command introduced above is essentially the application of Java API. The official Hadoop API documentation of Hadoop is provided below. If you want to learn Hadoop in depth, you can visit the following website to view the functions of each API.

Hadoop API documentation

To interact with Java API, you need to use the software IDEA to write Java programs.

3.1 Install IDEA software on Debian system

Detailed tutorials can be found here.

3.2 Use IDEA to develop and debug HDFS Java programs

Enter the IDEA installation directory and start the IDEA tool.

cd /opt/idea-IC-23.2/bin
ls -l
./bin/idea.sh

Start the IDEA tool. When IDEA is started, the interface shown below will pop up, prompting you to set the workspace.

You can directly use the default setting “/home/hadoop/workspace” and click the “OK” button. It can be seen that since the hadoop user is currently used to log in to the Linux system, the default workspace directory is located under the hadoop user directory “/home/hadoop”.

After IDEA is started, the interface shown in the figure below will appear.

3.3 Add the JAR package needed to the project

In order to write a Java application that can interact with HDFS, you generally need to add the following JAR package to the Java project:
(1) All JAR packages in the “/usr/local/hadoop/share/hadoop-3.3.6/common” directory, including hadoop-common-3.3.6.jar, hadoop-common- 3.3.6-tests.jar, haoop-nfs-3.3.6.jar, haoop-kms-3.3.6.jar and hadoop-registry-3.3.6.jar. Note that the directories jdiff, lib, sources and webapps are not included. ;
(2) All JAR packages in the “/usr/local/hadoop-3.3.6/share/hadoop/common/lib” directory;
(3) All JAR packages in the “/usr/local/hadoop-3.3.6/share/hadoop/hdfs” directory. Note that the directories jdiff, lib, sources and webapps are not included;
(4) All JAR packages in the “/usr/local/hadoop-3.3.6/share/hadoop/hdfs/lib” directory.

Step 1: Create the libs folder. If the libs folder already exists, you can proceed directly to the next step.

Step 2: Copy and paste the jar package that needs to be imported into the libs folder.

Copy the jar to libs. Remember the status of the jar package when you first put it in. Do not click to open it.

For example: You can enter the following command and then enter the /usr/local/hadoop-3.3.6/share/hadoop/common folder to find the jar package that needs to be imported into the HDFS_example project:

nautilus /

Including lib and other jar packages:

Similarly, copy the jar package and lib under hdfs:


Then import the above package into the Java project in IDEA:

Step 3: Establish dependency on the libs folder. Right-click the project in IDEA, select ‘Open Module Settings’ (or ‘Project Structure’), select ‘Modules’ in the pop-up window, and then select the ‘Dependencies’ tab.
Click the “+” button, select “JARs or directories”, then select the libs folder where the jar package was placed in the pop-up window, and then click OK (I directly selected the folder here).

Select “compile” here: Then click “apply” and “ok”:

If this status indicates that the import is successful:

In this way, the operation of importing the jar package is completed. Now you can write or run the code directly. If you need to further optimize the import package settings, you can make corresponding adjustments according to IDEA’s environment configuration and plug-ins (such as Maven).

4. Writing Java applications

4.1 Create a new file MergeFile.java

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.Writer;
import org.apache.hadoop.io.compress.DefaultCodec;

/**
 * Filter out files whose file names meet specific conditions
 */
class MyPathFilter implements PathFilter {
    String reg = null;
    MyPathFilter(String reg) {
        this.reg = reg;
    }
    public boolean accept(Path path) {
        if (!(path.toString().matches(reg)))
            return true;
        return false;
    }
}

/***
 * Use XMLRecordReader and SequenceFile.Writer to merge XML files in HDFS
 */
public class MergeFile {
    Path inputPath = null; //The path of the directory where the files to be merged are located
    Path outputPath = null; //Path of output file

    public MergeFile(String input, String output) {
        this.inputPath = new Path(input);
        this.outputPath = new Path(output);
    }

    public void doMerge() throws IOException {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://localhost:9000");
        conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");

        FileSystem fsSource = FileSystem.get(URI.create(inputPath.toString()), conf);
        FileSystem fsDst = FileSystem.get(URI.create(outputPath.toString()), conf);

        //The following filters out files with the suffix .abc in the input directory.
        FileStatus[] sourceStatus = fsSource.listStatus(inputPath, new MyPathFilter(".*\.abc"));

        //Create SequenceFile.Writer to write the merged file
        SequenceFile.Writer writer = SequenceFile.createWriter(fsDst, conf, outputPath, Text.class, Text.class,
                SequenceFile.CompressionType.BLOCK, new DefaultCodec());

        //The following reads the contents of each file after filtering and writes the contents of each file into SequenceFile.
        for (FileStatus status : sourceStatus) {
            FSDataInputStream fsdis = fsSource.open(status.getPath());
            byte[] data = new byte[(int) status.getLen()];

            fsdis.readFully(data);
            fsdis.close();

            // Write the file path as the key and the file content as the value into SequenceFile
            writer.append(new Text(status.getPath().toString()), new Text(data));
        }

        writer.close();
    }

    public static void main(String[] args) throws IOException {
        MergeFile merge = new MergeFile("hdfs://localhost:9000/user/hadoop/", "hdfs://localhost:9000/user/hadoop/merge.txt");
        merge.doMerge();
    }
}

Function: Use FSDataOutputStream and FSDataInputStream to merge files in HDFS. It reads the contents of each file in a loop and writes it to the output file.

4.2 Compile and run the program

Before starting to compile and run the program, please make sure that Hadoop has been started. If it has not been started yet, you need to open a Linux terminal and enter the following command to start Hadoop:

cd /usr/local/hadoop-3.3.6/bin/
./bin/hdfs start-dfs.sh

You can also check whether it has been started through the jps command:

Then, make sure that file1.txt, file2.txt, file3.txt, file4.abc and file5.abc already exist in the “/user/hadoop” directory of HDFS, and each file has content. Here, assume that the file content is as follows:
The content of file1.txt is: this is file1.txt
The content of file2.txt is: this is file2.txt
The content of file3.txt is: this is file3.txt
The content of file4.abc is: this is file4.abc
The content of file5.abc is: this is file5.abc

If it is not created, you can use a script command to quickly create it:

cd ~
pwd # Display the current path
vim create_files.sh # Edit script
chmod + x create_files.sh # Modify script permissions
./create_files.sh #Execute script
hdfs dfs -ls /user/hadoop # Check whether the creation is successful

#!/bin/bash

hdfs dfs -mkdir /user/hadoop

echo “this is file1.txt” | hdfs dfs -put – /user/hadoop/file1.txt
echo “this is file2.txt” | hdfs dfs -put – /user/hadoop/file2.txt
echo “this is file3.txt” | hdfs dfs -put – /user/hadoop/file3.txt
echo “this is file4.abc” | hdfs dfs -put – /user/hadoop/file4.abc
echo “this is file5.abc” | hdfs dfs -put – /user/hadoop/file5.abc

echo “Files created and content written successfully.”

Execute this code:

Special note: Before executing the code, please delete all folders under /user/hadoop (such as input, input22222, output, input33333, etc., files generated during previous operations) .

hdfs dfs -ls # Check what folders are in the /user/hadoop directory
hdfs dfs -rm -r file name # Delete folder

Problems that may arise:

Exception in thread “main” java.lang.NoClassDefFoundError:com/ctc/wstx/io/InputBootstrap, refer to Hadoop(4-1) for solution.

Exception in thread “main” java.net.ConnectException: Call From hadoop01/192.168.30.134 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; Please refer to Hadoop(4-2) for solution.

log4j: WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). , refer to Hadoop(4-3) for solution.

If the program runs successfully, you can view the generated merge.txt file in HDFS. For example, you can execute the following command in the Linux terminal:

hdfs dfs -ls
hdfs dfs -cat merge.txt

You can see the following results in merge.txt :

this is file1.txt
this is file2.txt
this is file3.txt

5. Deployment of Java applications

The following describes how to generate a JAR package from a Java application and deploy it to run on the Hadoop platform. First, create a new directory named myapp in the Hadoop installation directory to store the Hadoop application we wrote ourselves. You can execute the following command in the Linux terminal:

cd /usr/local/hadoop-3.3.6/
mkdir myapp

5.1 Use IDEA’s own packaging tool to package JAR

In the “Package Explorer” panel on the left side of the IDEA work interface, right-click the project name “HDFS_example” and select “Open Module Setting” in the pop-up menu, as shown in the figure below.

Then select the class file that needs to be run, here is MergeFile.java

Then click apply and ok. After confirmation, re-build the Artifas–jar package, as shown in the figure, and then there will be a corresponding jar package in the project out output.

View directory /usr/local/hadoop-3.3 .6/myapp, the jar directory already exists at this time.

Since the program has been run once before, it has been generated merge.txt, therefore, you need to first execute the following command in Linux to delete the file:

hdfs dfs -ls # View files
hdfs dfs -rm -r merge.txt # Delete files
hdfs dfs -ls

5.2 hadoop jar command running program

Now, you can use the hadoop jar command to run the program in the Linux system. The command is as follows:

cd /usr/local/hadoop-3.3.6/
./bin/hadoop jar ./myapp/HDFS_example.jar
hdfs dfs -ls
hdfs dfs -cat merge.txt
Command explanation:
hadoop jar <JAR file path> <application main class> [application parameters]

6. Practice

Several code files are given below for readers to practice on their own.

6.1 Writing files

 import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.fs.FileSystem;
        import org.apache.hadoop.fs.FSDataOutputStream;
        import org.apache.hadoop.fs.Path;
 
        public class Chapter3_1 {
                public static void main(String[] args) {
                        try {
                                Configuration conf = new Configuration();
                                conf.set("fs.defaultFS","hdfs://hadoop01:9000");
                                conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem");
                                FileSystem fs = FileSystem.get(conf);
                                byte[] buff = "Hello world".getBytes(); // Content to be written
                                String filename = "test"; //The file name to be written
                                FSDataOutputStream os = fs.create(new Path(filename));
                                os.write(buff,0,buff.length);
                                System.out.println("Create:" + filename);
                                os.close();
                                fs.close();
                        } catch (Exception e) {
                                e.printStackTrace();
                        }
                }
        }

6.2 Determine whether the file exists

 import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.fs.FileSystem;
        import org.apache.hadoop.fs.Path;
 
        public class Chapter3_2 {
                public static void main(String[] args) {
                            try {
                                    String filename = "test";
 
                                    Configuration conf = new Configuration();
                                    conf.set("fs.defaultFS","hdfs://hadoop01:9000");
                                    conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem");
                                    FileSystem fs = FileSystem.get(conf);
                                    if(fs.exists(new Path(filename))){
                                            System.out.println("File exists");
                                    }else{
                                            System.out.println("File does not exist");
                                    }
                                    fs.close();
                        } catch (Exception e) {
                                e.printStackTrace();
                        }
                }
        } 

6.3 Reading files

 import java.io.BufferedReader;
        import java.io.InputStreamReader;
 
        import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.fs.FileSystem;
        import org.apache.hadoop.fs.Path;
        import org.apache.hadoop.fs.FSDataInputStream;
 
        public class Chapter3_3 {
                public static void main(String[] args) {
                        try {
                                Configuration conf = new Configuration();
                                conf.set("fs.defaultFS","hdfs://hadoop01:9000");
                                conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem");
                                FileSystem fs = FileSystem.get(conf);
                                Path file = new Path("test");
                                FSDataInputStream getIt = fs.open(file);
                                BufferedReader d = new BufferedReader(new InputStreamReader(getIt));
                                String content = d.readLine(); //Read one line from the file
                                System.out.println(content);
                                d.close(); //Close the file
                                fs.close(); //Close hdfs
                        } catch (Exception e) {
                                e.printStackTrace();
                        }
                }
        }

Reference materials

Two simple ways to package IDEA MAVEN projects into jar packages_idea maven project packaging into jar packages-CSDN Blog

HDFS Programming Practice (Hadoop3.3.5)_Xiamen University Database Laboratory Blog (xmu.edu.cn)