WSL + Vscode one-stop to build Hadoop pseudo-distributed + Spark environment

Wsl + Vscode one-stop to build Hadoop + Spark environment

If you want to build an environment such as Linux, Hadoop, Spark, etc., the common practice now is to install a virtual machine on VM, Virtualbox and other software
This article introduces how to build a relevant environment on the windows subsystem (Windows Subsystem for Linux) and use vscode to develop Spark programs.

Wsl environment preparation

For detailed wsl installation documentation, please see
Type ubuntu in PowerShell to enter the wsl environment
Noticed that Windows File Explorer has added a little penguin

Build a Hadoop pseudo-distributed environment

Resource preparation

Modify /opt directory permissions:
sudo chown -R yourname /opt # Replace the following user name xuxin with the user name.
Create two new folders module and software in the /opt directory
Prepare jdk-1.8 and hadoop-3.2.3 in the /opt/software directory
jdk:< Java Downloads | Oracle>
hadoop:
Copy and paste the downloaded file directly into the /opt/software directory
unzip files

tar -zxvf /opt/software/jdk-8u212-linux-x64.tar.gz -C /opt/module/
tar -zxvf /opt/software/hadoop-3.2.3.tar.gz -C /opt/module/

Configure ssh service

Installssh server
sudo apt install openssh-server
Configure password-free login

cd ~/.ssh/ # If there is no such directory, execute ssh localhost first
ssh-keygen -t rsa # If a prompt appears, just press Enter
cat ./id_rsa.pub >> ./authorized_keys

Configure Java environment

Modify .bashrc file
vim ~/.bashrc
Add the following

export JAVA_HOME=/opt/module/jdk1.8.0_212
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

Make the configuration file effective
source ~/.bashrc
The following content appears to indicate that the Java environment has been set up.
java -version

Configuring hadoop environment

Modify the core-site.xml file

cd /opt/module/hadoop-3.2.3
vim ./etc/hadoop/core-site.xml

Change to the following configuration

<configuration>

<property>
<name>hadoop.tmp.dir</name>
<value>file:/opt/module/hadoop-3.2.3/tmp</value>
<description>Abase for other temporary directories.</description>
</property>

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

<property>
<name>hadooop.http.staticuser.user</name>
<value>xuxin</value>
</property>

</configuration>

Also modify the hdfs.xml file

cd /opt/module/hadoop-3.2.3
vim ./etc/hadoop/hdfs-site.xml

Change to the following configuration

<configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/module/hadoop-3.2.3/tmp/dfs/name</value>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/module/hadoop-3.2.3/tmp/dfs/data</value>
</property>

</configuration>

namenode initialization

cd /opt/module/hadoop-3.2.3
./bin/hdfs namenode -format

Similar log information appears to indicate successful initialization.
Start hdfs

cd /opt/module/hadoop-3.2.3/
./sbin/start-dfs.sh

The browser accesses localhost:9870 to access the web page.
Utilities -> Browse the file system View hdfs file system
If you need to operate on the web side, you can turn off the safe mode in the Hadoop directory.
./bin/hadoop dfsadmin -safemode leave

Run the wordcount sample code

Create test.txt file

cd /opt/module/hadoop-3.2.3/
mkdir input
vim test.txt

Test file content

I learn C language
I like Java
I do not like Python

hdfs create user directory
./bin/hadoop fs -mkdir -p /user/xuxin
Since the user directory is specified, you can use relative paths such as input in the command, and the corresponding absolute path is /user/xuxin/input
./bin/hadoop fs -put ./input input
It can be seen on the web that the upload has been successful.
Run the example wordcount code
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.3.jar wordcount input output
View output
./bin/hadoop fs -cat output/*

OK, the hadoop environment preparation has been successfully completed

Spark simple example

Resource preparation

Download spark-3.2.4 version
Copy and paste the file directly into the /opt/software directory in the same way
Unzip

cd /opt/software
tar -zxvf ./spark-3.2.4-bin-hadoop3.2.tgz -C ../module

Configure spark environment

Modify configuration file

mv spark-env.sh.template spark-env.sh
vim spark-env.sh

add a row
export SPARK_DIST_CLASSPATH=$(/opt/module/hadoop-3.2.3/bin/hadoop classpath)

Run the sample code

cd /opt/module/spark-3.2.4-bin-hadoop3.2
./bin/run-example SparkPi

operation result
Observed output: Pi is roughly 3.1451557257786287

Configure PySpark environment

Spark On Yarn

yarn preparation

Modify the mapred-site.xml configuration file

cd /opt/module/hadoop-3.2.3/etc/hadoop
vim mapred-site.xml

Add the following

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

Modify the yarn-site.xml configuration file
vim yarn-site.xml

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

Turn on yarn
./sbin/start-yarn.sh
View current process
You can simply write scripts to facilitate Hadoop startup and shutdown
- Simple referencemy_hadoop.sh

#!/bin/bash

if [ $# -lt 1 ]
then
        echo "No Args Input..."
        exit ;
fi

case $1 in
"start")
        echo "==================== Start hadoop cluster ===================="

        echo "-------------Start hdfs-------------"
         "/opt/module/hadoop-3.2.3/sbin/start-dfs.sh"
        echo "------------- start yarn -------------"
         "/opt/module/hadoop-3.2.3/sbin/start-yarn.sh"
        echo "GO SEE http://localhost:9870/explorer.html#/ EXPLORE YOUR HDFS!"
;;
"stop")
        echo "==================== Shut down the hadoop cluster ===================="
        echo "------------- Close yarn -------------"
        "/opt/module/hadoop-3.2.3/sbin/stop-yarn.sh"
        echo "------------- Close hdfs -------------"
         "/opt/module/hadoop-3.2.3/sbin/stop-dfs.sh"
;;
*)
                echo "Input Args Error..."
;;
esac

my_hadoop.sh start # Start hdfs & amp; yarn
my_hadoop.sh stop # Close yarn & amp; hdfs

Install Miniconda3

Resource preparation

Select Linux version to download
Copy to the /opt/software directory in the same way and run

cd /opt/software
bash ./Miniconda3-latest-Linux-x86_64.sh

Just follow the instructions to install. The installation directory can be /opt/module/miniconda3.
The installation is complete
Restart the shell and you can see that there is one more (base) in front
Create conda virtual environment
conda create -n pyspark python=3.10

Configuring spark files

Add: in spark-env.sh file:

HADOOP_CONF_DIR=/opt/module/hadoop-3.2.3
YARN_CONF_DIR=/opt/module/hadoop-3.2.3

Add configuration in .bashrc:

export PYSPARK_PYTHON=/opt/module/miniconda3/envs/pyspark/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/module/miniconda3/envs/pyspark/bin/python

Vscode configures remote connection

Plug-in preparation and python interpreter selection

Install the Remote Development plug-in
Install Python plugin
Find the remote resource manager on the left, select Ubuntu and connect in the current window
Create the pyspark-project folder in the home directory

cd ~
mkdir pyspark-project

Select this folder to open in vscode
Install python library pyspark
ctrl + ` Open terminal to install

conda activate pyspark
pip install pyspark==3.2.0 # The pyspark version cannot be too high, otherwise there will be compatibility issues

Python selects the pyspark virtual environment as the interpreter

Test the WordCount program

Write word_count.py program

# coding:utf8

'''
word_count.py
word count
'''

from pyspark import SparkConf, SparkContext


if __name__ == '__main__':
    conf = SparkConf().setAppName("WordCount")
    sc = SparkContext(conf=conf)

    input_path = "input/test.txt"

    file_rdd = sc.textFile(input_path)

    words_rdd = file_rdd.flatMap(lambda line: line.split(" "))

    words_with_one_rdd = words_rdd.map(lambda x: (x, 1))

    result_rdd = words_with_one_rdd.reduceByKey(lambda a, b: a + b)

    result_rdd.coalesce(1).saveAsTextFile("output")

Execute code

cd /opt/module/spark-3.2.4-bin-hadoop3.2
 ./bin/spark-submit --master yarn ~/pyspark-project/test/word_count.py

operation result
Output file:

At this point, the pyspark environment has been set up.
You can write programs on vscode and submit them to yarn