Wsl + Vscode one-stop to build Hadoop + Spark environment
If you want to build an environment such as Linux, Hadoop, Spark, etc., the common practice now is to install a virtual machine on VM, Virtualbox and other software
This article introduces how to build a relevant environment on the windows subsystem (Windows Subsystem for Linux) and use vscode to develop Spark programs.
Wsl environment preparation
For detailed wsl installation documentation, please see
Type ubuntu
in PowerShell to enter the wsl environment
Noticed that Windows File Explorer has added a little penguin
Build a Hadoop pseudo-distributed environment
Resource preparation
-
Modify /opt directory permissions:
sudo chown -R yourname /opt # Replace the following user name xuxin with the user name.
-
Create two new folders module and software in the /opt directory
Prepare jdk-1.8 and hadoop-3.2.3 in the /opt/software directory
jdk:< Java Downloads | Oracle>
hadoop: -
Copy and paste the downloaded file directly into the /opt/software directory
-
unzip files
tar -zxvf /opt/software/jdk-8u212-linux-x64.tar.gz -C /opt/module/ tar -zxvf /opt/software/hadoop-3.2.3.tar.gz -C /opt/module/
Configure ssh service
- Installssh server
sudo apt install openssh-server
- Configure password-free login
cd ~/.ssh/ # If there is no such directory, execute ssh localhost first ssh-keygen -t rsa # If a prompt appears, just press Enter cat ./id_rsa.pub >> ./authorized_keys
Configure Java environment
- Modify .bashrc file
vim ~/.bashrc
Add the following
export JAVA_HOME=/opt/module/jdk1.8.0_212 export JRE_HOME=${JAVA_HOME}/jre export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=${JAVA_HOME}/bin:$PATH
-
Make the configuration file effective
source ~/.bashrc
-
The following content appears to indicate that the Java environment has been set up.
java -version
Configuring hadoop environment
- Modify the core-site.xml file
cd /opt/module/hadoop-3.2.3 vim ./etc/hadoop/core-site.xml
- Change to the following configuration
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/opt/module/hadoop-3.2.3/tmp</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadooop.http.staticuser.user</name> <value>xuxin</value> </property> </configuration>
- Also modify the hdfs.xml file
cd /opt/module/hadoop-3.2.3 vim ./etc/hadoop/hdfs-site.xml
- Change to the following configuration
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/opt/module/hadoop-3.2.3/tmp/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/opt/module/hadoop-3.2.3/tmp/dfs/data</value> </property> </configuration>
- namenode initialization
cd /opt/module/hadoop-3.2.3 ./bin/hdfs namenode -format
-
Similar log information appears to indicate successful initialization.
-
Start hdfs
cd /opt/module/hadoop-3.2.3/ ./sbin/start-dfs.sh
-
The browser accesses localhost:9870 to access the web page.
Utilities -> Browse the file system View hdfs file system -
If you need to operate on the web side, you can turn off the safe mode in the Hadoop directory.
./bin/hadoop dfsadmin -safemode leave
Run the wordcount sample code
- Create test.txt file
cd /opt/module/hadoop-3.2.3/ mkdir input vim test.txt
- Test file content
I learn C language I like Java I do not like Python
-
hdfs create user directory
./bin/hadoop fs -mkdir -p /user/xuxin
-
Since the user directory is specified, you can use relative paths such as input in the command, and the corresponding absolute path is /user/xuxin/input
./bin/hadoop fs -put ./input input
-
It can be seen on the web that the upload has been successful.
-
Run the example wordcount code
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.3.jar wordcount input output
-
View output
./bin/hadoop fs -cat output/*
OK, the hadoop environment preparation has been successfully completed
Spark simple example
Resource preparation
-
Download spark-3.2.4 version
-
Copy and paste the file directly into the /opt/software directory in the same way
-
Unzip
cd /opt/software tar -zxvf ./spark-3.2.4-bin-hadoop3.2.tgz -C ../module
Configure spark environment
- Modify configuration file
mv spark-env.sh.template spark-env.sh vim spark-env.sh
- add a row
export SPARK_DIST_CLASSPATH=$(/opt/module/hadoop-3.2.3/bin/hadoop classpath)
Run the sample code
cd /opt/module/spark-3.2.4-bin-hadoop3.2 ./bin/run-example SparkPi
-
operation result
-
Observed output: Pi is roughly 3.1451557257786287
Configure PySpark environment
Spark On Yarn
yarn preparation
- Modify the mapred-site.xml configuration file
cd /opt/module/hadoop-3.2.3/etc/hadoop vim mapred-site.xml
Add the following
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
- Modify the yarn-site.xml configuration file
vim yarn-site.xml
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property>
-
Turn on yarn
./sbin/start-yarn.sh
View current process
-
You can simply write scripts to facilitate Hadoop startup and shutdown
- Simple referencemy_hadoop.sh
#!/bin/bash if [ $# -lt 1 ] then echo "No Args Input..." exit ; fi case $1 in "start") echo "==================== Start hadoop cluster ====================" echo "-------------Start hdfs-------------" "/opt/module/hadoop-3.2.3/sbin/start-dfs.sh" echo "------------- start yarn -------------" "/opt/module/hadoop-3.2.3/sbin/start-yarn.sh" echo "GO SEE http://localhost:9870/explorer.html#/ EXPLORE YOUR HDFS!" ;; "stop") echo "==================== Shut down the hadoop cluster ====================" echo "------------- Close yarn -------------" "/opt/module/hadoop-3.2.3/sbin/stop-yarn.sh" echo "------------- Close hdfs -------------" "/opt/module/hadoop-3.2.3/sbin/stop-dfs.sh" ;; *) echo "Input Args Error..." ;; esac
my_hadoop.sh start # Start hdfs & amp; yarn my_hadoop.sh stop # Close yarn & amp; hdfs
Install Miniconda3
- Resource preparation
Select Linux version to download - Copy to the /opt/software directory in the same way and run
cd /opt/software bash ./Miniconda3-latest-Linux-x86_64.sh
-
Just follow the instructions to install. The installation directory can be /opt/module/miniconda3.
-
The installation is complete
-
Restart the shell and you can see that there is one more (base) in front
-
Create conda virtual environment
conda create -n pyspark python=3.10
Configuring spark files
- Add: in spark-env.sh file:
HADOOP_CONF_DIR=/opt/module/hadoop-3.2.3 YARN_CONF_DIR=/opt/module/hadoop-3.2.3
- Add configuration in .bashrc:
export PYSPARK_PYTHON=/opt/module/miniconda3/envs/pyspark/bin/python export PYSPARK_DRIVER_PYTHON=/opt/module/miniconda3/envs/pyspark/bin/python
Vscode configures remote connection
Plug-in preparation and python interpreter selection
-
Install the Remote Development plug-in
-
Install Python plugin
-
Find the remote resource manager on the left, select Ubuntu and connect in the current window
-
Create the pyspark-project folder in the home directory
cd ~ mkdir pyspark-project
-
Select this folder to open in vscode
-
Install python library pyspark
ctrl + ` Open terminal to install
conda activate pyspark pip install pyspark==3.2.0 # The pyspark version cannot be too high, otherwise there will be compatibility issues
- Python selects the pyspark virtual environment as the interpreter
Test the WordCount program
- Write word_count.py program
# coding:utf8 ''' word_count.py word count ''' from pyspark import SparkConf, SparkContext if __name__ == '__main__': conf = SparkConf().setAppName("WordCount") sc = SparkContext(conf=conf) input_path = "input/test.txt" file_rdd = sc.textFile(input_path) words_rdd = file_rdd.flatMap(lambda line: line.split(" ")) words_with_one_rdd = words_rdd.map(lambda x: (x, 1)) result_rdd = words_with_one_rdd.reduceByKey(lambda a, b: a + b) result_rdd.coalesce(1).saveAsTextFile("output")
- Execute code
cd /opt/module/spark-3.2.4-bin-hadoop3.2 ./bin/spark-submit --master yarn ~/pyspark-project/test/word_count.py
-
operation result
-
Output file:
At this point, the pyspark environment has been set up.
You can write programs on vscode and submit them to yarn