WSL + Vscode one-stop to build Hadoop pseudo-distributed + Spark environment

Wsl + Vscode one-stop to build Hadoop + Spark environment

If you want to build an environment such as Linux, Hadoop, Spark, etc., the common practice now is to install a virtual machine on VM, Virtualbox and other software
This article introduces how to build a relevant environment on the windows subsystem (Windows Subsystem for Linux) and use vscode to develop Spark programs.

Wsl environment preparation

For detailed wsl installation documentation, please see
Type ubuntu in PowerShell to enter the wsl environment
Noticed that Windows File Explorer has added a little penguin
File Explorer

Build a Hadoop pseudo-distributed environment

Resource preparation

  • Modify /opt directory permissions:
    sudo chown -R yourname /opt # Replace the following user name xuxin with the user name.

  • Create two new folders module and software in the /opt directory
    Prepare jdk-1.8 and hadoop-3.2.3 in the /opt/software directory
    jdk:< Java Downloads | Oracle>
    hadoop:

  • Copy and paste the downloaded file directly into the /opt/software directory
    ![[Pasted image 20231112164154.png]]

  • unzip files

tar -zxvf /opt/software/jdk-8u212-linux-x64.tar.gz -C /opt/module/
tar -zxvf /opt/software/hadoop-3.2.3.tar.gz -C /opt/module/

Configure ssh service

  • Installssh server
    sudo apt install openssh-server
  • Configure password-free login
cd ~/.ssh/ # If there is no such directory, execute ssh localhost first
ssh-keygen -t rsa # If a prompt appears, just press Enter
cat ./id_rsa.pub >> ./authorized_keys

Configure Java environment

  • Modify .bashrc file
    vim ~/.bashrc
    Add the following
export JAVA_HOME=/opt/module/jdk1.8.0_212
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
  • Make the configuration file effective
    source ~/.bashrc

  • The following content appears to indicate that the Java environment has been set up.
    java -version
    java

Configuring hadoop environment

  • Modify the core-site.xml file
cd /opt/module/hadoop-3.2.3
vim ./etc/hadoop/core-site.xml
  • Change to the following configuration
<configuration>

<property>
<name>hadoop.tmp.dir</name>
<value>file:/opt/module/hadoop-3.2.3/tmp</value>
<description>Abase for other temporary directories.</description>
</property>

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

<property>
<name>hadooop.http.staticuser.user</name>
<value>xuxin</value>
</property>

</configuration>
  • Also modify the hdfs.xml file
cd /opt/module/hadoop-3.2.3
vim ./etc/hadoop/hdfs-site.xml
  • Change to the following configuration
<configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/module/hadoop-3.2.3/tmp/dfs/name</value>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/module/hadoop-3.2.3/tmp/dfs/data</value>
</property>

</configuration>
  • namenode initialization
cd /opt/module/hadoop-3.2.3
./bin/hdfs namenode -format
  • Similar log information appears to indicate successful initialization.
    namenode

  • Start hdfs

cd /opt/module/hadoop-3.2.3/
./sbin/start-dfs.sh
  • The browser accesses localhost:9870 to access the web page.
    Utilities -> Browse the file system View hdfs file system

  • If you need to operate on the web side, you can turn off the safe mode in the Hadoop directory.
    ./bin/hadoop dfsadmin -safemode leave

Run the wordcount sample code

  • Create test.txt file
cd /opt/module/hadoop-3.2.3/
mkdir input
vim test.txt
  • Test file content
I learn C language
I like Java
I do not like Python
  • hdfs create user directory
    ./bin/hadoop fs -mkdir -p /user/xuxin

  • Since the user directory is specified, you can use relative paths such as input in the command, and the corresponding absolute path is /user/xuxin/input
    ./bin/hadoop fs -put ./input input

  • It can be seen on the web that the upload has been successful.

  • Run the example wordcount code
    ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.3.jar wordcount input output

  • View output
    ./bin/hadoop fs -cat output/*
    wordcount

OK, the hadoop environment preparation has been successfully completed

Spark simple example

Resource preparation

  • Download spark-3.2.4 version

  • Copy and paste the file directly into the /opt/software directory in the same way

  • Unzip

cd /opt/software
tar -zxvf ./spark-3.2.4-bin-hadoop3.2.tgz -C ../module

Configure spark environment

  • Modify configuration file
mv spark-env.sh.template spark-env.sh
vim spark-env.sh

  • add a row
    export SPARK_DIST_CLASSPATH=$(/opt/module/hadoop-3.2.3/bin/hadoop classpath)

Run the sample code

cd /opt/module/spark-3.2.4-bin-hadoop3.2
./bin/run-example SparkPi
  • operation result

  • Observed output: Pi is roughly 3.1451557257786287

Configure PySpark environment

Spark On Yarn

yarn preparation

  • Modify the mapred-site.xml configuration file
cd /opt/module/hadoop-3.2.3/etc/hadoop
vim mapred-site.xml

Add the following

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
  • Modify the yarn-site.xml configuration file
    vim yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
  • Turn on yarn
    ./sbin/start-yarn.sh
    View current process
    jps

  • You can simply write scripts to facilitate Hadoop startup and shutdown

    • Simple referencemy_hadoop.sh
#!/bin/bash

if [ $# -lt 1 ]
then
        echo "No Args Input..."
        exit ;
fi

case $1 in
"start")
        echo "==================== Start hadoop cluster ===================="

        echo "-------------Start hdfs-------------"
         "/opt/module/hadoop-3.2.3/sbin/start-dfs.sh"
        echo "------------- start yarn -------------"
         "/opt/module/hadoop-3.2.3/sbin/start-yarn.sh"
        echo "GO SEE http://localhost:9870/explorer.html#/ EXPLORE YOUR HDFS!"
;;
"stop")
        echo "==================== Shut down the hadoop cluster ===================="
        echo "------------- Close yarn -------------"
        "/opt/module/hadoop-3.2.3/sbin/stop-yarn.sh"
        echo "------------- Close hdfs -------------"
         "/opt/module/hadoop-3.2.3/sbin/stop-dfs.sh"
;;
*)
                echo "Input Args Error..."
;;
esac
my_hadoop.sh start # Start hdfs & amp; yarn
my_hadoop.sh stop # Close yarn & amp; hdfs

Install Miniconda3

  • Resource preparation

    Select Linux version to download
  • Copy to the /opt/software directory in the same way and run
cd /opt/software
bash ./Miniconda3-latest-Linux-x86_64.sh
  • Just follow the instructions to install. The installation directory can be /opt/module/miniconda3.

  • The installation is complete

  • Restart the shell and you can see that there is one more (base) in front

  • Create conda virtual environment
    conda create -n pyspark python=3.10

Configuring spark files

  • Add: in spark-env.sh file:
HADOOP_CONF_DIR=/opt/module/hadoop-3.2.3
YARN_CONF_DIR=/opt/module/hadoop-3.2.3
  • Add configuration in .bashrc:
export PYSPARK_PYTHON=/opt/module/miniconda3/envs/pyspark/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/module/miniconda3/envs/pyspark/bin/python

Vscode configures remote connection

Plug-in preparation and python interpreter selection

  • Install the Remote Development plug-in

  • Install Python plugin

  • Find the remote resource manager on the left, select Ubuntu and connect in the current window

  • Create the pyspark-project folder in the home directory

cd ~
mkdir pyspark-project
  • Select this folder to open in vscode
    ![[Pasted image 20231112194429.png]]

  • Install python library pyspark
    ctrl + ` Open terminal to install

conda activate pyspark
pip install pyspark==3.2.0 # The pyspark version cannot be too high, otherwise there will be compatibility issues
  • Python selects the pyspark virtual environment as the interpreter

Test the WordCount program

  • Write word_count.py program
# coding:utf8

'''
word_count.py
word count
'''

from pyspark import SparkConf, SparkContext


if __name__ == '__main__':
    conf = SparkConf().setAppName("WordCount")
    sc = SparkContext(conf=conf)

    input_path = "input/test.txt"

    file_rdd = sc.textFile(input_path)

    words_rdd = file_rdd.flatMap(lambda line: line.split(" "))

    words_with_one_rdd = words_rdd.map(lambda x: (x, 1))

    result_rdd = words_with_one_rdd.reduceByKey(lambda a, b: a + b)

    result_rdd.coalesce(1).saveAsTextFile("output")
    
  • Execute code
cd /opt/module/spark-3.2.4-bin-hadoop3.2
 ./bin/spark-submit --master yarn ~/pyspark-project/test/word_count.py
  • operation result
    ![[Pasted image 20231112200340.png]]

  • Output file:

    At this point, the pyspark environment has been set up.
    You can write programs on vscode and submit them to yarn