Apache Airflow (2): Airflow stand-alone construction

Personal homepage: IT Pindao_Big data OLAP system technology stack, Apache Doris, Clickhouse technology-CSDN blog

Private chat with bloggers: Join the big data technology discussion group chat to get more big data information.

Blogger’s personal B stack address: Brother Bao teaches you about big data’s personal space – Brother Bao teaches you about big data personal homepage – Bilibili Video

Table of Contents

1. Install Anconda and python3.7

2. Install Airflow on a single machine

?3. Start Airflow


Airflow is based on Python, which is a package in Python. The installation requires Python 3.6 or above. Metadata DataBase supports PostgreSQL9.6+, MySQL5.7+, and SQLLite3.15.0+.

?1. Install Anconda and python3.7

1) Download Anconda from the official website, select the linux version, and install it

Download official website address: https://www.anaconda.com/products/individual#macos

2) Upload the downloaded anconda installation package to the mynode4 node and install it

sh Anaconda3-2020.02-Linux-x86_64.sh [Just press Enter]

Do you accept the license terms? [yes|no]

Yes【Continue to enter】

... ...

Anaconda3 will now be installed into this location:

/root/anaconda3



  - Press ENTER to confirm the location

  - Press CTRL-C to abort the installation

  - Or specify a different location below



[/root/anaconda3] >>> [Just press Enter and install to the /root/anaconda3 path]

... ...

Do you wish the installer to initialize Anaconda3

by running conda init? [yes|no]

[no] >>>yes [enter yes and press Enter]

... ...

[Installation completed]

3) Configure Anconda’s environment variables

Add the following statements to /etc/profile:

export PATH=$PATH:/root/anaconda3/bin

#Make environment variables effective

source /etc/profile

4) Install python3.7 python environment

 conda create -n python37 python=3.7

5) Activate the python environment using python37

conda activate python37 [activate the python37 environment, you need to execute source activate first]

The relevant commands are as follows:

source activate [Initialize conda, must be executed, after execution you can use the conda command to activate the environment]

conda deactivate [Exit the current base environment]

conda activate python37 [activate the use of python37 environment]

conda deactivate [Exit the current python37 environment]

conda remove -n python37 --all [Delete python37 environment]

2. Install Airflow on a single machine

When deploying airflow on a single node, all airflow processes run on one machine. The architecture diagram is as follows:

1) Install the system dependencies required for Airflow

The normal use of Airflow requires some system dependencies. Install the following dependencies on the mynode4 node:

yum -y install mysql-devel gcc gcc-devel python-devel gcc-c + + cyrus-sasl cyrus-sasl-devel cyrus-sasl-lib

2) Create the corresponding library in MySQL and set parameters

We use mysql for the Metadata database used by aiflow, and create the library and table information used by airflow in the mysql of the node2 node.

CREATE DATABASE airflow CHARACTER SET utf8;

create user 'airflow'@'%' identified by '123456';

grant all privileges on airflow.* to 'airflow'@'%';

flush privileges;

Modify “/etc/my.cnf” on the mysql installation node node2, and add the following content under [mysqld]:

[mysqld]

explicit_defaults_for_timestamp=1

Note: The above configuration of the explicit_defaults_for_timestamp system variable determines how the MySQL server handles the default value and NULL value in the timestamp column differently. This variable was introduced in MySQL version 5.6.6, and the default value is 0. By default, if the timestamp column does not explicitly specify the null attribute, the column will automatically be added with the not null attribute. If null is inserted into this column value will automatically set the value of the column to the current timestamp value. When this value is set to 1, if the timestamp column does not explicitly specify the not null attribute, the default column can be null. At this time, when a null value is inserted into the column, null will be recorded directly instead of the current timestamp. , if not null is specified, an error will be reported.

In Airflow, the corresponding mysql parameter needs to be set to 1. After modifying the “my.cnf” value above, restart Mysql. After restarting, you can check whether the corresponding parameters are effective:

#Restart mysql

[root@node2 ~]# service mysqld restart



#Re-login to mysql query

mysql> show variables like 'explicit_defaults_for_timestamp';

3) Install Airflow

Switch the python37 environment on node4, install airflow, and specify the version as 2.1.3

(python37) [root@node4 ~]# conda activate python37

(python37) [root@node4 ~]# pip install apache-airflow==2.1.3 -i https://pypi.tuna.tsinghua.edu.cn/simple

By default, Airflow is installed in the $ANCONDA_HOME/envs/python37/lib/python3.7/site-packages/airflow directory. The Airflow file storage directory is in the /root/airflow directory by default, but this directory needs to be automatically created after executing “airflow version” to check the installed Airflow version information:

(python37) [root@node4 ~]# airflow version

2.1.3

Note: If you do not want to use the default “/root/airflow” directory as the file storage directory, you can also set the environment variable before installing airflow:

(python37) [root@node4 ~]# vim /etc/profile

export AIRFLOW_HOME=/software/airflow



#Make the configured environment variables take effect

source /etc/profile

After airflow is installed in this way, if you check the corresponding version, the directory configured in “AIRFLOW_HOME” will be regarded as the file storage directory of airflow.

4) Configure the database used by Airflow as MySQL

Open the configured airflow file storage directory. By default, in the $AIRFLOW_HOME directory “/root/airflow”, there will be the “airflow.cfg” configuration file. Modify the configuration as follows:

[core]

dags_folder = /root/airflow/dags



#Modify time zone

default_timezone = Asia/Shanghai



# Configure database

sql_alchemy_conn=mysql + mysqldb://airflow:123456@node2:3306/airflow?use_unicode=true & charset=utf8



[webserver]

#Set time zone

default_ui_timezone = Asia/Shanghai



#Set DAG display mode

# Default DAG view. Valid values are: ``tree``, ``graph``, ``duration``, ``gantt``, ``landing_times``

dag_default_view = graph



[scheduler]

#Set the default discovery new task cycle, the default is 5 minutes

# How often (in seconds) to scan the DAGs directory for new files. Default to 5 minutes.

dag_dir_list_interval = 30

5) Install the required python dependency packages

When initializing the Airflow database, you need to use the package to connect to mysql. Execute the following command to install the python package corresponding to mysql.

?
(python37) [root@node4 ~]# pip install mysqlclient -i Simple Index

?

6) Initialize Airflow database

(python37) [root@node4 airflow]# airflow db init

After initialization, the corresponding table will be generated under the MySQL airflow library.

7) Create administrator user information

Execute the following command on the node4 node to create user information for operating Airflow:

airflow users create \

    --username airflow \

    --firstname airflow \

    --lastname airflow \

    --role Admin \

    --email [email protected]

After the execution is completed, set the password to “123456” and confirm it to complete the creation of Airflow administrator information.

?3. Start Airflow

1) Start webserver

#Start webserver in foreground mode

(python37) [root@node4 airflow]# airflow webserver --port 8080



#Run the webserver in daemon mode, the default port is 8080. ps aux|grep webserver to view background processes

airflow webserver --port 8080 -D

2) Start scheduler

Open a new window, switch to python37 environment, and start Schduler:

#Start scheduler in foreground mode

(python37) [root@node4 ~]# airflow scheduler



#Run Scheduler in daemon mode, ps aux|grep scheduler to view background processes

 airflow scheduler -D

3) Access Airflow webui

Browser access: http://node4:8080

Enter the username created earlier: airflow Password: 123456