Personal homepage: IT Pindao_Big data OLAP system technology stack, Apache Doris, Clickhouse technology-CSDN blog
Private chat with bloggers: Join the big data technology discussion group chat to get more big data information.
Blogger’s personal B stack address: Brother Bao teaches you about big data’s personal space – Brother Bao teaches you about big data personal homepage – Bilibili Video
Table of Contents
1. Install Anconda and python3.7
2. Install Airflow on a single machine
?3. Start Airflow
Airflow is based on Python, which is a package in Python. The installation requires Python 3.6 or above. Metadata DataBase supports PostgreSQL9.6+, MySQL5.7+, and SQLLite3.15.0+.
?1. Install Anconda and python3.7
1) Download Anconda from the official website, select the linux version, and install it
Download official website address: https://www.anaconda.com/products/individual#macos
2) Upload the downloaded anconda installation package to the mynode4 node and install it
sh Anaconda3-2020.02-Linux-x86_64.sh [Just press Enter] Do you accept the license terms? [yes|no] Yes【Continue to enter】 ... ... Anaconda3 will now be installed into this location: /root/anaconda3 - Press ENTER to confirm the location - Press CTRL-C to abort the installation - Or specify a different location below [/root/anaconda3] >>> [Just press Enter and install to the /root/anaconda3 path] ... ... Do you wish the installer to initialize Anaconda3 by running conda init? [yes|no] [no] >>>yes [enter yes and press Enter] ... ... [Installation completed]
3) Configure Anconda’s environment variables
Add the following statements to /etc/profile: export PATH=$PATH:/root/anaconda3/bin #Make environment variables effective source /etc/profile
4) Install python3.7 python environment
conda create -n python37 python=3.7
5) Activate the python environment using python37
conda activate python37 [activate the python37 environment, you need to execute source activate first]
The relevant commands are as follows:
source activate [Initialize conda, must be executed, after execution you can use the conda command to activate the environment] conda deactivate [Exit the current base environment] conda activate python37 [activate the use of python37 environment] conda deactivate [Exit the current python37 environment] conda remove -n python37 --all [Delete python37 environment]
2. Install Airflow on a single machine
When deploying airflow on a single node, all airflow processes run on one machine. The architecture diagram is as follows:
1) Install the system dependencies required for Airflow
The normal use of Airflow requires some system dependencies. Install the following dependencies on the mynode4 node:
yum -y install mysql-devel gcc gcc-devel python-devel gcc-c + + cyrus-sasl cyrus-sasl-devel cyrus-sasl-lib
2) Create the corresponding library in MySQL and set parameters
We use mysql for the Metadata database used by aiflow, and create the library and table information used by airflow in the mysql of the node2 node.
CREATE DATABASE airflow CHARACTER SET utf8; create user 'airflow'@'%' identified by '123456'; grant all privileges on airflow.* to 'airflow'@'%'; flush privileges;
Modify “/etc/my.cnf” on the mysql installation node node2, and add the following content under [mysqld]:
[mysqld] explicit_defaults_for_timestamp=1
Note: The above configuration of the explicit_defaults_for_timestamp system variable determines how the MySQL server handles the default value and NULL value in the timestamp column differently. This variable was introduced in MySQL version 5.6.6, and the default value is 0. By default, if the timestamp column does not explicitly specify the null attribute, the column will automatically be added with the not null attribute. If null is inserted into this column value will automatically set the value of the column to the current timestamp value. When this value is set to 1, if the timestamp column does not explicitly specify the not null attribute, the default column can be null. At this time, when a null value is inserted into the column, null will be recorded directly instead of the current timestamp. , if not null is specified, an error will be reported.
In Airflow, the corresponding mysql parameter needs to be set to 1. After modifying the “my.cnf” value above, restart Mysql. After restarting, you can check whether the corresponding parameters are effective:
#Restart mysql [root@node2 ~]# service mysqld restart #Re-login to mysql query mysql> show variables like 'explicit_defaults_for_timestamp';
3) Install Airflow
Switch the python37 environment on node4, install airflow, and specify the version as 2.1.3
(python37) [root@node4 ~]# conda activate python37 (python37) [root@node4 ~]# pip install apache-airflow==2.1.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
By default, Airflow is installed in the $ANCONDA_HOME/envs/python37/lib/python3.7/site-packages/airflow directory. The Airflow file storage directory is in the /root/airflow directory by default, but this directory needs to be automatically created after executing “airflow version” to check the installed Airflow version information:
(python37) [root@node4 ~]# airflow version 2.1.3
Note: If you do not want to use the default “/root/airflow” directory as the file storage directory, you can also set the environment variable before installing airflow:
(python37) [root@node4 ~]# vim /etc/profile export AIRFLOW_HOME=/software/airflow #Make the configured environment variables take effect source /etc/profile
After airflow is installed in this way, if you check the corresponding version, the directory configured in “AIRFLOW_HOME” will be regarded as the file storage directory of airflow.
4) Configure the database used by Airflow as MySQL
Open the configured airflow file storage directory. By default, in the $AIRFLOW_HOME directory “/root/airflow”, there will be the “airflow.cfg” configuration file. Modify the configuration as follows:
[core] dags_folder = /root/airflow/dags #Modify time zone default_timezone = Asia/Shanghai # Configure database sql_alchemy_conn=mysql + mysqldb://airflow:123456@node2:3306/airflow?use_unicode=true & charset=utf8 [webserver] #Set time zone default_ui_timezone = Asia/Shanghai #Set DAG display mode # Default DAG view. Valid values are: ``tree``, ``graph``, ``duration``, ``gantt``, ``landing_times`` dag_default_view = graph [scheduler] #Set the default discovery new task cycle, the default is 5 minutes # How often (in seconds) to scan the DAGs directory for new files. Default to 5 minutes. dag_dir_list_interval = 30
5) Install the required python dependency packages
When initializing the Airflow database, you need to use the package to connect to mysql. Execute the following command to install the python package corresponding to mysql.
? (python37) [root@node4 ~]# pip install mysqlclient -i Simple Index ?
6) Initialize Airflow database
(python37) [root@node4 airflow]# airflow db init
After initialization, the corresponding table will be generated under the MySQL airflow library.
7) Create administrator user information
Execute the following command on the node4 node to create user information for operating Airflow:
airflow users create \ --username airflow \ --firstname airflow \ --lastname airflow \ --role Admin \ --email [email protected]
After the execution is completed, set the password to “123456” and confirm it to complete the creation of Airflow administrator information.
?3. Start Airflow
1) Start webserver
#Start webserver in foreground mode (python37) [root@node4 airflow]# airflow webserver --port 8080 #Run the webserver in daemon mode, the default port is 8080. ps aux|grep webserver to view background processes airflow webserver --port 8080 -D
2) Start scheduler
Open a new window, switch to python37 environment, and start Schduler:
#Start scheduler in foreground mode (python37) [root@node4 ~]# airflow scheduler #Run Scheduler in daemon mode, ps aux|grep scheduler to view background processes airflow scheduler -D
3) Access Airflow webui
Browser access: http://node4:8080
Enter the username created earlier: airflow Password: 123456