Analysis of hazard factors based on air pollution data

1. Project background

In recent years, with the development of social economy and the improvement of people’s living standards, environmental pollution has attracted more and more attention. Air pollution is one of the greatest environmental threats to human health, alongside climate change. It is estimated that exposure to air pollution causes 7 million premature deaths each year and results in millions of lost years of healthy life. In this regard, how to analyze the related harmful factors of air pollution and what kind of air indicators are harmful to human health has become a difficult problem faced by relevant departments.

2. Project requirements

1. Analyze the number of deaths due to air pollution and discover which type of air pollution is more dangerous.

2. Combined with national and regional political, economic and other factors, try to discover the causes of death due to air pollution.

3. Predict the total number of deaths due to air pollution in the future and propose suggestions for improving air quality.

3. Project implementation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

1. Data Exploration

# Read data
air = pd.read_excel('death-rates-from-air-pollution.xlsx')
air.head()

# Basic data information
air.info()

#Modify column name
air = air.rename(
    columns={
        'Air pollution (total) (deaths per 100,000)': 'Air pollution (total)',
        'Indoor air pollution (deaths per 100,000)': 'Indoor air pollution',
        'Outdoor particulate matter (deaths per 100,000)':
        'Outdoor particulate matter',
        'Outdoor ozone pollution (deaths per 100,000)':
        'Outdoor ozone pollution'
    })
air.head()

# Whether there is a null value
air.isna().sum()

# Is the year data of each country and region complete?
df = pd.DataFrame(air['Entity'].value_counts())
df[df['Entity'] != 28]

ce = air[air['Code'].isna()]['Entity'].unique()
print('Country and region Code is the number of null values:', len(ce))
print(ce)

2. Analysis

2.1. Total number of deaths caused by various types of air pollution

air_pol = air.iloc[:, 3:].sum()

x = air_pol.index
y = air_pol

bar = plt.bar(x, y)
plt.bar_label(bar, label_type='edge')
plt.title('Total number of deaths caused by various air pollution')
plt.xticks(rotation=-20)
plt.show()

It can be seen from the figure that Air pollution (total) has the largest number of deaths

2.2. The total number of deaths caused by various types of air pollution every year

air_year = air.groupby('Year')[[
    'Air pollution (total)', 'Indoor air pollution',
    'Outdoor particulate matter', 'Outdoor ozone pollution'
]].sum()


x = np.arange(len(air_year))
air_pollution_categories = ['Air pollution (total)', 'Indoor air pollution', 'Outdoor particulate matter', 'Outdoor ozone pollution']
colors = ['-or', '-Dc', '-^g', '-dy']

fig, axes = plt.subplots(2, 2, figsize=(18, 10))

for i, category in enumerate(air_pollution_categories):
    row = i // 2
    col=i%2

    y = air_year[category]
    axes[row, col].plot(x, y, colors[i])
    axes[row, col].set_title(f'{category}Number of deaths per year')
    axes[row, col].set_xticks(x, air_year.index, rotation=-50)

plt.tight_layout()
plt.show()

It can be seen from the above four figures that as time goes by, the number of deaths in various situations has generally shown a downward trend.

2.3. Number of deaths caused by air pollution in various regions

air_Entity = air.groupby('Entity')[[
    'Air pollution (total)', 'Indoor air pollution',
    'Outdoor particulate matter', 'Outdoor ozone pollution'
]].sum()

ls_air = [
    'Air pollution (total)', 'Indoor air pollution',
    'Outdoor particulate matter', 'Outdoor ozone pollution'
]
color = ['c', 'g', 'b', 'r']
x = np.arange(len(air_Entity))

fig, axes = plt.subplots(4, 1, figsize=(18, 38))

for i in range(4):
    y = air_Entity[ls_air[i]]
    axes[i].bar(x, y, color=color[i])
    axes[i].set_title(f'{ls_air[i]}Number of fatalities in each region')
    axes[i].set_xticks(x, air_Entity.index, rotation=90)

plt.tight_layout()
plt.show()

2.4. Top 10 areas where deaths from ozone, particulate matter, and indoor pollution are mainly concentrated

ls = [
    'Outdoor ozone pollution', 'Outdoor particulate matter',
    'Indoor air pollution'
]
color = ['r', 'b', 'g']
pol = ['ozone', 'particulate matter', 'indoor']
xs = np.arange(10)

fig, axes = plt.subplots(3, 1, figsize=(12, 15))

for i in range(3):
    y = air_Entity[ls[i]].sort_values(ascending=False).head(10)
    bar = axes[i].bar(xs, y, color=color[i])
    axes[i].bar_label(bar, label_type='edge')
    axes[i].set_title(f'{pol[i]}Top 10 areas with death toll from pollution')
    axes[i].set_xticks(xs, y.index, rotation=-80)

plt.tight_layout()
plt.show()

The top 10 areas with the highest number of deaths in these three situations all belong to developed countries and backward areas, which shows that air pollution is related to the environment, economy and politics of the region.

3. Predict the total number of future deaths

3.1. Building features

Using the sliding window method to construct features to predict the total number of deaths due to air pollution in the future

ls_x1 = []
ls_x2 = []
ls_x3 = []
ls_y = []
for i in range(len(air_year) - 3):
    ls_x1.append(air_year.iloc[i, 0])
    ls_x2.append(air_year.iloc[i + 1, 0])
    ls_x3.append(air_year.iloc[i + 2, 0])
    ls_y.append(air_year.iloc[i + 3, 0])

data = {'x1': ls_x1, 'x2': ls_x2, 'x3': ls_x3, 'y': ls_y}
df = pd.DataFrame(data)
df

3.2. Model training and testing

# Divide data
train = df.iloc[:20, :]
test_x, test_y = df.iloc[20:, 0:3], df.iloc[20:, 3]
train = train.sample(frac=1).reset_index(drop=True) #Disorganize the data
train_x, train_y = train.iloc[:, 0:3], train.iloc[:, 3]

from sklearn.linear_model import LinearRegression #Select linear regression model

line = LinearRegression()
line.fit(train_x, train_y)
pred = line.predict(test_x)

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error # Model evaluation indicators

print(r2_score(test_y, pred))
print(mean_absolute_error(test_y, pred))
print(mean_squared_error(test_y, pred))

3.3. Forecast

Also use the sliding method to predict values in future years; add the predicted values to the same list as the known values as feature values to continue predicting new values.

# Forecast the number of deaths caused by air pollution from 2018 to 2027
feature_x = list(air_year['Air pollution (total)'][-3:].values)
feature_y = []
for i in range(10):
    feature = np.array(feature_x).reshape(1, -1)
    pred_y = line.predict(feature)[0]
    feature_y.append(pred_y)
    feature_x.append(pred_y)
    feature_x = feature_x[-3:]

pred_fea = {'year': np.arange(2018,2028), 'pred_value': feature_y }
pred_fea = pd.DataFrame(pred_fea)
pred_fea

The predicted values follow the same trend as the known values, with the number of deaths decreasing as the years go by.