1. Project background
In recent years, with the development of social economy and the improvement of people’s living standards, environmental pollution has attracted more and more attention. Air pollution is one of the greatest environmental threats to human health, alongside climate change. It is estimated that exposure to air pollution causes 7 million premature deaths each year and results in millions of lost years of healthy life. In this regard, how to analyze the related harmful factors of air pollution and what kind of air indicators are harmful to human health has become a difficult problem faced by relevant departments.
2. Project requirements
1. Analyze the number of deaths due to air pollution and discover which type of air pollution is more dangerous.
2. Combined with national and regional political, economic and other factors, try to discover the causes of death due to air pollution.
3. Predict the total number of deaths due to air pollution in the future and propose suggestions for improving air quality.
3. Project implementation
import pandas as pd import numpy as np import matplotlib.pyplot as plt plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False
1. Data Exploration
# Read data air = pd.read_excel('death-rates-from-air-pollution.xlsx') air.head()
# Basic data information air.info() #Modify column name air = air.rename( columns={ 'Air pollution (total) (deaths per 100,000)': 'Air pollution (total)', 'Indoor air pollution (deaths per 100,000)': 'Indoor air pollution', 'Outdoor particulate matter (deaths per 100,000)': 'Outdoor particulate matter', 'Outdoor ozone pollution (deaths per 100,000)': 'Outdoor ozone pollution' }) air.head() # Whether there is a null value air.isna().sum() # Is the year data of each country and region complete? df = pd.DataFrame(air['Entity'].value_counts()) df[df['Entity'] != 28] ce = air[air['Code'].isna()]['Entity'].unique() print('Country and region Code is the number of null values:', len(ce)) print(ce)
2. Analysis
2.1. Total number of deaths caused by various types of air pollution
air_pol = air.iloc[:, 3:].sum() x = air_pol.index y = air_pol bar = plt.bar(x, y) plt.bar_label(bar, label_type='edge') plt.title('Total number of deaths caused by various air pollution') plt.xticks(rotation=-20) plt.show()
It can be seen from the figure that Air pollution (total) has the largest number of deaths
2.2. The total number of deaths caused by various types of air pollution every year
air_year = air.groupby('Year')[[ 'Air pollution (total)', 'Indoor air pollution', 'Outdoor particulate matter', 'Outdoor ozone pollution' ]].sum() x = np.arange(len(air_year)) air_pollution_categories = ['Air pollution (total)', 'Indoor air pollution', 'Outdoor particulate matter', 'Outdoor ozone pollution'] colors = ['-or', '-Dc', '-^g', '-dy'] fig, axes = plt.subplots(2, 2, figsize=(18, 10)) for i, category in enumerate(air_pollution_categories): row = i // 2 col=i%2 y = air_year[category] axes[row, col].plot(x, y, colors[i]) axes[row, col].set_title(f'{category}Number of deaths per year') axes[row, col].set_xticks(x, air_year.index, rotation=-50) plt.tight_layout() plt.show()
It can be seen from the above four figures that as time goes by, the number of deaths in various situations has generally shown a downward trend.
2.3. Number of deaths caused by air pollution in various regions
air_Entity = air.groupby('Entity')[[ 'Air pollution (total)', 'Indoor air pollution', 'Outdoor particulate matter', 'Outdoor ozone pollution' ]].sum() ls_air = [ 'Air pollution (total)', 'Indoor air pollution', 'Outdoor particulate matter', 'Outdoor ozone pollution' ] color = ['c', 'g', 'b', 'r'] x = np.arange(len(air_Entity)) fig, axes = plt.subplots(4, 1, figsize=(18, 38)) for i in range(4): y = air_Entity[ls_air[i]] axes[i].bar(x, y, color=color[i]) axes[i].set_title(f'{ls_air[i]}Number of fatalities in each region') axes[i].set_xticks(x, air_Entity.index, rotation=90) plt.tight_layout() plt.show()
2.4. Top 10 areas where deaths from ozone, particulate matter, and indoor pollution are mainly concentrated
ls = [ 'Outdoor ozone pollution', 'Outdoor particulate matter', 'Indoor air pollution' ] color = ['r', 'b', 'g'] pol = ['ozone', 'particulate matter', 'indoor'] xs = np.arange(10) fig, axes = plt.subplots(3, 1, figsize=(12, 15)) for i in range(3): y = air_Entity[ls[i]].sort_values(ascending=False).head(10) bar = axes[i].bar(xs, y, color=color[i]) axes[i].bar_label(bar, label_type='edge') axes[i].set_title(f'{pol[i]}Top 10 areas with death toll from pollution') axes[i].set_xticks(xs, y.index, rotation=-80) plt.tight_layout() plt.show()
The top 10 areas with the highest number of deaths in these three situations all belong to developed countries and backward areas, which shows that air pollution is related to the environment, economy and politics of the region.
3. Predict the total number of future deaths
3.1. Building features
Using the sliding window method to construct features to predict the total number of deaths due to air pollution in the future
ls_x1 = [] ls_x2 = [] ls_x3 = [] ls_y = [] for i in range(len(air_year) - 3): ls_x1.append(air_year.iloc[i, 0]) ls_x2.append(air_year.iloc[i + 1, 0]) ls_x3.append(air_year.iloc[i + 2, 0]) ls_y.append(air_year.iloc[i + 3, 0]) data = {'x1': ls_x1, 'x2': ls_x2, 'x3': ls_x3, 'y': ls_y} df = pd.DataFrame(data) df
3.2. Model training and testing
# Divide data train = df.iloc[:20, :] test_x, test_y = df.iloc[20:, 0:3], df.iloc[20:, 3] train = train.sample(frac=1).reset_index(drop=True) #Disorganize the data train_x, train_y = train.iloc[:, 0:3], train.iloc[:, 3] from sklearn.linear_model import LinearRegression #Select linear regression model line = LinearRegression() line.fit(train_x, train_y) pred = line.predict(test_x) from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error # Model evaluation indicators print(r2_score(test_y, pred)) print(mean_absolute_error(test_y, pred)) print(mean_squared_error(test_y, pred))
3.3. Forecast
Also use the sliding method to predict values in future years; add the predicted values to the same list as the known values as feature values to continue predicting new values.
# Forecast the number of deaths caused by air pollution from 2018 to 2027 feature_x = list(air_year['Air pollution (total)'][-3:].values) feature_y = [] for i in range(10): feature = np.array(feature_x).reshape(1, -1) pred_y = line.predict(feature)[0] feature_y.append(pred_y) feature_x.append(pred_y) feature_x = feature_x[-3:] pred_fea = {'year': np.arange(2018,2028), 'pred_value': feature_y } pred_fea = pd.DataFrame(pred_fea) pred_fea
The predicted values follow the same trend as the known values, with the number of deaths decreasing as the years go by.
?