Python data analysis case 28 – traffic accident prediction in Seattle (unbalanced sample processing)

This case is suitable for students in the direction of machine learning data science.

Introduction (nonsense collection)

Traffic accidents are a serious public safety problem, and thousands of people die in traffic accidents every year around the world. With the development of transportation and the acceleration of urbanization, traffic accidents have become one of the main factors restricting urban development and people’s happiness. Therefore, by analyzing the causes and severity of traffic accidents and developing effective prediction models, it is possible to effectively prevent traffic accidents and reduce the harm of traffic accidents to human society.

The goal of this research is to predict the severity of traffic accidents through machine learning models to protect urban residents from serious injuries in traffic accidents and prevent traffic collisions. To achieve this goal, we will analyze various factors that affect the severity of traffic accidents and build an accurate and reliable prediction model.

At present, people usually rely on experience and statistical data to evaluate the severity of traffic accidents. However, the disadvantage of this method is that there may be subjectivity and limitations, and it is impossible to accurately predict the occurrence and severity of traffic accidents. In contrast, machine learning models can automatically learn and optimize the model to predict the severity of traffic accidents with the greatest accuracy. This study aims to compare the advantages and disadvantages between the current solution and our proposed solution to demonstrate the superiority of machine learning models in traffic accident prediction.

Finally, we will provide evidence in support of our study by analyzing traffic accident data in the city of Seattle over a 10-year period, where our response variable y is SEVERITYCODE (accident severity code). Because in a traffic accident, the severity of the accident is our most concerned issue, and it is also the target variable we hope to be able to predict. The other columns may serve as predictor variables x that help us predict accident severity. This study will provide a scientific basis for the Seattle transportation government to formulate more effective traffic accident prevention measures and policies.

Import package

#Import common packages for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
 
%matplotlib inline
plt.rcParams['axes.unicode_minus'] = False #Negative sign

Data Exploration

# read data
data=pd.read_csv('Collisions.csv', na_values='Unknown')
#View the first three rows of data
data.head(3)

#Automatically resolve data types
data = data.infer_objects()
#View data information
data. info()

# variable meaning
”’OBJECTID: Identifier of the incident object
INCKEY: Internal Incident Unique Key
COLDETKEY: Identifier for collected data
REPORTNO: Incident report number
STATUS: Incident handling status
ADDRTYPE: address type
INTKEY: intersection identifier
LOCATION: The location of the accident
EXCEPTRSNCODE: exception reason code
EXCEPTRSNDESC: exception reason description
SEVERITYCODE: Incident severity code
SEVERITYDESC: Incident severity description
COLLISIONTYPE: collision type
PERSONCOUNT: Number of people involved
PEDCOUNT: Number of pedestrians
PEDCYLCOUNT: Number of bicycles
VEHCOUNT: Number of vehicles involved
INJURIES: number of injuries
SERIOUSINJURIES: Number of serious injuries
FATALITIES: Number of fatalities
INCDATE: The date of the incident
INCDTTM: Date and time of accident
JUNCTIONTYPE: intersection type
SDOT_COLCODE: Incident code collected by SDOT
SDOT_COLDESC: Incident description collected by SDOT
INATTENTIONIND: inattentive mark
UNDERINFL: Under the influence of the driver
WEATHER: Weather conditions
ROADCOND: road condition
LIGHTCOND: light condition
PEDROWNOTGRNT: Pedestrians not granted rights
SDOTCOLNUM: Incident number collected by SDOT
SPEEDING: Speeding
ST_COLCODE: State accident code
ST_COLDESC: State accident description
SEGLANEKEY: Segmented lane key
CROSSWALKKEY: Crosswalk keyword
HITPARKEDCAR: Whether to hit a parked vehicle ”’

#observe missing values
import missing no as msno
msno.matrix(data)

Data preprocessing

Feature Preprocessing

'''OBJECTID, INCKEY, COLDETKEY, REPORTNO'''
#These four variables are numbered variables with unique values, which are not helpful to the model, delete them
data.drop(columns=['OBJECTID','INCKEY','COLDETKEY','REPORTNO'],inplace=True)

#If there is a row with all null values, delete it

data.dropna(how='all',inplace=True)

#Delete the variable with the only value (if there is a column with the same value, that is, the characteristic variable with the only value can be deleted, because there is no difference between each sample, it is useless for the model)

for col in data.columns:
    if len(data[col].value_counts())==1:
        print(col)
        data.drop(col,axis=1,inplace=True)

#Delete when the missing ratio reaches a certain level, and delete it when the missing amount reaches a certain level. If there are too many missing, it is better not to have this feature. My ratio here is 20%

miss_ratio=0.2
for col in data.columns:
    if data[col].isnull().sum()>data.shape[0]*miss_ratio:
        print(col)
        data.drop(col,axis=1,inplace=True)

A total of 38 variables, 11 have now been deleted

Then the data needs to be further refined, and the data should be divided into numerical and other types.

Look at the numeric data first

Numeric variable preprocessing

#View numeric data,
pd.set_option('display.max_columns', 20)
data.select_dtypes(exclude=['object']).head()

To do machine learning, of course, the more dispersed the features are, the better, because in this way, X can be more differentiated and thus better classified.

So those variables whose data distribution is very concentrated can be thrown away.

Variance can be used, but due to the inconsistent size of the data, the variance is not easy to compare.

Here we use the difference ratio to measure the degree of dispersion of the data. If the difference ratio is lower than 0.02, delete it

#Calculate the difference ratio
variation_ratio_s=0.02
for col in data.select_dtypes(exclude=['object']).columns:
    df_count = data[col].value_counts()
    kind=df_count. index[0]
    variation_ratio=1-(df_count.iloc[0]/len(data[col]))
    if variation_ratio<variation_ratio_s:
        print(f'{col} has the maximum value of {kind}, and the difference ratio is {round(variation_ratio,4)}, which is too small and needs to be deleted')
        #data.drop(col,axis=1,inplace=True)

We found that the outlier ratio of the four variables of SERIOUSINJURIES, FATALITIES, SEGLANEKEY, and CROSSWALKKEY is very small, most of which are 0.

But SERIOUSINJURIES and FATALITIES are variables that have a lot to do with traffic severity, and we choose to keep them.

And SEGLANEKEY and CROSSWALKKEY are road section ID and crosswalk ID, which should have little effect on prediction, so choose to delete them.

##Delete SEGLANEKEY and CROSSWALKKEY
data.drop(columns=['SEGLANEKEY','CROSSWALKKEY'],inplace=True)

object variable preprocessing

View the first three rows of non-numeric variables

data.select_dtypes(include=['object']).head(3)

'LOCATION','SDOT_COLDESC','ST_COLCODE','ST_COLDESC',
#These four variables are specific descriptions for this event, and some are text types, which cannot be used as feature variables, delete
'SEVERITYDESC','STATUS','SDOT_COLCODE'
#These variables have no explanation for the severity of the traffic, and are also deleted
data.drop(columns=['LOCATION','SDOT_COLDESC','ST_COLCODE','ST_COLDESC','SEVERITYDESC','STATUS','SDOT_COLCODE'], inplace=True)

Fill missing values

1. Delete missing values: For some variables with fewer rows with missing values, you can directly delete the row samples with missing values.

It can be seen that there are missing values in our response variable SEVERITYCODE, and there is only one missing value. Deleting this missing value will not have a great impact on the overall data.

Moreover, missing values may cause interference and errors in subsequent modeling and analysis, so in this case, we can directly delete rows with missing values.

2. Other variables fill in missing values:

(1) For categorical variables (such as ADDRTYPE, COLLISIONTYPE, JUNCTIONTYPE, UNDERINFL, WEATHER, ROADCOND, LIGHTCOND) can be filled with the mode.

# remove missing values
data.dropna(subset=['SEVERITYCODE'], inplace=True)

# Categorical variable fill mode
data['ADDRTYPE'].fillna(data['ADDRTYPE'].mode()[0], inplace=True)
data['COLLISIONTYPE'].fillna(data['COLLISIONTYPE'].mode()[0], inplace=True)
data['JUNCTIONTYPE'].fillna(data['JUNCTIONTYPE'].mode()[0], inplace=True)
data['UNDERINFL'].fillna(data['UNDERINFL'].mode()[0], inplace=True)
data['WEATHER'].fillna(data['WEATHER'].mode()[0], inplace=True)
data['ROADCOND'].fillna(data['ROADCOND'].mode()[0], inplace=True)
data['LIGHTCOND'].fillna(data['LIGHTCOND'].mode()[0], inplace=True)

Feature Engineering

Two time variables are converted into time format

data['INCDTTM'] = pd.to_datetime(data['INCDTTM'])
data['INCDATE'] = pd.to_datetime(data['INCDATE'])

The INCDTTM variable contains the hour, which is more detailed. It is enough to extract the time feature from the INCDTTM variable. We extract the year, month, and hour when the accident occurred. Because the day may not have an obvious meaning, and the month represents the seasonality, and the hour represents the most likely moment of daily traffic accidents, so choose the year, month, and hour as the new features

data['year']=data['INCDTTM'].dt.year
data['month'] = data['INCDTTM'].dt.year
data['hour']=data['INCDTTM'].dt.hour

After building a new feature, delete the original feature if it is not needed

data.drop(columns=['INCDTTM','INCDATE'],inplace=True)

There are duplicate meaning variables in UNDERINFL, which need to be processed

# change 0 to N, 1 to Y
data['UNDERINFL']=data['UNDERINFL'].map({'N':'N','0':'N','Y' :'Y','1':'Y'})

#View processed data information

print(data. shape)
data. info()

There are no missing values anymore.

#First take out y from the training set
y=data['SEVERITYCODE']
data.drop(['SEVERITYCODE'],axis=1,inplace=True)

##View each variable of the categorical variable is divided into several categories

for col in data.select_dtypes(include=['object']).columns:
    print(f"{col} variable has {len(data[col].unique())} class")

Data Exploration

Numeric variable drawing

#View the boxplot distribution of feature variables
columns = data.select_dtypes(exclude=['object']).columns.tolist() # list header
dis_cols = 4 # a few lines
dis_rows = len(columns)
plt.figure(figsize=(4 * dis_cols, 4 * dis_rows),dpi=128)
for i in range(len(columns)):
    plt.subplot(dis_rows,dis_cols,i+1)
    sns.boxplot(data=data[columns[i]], orient="v",width=0.5)
    plt.xlabel(columns[i], fontsize = 20)
plt.tight_layout()
#plt.savefig('Characteristic variable box plot', format='png', dpi=500)
plt. show()

#Draw density map, compare training set and test set
dis_cols = 4 # a few lines
dis_rows = len(columns)
plt.figure(figsize=(4 * dis_cols, 4 * dis_rows),dpi=128)
 
for i in range(len(columns)):
    ax = plt.subplot(dis_rows, dis_cols, i + 1)
    ax = sns.kdeplot(data[columns[i]], color="Red" ,fill=True)
    ax.set_xlabel(columns[i], fontsize = 20)
    ax.set_ylabel("Frequency",fontsize = 18)
plt.tight_layout()
#plt.savefig('Training and testing feature variable kernel density map', format='png', dpi=500)
plt. show()

Type variable drawing

#View the boxplot distribution of feature variables
columns = data.select_dtypes(include=['object']).columns.tolist() # list header
dis_cols = 3 # a few lines
dis_rows = len(columns)
plt.figure(figsize=(5 * dis_cols, 5 * dis_rows),dpi=128)
for i in range(len(columns)):
    plt.subplot(dis_rows,dis_cols,i+1)
    sns.barplot(x=data[columns[i]].value_counts().index, y=data[columns[i]].value_counts().values)
    #sns.barplot(data=data[columns[i]].value_counts(), orient="v",width=0.5)
    plt.xlabel(columns[i], fontsize = 15)
    plt. xticks(rotation=45)
plt.tight_layout()
#plt.savefig('Characteristic variable box plot', format='png', dpi=500)
plt. show()

Correlation Analysis

#Draw the correlation between numerical variables
corr = plt.subplots(figsize = (18,16),dpi=128)
corr= sns.heatmap(data.corr(method='spearman'),annot=True,square=True)

Independent hot encoding of categorical variables

#Many x are text-type categorical variables, so the independent hot encoding process to be performed

data=pd.get_dummies(data)
data.shape

View the distribution of the response variable

y. value_counts()

visualization

# View the distribution of y
# classification problem
plt.figure(figsize=(8,3),dpi=128)
plt.subplot(1,2,1)
y.value_counts().plot.bar(title='Response variable histogram graph')
plt.subplot(1,2,2)
y.value_counts().plot.pie(title='Response Variable Pie Chart')
#plt.savefig('response variable.png')
plt.tight_layout()
plt. show()

##The data distribution of the response variable is extremely unbalanced. We map the four groups of 0, 1, 2, and 2b to 0, 3 (indicating that the traffic accident is very serious) to 1, which means it is a two-category problem

y=y.map({'0':0,'1':0,'2':0,'2b':0,'3':1} )
y. value_counts()

Category balance processing for y below

When dealing with extremely unbalanced classification samples, the following methods can be considered: Undersampling: Randomly select a part of samples from the majority category, so that the number of samples in the majority category is similar to the number of samples in the minority category. The advantage of this method is that it is simple and fast, but some useful information may be lost.

Oversampling: Randomly copy some samples from the minority class so that the number of samples in the minority class is similar to the number of samples in the majority class. This approach has the advantage of fully utilizing the dataset, but may lead to overfitting.

SMOTE (Synthetic Minority Over-sampling Technique) algorithm: It is a commonly used oversampling method, which expands the data set by interpolating minority class samples to generate new samples. This method can effectively avoid the overfitting problem.

Mixed Sampling: Combining the advantages of undersampling and oversampling, it can reduce the amount of data and make full use of the data set. You can undersample first, and then oversample the undersampled data.

We first upsample the less sampled

from imblearn.over_sampling import RandomOverSampler
os=RandomOverSampler(sampling_strategy=0.0025)
X_train_ns,y_train_ns=os.fit_resample(data,y)
print("The number of classes before fit {}". format(y. value_counts(). to_dict()))
print("The number of classes after fit {}".format(y_train_ns.value_counts().to_dict()))

Increase the number of samples for y to 1 to 599

Then down-sample the samples with more samples, and the ratio is 0.2, that is, the number of class 0 is 80%, and the number of class 1 is 20%. Although it is not balanced, it is much better than the one just now. . I can also train.

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy=0.2)
X_resampled, y_resampled = rus.fit_resample(X_train_ns, y_train_ns)
X_train_ns2, y_train_ns2 = rus.fit_resample(X_train_ns, y_train_ns)
print("The number of classes before fit {}".format(y_train_ns.value_counts().to_dict()))
print("The number of classes after fit {}".format(y_train_ns2.value_counts().to_dict()))

View now The data shape of

print(X_train_ns2.shape,y_train_ns2.shape)

A total of 3594 samples, including 2995 in category 0, 599 in category 1, and 57 x feature variables

Algorithm Implementation

#Divide the training set and test set, check their shape
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X_train_ns2, y_train_ns2, stratify=y_train_ns2, test_size=0.2, random_state=0)
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)

#data standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler. fit(X_train)
X_train_s = scaler. transform(X_train)
X_test_s = scaler. transform(X_test)
print('training data shape:')
print(X_train_s. shape, y_train. shape)
print('Test data shape:')
print(X_test_s. shape, y_test. shape)

The above code first divides the training set and the test set, the proportion of the test set is 20%, and the random number seed is 0. Then standardize the data, print and view the shape of the training set data and test set data. It can be seen that we have a training set of 4598, a test set of 1150, and 74 feature variables.

#Use five models to compare the accuracy of the validation set

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

#Logistic regression
model1 = LogisticRegression(C=1e10,max_iter=10000)

#Knearest neighbors
model2 = KNeighborsClassifier(n_neighbors=10)
 
#decision tree
model3 = DecisionTreeClassifier(random_state=77)
 
#random forest
model4= RandomForestClassifier(n_estimators=1000, max_features='sqrt', random_state=10)
 
#Support Vector Machines
model5 = SVC(kernel="rbf", random_state=77)
 
model_list=[model1,model2,model3,model4,model5]
model_name=['Logistic Regression', 'K-Nearest Neighbors', 'Decision Tree', 'Random Forest','Support Vector Machine']

Evaluation criteria

This article is a classification problem. Four commonly used and reliable evaluation indicators for classification problems, accuracy, precision, recall and F1 value, are used to comprehensively evaluate the model. The calculation formulas of the four indicators are as follows

Compute all evaluation metrics and define evaluation functions

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import cohen_kappa_score
from sklearn.model_selection import KFold

def evaluation(y_test, y_predict):
    accuracy=classification_report(y_test, y_predict, output_dict=True)['accuracy']
    s=classification_report(y_test, y_predict, output_dict=True)['weighted avg']
    precision=s['precision']
    recall=s['recall']
    f1_score=s['f1-score']
    #kappa=cohen_kappa_score(y_test, y_predict)
    return accuracy, precision, recall, f1_score #, kappa
def evaluation2(lis):
    array = np.array(lis)
    return array.mean() , array.std()

df_eval=pd.DataFrame(columns=['Accuracy','Precision','Recall','F1_score'])
for i in range(5):
    model_C = model_list[i]
    
    name=model_name[i]
    print(f'{name} model fitting completed')
    model_C.fit(X_train_s, y_train)
    pred = model_C.predict(X_test_s)
    s=classification_report(y_test, pred)
    s=evaluation(y_test,pred)
    df_eval.loc[name,:]=list(s)

df_eval

bar_width = 0.4
colors=['c', 'b', 'g', 'tomato', 'm', 'y', 'lime', 'k' ,'orange','pink','grey','tan']
fig, ax = plt.subplots(2,2,figsize=(10,8),dpi=128)
for i, col in enumerate(df_eval.columns):
    n=int(str('22') + str(i + 1))
    plt. subplot(n)
    df_col = df_eval[col]
    m = np.arange(len(df_col))
    plt.bar(x=m,height=df_col.to_numpy(),width=bar_width,color=colors)
    
    #plt.xlabel('Methods', fontsize=12)
    names=df_col.index
    plt.xticks(range(len(df_col)),names,fontsize=10)
    plt. xticks(rotation=40)
    plt.ylabel(col, fontsize=14)
    
plt.tight_layout()
#plt.savefig('Histogram.jpg',dpi=512)
plt. show()

It can be seen that the random forest model above works best

Further hyperparameter search for the random forest model below

#Use K-fold cross-validation to search for optimal hyperparameters
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV

pipeline flow, grid search

from sklearn.pipeline import Pipeline
steps = [
    ('standardize', StandardScaler()),
    ('model',RandomForestClassifier(max_features='sqrt',random_state=10) ]

RF_pipe = Pipeline( steps = steps )

param_distributions = {'model__max_depth': range(2, 9),
                       'model__n_estimators': [250,500,750,1000,1500,2000]}

model= GridSearchCV( RF_pipe, param_distributions, n_jobs = 6, cv = KFold(n_splits=3, shuffle=True, random_state=0) )

View the parameters in this pipeline

RF_pipe.get_params()['model'].get_params().keys()

fit model

model.fit(X_train,y_train)

#View the best parameters and accuracy of the model
print(f" best param: {model.best_params_},accuary:{model.best_score_}" )

#Assign the best parameters to the model
model = model.best_estimator_
pred = model. predict(X_test_s)
s=evaluation(y_test,pred)
print(f'Acc: {s[0]}, Prec: {s[1]}, Recall: {s[2]}, F1:{s[3]}')

#Accuracy, precision, recall and F1 value

RF_s=evaluation(model. predict(data),y)
RF_s

Conclusion

Comparison of models, etc. On this classification problem, random forest works best, followed by support vector machine and logistic regression

The limitation of this case is that the data sample size with the value of y being 1 is too small, and too much upsampling may lead to overfitting of the model, etc.

Variable importance

# use all data training
model=RandomForestClassifier(max_depth= 8,n_estimators=2000,max_features='sqrt',random_state=10) #Use the previous optimal parameters
model. fit(np. array(data), y)
pred = model. predict(np. array(data))
s=evaluation(y,pred)
print(f'Acc: {s[0]}, Prec: {s[1]}, Recall: {s[2]}, F1:{s[3]}')

n=10
sorted_index = model.feature_importances_.argsort()[::-1][:n]
mfs=model.feature_importances_[sorted_index][:n]
plt.figure(figsize=(5,4),dpi=128)
sns.barplot(y=np.array(range(len(mfs))),x=mfs,orient='h')
plt.yticks(np.arange(data.shape[1])[:n], data.columns[sorted_index][:n])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt. title('LGBM')
plt. show()

It can be seen that variables such as PEDCOUNT, COLLISIONTYPE_Pedesttain have a high degree of influence on the response variable y, indicating that these variables have a significant impact on traffic accidents

These variables should be controlled to reduce the severity of traffic accidents