This case is suitable for students in the direction of machine learning data science.
Introduction (nonsense collection)
Traffic accidents are a serious public safety problem, and thousands of people die in traffic accidents every year around the world. With the development of transportation and the acceleration of urbanization, traffic accidents have become one of the main factors restricting urban development and people’s happiness. Therefore, by analyzing the causes and severity of traffic accidents and developing effective prediction models, it is possible to effectively prevent traffic accidents and reduce the harm of traffic accidents to human society.
The goal of this research is to predict the severity of traffic accidents through machine learning models to protect urban residents from serious injuries in traffic accidents and prevent traffic collisions. To achieve this goal, we will analyze various factors that affect the severity of traffic accidents and build an accurate and reliable prediction model.
At present, people usually rely on experience and statistical data to evaluate the severity of traffic accidents. However, the disadvantage of this method is that there may be subjectivity and limitations, and it is impossible to accurately predict the occurrence and severity of traffic accidents. In contrast, machine learning models can automatically learn and optimize the model to predict the severity of traffic accidents with the greatest accuracy. This study aims to compare the advantages and disadvantages between the current solution and our proposed solution to demonstrate the superiority of machine learning models in traffic accident prediction.
Finally, we will provide evidence in support of our study by analyzing traffic accident data in the city of Seattle over a 10-year period, where our response variable y is SEVERITYCODE (accident severity code). Because in a traffic accident, the severity of the accident is our most concerned issue, and it is also the target variable we hope to be able to predict. The other columns may serve as predictor variables x that help us predict accident severity. This study will provide a scientific basis for the Seattle transportation government to formulate more effective traffic accident prevention measures and policies.
Import package
#Import common packages for data analysis import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline plt.rcParams['axes.unicode_minus'] = False #Negative sign
Data Exploration
# read data data=pd.read_csv('Collisions.csv', na_values='Unknown') #View the first three rows of data data.head(3)
#Automatically resolve data types data = data.infer_objects() #View data information data. info()
# variable meaning
”’OBJECTID: Identifier of the incident object
INCKEY: Internal Incident Unique Key
COLDETKEY: Identifier for collected data
REPORTNO: Incident report number
STATUS: Incident handling status
ADDRTYPE: address type
INTKEY: intersection identifier
LOCATION: The location of the accident
EXCEPTRSNCODE: exception reason code
EXCEPTRSNDESC: exception reason description
SEVERITYCODE: Incident severity code
SEVERITYDESC: Incident severity description
COLLISIONTYPE: collision type
PERSONCOUNT: Number of people involved
PEDCOUNT: Number of pedestrians
PEDCYLCOUNT: Number of bicycles
VEHCOUNT: Number of vehicles involved
INJURIES: number of injuries
SERIOUSINJURIES: Number of serious injuries
FATALITIES: Number of fatalities
INCDATE: The date of the incident
INCDTTM: Date and time of accident
JUNCTIONTYPE: intersection type
SDOT_COLCODE: Incident code collected by SDOT
SDOT_COLDESC: Incident description collected by SDOT
INATTENTIONIND: inattentive mark
UNDERINFL: Under the influence of the driver
WEATHER: Weather conditions
ROADCOND: road condition
LIGHTCOND: light condition
PEDROWNOTGRNT: Pedestrians not granted rights
SDOTCOLNUM: Incident number collected by SDOT
SPEEDING: Speeding
ST_COLCODE: State accident code
ST_COLDESC: State accident description
SEGLANEKEY: Segmented lane key
CROSSWALKKEY: Crosswalk keyword
HITPARKEDCAR: Whether to hit a parked vehicle ”’
#observe missing values import missing no as msno msno.matrix(data)
Data preprocessing
Feature Preprocessing
'''OBJECTID, INCKEY, COLDETKEY, REPORTNO''' #These four variables are numbered variables with unique values, which are not helpful to the model, delete them data.drop(columns=['OBJECTID','INCKEY','COLDETKEY','REPORTNO'],inplace=True)
#If there is a row with all null values, delete it
data.dropna(how='all',inplace=True)
#Delete the variable with the only value (if there is a column with the same value, that is, the characteristic variable with the only value can be deleted, because there is no difference between each sample, it is useless for the model)
for col in data.columns: if len(data[col].value_counts())==1: print(col) data.drop(col,axis=1,inplace=True)
#Delete when the missing ratio reaches a certain level, and delete it when the missing amount reaches a certain level. If there are too many missing, it is better not to have this feature. My ratio here is 20%
miss_ratio=0.2 for col in data.columns: if data[col].isnull().sum()>data.shape[0]*miss_ratio: print(col) data.drop(col,axis=1,inplace=True)
A total of 38 variables, 11 have now been deleted
Then the data needs to be further refined, and the data should be divided into numerical and other types.
Look at the numeric data first
Numeric variable preprocessing
#View numeric data, pd.set_option('display.max_columns', 20) data.select_dtypes(exclude=['object']).head()
To do machine learning, of course, the more dispersed the features are, the better, because in this way, X can be more differentiated and thus better classified.
So those variables whose data distribution is very concentrated can be thrown away.
Variance can be used, but due to the inconsistent size of the data, the variance is not easy to compare.
Here we use the difference ratio to measure the degree of dispersion of the data. If the difference ratio is lower than 0.02, delete it
#Calculate the difference ratio variation_ratio_s=0.02 for col in data.select_dtypes(exclude=['object']).columns: df_count = data[col].value_counts() kind=df_count. index[0] variation_ratio=1-(df_count.iloc[0]/len(data[col])) if variation_ratio<variation_ratio_s: print(f'{col} has the maximum value of {kind}, and the difference ratio is {round(variation_ratio,4)}, which is too small and needs to be deleted') #data.drop(col,axis=1,inplace=True)
We found that the outlier ratio of the four variables of SERIOUSINJURIES, FATALITIES, SEGLANEKEY, and CROSSWALKKEY is very small, most of which are 0.
But SERIOUSINJURIES and FATALITIES are variables that have a lot to do with traffic severity, and we choose to keep them.
And SEGLANEKEY and CROSSWALKKEY are road section ID and crosswalk ID, which should have little effect on prediction, so choose to delete them.
##Delete SEGLANEKEY and CROSSWALKKEY data.drop(columns=['SEGLANEKEY','CROSSWALKKEY'],inplace=True)
object variable preprocessing
View the first three rows of non-numeric variables
data.select_dtypes(include=['object']).head(3)
'LOCATION','SDOT_COLDESC','ST_COLCODE','ST_COLDESC', #These four variables are specific descriptions for this event, and some are text types, which cannot be used as feature variables, delete 'SEVERITYDESC','STATUS','SDOT_COLCODE' #These variables have no explanation for the severity of the traffic, and are also deleted data.drop(columns=['LOCATION','SDOT_COLDESC','ST_COLCODE','ST_COLDESC','SEVERITYDESC','STATUS','SDOT_COLCODE'], inplace=True)
Fill missing values
1. Delete missing values: For some variables with fewer rows with missing values, you can directly delete the row samples with missing values.
It can be seen that there are missing values in our response variable SEVERITYCODE, and there is only one missing value. Deleting this missing value will not have a great impact on the overall data.
Moreover, missing values may cause interference and errors in subsequent modeling and analysis, so in this case, we can directly delete rows with missing values.
2. Other variables fill in missing values:
(1) For categorical variables (such as ADDRTYPE, COLLISIONTYPE, JUNCTIONTYPE, UNDERINFL, WEATHER, ROADCOND, LIGHTCOND) can be filled with the mode.
# remove missing values data.dropna(subset=['SEVERITYCODE'], inplace=True) # Categorical variable fill mode data['ADDRTYPE'].fillna(data['ADDRTYPE'].mode()[0], inplace=True) data['COLLISIONTYPE'].fillna(data['COLLISIONTYPE'].mode()[0], inplace=True) data['JUNCTIONTYPE'].fillna(data['JUNCTIONTYPE'].mode()[0], inplace=True) data['UNDERINFL'].fillna(data['UNDERINFL'].mode()[0], inplace=True) data['WEATHER'].fillna(data['WEATHER'].mode()[0], inplace=True) data['ROADCOND'].fillna(data['ROADCOND'].mode()[0], inplace=True) data['LIGHTCOND'].fillna(data['LIGHTCOND'].mode()[0], inplace=True)
Feature Engineering
Two time variables are converted into time format
data['INCDTTM'] = pd.to_datetime(data['INCDTTM']) data['INCDATE'] = pd.to_datetime(data['INCDATE'])
The INCDTTM variable contains the hour, which is more detailed. It is enough to extract the time feature from the INCDTTM variable. We extract the year, month, and hour when the accident occurred. Because the day may not have an obvious meaning, and the month represents the seasonality, and the hour represents the most likely moment of daily traffic accidents, so choose the year, month, and hour as the new features
data['year']=data['INCDTTM'].dt.year data['month'] = data['INCDTTM'].dt.year data['hour']=data['INCDTTM'].dt.hour
After building a new feature, delete the original feature if it is not needed
data.drop(columns=['INCDTTM','INCDATE'],inplace=True)
There are duplicate meaning variables in UNDERINFL, which need to be processed
# change 0 to N, 1 to Y data['UNDERINFL']=data['UNDERINFL'].map({'N':'N','0':'N','Y' :'Y','1':'Y'})
#View processed data information
print(data. shape) data. info()
There are no missing values anymore.
#First take out y from the training set y=data['SEVERITYCODE'] data.drop(['SEVERITYCODE'],axis=1,inplace=True)
##View each variable of the categorical variable is divided into several categories
for col in data.select_dtypes(include=['object']).columns: print(f"{col} variable has {len(data[col].unique())} class")
Data Exploration
Numeric variable drawing
#View the boxplot distribution of feature variables columns = data.select_dtypes(exclude=['object']).columns.tolist() # list header dis_cols = 4 # a few lines dis_rows = len(columns) plt.figure(figsize=(4 * dis_cols, 4 * dis_rows),dpi=128) for i in range(len(columns)): plt.subplot(dis_rows,dis_cols,i+1) sns.boxplot(data=data[columns[i]], orient="v",width=0.5) plt.xlabel(columns[i], fontsize = 20) plt.tight_layout() #plt.savefig('Characteristic variable box plot', format='png', dpi=500) plt. show()
#Draw density map, compare training set and test set dis_cols = 4 # a few lines dis_rows = len(columns) plt.figure(figsize=(4 * dis_cols, 4 * dis_rows),dpi=128) for i in range(len(columns)): ax = plt.subplot(dis_rows, dis_cols, i + 1) ax = sns.kdeplot(data[columns[i]], color="Red" ,fill=True) ax.set_xlabel(columns[i], fontsize = 20) ax.set_ylabel("Frequency",fontsize = 18) plt.tight_layout() #plt.savefig('Training and testing feature variable kernel density map', format='png', dpi=500) plt. show()
Type variable drawing
#View the boxplot distribution of feature variables columns = data.select_dtypes(include=['object']).columns.tolist() # list header dis_cols = 3 # a few lines dis_rows = len(columns) plt.figure(figsize=(5 * dis_cols, 5 * dis_rows),dpi=128) for i in range(len(columns)): plt.subplot(dis_rows,dis_cols,i+1) sns.barplot(x=data[columns[i]].value_counts().index, y=data[columns[i]].value_counts().values) #sns.barplot(data=data[columns[i]].value_counts(), orient="v",width=0.5) plt.xlabel(columns[i], fontsize = 15) plt. xticks(rotation=45) plt.tight_layout() #plt.savefig('Characteristic variable box plot', format='png', dpi=500) plt. show()
Correlation Analysis
#Draw the correlation between numerical variables corr = plt.subplots(figsize = (18,16),dpi=128) corr= sns.heatmap(data.corr(method='spearman'),annot=True,square=True)
Independent hot encoding of categorical variables
#Many x are text-type categorical variables, so the independent hot encoding process to be performed
data=pd.get_dummies(data) data.shape
View the distribution of the response variable
y. value_counts()
visualization
# View the distribution of y # classification problem plt.figure(figsize=(8,3),dpi=128) plt.subplot(1,2,1) y.value_counts().plot.bar(title='Response variable histogram graph') plt.subplot(1,2,2) y.value_counts().plot.pie(title='Response Variable Pie Chart') #plt.savefig('response variable.png') plt.tight_layout() plt. show()
##The data distribution of the response variable is extremely unbalanced. We map the four groups of 0, 1, 2, and 2b to 0, 3 (indicating that the traffic accident is very serious) to 1, which means it is a two-category problem
y=y.map({'0':0,'1':0,'2':0,'2b':0,'3':1} ) y. value_counts()
Category balance processing for y below
When dealing with extremely unbalanced classification samples, the following methods can be considered: Undersampling: Randomly select a part of samples from the majority category, so that the number of samples in the majority category is similar to the number of samples in the minority category. The advantage of this method is that it is simple and fast, but some useful information may be lost.
Oversampling: Randomly copy some samples from the minority class so that the number of samples in the minority class is similar to the number of samples in the majority class. This approach has the advantage of fully utilizing the dataset, but may lead to overfitting.
SMOTE (Synthetic Minority Over-sampling Technique) algorithm: It is a commonly used oversampling method, which expands the data set by interpolating minority class samples to generate new samples. This method can effectively avoid the overfitting problem.
Mixed Sampling: Combining the advantages of undersampling and oversampling, it can reduce the amount of data and make full use of the data set. You can undersample first, and then oversample the undersampled data.
We first upsample the less sampled
from imblearn.over_sampling import RandomOverSampler os=RandomOverSampler(sampling_strategy=0.0025) X_train_ns,y_train_ns=os.fit_resample(data,y) print("The number of classes before fit {}". format(y. value_counts(). to_dict())) print("The number of classes after fit {}".format(y_train_ns.value_counts().to_dict()))
Increase the number of samples for y to 1 to 599
Then down-sample the samples with more samples, and the ratio is 0.2, that is, the number of class 0 is 80%, and the number of class 1 is 20%. Although it is not balanced, it is much better than the one just now. . I can also train.
from imblearn.under_sampling import RandomUnderSampler rus = RandomUnderSampler(sampling_strategy=0.2) X_resampled, y_resampled = rus.fit_resample(X_train_ns, y_train_ns) X_train_ns2, y_train_ns2 = rus.fit_resample(X_train_ns, y_train_ns) print("The number of classes before fit {}".format(y_train_ns.value_counts().to_dict())) print("The number of classes after fit {}".format(y_train_ns2.value_counts().to_dict()))
View now The data shape of
print(X_train_ns2.shape,y_train_ns2.shape)
A total of 3594 samples, including 2995 in category 0, 599 in category 1, and 57 x feature variables
Algorithm Implementation
#Divide the training set and test set, check their shape from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test=train_test_split(X_train_ns2, y_train_ns2, stratify=y_train_ns2, test_size=0.2, random_state=0) print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
#data standardization from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler. fit(X_train) X_train_s = scaler. transform(X_train) X_test_s = scaler. transform(X_test) print('training data shape:') print(X_train_s. shape, y_train. shape) print('Test data shape:') print(X_test_s. shape, y_test. shape)
The above code first divides the training set and the test set, the proportion of the test set is 20%, and the random number seed is 0. Then standardize the data, print and view the shape of the training set data and test set data. It can be seen that we have a training set of 4598, a test set of 1150, and 74 feature variables.
#Use five models to compare the accuracy of the validation set
from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC
#Logistic regression model1 = LogisticRegression(C=1e10,max_iter=10000) #Knearest neighbors model2 = KNeighborsClassifier(n_neighbors=10) #decision tree model3 = DecisionTreeClassifier(random_state=77) #random forest model4= RandomForestClassifier(n_estimators=1000, max_features='sqrt', random_state=10) #Support Vector Machines model5 = SVC(kernel="rbf", random_state=77) model_list=[model1,model2,model3,model4,model5] model_name=['Logistic Regression', 'K-Nearest Neighbors', 'Decision Tree', 'Random Forest','Support Vector Machine']
Evaluation criteria
This article is a classification problem. Four commonly used and reliable evaluation indicators for classification problems, accuracy, precision, recall and F1 value, are used to comprehensively evaluate the model. The calculation formulas of the four indicators are as follows
Compute all evaluation metrics and define evaluation functions
from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report from sklearn.metrics import cohen_kappa_score from sklearn.model_selection import KFold
def evaluation(y_test, y_predict): accuracy=classification_report(y_test, y_predict, output_dict=True)['accuracy'] s=classification_report(y_test, y_predict, output_dict=True)['weighted avg'] precision=s['precision'] recall=s['recall'] f1_score=s['f1-score'] #kappa=cohen_kappa_score(y_test, y_predict) return accuracy, precision, recall, f1_score #, kappa def evaluation2(lis): array = np.array(lis) return array.mean() , array.std()
df_eval=pd.DataFrame(columns=['Accuracy','Precision','Recall','F1_score']) for i in range(5): model_C = model_list[i] name=model_name[i] print(f'{name} model fitting completed') model_C.fit(X_train_s, y_train) pred = model_C.predict(X_test_s) s=classification_report(y_test, pred) s=evaluation(y_test,pred) df_eval.loc[name,:]=list(s)
df_eval
bar_width = 0.4 colors=['c', 'b', 'g', 'tomato', 'm', 'y', 'lime', 'k' ,'orange','pink','grey','tan'] fig, ax = plt.subplots(2,2,figsize=(10,8),dpi=128) for i, col in enumerate(df_eval.columns): n=int(str('22') + str(i + 1)) plt. subplot(n) df_col = df_eval[col] m = np.arange(len(df_col)) plt.bar(x=m,height=df_col.to_numpy(),width=bar_width,color=colors) #plt.xlabel('Methods', fontsize=12) names=df_col.index plt.xticks(range(len(df_col)),names,fontsize=10) plt. xticks(rotation=40) plt.ylabel(col, fontsize=14) plt.tight_layout() #plt.savefig('Histogram.jpg',dpi=512) plt. show()
It can be seen that the random forest model above works best
Further hyperparameter search for the random forest model below
#Use K-fold cross-validation to search for optimal hyperparameters from sklearn.model_selection import KFold, StratifiedKFold from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
pipeline flow, grid search
from sklearn.pipeline import Pipeline steps = [ ('standardize', StandardScaler()), ('model',RandomForestClassifier(max_features='sqrt',random_state=10) ] RF_pipe = Pipeline( steps = steps ) param_distributions = {'model__max_depth': range(2, 9), 'model__n_estimators': [250,500,750,1000,1500,2000]} model= GridSearchCV( RF_pipe, param_distributions, n_jobs = 6, cv = KFold(n_splits=3, shuffle=True, random_state=0) )
View the parameters in this pipeline
RF_pipe.get_params()['model'].get_params().keys()
fit model
model.fit(X_train,y_train)
#View the best parameters and accuracy of the model print(f" best param: {model.best_params_},accuary:{model.best_score_}" )
#Assign the best parameters to the model model = model.best_estimator_ pred = model. predict(X_test_s) s=evaluation(y_test,pred) print(f'Acc: {s[0]}, Prec: {s[1]}, Recall: {s[2]}, F1:{s[3]}')
#Accuracy, precision, recall and F1 value
RF_s=evaluation(model. predict(data),y) RF_s
Conclusion
Comparison of models, etc. On this classification problem, random forest works best, followed by support vector machine and logistic regression
The limitation of this case is that the data sample size with the value of y being 1 is too small, and too much upsampling may lead to overfitting of the model, etc.
Variable importance
# use all data training model=RandomForestClassifier(max_depth= 8,n_estimators=2000,max_features='sqrt',random_state=10) #Use the previous optimal parameters model. fit(np. array(data), y) pred = model. predict(np. array(data)) s=evaluation(y,pred) print(f'Acc: {s[0]}, Prec: {s[1]}, Recall: {s[2]}, F1:{s[3]}')
n=10 sorted_index = model.feature_importances_.argsort()[::-1][:n] mfs=model.feature_importances_[sorted_index][:n] plt.figure(figsize=(5,4),dpi=128) sns.barplot(y=np.array(range(len(mfs))),x=mfs,orient='h') plt.yticks(np.arange(data.shape[1])[:n], data.columns[sorted_index][:n]) plt.xlabel('Feature Importance') plt.ylabel('Feature') plt. title('LGBM') plt. show()
It can be seen that variables such as PEDCOUNT, COLLISIONTYPE_Pedesttain have a high degree of influence on the response variable y, indicating that these variables have a significant impact on traffic accidents
These variables should be controlled to reduce the severity of traffic accidents