Build a scorecard model using LR
Step 1: Get data and read data
The data is personal consumption loan data, and the goal is to build a credit card scoring model for A card. There are so many characteristics
SeriousDlqin2yrs (overdue behavior of 90 days or more)
When SeriousDlqin2yrs1, it is a bad customer; when SeriousDlqin2yrs0, it is a good customer;
Revolving Utilization of unsecured lines (loan and credit card available line and total line ratio)
age (borrower borrowing age)
Numberoftime30-59dayspastduenotworse (35-59 days overdue in the past two years but no worse development)
DebtRatio (monthly debt repayment, alimony, living expenses divided by gross monthly income)
Monthlyincome (monthly income)
Numberoftimes90dayslate (Number of times 90 days past due or worse in the past 2 years)
Numberrealestateloansorlines (number of mortgages and real estate loans, including home equity lines of credit)
Numberoftime60-89dayspastduenotworse (60-89 days overdue in the past two years but no worse development)
Number of Dependents (the number of dependents (spouse, children, etc.) in the family that does not include itself)
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression as LR
data = pd.read_csv(r”.\rankingcard.csv”,index_col=0)
The second step: data cleaning and feature engineering
Data exploration
#observation data type data.head()# Note that you can see that the first column is a label, and the remaining 10 columns are features # Observe the data structure data.shape#(150000, 11) data. info()
#Explore missing values data.isnull().sum()/data.shape[0]#Get the proportion of missing values #data.isnull().mean()#Another form of writing the previous line of code
Delete duplicates and fill missing values
Delete duplicate values
data.drop_duplicates(inplace=True) #Be sure to restore the index after deleting data. index = range(data. shape[0])
Fill the missing ratio with the mean value
data['NumberOfDependents'].fillna(int(data['NumberOfDependents'].mean()),inplace=True)
For a relatively large missing ratio, fill it with random forest
def fill_missing_rf(X,y,to_fill): """ Function to fill missing values of a feature using Random Forest parameter: X: Eigen matrix to fill y: complete, labels with no missing values to_fill: String, the name of the column to be filled """ #Build our new feature matrix and new labels df = X. copy() fill = df.loc[:,to_fill] df = pd.concat([df.loc[:,df.columns != to_fill],pd.DataFrame(y)],axis=1) # Find out our training and test sets Ytrain = fill[fill.notnull()] Ytest = fill[fill.isnull()] Xtrain = df.iloc[Ytrain.index,:] Xtest = df.iloc[Ytest.index,:] #Fill missing values with random forest regression from sklearn.ensemble import RandomForestRegressor as rfr rfr = rfr(n_estimators=100) rfr = rfr.fit(Xtrain, Ytrain) Ypredict = rfr.predict(Xtest) return Ypredict X = data.iloc[:,1:] y = data["SeriousDlqin2yrs"]#y = data.iloc[:,0] X.shape#(149391, 10) #=====[TIME WARNING: 1 min]=====# y_pred = fill_missing_rf(X,y,"MonthlyIncome") #Note that the following code can be used to check whether the data is the same # y_pred.shape == data.loc[data.loc[:,"MonthlyIncome"].isnull(),"MonthlyIncome"].shape #After confirming that our results are reasonable, we can overwrite the data data.loc[data.loc[:,"MonthlyIncome"].isnull(),"MonthlyIncome"] = y_pred data. info()
Outlier handling
# Abnormal values have also been observed by us. The minimum age value is 0, which does not meet the business needs of banks. Even children's accounts must be at least 8 years old. We can # Check how many people have age 0 (data["age"] == 0). sum() #It is found that only one person's age is 0. It can be judged that this must be caused by an entry error. It can be treated as a missing value and directly delete this sample. data = data[data["age"] != 0]
The problem of unbalanced samples
Solve the problem of sample imbalance by upsampling, so that the positive and negative samples become 1:1
#Explore the distribution of tags X = data.iloc[:,1:] y = data.iloc[:,0] y.value_counts()#Check the amount of data worth each category, and check whether the samples are balanced n_sample = X.shape[0] n_1_sample = y. value_counts()[1] n_0_sample = y. value_counts()[0] print('Number of samples: {}; 1 accounts for {:.2%}; 0 accounts for {:.2%}'.format(n_sample,n_1_sample/n_sample,n_0_sample/n_sample)) #Number of samples: 149165; 1 accounts for 6.62%; 0 accounts for 93.38% #If an error is reported, install it at the prompt: pip install imblearn import imblearn #imblearn is a library specially used to deal with unbalanced data sets, and its performance is much higher than sklearn in dealing with sample imbalance problems There are also classes in #imblearn, which also need to be instantiated, fit fitting, similar to sklearn usage from imblearn.over_sampling import SMOTE sm = SMOTE(random_state=42) #Instantiation X,y = sm.fit_sample(X,y) n_sample_ = X.shape[0]#278584 pd.Series(y).value_counts() n_1_sample = pd.Series(y).value_counts()[1] n_0_sample = pd.Series(y).value_counts()[0] print('Number of samples: {}; 1 accounts for {:.2%}; 0 accounts for {:.2%}'.format(n_sample_,n_1_sample/n_sample_,n_0_sample/n_sample_)) #Number of samples: 278584; 1 accounts for 50.00%; 0 accounts for 50.00%
Divide training set and test set
from sklearn.model_selection import train_test_split X = pd. DataFrame(X) y = pd. DataFrame(y) X_train, X_vali, Y_train, Y_vali = train_test_split(X,y,test_size=0.3,random_state=420) model_data = pd.concat([Y_train, X_train], axis=1)#Training data to build a model model_data.index = range(model_data.shape[0]) model_data.columns = data.columns vali_data = pd.concat([Y_vali, X_vali], axis=1)#validation set vali_data. index = range(vali_data. shape[0]) vali_data.columns = data.columns model_data.to_csv(r".\model_data.csv")#training data vali_data.to_csv(r".\vali_data.csv")#Validate data
binning variables
Binning target
The desired effect of binning:
We hope that people with different attributes have different scores, so we hope that the attributes of people in the same box are as similar as possible, while the attributes of people in different boxes are as different as possible. The differences within groups were small.” For the scorecard, it means that we hope that the default probability of people in each box is similar, and the default probability of people in different boxes is very different, that is, the WOE gap is larger. Here you can use the chi-square test to compare the similarity between the two boxes. If the P value of the chi-square attrition between the two boxes is large, it means that they are very similar, then we can merge the two boxes into one box .
Based on this idea, we summarize the steps for binning features:
(1) First divide the continuous variable into a group of categorical variables with a large number (q group)
(2) Make sure that each group contains samples of both categories, otherwise the IV value cannot be calculated
(3) We conduct a chi-square test on adjacent groups, and groups with a large P value in the chi-square test are merged until the number of groups in the data is less than the set N boxes
(4) Let a feature be divided into [2, 3, 4…20] boxes, observe how the IV value changes under each number of bins, and find the most suitable number of bins
(5) After the binning is completed, we calculate the WOE value of each bin and observe the binning results
First, divide each feature into equal-frequency bins. The initial number of bins is q. Find the number of 0 and 1 in each bin. You can find the woe and iv values accordingly, and pass the chi-square test.
def graphforbestbin(DF, X, Y, n=5,q=20,graph=True): ''' Automatic optimal binning function, binning based on chi-square test parameter: DF: the data that needs to be entered X: The column name that needs to be binned Y: The label Y column name corresponding to the binned data n: number of reserved bins q: the number of initial bins graph: Whether to draw the IV image The interval is front-open and back-close (] ''' DF = DF[[X,Y]].copy() DF["qcut"],bins = pd.qcut(DF[X], retbins=True, q=q,duplicates="drop") count_y0 = DF.loc[DF[Y]==0].groupby(by="qcut").count()[Y] count_y1 = DF.loc[DF[Y]==1].groupby(by="qcut").count()[Y] num_bins = [*zip(bins,bins[1:],count_y0,count_y1)] \t # Make sure there are 0 and 1 in each bin for i in range(q): if 0 in num_bins[0][2:]: num_bins[0:2] = [( num_bins[0][0], num_bins[1][1], num_bins[0][2] + num_bins[1][2], num_bins[0][3] + num_bins[1][3])] continue for i in range(len(num_bins)): if 0 in num_bins[i][2:]: num_bins[i-1:i+1] = [( num_bins[i-1][0], num_bins[i][1], num_bins[i-1][2] + num_bins[i][2], num_bins[i-1][3] + num_bins[i][3])] break else: break def get_woe(num_bins): columns = ["min","max","count_0","count_1"] df = pd.DataFrame(num_bins,columns=columns) df["total"] = df.count_0 + df.count_1 df["percentage"] = df.total / df.total.sum() df["bad_rate"] = df.count_1 / df.total df["good%"] = df.count_0/df.count_0.sum() df["bad%"] = df.count_1/df.count_1.sum() df["woe"] = np.log(df["good%"] / df["bad%"]) return df def get_iv(df): rate = df["good%"] - df["bad%"] iv = np.sum(rate * df.woe) return iv IV = [] axisx = [] while len(num_bins) > n: pvs = [] for i in range(len(num_bins)-1): x1 = num_bins[i][2:] x2 = num_bins[i + 1][2:] pv = scipy.stats.chi2_contingency([x1,x2])[1] pvs.append(pv) i = pvs. index(max(pvs)) num_bins[i:i + 2] = [( num_bins[i][0], num_bins[i+1][1], num_bins[i][2] + num_bins[i + 1][2], num_bins[i][3] + num_bins[i + 1][3])] bins_df = pd. DataFrame(get_woe(num_bins)) axisx.append(len(num_bins)) IV.append(get_iv(bins_df)) if graph: plt. figure() plt.plot(axisx, IV) plt. xticks(axisx) plt.xlabel("number of box") plt.ylabel("IV") plt. show() return bins_df
The effect pursued by the separated binning
1. It is hoped that the difference between the bad_rate of each group is as large as possible
2. The bigger the woe difference, the better, it should be monotonic, either from positive to negative, or from negative to positive, there can only be one turning process
3. If the woe value changes by two turns, such as a w-shape, it proves that there is a problem with the binning process
4. The more information num_bins retains, the better
Next, use the above function to find the IV value of each feature under different bins, and determine the final number of bins according to the steep decline of the IV value
model_data.columns for i in model_data.columns[1:-1]: print(i) graphforbestbin(model_data,i,"SeriousDlqin2yrs",n=2,q=20)
As some of the results of binning cannot be separated, manual binning is required, and others are based on automatic binning
In fact, it can be found that not every feature can automatically complete so many bins. For example, the number of family members cannot be divided into 20 groups, so the features that can be binned can be released into separate groups, and the variables that cannot be automatically binned Observe for yourself. same steps, 1. Determine the number of bins that can be automatically binned, and calculate the upper and lower limits of the corresponding boxes obtained through equal-frequency binning and chi-square test. At the same time, for manual binning, manually binning Values are set to 3 bins here, and at the same time, replace the maximum value with np.inf, and replace the minimum value with -np.inf to ensure that the model can find the corresponding box for selection when it comes in with a larger value. 2. Find the number of 1 and 0 in each bin, find the woe value corresponding to each feature, and map the woe value corresponding to each feature to the feature matrix 3. Separate the training set and test set, and prepare for model fitting training
auto_col_bins = {<!-- -->"RevolvingUtilizationOfUnsecuredLines":6, "age": 5, "DebtRatio": 4, "MonthlyIncome": 3, "NumberOfOpenCreditLinesAndLoans":5} #Cannot use automatic binning variables hand_bins = {<!-- -->"NumberOfTime30-59DaysPastDueNotWorse":[0,1,2,13] ,"NumberOfTimes90DaysLate":[0,1,2,17] ,"NumberRealEstateLoansOrLines":[0,1,2,4,54] ,"NumberOfTime60-89DaysPastDueNotWorse":[0,1,2,8] ,"NumberOfDependents":[0,1,2,3]} #Guaranteed interval coverage Use np.inf to replace the maximum value and -np.inf to replace the minimum value #Reason: For example, some new values appear, for example, the number of family members is 30, which has never appeared before. After changing the range to the maximum value, these new values can be assigned to the box hand_bins = {<!-- -->k:[-np.inf,*v[:-1],np.inf] for k,v in hand_bins.items()} bins_of_col = {<!-- -->} # Generate the binning interval of automatic binning and the IV value after binning for col in auto_col_bins: bins_df = graphforbestbin(model_data, col ,"SeriousDlqin2yrs" ,n=auto_col_bins[col] #Use the properties of the dictionary to get the number of boxes corresponding to each feature ,q=20 ,graph=False) bins_list = sorted(set(bins_df["min"]).union(bins_df["max"])) #Guaranteed interval coverage Use np.inf to replace the maximum value -np.inf to replace the minimum value bins_list[0], bins_list[-1] = -np.inf, np.inf bins_of_col[col] = bins_list #Merge manual binned data bins_of_col. update(hand_bins) data = model_data. copy() #Function pd.cut, can bin the data according to the known binning interval #The parameter is pd.cut (data, binning interval represented by a list) data = data[["age","SeriousDlqin2yrs"]].copy() data["cut"] = pd.cut(data["age"],[-np.inf, 48.49986200790144, 58.757170160044694, 64.0, 74.0, np.inf]) data. head() #Aggregate the data according to the binning results, and take out the label value data.groupby("cut")["SeriousDlqin2yrs"].value_counts() #Use unstack() to change the tree structure into a table structure data.groupby("cut")["SeriousDlqin2yrs"].value_counts().unstack() bins_df = data.groupby("cut")["SeriousDlqin2yrs"].value_counts().unstack() bins_df["woe"] = np.log((bins_df[0]/bins_df[0].sum())/(bins_df[1]/bins_df[1].sum())) def get_woe(df, col, y, bins): df = df[[col,y]].copy() df["cut"] = pd. cut(df[col],bins) bins_df = df.groupby("cut")[y].value_counts().unstack() woe = bins_df["woe"] = np.log((bins_df[0]/bins_df[0].sum())/(bins_df[1]/bins_df[1].sum())) return woe #Store the WOE of all features into the dictionary woeall = {<!-- -->} for col in bins_of_col: woeall[col] = get_woe(model_data,col,"SeriousDlqin2yrs",bins_of_col[col]) #Don't want to overwrite the original data, create a new DataFrame, the index is exactly the same as the original data model_data model_woe = pd.DataFrame(index=model_data.index) #After binning the original data, map the WOE structure to the data with the map function according to the result of the box model_woe["age"] = pd.cut(model_data["age"],bins_of_col["age"]).map(woeall["age"]) #All feature operations can be written as: for col in bins_of_col: model_woe[col] = pd.cut(model_data[col],bins_of_col[col]).map(woeall[col]) #Add labels to data model_woe["SeriousDlqin2yrs"] = model_data["SeriousDlqin2yrs"] #This is our modeling data model_woe. head() vali_woe = pd.DataFrame(index=vali_data.index) for col in bins_of_col: vali_woe[col] = pd.cut(vali_data[col],bins_of_col[col]).map(woeall[col]) vali_woe["SeriousDlqin2yrs"] = vali_data["SeriousDlqin2yrs"] vali_X = vali_woe.iloc[:,:-1] vali_y = vali_woe.iloc[:,-1]
Step 3: Model Development
Use logistic regression to fit the model, first use no parameters for training
X = model_woe.iloc[:,:-1] y = model_woe.iloc[:,-1] from sklearn.linear_model import LogisticRegression as LR lr = LR().fit(X,y) lr.score(vali_X,vali_y)#0.8641356370249832 #The result is 0.8641356370249832
Tune the parameters for drawing the learning curve
The main parameters of logistic regression are C, penalty, solver, max_iter and multi_class
C is the reciprocal of the regularization strength. The smaller C is, the smaller the loss function will be. The heavier the penalty of the model on the loss function, the stronger the effect of regularization. prevent overfitting
penalty You can enter l1 and l2 to specify which regularization method to use, and do not fill in the default L2
Note that if L1 regularization is selected, the parameter solver can only use the solving methods ‘liblinear’ and ‘saga’. If ‘l2’ regularization is used, all the solving methods in the parameter solver can be used.
The solver defaults to liblinear” which is dedicated to binary classification and is now the default solver
multiclass input ovr, multinomial, auto to tell the model, the type of classification problem we want to deal with, the default is ovr
ovr binary classification
multinomial multi-category
auto chooses according to the label type
l1 and l2 regularization
In the process of l1 regularization being gradually strengthened, the parameters that carry features with a small amount of information and do not contribute much to the model will become 0 faster than the parameters that carry a large amount of information and that contribute greatly to the model. Therefore, the essence of L1 regularization is a process of feature selection, which controls the ‘sparseness’ of parameters. Generally speaking, if the feature quantity is large and the data dimension is high, we tend to use L1 regularization.
In contrast, during the enhancement process of L2 regularization, each feature will make a small contribution to the model as much as possible, but it carries less information, and the parameters of features that do not contribute much to the model are very close to 0. Generally speaking, if our main purpose is to prevent overfitting, it is enough to choose L2 regularization
c_1 = np.linspace(0.01,1,20) c_2 = np.linspace(0.01,0.2,20) score = [] for i in c_1: lr = LR(solver='liblinear',C=i).fit(X,y) score.append(lr.score(vali_X,vali_y)) plt. figure() plt.plot(c_1, score) plt. show() lr.n_iter_#array([7], dtype=int32) score = [] for i in [1,2,3,4,5,6]: lr = LR(solver='liblinear', C=0.025, max_iter=i).fit(X,y) score.append(lr.score(vali_X,vali_y)) plt. figure() plt.plot([1,2,3,4,5,6], score) plt. show()
Since the model has only 10 features and is not particularly high-dimensional, it is sufficient to use L2 regularization. The for loop draws a learning curve for the C value and max_iter value according to the different values of C and max_iter. It is found that the upper limit of the accuracy is also close to the time when there is no parameter adjustment, so there is no need to make too many parameter adjustments.
Step 4: Model testing and evaluation
ROC curve
The curve falls on the upper left, the AUC is 0.94, and the result is better than the accuracy. For the model of the score card, which is more likely to capture minority samples, the effect is quite good, and the recall rate is relatively high.
import scikitplot as skplt #%%cmd #pip install scikit-plot vali_proba_df = pd.DataFrame(lr.predict_proba(vali_X)) skplt.metrics.plot_roc(vali_y, vali_proba_df, plot_micro=False, figsize=(6,6), plot_macro=False)
Step 5: Online model
Create a scorecard model
After modeling, the predictive ability of the model was verified by using the accuracy rate and ROC curve. The next step is to use logistic regression to convert to a standard scorecard. The score in the scorecard is calculated by the following formula:
S
c
o
r
e
=
A
?
B
?
l
o
g
(
o
d
d
the s
)
Score = A – B*log(odds)
Score=A?B?log(odds)
Among them, A and B are constants, A is called compensation, B is called scale, and log (odds) represents the possibility of a person defaulting. In fact, the result of logistic regression takes the logarithmic probability form to get
θ
T
x
\theta^Tx
θTx, that is, our parameter * feature matrix, so log(odds) is our parameter, and the two constants can be calculated by bringing the scores of two hypotheses into the formula. These two hypotheses are:
1. The expected score under a certain default probability
2. Assigned Probability of Default Doubling Fraction (PDO)
If, assuming that the specific score set when the logarithmic probability is 1/60 is 600, and PDO=20, then the score with a logarithmic probability of 1/30 is 620,
With the above linear expression, we can get:
600
=
A
?
B
?
l
o
g
(
1
/
60
)
600 = A-B*log(1/60)
600=A·B·log(1/60)
620
=
A
?
B
?
l
o
g
(
1
/
30
)
620 = A – B*log(1/30)
620=A?B?log(1/30)
The values of A and B can be easily calculated with numpy;
B = 20/np.log(2) A = 600 + B*np.log(1/60)
With A and B, the score is easy to get, the intercept is taken as log(odds) into the formula for calculation, and the scores of other features are also calculated by taking the coefficient into
# get basic points base_score = A -B*lr.intercept_ # Get the score corresponding to each feature score_age = woeall['age'] *(-B*lr.coef_[0][1])
withopen(file,"w") asfdata: fdata.write("base_score,{}\\ ".format(base_score)) fori,colinenumerate(X.columns): score = woeall[col] * (-B*lr.coef_[0][i]) score.name = "Score" score.index.name = col score.to_csv(file, header=True, mode="a")
Step Six: Monitoring and Reporting
Quoted from Teacher Cai Cai’s course~