Build a scorecard model using LR

Step 1: Get data and read data

The data is personal consumption loan data, and the goal is to build a credit card scoring model for A card. There are so many characteristics
SeriousDlqin2yrs (overdue behavior of 90 days or more)
When SeriousDlqin2yrs1, it is a bad customer; when SeriousDlqin2yrs0, it is a good customer;
Revolving Utilization of unsecured lines (loan and credit card available line and total line ratio)
age (borrower borrowing age)
Numberoftime30-59dayspastduenotworse (35-59 days overdue in the past two years but no worse development)
DebtRatio (monthly debt repayment, alimony, living expenses divided by gross monthly income)
Monthlyincome (monthly income)
Numberoftimes90dayslate (Number of times 90 days past due or worse in the past 2 years)
Numberrealestateloansorlines (number of mortgages and real estate loans, including home equity lines of credit)
Numberoftime60-89dayspastduenotworse (60-89 days overdue in the past two years but no worse development)
Number of Dependents (the number of dependents (spouse, children, etc.) in the family that does not include itself)

%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression as LR
data = pd.read_csv(r”.\rankingcard.csv”,index_col=0)

The second step: data cleaning and feature engineering

Data exploration

#observation data type
data.head()# Note that you can see that the first column is a label, and the remaining 10 columns are features
 
# Observe the data structure
data.shape#(150000, 11)
data. info()

#Explore missing values
data.isnull().sum()/data.shape[0]#Get the proportion of missing values
#data.isnull().mean()#Another form of writing the previous line of code

Delete duplicates and fill missing values

Delete duplicate values

data.drop_duplicates(inplace=True)
#Be sure to restore the index after deleting
data. index = range(data. shape[0])

Fill the missing ratio with the mean value

data['NumberOfDependents'].fillna(int(data['NumberOfDependents'].mean()),inplace=True)

For a relatively large missing ratio, fill it with random forest

def fill_missing_rf(X,y,to_fill):

    """
    Function to fill missing values of a feature using Random Forest

    parameter:
    X: Eigen matrix to fill
    y: complete, labels with no missing values
    to_fill: String, the name of the column to be filled
    """

    #Build our new feature matrix and new labels
    df = X. copy()
    fill = df.loc[:,to_fill]
    df = pd.concat([df.loc[:,df.columns != to_fill],pd.DataFrame(y)],axis=1)

    # Find out our training and test sets
    Ytrain = fill[fill.notnull()]
    Ytest = fill[fill.isnull()]
    Xtrain = df.iloc[Ytrain.index,:]
    Xtest = df.iloc[Ytest.index,:]

    #Fill missing values with random forest regression
    from sklearn.ensemble import RandomForestRegressor as rfr
    rfr = rfr(n_estimators=100)
    rfr = rfr.fit(Xtrain, Ytrain)
    Ypredict = rfr.predict(Xtest)

    return Ypredict
    X = data.iloc[:,1:]
y = data["SeriousDlqin2yrs"]#y = data.iloc[:,0]
X.shape#(149391, 10)

#=====[TIME WARNING: 1 min]=====#
y_pred = fill_missing_rf(X,y,"MonthlyIncome")

#Note that the following code can be used to check whether the data is the same
# y_pred.shape == data.loc[data.loc[:,"MonthlyIncome"].isnull(),"MonthlyIncome"].shape

#After confirming that our results are reasonable, we can overwrite the data
data.loc[data.loc[:,"MonthlyIncome"].isnull(),"MonthlyIncome"] = y_pred

data. info()

Outlier handling

# Abnormal values have also been observed by us. The minimum age value is 0, which does not meet the business needs of banks. Even children's accounts must be at least 8 years old. We can
# Check how many people have age 0
(data["age"] == 0). sum()
#It is found that only one person's age is 0. It can be judged that this must be caused by an entry error. It can be treated as a missing value and directly delete this sample.
data = data[data["age"] != 0]

The problem of unbalanced samples

Solve the problem of sample imbalance by upsampling, so that the positive and negative samples become 1:1

#Explore the distribution of tags
X = data.iloc[:,1:]
y = data.iloc[:,0]
 
y.value_counts()#Check the amount of data worth each category, and check whether the samples are balanced
 
n_sample = X.shape[0]
 
n_1_sample = y. value_counts()[1]
n_0_sample = y. value_counts()[0]
 
print('Number of samples: {}; 1 accounts for {:.2%}; 0 accounts for {:.2%}'.format(n_sample,n_1_sample/n_sample,n_0_sample/n_sample))
#Number of samples: 149165; 1 accounts for 6.62%; 0 accounts for 93.38%


#If an error is reported, install it at the prompt: pip install imblearn
import imblearn
#imblearn is a library specially used to deal with unbalanced data sets, and its performance is much higher than sklearn in dealing with sample imbalance problems
There are also classes in #imblearn, which also need to be instantiated, fit fitting, similar to sklearn usage
 
from imblearn.over_sampling import SMOTE
 
sm = SMOTE(random_state=42) #Instantiation
X,y = sm.fit_sample(X,y)
 
n_sample_ = X.shape[0]#278584

pd.Series(y).value_counts()
 
n_1_sample = pd.Series(y).value_counts()[1]
n_0_sample = pd.Series(y).value_counts()[0]
 
print('Number of samples: {}; 1 accounts for {:.2%}; 0 accounts for {:.2%}'.format(n_sample_,n_1_sample/n_sample_,n_0_sample/n_sample_))
#Number of samples: 278584; 1 accounts for 50.00%; 0 accounts for 50.00%

Divide training set and test set

from sklearn.model_selection import train_test_split
X = pd. DataFrame(X)
y = pd. DataFrame(y)
 
X_train, X_vali, Y_train, Y_vali = train_test_split(X,y,test_size=0.3,random_state=420)
model_data = pd.concat([Y_train, X_train], axis=1)#Training data to build a model
model_data.index = range(model_data.shape[0])
model_data.columns = data.columns
 
vali_data = pd.concat([Y_vali, X_vali], axis=1)#validation set
vali_data. index = range(vali_data. shape[0])
vali_data.columns = data.columns
 
model_data.to_csv(r".\model_data.csv")#training data
vali_data.to_csv(r".\vali_data.csv")#Validate data

binning variables

Binning target

The desired effect of binning:
We hope that people with different attributes have different scores, so we hope that the attributes of people in the same box are as similar as possible, while the attributes of people in different boxes are as different as possible. The differences within groups were small.” For the scorecard, it means that we hope that the default probability of people in each box is similar, and the default probability of people in different boxes is very different, that is, the WOE gap is larger. Here you can use the chi-square test to compare the similarity between the two boxes. If the P value of the chi-square attrition between the two boxes is large, it means that they are very similar, then we can merge the two boxes into one box .
Based on this idea, we summarize the steps for binning features:
(1) First divide the continuous variable into a group of categorical variables with a large number (q group)
(2) Make sure that each group contains samples of both categories, otherwise the IV value cannot be calculated
(3) We conduct a chi-square test on adjacent groups, and groups with a large P value in the chi-square test are merged until the number of groups in the data is less than the set N boxes
(4) Let a feature be divided into [2, 3, 4…20] boxes, observe how the IV value changes under each number of bins, and find the most suitable number of bins
(5) After the binning is completed, we calculate the WOE value of each bin and observe the binning results

First, divide each feature into equal-frequency bins. The initial number of bins is q. Find the number of 0 and 1 in each bin. You can find the woe and iv values accordingly, and pass the chi-square test.

def graphforbestbin(DF, X, Y, n=5,q=20,graph=True):
    '''
    Automatic optimal binning function, binning based on chi-square test

    parameter:
    DF: the data that needs to be entered
    X: The column name that needs to be binned
    Y: The label Y column name corresponding to the binned data
    n: number of reserved bins
    q: the number of initial bins
    graph: Whether to draw the IV image

    The interval is front-open and back-close (]

    '''
    
    DF = DF[[X,Y]].copy()

    DF["qcut"],bins = pd.qcut(DF[X], retbins=True, q=q,duplicates="drop")
    count_y0 = DF.loc[DF[Y]==0].groupby(by="qcut").count()[Y]
    count_y1 = DF.loc[DF[Y]==1].groupby(by="qcut").count()[Y]
    num_bins = [*zip(bins,bins[1:],count_y0,count_y1)]
\t
# Make sure there are 0 and 1 in each bin
    for i in range(q):
        if 0 in num_bins[0][2:]:
            num_bins[0:2] = [(
                num_bins[0][0],
                num_bins[1][1],
                num_bins[0][2] + num_bins[1][2],
                num_bins[0][3] + num_bins[1][3])]
            continue

        for i in range(len(num_bins)):
            if 0 in num_bins[i][2:]:
                num_bins[i-1:i+1] = [(
                    num_bins[i-1][0],
                    num_bins[i][1],
                    num_bins[i-1][2] + num_bins[i][2],
                    num_bins[i-1][3] + num_bins[i][3])]
                break
        else:
            break

    def get_woe(num_bins):
        columns = ["min","max","count_0","count_1"]
        df = pd.DataFrame(num_bins,columns=columns)
        df["total"] = df.count_0 + df.count_1
        df["percentage"] = df.total / df.total.sum()
        df["bad_rate"] = df.count_1 / df.total
        df["good%"] = df.count_0/df.count_0.sum()
        df["bad%"] = df.count_1/df.count_1.sum()
        df["woe"] = np.log(df["good%"] / df["bad%"])
        return df

    def get_iv(df):
        rate = df["good%"] - df["bad%"]
        iv = np.sum(rate * df.woe)
        return iv

    IV = []
    axisx = []
    while len(num_bins) > n:
        pvs = []
        for i in range(len(num_bins)-1):
            x1 = num_bins[i][2:]
            x2 = num_bins[i + 1][2:]
            pv = scipy.stats.chi2_contingency([x1,x2])[1]
            pvs.append(pv)

        i = pvs. index(max(pvs))
        num_bins[i:i + 2] = [(
            num_bins[i][0],
            num_bins[i+1][1],
            num_bins[i][2] + num_bins[i + 1][2],
            num_bins[i][3] + num_bins[i + 1][3])]

        bins_df = pd. DataFrame(get_woe(num_bins))
        axisx.append(len(num_bins))
        IV.append(get_iv(bins_df))
        
    if graph:
        plt. figure()
        plt.plot(axisx, IV)
        plt. xticks(axisx)
        plt.xlabel("number of box")
        plt.ylabel("IV")
        plt. show()
    return bins_df

The effect pursued by the separated binning
1. It is hoped that the difference between the bad_rate of each group is as large as possible
2. The bigger the woe difference, the better, it should be monotonic, either from positive to negative, or from negative to positive, there can only be one turning process
3. If the woe value changes by two turns, such as a w-shape, it proves that there is a problem with the binning process
4. The more information num_bins retains, the better

Next, use the above function to find the IV value of each feature under different bins, and determine the final number of bins according to the steep decline of the IV value

model_data.columns

for i in model_data.columns[1:-1]:
    print(i)
    graphforbestbin(model_data,i,"SeriousDlqin2yrs",n=2,q=20)

Draw the IV value and the number of bins in each feature

As some of the results of binning cannot be separated, manual binning is required, and others are based on automatic binning

In fact, it can be found that not every feature can automatically complete so many bins. For example, the number of family members cannot be divided into 20 groups, so the features that can be binned can be released into separate groups, and the variables that cannot be automatically binned Observe for yourself.
same steps,
1. Determine the number of bins that can be automatically binned, and calculate the upper and lower limits of the corresponding boxes obtained through equal-frequency binning and chi-square test. At the same time, for manual binning, manually binning Values are set to 3 bins here, and at the same time, replace the maximum value with np.inf, and replace the minimum value with -np.inf to ensure that the model can find the corresponding box for selection when it comes in with a larger value.
2. Find the number of 1 and 0 in each bin, find the woe value corresponding to each feature, and map the woe value corresponding to each feature to the feature matrix
3. Separate the training set and test set, and prepare for model fitting training

auto_col_bins = {<!-- -->"RevolvingUtilizationOfUnsecuredLines":6,
                "age": 5,
                "DebtRatio": 4,
                "MonthlyIncome": 3,
                "NumberOfOpenCreditLinesAndLoans":5}
 
#Cannot use automatic binning variables
hand_bins = {<!-- -->"NumberOfTime30-59DaysPastDueNotWorse":[0,1,2,13]
            ,"NumberOfTimes90DaysLate":[0,1,2,17]
            ,"NumberRealEstateLoansOrLines":[0,1,2,4,54]
            ,"NumberOfTime60-89DaysPastDueNotWorse":[0,1,2,8]
            ,"NumberOfDependents":[0,1,2,3]}
 
#Guaranteed interval coverage Use np.inf to replace the maximum value and -np.inf to replace the minimum value
#Reason: For example, some new values appear, for example, the number of family members is 30, which has never appeared before. After changing the range to the maximum value, these new values can be assigned to the box
hand_bins = {<!-- -->k:[-np.inf,*v[:-1],np.inf] for k,v in hand_bins.items()}

bins_of_col = {<!-- -->}
 
# Generate the binning interval of automatic binning and the IV value after binning
 
for col in auto_col_bins:
    bins_df = graphforbestbin(model_data, col
                             ,"SeriousDlqin2yrs"
                             ,n=auto_col_bins[col]
                             #Use the properties of the dictionary to get the number of boxes corresponding to each feature
                             ,q=20
                             ,graph=False)
    bins_list = sorted(set(bins_df["min"]).union(bins_df["max"]))
    #Guaranteed interval coverage Use np.inf to replace the maximum value -np.inf to replace the minimum value
    bins_list[0], bins_list[-1] = -np.inf, np.inf
    bins_of_col[col] = bins_list
    
#Merge manual binned data
bins_of_col. update(hand_bins)
 
data = model_data. copy()
 
#Function pd.cut, can bin the data according to the known binning interval
#The parameter is pd.cut (data, binning interval represented by a list)
data = data[["age","SeriousDlqin2yrs"]].copy()
 
data["cut"] = pd.cut(data["age"],[-np.inf, 48.49986200790144, 58.757170160044694, 64.0, 74.0, np.inf])
 
data. head()

#Aggregate the data according to the binning results, and take out the label value
data.groupby("cut")["SeriousDlqin2yrs"].value_counts()
 
#Use unstack() to change the tree structure into a table structure
data.groupby("cut")["SeriousDlqin2yrs"].value_counts().unstack()
 
bins_df = data.groupby("cut")["SeriousDlqin2yrs"].value_counts().unstack()
 
bins_df["woe"] = np.log((bins_df[0]/bins_df[0].sum())/(bins_df[1]/bins_df[1].sum()))

def get_woe(df, col, y, bins):
    df = df[[col,y]].copy()
    df["cut"] = pd. cut(df[col],bins)
    bins_df = df.groupby("cut")[y].value_counts().unstack()
    woe = bins_df["woe"] = np.log((bins_df[0]/bins_df[0].sum())/(bins_df[1]/bins_df[1].sum()))
    return woe
 
#Store the WOE of all features into the dictionary
woeall = {<!-- -->}
for col in bins_of_col:
    woeall[col] = get_woe(model_data,col,"SeriousDlqin2yrs",bins_of_col[col])
    
#Don't want to overwrite the original data, create a new DataFrame, the index is exactly the same as the original data model_data
model_woe = pd.DataFrame(index=model_data.index)
 
#After binning the original data, map the WOE structure to the data with the map function according to the result of the box
model_woe["age"] = pd.cut(model_data["age"],bins_of_col["age"]).map(woeall["age"])
 
#All feature operations can be written as:
for col in bins_of_col:
    model_woe[col] = pd.cut(model_data[col],bins_of_col[col]).map(woeall[col])
    
#Add labels to data
model_woe["SeriousDlqin2yrs"] = model_data["SeriousDlqin2yrs"]
 
#This is our modeling data
model_woe. head()

vali_woe = pd.DataFrame(index=vali_data.index)
 
for col in bins_of_col:
    vali_woe[col] = pd.cut(vali_data[col],bins_of_col[col]).map(woeall[col])
vali_woe["SeriousDlqin2yrs"] = vali_data["SeriousDlqin2yrs"]
 
vali_X = vali_woe.iloc[:,:-1]
vali_y = vali_woe.iloc[:,-1]

Step 3: Model Development

Use logistic regression to fit the model, first use no parameters for training

X = model_woe.iloc[:,:-1]
y = model_woe.iloc[:,-1]
 
from sklearn.linear_model import LogisticRegression as LR
 
lr = LR().fit(X,y)
lr.score(vali_X,vali_y)#0.8641356370249832

#The result is 0.8641356370249832

Tune the parameters for drawing the learning curve

The main parameters of logistic regression are C, penalty, solver, max_iter and multi_class
C is the reciprocal of the regularization strength. The smaller C is, the smaller the loss function will be. The heavier the penalty of the model on the loss function, the stronger the effect of regularization. prevent overfitting
penalty You can enter l1 and l2 to specify which regularization method to use, and do not fill in the default L2
Note that if L1 regularization is selected, the parameter solver can only use the solving methods ‘liblinear’ and ‘saga’. If ‘l2’ regularization is used, all the solving methods in the parameter solver can be used.
The solver defaults to liblinear” which is dedicated to binary classification and is now the default solver
multiclass input ovr, multinomial, auto to tell the model, the type of classification problem we want to deal with, the default is ovr
ovr binary classification
multinomial multi-category
auto chooses according to the label type

l1 and l2 regularization

In the process of l1 regularization being gradually strengthened, the parameters that carry features with a small amount of information and do not contribute much to the model will become 0 faster than the parameters that carry a large amount of information and that contribute greatly to the model. Therefore, the essence of L1 regularization is a process of feature selection, which controls the ‘sparseness’ of parameters. Generally speaking, if the feature quantity is large and the data dimension is high, we tend to use L1 regularization.
In contrast, during the enhancement process of L2 regularization, each feature will make a small contribution to the model as much as possible, but it carries less information, and the parameters of features that do not contribute much to the model are very close to 0. Generally speaking, if our main purpose is to prevent overfitting, it is enough to choose L2 regularization

c_1 = np.linspace(0.01,1,20)
c_2 = np.linspace(0.01,0.2,20)
 
score = []
for i in c_1:
    lr = LR(solver='liblinear',C=i).fit(X,y)
    score.append(lr.score(vali_X,vali_y))
plt. figure()
plt.plot(c_1, score)
plt. show()
 
lr.n_iter_#array([7], dtype=int32)
 
score = []
for i in [1,2,3,4,5,6]:
    lr = LR(solver='liblinear', C=0.025, max_iter=i).fit(X,y)
    score.append(lr.score(vali_X,vali_y))
plt. figure()
plt.plot([1,2,3,4,5,6], score)
plt. show()

Since the model has only 10 features and is not particularly high-dimensional, it is sufficient to use L2 regularization. The for loop draws a learning curve for the C value and max_iter value according to the different values of C and max_iter. It is found that the upper limit of the accuracy is also close to the time when there is no parameter adjustment, so there is no need to make too many parameter adjustments.

Step 4: Model testing and evaluation

ROC curve

The curve falls on the upper left, the AUC is 0.94, and the result is better than the accuracy. For the model of the score card, which is more likely to capture minority samples, the effect is quite good, and the recall rate is relatively high.

import scikitplot as skplt
 
#%%cmd
#pip install scikit-plot
 
vali_proba_df = pd.DataFrame(lr.predict_proba(vali_X))
skplt.metrics.plot_roc(vali_y, vali_proba_df,
                        plot_micro=False, figsize=(6,6),
                        plot_macro=False)

Step 5: Online model

Create a scorecard model

After modeling, the predictive ability of the model was verified by using the accuracy rate and ROC curve. The next step is to use logistic regression to convert to a standard scorecard. The score in the scorecard is calculated by the following formula:

(

the s

)

Score = A – B*log(odds)

Score=A?B?log(odds)
Among them, A and B are constants, A is called compensation, B is called scale, and log (odds) represents the possibility of a person defaulting. In fact, the result of logistic regression takes the logarithmic probability form to get

\theta^Tx

θTx, that is, our parameter * feature matrix, so log(odds) is our parameter, and the two constants can be calculated by bringing the scores of two hypotheses into the formula. These two hypotheses are:
1. The expected score under a certain default probability
2. Assigned Probability of Default Doubling Fraction (PDO)
If, assuming that the specific score set when the logarithmic probability is 1/60 is 600, and PDO=20, then the score with a logarithmic probability of 1/30 is 620,
With the above linear expression, we can get:

600

(

)

600 = A-B*log(1/60)

600=A·B·log(1/60)

620

(

)

620 = A – B*log(1/30)

620=A?B?log(1/30)
The values of A and B can be easily calculated with numpy;

B = 20/np.log(2)
A = 600 + B*np.log(1/60)

With A and B, the score is easy to get, the intercept is taken as log(odds) into the formula for calculation, and the scores of other features are also calculated by taking the coefficient into
coefficient of each feature after modeling

# get basic points
base_score = A -B*lr.intercept_
# Get the score corresponding to each feature
score_age = woeall['age'] *(-B*lr.coef_[0][1])

Score of each bin

withopen(file,"w") asfdata:
    fdata.write("base_score,{}\\
".format(base_score)) fori,colinenumerate(X.columns):
    score = woeall[col] * (-B*lr.coef_[0][i])
    score.name = "Score"
    score.index.name = col
    score.to_csv(file, header=True, mode="a")

Step Six: Monitoring and Reporting

Quoted from Teacher Cai Cai’s course~