Logistic Regression and Evaluation Metrics for Logistic Regression

1. Introduction to logistic regression

Logistic Regression (Logistic Regression) is a classification model in machine learning. Logistic regression is a classification algorithm. Although the name contains regression, it has a certain connection with regression. Due to the simplicity and high efficiency of the algorithm, it is widely used in practice.

1. Application scenarios of logistic regression

Ad CTR
Is it spam
Are you sick?
financial fraud
fake account

These examples are all judgments that fall between the two categories. Simply put, logistic regression is a powerful tool for solving binary classification problems.

2. The principle of logistic regression

Define the logistic function: Logistic regression uses a logistic function (sigmoid function) to convert the output of the linear regression model into a probability value. The definition of the logistic function is as follows:

the s

i

g

m

o

i

d

(

z

)

=

1

/

(

1

+

e

x

p

(

?

z

)

)

sigmoid(z) = 1 / (1 + exp(-z))

sigmoid(z)=1/(1 + exp(?z))
where z is the output of the linear regression model and exp() is the natural exponential function.
Construct a linear regression model: Suppose there is a sample x containing n features, the corresponding feature weight is w, and the intercept item is b. The output of the linear regression model is:

z

=

w

0

x

0

+

w

1

x

1

+

.

.

.

+

w

no

x

no

+

b

z = w?x? + w?x? + … + w?x? + b

z=w0?x0? + w1?x1? + … + wn?xn? + b
Among them, x? is a constant item with a value of 1.
Apply the logistic function: Bring the output z of the linear regression model into the logistic function to obtain the probability p that the sample belongs to a certain category. The value range of p is between 0 and 1.

p

=

the s

i

g

m

o

i

d

(

z

)

=

1

/

(

1

+

e

x

p

(

?

z

)

)

p = sigmoid(z) = 1 / (1 + exp(-z))

p=sigmoid(z)=1/(1 + exp(?z))
Setting the threshold: In order to classify, the probability value p needs to be converted into the result of the binary classification. Usually, you can set a threshold such as 0.5. When p is greater than or equal to 0.5, the sample is predicted as a positive class (1), otherwise it is predicted as a negative class (0).
Model training and parameter estimation: Use the training data set to estimate the parameters w and b of the model through optimization methods such as maximum likelihood estimation or gradient descent, so that the model’s prediction of the training data is as consistent as possible with the actual label.

2. Loss and optimization of logistic regression

1. Loss

In logistic regression, the commonly used loss function is the binary cross-entropy loss function (Binary Cross-Entropy Loss), also known as the logarithmic loss function (Log Loss).

For binary classification problems, given the true label y (0 or 1) of a sample and the predicted probability p of a logistic regression model, the binary cross-entropy loss function can be defined as:

(

the y

)

[

the y

(

)

(

the y

)

(

)

]

L(y, p) = -[y * log(p) + (1 – y) * log(1 – p)]

L(y,p)=?[y?log(p) + (1?y)?log(1?p)]
Among them, y is the true label of the sample, and p is the probability that the logistic regression model predicts that the sample belongs to the positive example (category 1).

Whenever we want loss function value, the smaller the better

When y=1, we want the p-value to be as large as possible
When y=0, we want the p-value to be as small as possible

For logistic regression problems, the complete loss function is the mean or sum of the binary cross-entropy loss functions over all training samples.

Suppose there are m training samples, the true label of each sample is y? (0 or 1), and the predicted probability of the logistic regression model is p?. Then the complete loss function can be defined as:

(

)

(

)

[

the y

(

)

(

the y

)

(

)

]

L(w, b) = (1/m) * Σ[-y? * log(p?) – (1 – y?) * log(1 – p?)]

L(w,b)=(1/m)?Σ[?yilog(pi?)?(1?yi?)?log(1?pi?)]
where w is the feature weight vector and b is the intercept term.

2. Optimization

Also use the gradient descent algorithm to reduce the value of the loss function. In this way, the weight parameters of the corresponding algorithm in front of the logistic regression are updated, increasing the probability of originally belonging to category 1, and reducing the probability of originally belonging to category 0.

3. Logistic regression API

sklearn.linear_model.LogisticRegression(solver=”liblinear”, penalty=”l2″, C=1.0)
- Solver optional parameters: (“liblinear”, “sag”, “saga”, “newton-cg”, “lbfgs”)
  - Default: “liblinear”; the algorithm to use for the optimization problem.
  - “liblinear” is a good choice for small datasets, while “sag” and “saga” will be faster for large datasets.
  - For multi-classification problems, only “newton-cg”, “sag”, “saga” and “lbfgs” can handle multiple losses; “liblinear” is limited to binary classification problems.
- penalty: regularization type
- C: Regularization Strength
By default, the number of classes with a small number is regarded as positive examples.

The LogisticRegression method is equivalent to SGDClassifier(loss=”log”, penalty””), SGDClassifier implements a common stochastic gradient descent algorithm (SG), and uses LogisticRegression (implements SAG)

Fourth, the evaluation index of logistic regression

Accuracy: Accuracy is the most common evaluation indicator, which represents the ratio of the number of samples that the model predicts correctly to the total number of samples.
Precision: The precision measures the proportion of true examples among all samples predicted to be positive by the model.
Recall (Recall): The recall rate measures the proportion of the model that correctly predicts positive examples among all true positive examples.
F1 Score (F1 Score): The F1 score is the harmonic mean of precision and recall, taking into account the performance of both.

These evaluation metrics can help us comprehensively evaluate the performance of logistic regression models. Precision measures how well the overall predictions are made, while precision and recall focus on how well the model matches between positive predictions and true positives. The F1 score comprehensively considers the precision rate and the recall rate, and is suitable for balancing the precision rate and the recall rate.

In addition to the above indicators, indicators such as confusion matrix, ROC curve, and AUC (Area Under the Curve) can also be considered to evaluate the performance of the logistic regression model. The specific indicator to choose depends on the specific problem and needs.

5. Confusion Matrix

1. Confusion Matrix

Under the classification task, there are four different combinations between the predicted results and the real results, forming a confusion matrix (suitable for multi-classification)

As can be seen from the above figure, when the predicted result is the same as the real result, it is true; when the predicted result is inconsistent with the real result, it is false

2. Precision and Recall

Precision: The precision measures the proportion of true examples among all samples predicted to be positive by the model. Accuracy can be calculated as:

P

r

e

c

i

the s

i

o

no

=

T

P

/

(

T

P

+

f

P

)

Precision = TP / (TP + FP)

Precision=TP/(TP + FP)
Recall (Recall): The recall rate measures the proportion of the model that correctly predicts positive examples among all true positive examples. Recall can be calculated as:

R

e

c

a

l

l

=

T

P

/

(

T

P

+

f

N

)

Recall = TP / (TP + FN)

Recall=TP/(TP + FN)

3. Other classification evaluation methods

F1 Score (F1 Score): The F1 score is the harmonic mean of precision and recall, taking into account the performance of both. F1 score can be calculated as:

(

the s

)

(

the s

)

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

F1Score=2?(Precision?Recall)/(Precision+Recall)

4. Classification Evaluation API

sklearn.metrics.classification_report(y_true, y_pred, labels=[], target_names=None)
- y_true: true target value
- y_pred: predicted target value
- labels: the number corresponding to the specified category
- target_names: target category names
- return: precision and recall for each category

# 5. Precision \ recall index evaluation
ret = classification_report(y_true=y_test,y_pred=y_pre,labels=(2,4),target_names=("benign","malignant"))
print(ret)

5. How to measure the evaluation under sample imbalance?

Assuming such a situation, if there are 99 samples of cancer and 1 sample of non-cancer, I will predict all positive cases anyway (the default cancer is positive), and the accuracy rate is 99%. But this effect is not good, there will be a misdiagnosis, which is a very serious medical accident. This is the evaluation problem under sample imbalance.

How to judge whether the sample is balanced?

In a binary classification problem, if the ratio of the two classes is greater than 4:1, the sample is considered unbalanced.

In the case of unbalanced samples, ROC curve and AUC indicators are needed for evaluation

6. ROC curve and AUC indicator

1.TPR and FPR

TPR=TP/(TP + FN)
- The proportion of predicted class 1 among all samples with true class 1
FPR=FP/(FP+TN)
- The proportion of predicted class 1 among all samples with true class 0

2. ROC curve

ROC (Receiver Operating Characteristic) curve is a common tool for evaluating the performance of binary classification models. It shows the trade-off of the model between positive and negative examples by plotting the relationship between the True Positive Rate (TPR) and the False Positive Rate (FPR) of the classifier at different thresholds.
The horizontal axis of the ROC curve is FPR, and the vertical axis is TPR. When drawing the ROC curve, we calculate the TPR and FPR in different cases by changing the threshold of the classifier.
Ideally, The ROC curve of a well-performing classifier should be as close to the upper left corner as possible, that is, TPR is close to 1 and FPR is close to 0. This means that the classifier is able to maximize the classification of true positives while minimizing false positives. Points on the diagonal (FPR = TPR) represent random guesses of the classifier.

3.AUC indicator

AUC (Area Under the Curve) is the area under the ROC curve, which is a commonly used evaluation indicator for measuring the performance of the binary classification model. The range of AUC value is between 0 and 1. The closer the value is to 1, the better the performance of the classifier, and the closer the value is to 0.5, the closer the performance of the classifier is to random guessing.
When the AUC value is close to 1, it means that the model can well rank positive samples before negative samples; when the AUC value is close to 0.5, it means that the performance of the model is equivalent to random guessing; when the AUC value is less than 0.5, it means that the performance of the model is relatively low. Bad, even worse than random guessing.
As a comprehensive evaluation index, AUC can reflect the performance of the classifier more comprehensively than the simple accuracy rate or precision rate, especially more effective in the case of imbalanced categories. It is not affected by the classification threshold, and can compare and evaluate the performance of classifiers under different thresholds.

4. AUC calculation API

from sklearn.metrics import roc_auc_score
- sklearn.metrics.roc_auc_score(y_true,y_score)
  - Calculate the area of the ROC curve, that is, the AUC value
  - y_true: the true category of each sample, must be 0 (negative example), 1 (positive example) mark
  - y_score: prediction score, which can be the estimated probability of the positive class, the confidence value, or the return value of the classifier method
AUC can only be used to evaluate two categories
AUC is very suitable for evaluating classifier performance in sample imbalance