Performance evaluation indicators (precision, recall, ROC, AUC)

Empirical error, overfitting, underfitting

Empirical error, overfitting, and underfitting are common concepts in machine learning, and they are related to the performance and generalization ability of the model.

Empirical Error refers to the error or loss on the training data. It is derived by applying the model to the training data and calculating the difference between the predicted results and the true results. Generally, our goal is to minimize the empirical error so that the training model has a better fit on the training data.

Overfitting means that the model performs too well on training data, but performs poorly on new data. Overfitting may be caused by the model being too complex, too little training data, or the presence of noisy data. When a model remembers subtle differences and noise in the training data, it overfits, resulting in poor predictive performance on unknown data.

Underfitting refers to the problem that the model performs poorly on the training data and cannot fit the data well. Underfitting is usually caused by insufficient model complexity or insufficient training data. Underfitting occurs when a model is too simple to capture complex relationships in the data, causing the model to perform poorly on both training data and new data.

In order to solve the problems of overfitting and underfitting, common methods include increasing the amount of training data, adjusting model complexity (such as increasing or decreasing the parameters of the model), and using regularization techniques (such as L1 regularization or L2 regularization) to control The complexity of the model, and the use of techniques such as cross-validation to evaluate the generalization ability of the model.

Confusion matrix

Before introducing these concepts, let’s introduce a concept: confusion matrix. For k-element classification, it is actually a k x k table used to record the prediction results of the classifier. For common binary classification, its confusion matrix is 2×2.

Suppose you want to predict whether 15 people are sick or not, using 1 to represent sick and 0 to represent normal. The prediction results are as follows:

Predicted value:	1	1	1	1	1	0	0	0	0	0	1	1	1	0	1
True value:	0	1	1	0	1	1	0	0	1	0	1	0	1	0	0

Convert the above prediction results into a confusion matrix, as follows:

The above figure shows a confusion matrix for binary classification. From this confusion matrix, the following information can be obtained:

There are a total of 5 + 2 + 4 + 4 = 15 sample data
There are 5 samples with a true value of 1 and a predicted value of 1. There are 2 samples with a true value of 1 and a predicted value of 0. There are 4 samples with a true value of 0 and a predicted value of 1. The true value is 0 predicted value. There are 4 samples that are also 0.

Binary classification problems can get True Positive (TP, true positive), False Positive (FP, false positive), False Negative (FN, false negative) and True Negative (TN, true negative). These four values correspond to the four positions of the confusion matrix for binary classification problems.

Tips: The above four concepts are often confused (is that why the name of the confusion matrix comes from?), here is a small way to help you remember it. In medicine, it is generally believed that positive means disease and negative means normal. Therefore, as long as the “positive” keyword appears, it means that the result is disease. In addition, positives are also divided into true positives and false positives. It can be seen from the name: true positive means a real positive, that is to say, it is actually positive ( disease), the prediction is also positive (disease); false positive means an unreal positive, that is to say, it is actually negative (normal) and the prediction is positive (disease). True negatives and false negatives can also be simply understood in the above way.

Obviously, here TP=5, FP=2, FN=4, TN=4.

1. Precision P, recall R, F1 value

Accuracy (Precision): P=TP/(TP + FP). In layman’s terms, it is the proportion of correctly predicted positive data to predicted as positive data.
Recall: R=TP/(TP + FN). In layman’s terms, it is the ratio of data predicted to be positive to actually positive data
F1 value (F score):

As shown in the figure below, the value of F1 is affected by P and R at the same time. Simply pursuing the improvement of P and R does not have much effect. In actual business projects, it is indeed a very challenging thing to combine the positive and negative sample ratios.
Image display

What is AUC

According to Wikipedia, AUC (Area under the Curve of ROC) is the area under the ROC curve and is the standard for judging the quality of a two-class prediction model. ROC (receiver operating characteristic curve) The receiver operating characteristic curve is an indicator invented by electronic engineers and radar engineers during World War II to detect enemy vehicles (aircraft, ships) on the battlefield. Belongs to signal detection theory. The abscissa of the ROC curve is False Positive Rate (also called False Positive Rate), and the ordinate is True Positive Rate (True Positive Rate ), correspondingly there are True Negative Rate (True Negative Rate) and False Negative Rate (False Negative Rate), AUC is used It is used to measure the performance (generalization ability) of machine learning algorithms for “two classification problems”.

We know that we often use ACC accuracy to judge the quality of classifier classification results. Since we have ACC, why do we need ROC? A very important factor is that data skew often occurs in actual sample data sets, either negative or negative. The number of class samples is greater than the number of positive class samples, or the number of positive class samples is greater than the number of negative class samples. The AUC value is used as the evaluation criterion because in many cases the ROC curve cannot clearly explain which classifier performs better. Compared with AUC which is a numerical value, the classifier with a larger AUC performs better and the numerical value is easier to judge. .

First, explain several concepts commonly used in binary classification problems: True Positive, False Positive, True Negative, False Negative

They are distinguished based on the combination of true and predicted categories.

Suppose there is a batch of test samples, and these samples have only two categories: positive examples and negative examples. The machine learning algorithm predicts the category (the left half predicts the category as a positive example, and the right half predicts the category as a negative example), while the real positive category in the sample is in the upper half, and the lower half is the real negative example.
The total number of real positive categories in the sample is TP + FN. True Positive Rate, TPR = TP/(TP + FN).
In the same way, the total number of real counterexample categories in the sample is FP + TN. False Positive Rate, FPR=FP/(TN + FP).

		predict		Total
		1	0	Total
actual ?	1(P)	True Positive（TP）	False Negative (FN)	Actual Positive(TP + FN)
actual ?	0(N)	False Positive (FP)	True Negative(TN)	Actual Negative(FP + TN)
Total		Predicted Positive(TP + FP)	Predicted Negative(FN + TN)	TP+FP+FN+TN

There is also a concept called “Truncation point“. After the machine learning algorithm predicts the test sample, it can output the similarity probability of each test sample to a certain category.

For example, the probability that t1 is category P is 0.3. Generally, we think that if the probability is lower than 0.5, t1 belongs to category N. The 0.5 here is the “cutoff point”.
To summarize, the three most important concepts for calculating ROC are TPR, FPR, and cutoff point.

The cutoff point takes different values, and the calculation results of TPR and FPR are also different. The curves obtained by drawing the corresponding TPR and FPR results at different values of the cutoff point in the two-dimensional coordinate system

The value ranges of both the x-axis and the y-axis are [0, 1]. We can get a set of (x, y) points and connect them to create a ROC curve. The example picture is as follows:

The ordinate is true positive rate (TPR) = TP / (TP + FN=P) (the denominator is the total of the horizontal rows). Intuitive explanation: It is actually 1. How many guesses are correct?
The abscissa is false positive rate (FPR) = FP / (FP + TN=N) Intuitive explanation: It is actually 0, how many wrong guesses

The dashed line in the figure corresponds to the result of random prediction. It is not difficult to see that as FPR increases, the ROC curve starts from the origin (0, 0) and will eventually fall to the (1, 1) point. ROC is the area of the curve on the lower right side. The figure below shows the three AUC values:

AUC = 1, which is a perfect classifier. When using this prediction model, no matter what threshold is set, perfect predictions can be obtained. In most prediction situations, there is no perfect classifier.
0.5 < AUC < 1, better than random guessing. This classifier (model) can have predictive value if the threshold is properly set.
AUC = 0.5, which is the same as random guessing (for example: losing copper coins), and the model has no predictive value.
AUC < 0.5 is worse than random guessing; but as long as you always go against the prediction, it is better than random guessing, so there is no situation where AUC < 0.5

AUC must be familiar to everyone who does machine learning. It is an evaluation index to measure the quality of a two-classification model. It indicates the probability that a positive example ranks before a negative example. Other evaluation indicators include precision, accuracy, and recall, but AUC is more commonly used than these three. Because generally in classification models, prediction results are expressed in the form of probabilities. If you want to calculate the accuracy, you usually set a threshold manually to convert the corresponding probability into a category. This threshold greatly affects the accuracy of the model. rate calculation.

We might as well give an extreme example: a two-category classification problem has a total of 10 samples, 9 of which are positive examples and 1 sample is a negative example. If all are correct, the accuracy will be as high as 90%, and this is not This is not the result we hoped for, especially when the score of this negative example is still the highest. The performance of the model should be extremely poor, but it is counterproductive in terms of accuracy. And AUC can well describe the overall performance of the model. In this case, the AUC value of the model will be equal to 0 (of course, the case less than 50% can be solved by negating it, but this is another story).

ROC calculation example

Through calculation, the results (FPR, TPR, Truncation Point) are

[ 0. 0.5 0.5 1. ]
[ 0.5 0.5 1. 1. ]
[0.8 0.4 0.35 0.1]

Draw the FPR and TPR in the results into two-dimensional coordinates, and the resulting ROC curve is as follows (indicated by the blue line). The area of the ROC curve is expressed by AUC (light yellow shaded area).

Detailed calculation process

The data given in the above example is as follows:

y = np.array([1, 1, 2, 2])
scores = np.array([0.1, 0.4, 0.35, 0.8])

Using this data, what is the process of calculating TPR and FPR?

1. Analyze data

y is a one-dimensional array (the true classification of the sample). The array values represent categories (there are two categories, 1 and 2). We assume that 1 in y represents a negative example and 2 represents a positive example. That is to rewrite y as:

y_true = [0, 0, 1, 1]

Score is the probability that each sample is a positive example.

2. Sort data according to score

Sample	Prediction probability (score) of belonging to P	True category
y[0]	0.1	N
y[2]	0.35	P
y[1]	0.4	N
y[3]	0.8	P

3. Take the `truncation points` as the score value

When the cutoff point is set to 0.1, 0.35, 0.4, and 0.8, the results of TPR and FPR are calculated.

3.1 `Truncation point` is 0.1

It shows that as long as score>=0.1, its prediction category is a positive example.
At this time, because the scores of the four samples are all greater than or equal to 0.1, the prediction categories of all samples are P.

scores = [0.1, 0.4, 0.35, 0.8]
y_true = [0, 0, 1, 1]
y_pred = [1, 1, 1, 1]

TPR = TP/(TP + FN) = 1
FPR = FP/(TN + FP) = 1

3.2 `Truncation point` is 0.35

It shows that as long as score>=0.35, its prediction category is P.
At this time, because 3 of the 4 sample scores are greater than or equal to 0.35. Therefore, 3 of the prediction classes of all samples are P (2 predictions are correct and 1 prediction is wrong); 1 sample is predicted to be N (prediction is correct).

scores = [0.1, 0.4, 0.35, 0.8]
y_true = [0, 0, 1, 1]
y_pred = [0, 1, 1, 1]

TPR = TP/(TP + FN) = 1
FPR = FP/(TN + FP) = 0.5

3.3 `Truncation point` is 0.4

It shows that as long as score>=0.4, its prediction category is P.
At this time, because 2 of the 4 sample scores are greater than or equal to 0.4. Therefore, 2 of the prediction classes of all samples are P (1 prediction is correct, 1 prediction is wrong); 2 samples are predicted to be N (1 prediction is correct, 1 prediction is wrong).

scores = [0.1, 0.4, 0.35, 0.8]
y_true = [0, 0, 1, 1]
y_pred = [0, 1, 0, 1]

TPR = TP/(TP + FN) = 0.5
FPR = FP/(TN + FP) = 0.5

3.4 `Truncation point` is 0.8

It shows that as long as score>=0.8, its prediction category is P. Therefore, one of the predicted classes of all samples is P (1 prediction is correct); 3 samples are predicted to be N (2 predictions are correct, and 1 prediction is wrong).

scores = [0.1, 0.4, 0.35, 0.8]
y_true = [0, 0, 1, 1]
y_pred = [0, 0, 0, 1]

TPR = TP/(TP + FN) = 0.5
FPR = FP/(TN + FP) = 0

Use the following description to express the calculation process of TPR and FPR, which is easier to remember

TPR: The proportion of real positive examples that are predicted correctly
FPR: The proportion of correct predictions among real counterexamples

The most ideal classifier is to classify the sample completely correctly, that is, FP=0, FN=0. Sothe ideal classifier FPR=0, TPR=0.

The first point, (0,1), is FPR=0, TPR=1, which means FN (false negative)=0, and FP (false positive)=0. Wow, this is a perfect classifier, it classifies all samples correctly.

The second point, (1,0), that is, FPR=1, TPR=0, similar analysis can find that this is the worst classifier because it successfully avoids all correct answers.

The third point, (0,0), is FPR=TPR=0, that is, FP (false positive) = TP (true positive) = 0. It can be found that the classifier predicts that all samples are negative samples (negative).

At the fourth point (1,1), the classifier actually predicts that all samples are positive samples. After the above analysis, we can conclude that the closer the ROC curve is to the upper left corner, the better the performance of the classifier.

Related resources: Machine learning model evaluation ppt_model training set accuracy=resources-CSDN library