PR curve and ROC curve

  • 1. PR Curve

1. What is the P-R curve:

P is precision, which is precision rate (precision rate), and R is recall, which is recall rate, so the P-R curve reflects the relationship between precision rate and recall rate. Generally, the abscissa is recall and the ordinate is precision.

2. Function of P-R curve:

PR curves are often used in the field of information extraction, and when the category distribution in our data set is uneven, we can use PR curves instead. It is a common evaluation index in data mining.

3. How to calculate P value and R value:

Confusion matrix:

TP: The number of examples that are correctly classified as positive examples, that is, the number of examples that are actually positive examples and are classified as positive examples by the classifier

FP: The number of examples that are incorrectly classified as positive examples, that is, the number of examples that are actually negative examples but are classified as positive examples by the classifier

FN: The number of examples that are incorrectly classified as negative examples, that is, the number of examples that are actually positive examples but are classified as negative examples by the classifier

TN: The number of examples that are correctly classified as counterexamples, that is, the number of examples that are actually counterexamples and are classified as counterexamples by the classifier.

Calculation formula:

P (precision rate)

p=\frac{TP}{TP + FP}

R (recall rate)

R=\frac{TP}{TP + FN}

4. How to understand the P-R curve (pictured):

5. Comparison of advantages and disadvantages:

①The closer the curve is to the upper right, the better the performance. (For example, the black curve in the picture above)

②When a curve is completely included by another curve, the performance of the latter is better than that of the former. (For example, orange-blue curve, orange is better than blue)

③If the curves cross (black-orange curve), the judgment is based on:

3.1: Judging based on the area under the curve, a larger area is better than a smaller area.

3.2: Judgment based on the balance point F: The balance point is the point when the accuracy rate and the duplication rate are equal. The calculation formula for F is F = 2 * P * R / (P + R ). The larger the F value, the better the performance.

2. ROC curve

1. What is Roc curve:

Roc stands for Receiver Operating Characteristic, which is a receiver operating characteristic curve and a coordinate diagram analysis tool.

2. The role of Roc curve:

In the fields of machine learning and data mining, the generalization performance of the learner is also evaluated.

3. Calculation formula:

True rate:

True Positive Rate=\frac{TP}{TP + FN}

False positive rate:

False Positive Rate=\frac{FP}{TN + FP}

How to understand the Roc curve (as shown in the picture):

When the ROC curve is close to the (1,0) point, it indicates that the model generalization performance is better. When it is close to the diagonal line, it means that the prediction result of the model is a random prediction result.

4. Comparison of advantages and disadvantages:

1: When a curve is completely included by another curve, the performance of the latter is better than that of the former.

2: The area under the ROC curve (AUC) can be used as an indicator to evaluate the performance of the model. For example, when the ROC curves of two models intersect, it is difficult to say which model is better. At this time, AUC can be used as a more reasonable criterion.

3. Comparison of PR curve and ROC curve

1. Coordinate axis

The x-axis of the ROC curve is FPR and the y-axis is TPR.

The x-axis of the PR curve is Recall and the y-axis is Precision.

2. Application scenarios

In the case of balanced positive and negative samples, the ROC curve is a good performance measure

In the case of imbalance between positive and negative samples, the PR curve is usually more informative.

3. Overall performance indicators

ROC curve uses AUC

PR curve uses AP

4. Others

The ROC curve takes into account both positive and negative examples, so it is suitable for evaluating the overall performance of the classifier. In contrast, the PR curve focuses entirely on positive examples.

5. If there are multiple pieces of data and different category distributions, for example, in the credit card fraud problem, the proportion of positive and negative cases may be different every month. At this time, if you just want to simply compare the performance of the classifier and eliminate the category distribution The ROC curve is more suitable for the impact of changes, because changes in class distribution may cause the PR curve to change for better or worse. In this case, it is difficult to compare models; conversely, if you want to test the impact of different class distributions on the performance of the classifier, Then the PR curve is more suitable.

6. If you want to evaluate the prediction of positive examples under the same category distribution, you should choose the PR curve.

In class imbalance problems, the ROC curve usually gives an optimistic effect estimate, so most of the time the PR curve is better.

7. Finally, according to the specific application, you can find the optimal point on the curve, obtain the corresponding precision, recall, f1 score and other indicators, and adjust the threshold of the model to obtain a model that meets the specific application.

4. Code implementation of P-R curve and ROC curve drawing:

Import the required packages:

from distutils.log import error
import matplotlib
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.linear_model import LogisticRegression

Load the data set and train. The data set used here is the classic data set for binary classification tasks, the breast cancer data set.

# Use the breast cancer data set provided by Sklearn
data = load_breast_cancer()
X = data.data
y = data.target
?
# Divide training set and test set data
X_train,X_test,y_train, y_test = train_test_split(X,y,test_size=0.3)
?
#Train model
model = LogisticRegression()
?
model.fit(X_train, y_train)
?
# The output of predict_proba is the probability of each category. For a binary classification problem, its shape is always (n_sample, 2)
scores = model.predict_proba(X_test) 

Drawing of P-R curve:

plt.figure("P-R Curve")
plt.title('Precision/Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
?
#The input of precision_recall_curve is:
# y_test:binary label
# scores: estimated probability
# pos_label: the label of the positive class
# Calculate recall and precision rates for different thresholds. This implementation is limited to binary classification tasks.
precision, recall, thresholds = precision_recall_curve(y_test, scores[:,-1],pos_label=1)
plt.plot(recall,precision)
plt.show()

The running results are as follows:

ROC curve drawing

plt.figure("ROC Curve")
plt.title('TPR/FPR Curve')
plt.xlabel('FPR')
plt.ylabel('TPR')
?
#The input of roc_curve is:
# y: sample label
# scores: The model outputs the probability that the sample is a positive example
# pos_label: Label marked as a positive example. In this example, the one marked as 1 is a positive example.
fpr,tpr, thresholds = metrics.roc_curve(y_test,scores[:,-1],pos_label=1)
?
?
plt.plot(fpr,tpr)
plt.show()

The running results are as follows: