ROC curve and PR curve

Table of Contents

Article overview:

Introduction:

ROC curve:

PR curve:

Features:

Draw ROC curve and PR curve

Environment setup and data preparation and model training

Functions for calculating Precision and Recall and functions for calculating TPR and FPR

Draw ROC curve and PR curve

The results obtained:

Article Overview:

In this blog, we will introduce how to plot receiver operating characteristic (ROC) curves and precision-recall (PR) curves using Python and the Scikit-learn library. These curves are very important in machine learning tasks for evaluating classifier performance.

Introduction:

For a binary classification task (assuming 1 represents the positive class and 0 represents the negative class), for a sample, there are a total of four classification results:

The category is actually 1 and is divided into 0, FN (False Negative)

The category is actually 1 and is classified as 1, TP (True Positive)

The category is actually 0 and is classified as 1, FP (False Positive)

The category is actually 0 and is classified as 0, TN (True Negative)

It can be represented by a Confusion Matrix:

ROC Curve:

The full name of ROC is Receiver Operating Characteristic, and its Chinese name is “Receiver Operating Characteristic Curve“. Its main analysis tool is a curve drawn on a two-dimensional plane – the ROC curve. The abscissa of the plane is the false positive rate (FPR), and the ordinate is the true positive rate (TPR).

TPR (True Positive Rate) true positive rate:

$TPR=TP/TP + FN=TP/P$

FPR (False Positive rate) false positive rate

$FPR=FP/TN + FP=FP/N=1-TN/N$

The larger the TPR, the larger the TP, which means that for all the positive examples in the test sample, most of them are predicted correctly by the learner. The smaller the FPR, the smaller the FP and the larger the TN, which means the smaller the FPR. For all counterexamples in the test sample, most of them are predicted correctly by the learner. Therefore, a good model should have a large TPR and a small PFR.

AUC (Area Under Curve):

AUC is defined as the area under the ROC curve. Obviously this area is less than 1, and because the ROC curve is generally above the y=x line, the AUC is generally between 0.5 and 1. The AUC value is used as the evaluation criterion because in many cases the ROC curve cannot clearly illustrate which classifier performs better. As a numerical value, the classifier with a larger AUC performs better.

PR curve:

The P-R curve is the precision vs. recall curve, with recall as the abscissa axis and precision as the ordinate axis.

$precision=TP/(TP + FP)$

$recall=TP/(TP + FN)$

Features:

The performance of the ROC curve is relatively stable for imbalanced data sets (for example, the number of positive samples and negative samples is very different) and is not easily affected by data imbalance.
When the ratio of positive and negative samples is extremely imbalanced, the PR curve can often provide a more accurate view of model performance than the ROC curve.

Draw ROC curve and PR curve

Environment settings, data preparation and model training

The data set used here is the breast cancer data set preset by the Python Scikit-learn library. Import relevant libraries (including datasets, SVC, train_test_split, pyplot and numpy), then preprocess the data, divide the training set and the test set, and use train_test_split function, the proportion of the test set is set to 30%, and the random seed is set to 0. We then use an SVC (Support Vector Classification) model for training.

from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

# Load breast cancer data set
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target

# Divide training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# SVC model
clf = SVC(kernel='linear', probability=True, random_state=0)
clf.fit(X_train, y_train.ravel())

Function to calculate Precision and Recall and function to calculate TPR and FPR

Define the tpr_fpr function to calculate TPR and FPR, and define the precision_recall function to calculate the Precision and Recall functions.

tpr_fpr function: This function calculates and returns TPR and FPR:
1. Iterate through each value in the prediction score (y_score), and if the value is greater than the set threshold (threshold), count true positives (tp) or false positives (fp) according to the true value (y_true). Otherwise, count true negatives (tn) or false negatives (fn) based on the true value.
2. Use formulas to calculate the true positive rate (TPR) and false positive rate (FPR), and then return these two values. TPR is equal to the ratio of the number of true positives to all actual positives (true positives + false negatives), and FPR is equal to the ratio of the number of false positives to all actual negatives (false positives + true negatives).
precision_recall function: This function calculates and returns precision and recall:
1. The basic operation is the same as the tpr_fpr function, which also traverses the prediction scores, determines and counts tp, fp, tn, and fn.
2. Use formulas to calculate Precision and Recall, then return these two values. Precision is equal to the ratio of the number of true positives to all predicted positives (true positives + false positives), and recall is equal to the ratio of the number of true positives to all actual positives (true positives + false negatives).

# Function to calculate TPR and FPR
def tpr_fpr(y_true, y_score, threshold):
    tp, fp, tn, fn = 0, 0, 0, 0
    for i in range(len(y_score)):
        if y_score[i] > threshold:
            if y_true[i] == 1:
                tp + = 1
            else:
                fp + = 1
        else:
            if y_true[i] == 0:
                tn + = 1
            else:
                fn + = 1
    tpr = tp / (tp + fn)
    fpr = fp / (fp + tn)
    return tpr, fpr

# Function to calculate Precision and Recall
def precision_recall(y_true, y_score, threshold):
    tp, fp, tn, fn = 0, 0, 0, 0
    for i in range(len(y_score)):
        if y_score[i] > threshold:
            if y_true[i] == 1:
                tp + = 1
            else:
                fp + = 1
        else:
            if y_true[i] == 0:
                tn + = 1
            else:
                fn + = 1
    if tp + fp > 0:
        precision = tp / (tp + fp)
    else:
        precision=1.0
    if tp + fn > 0:
        recall = tp / (tp + fn)
    else:
        recall=1.0
    return precision, recall

Draw ROC curve and PR curve

Through the two functions previously defined for calculating TPR, FPR, Precision and Recall, we use NumPy’s linspace function to obtain a series of thresholds from minimum to maximum for subsequent ROC and PR curve drawing.

After that, we use these thresholds to calculate the TPR and FPR values corresponding to each threshold and draw the ROC curve. Similarly, we also calculate the Precision and Recall values corresponding to each threshold to draw the PR curve.

ROC Curve:
- We first created 100 thresholds ranging from the minimum to the maximum value in y_score.
- Using the tpr_fpr function and these thresholds, we calculate the corresponding TPR and FPR values. The original output is a list of tuples, which we convert into a numpy array structure so that we can subsequently retrieve values by column.
- Use plt.figure() to create a new image.
- Using the plt.plot() function, we draw the ROC curve with FPR as the x-axis and TPR as the y-axis. roc[:, 1] represents the second column of all rows (ie FPR), roc[:, 0] represents the first column of all rows (ie TPR).
- A diagonal line is drawn to represent the effect of the random classifier. This is the evaluation standard line. The ideal ROC curve should be as far away from this line as possible (that is, the farther to the upper left corner, the better).
- Set the x-axis and y-axis labels and titles, and use plt.show() to display the image.
PR Curve:
- Similar to the ROC curve, we use the precision_recall function and the same threshold to calculate the precision and recall values, and then convert them into a numpy array structure.
- Use plt.figure() to create a new image.
- Using the plt.plot() function, we draw the PR curve with Recall as the x-axis and Precision as the y-axis.
- The labels and titles of the x-axis and y-axis are also set, and the image is displayed using plt.show().

# Calculate the score of the decision function
y_score = clf.decision_function(X_test)

# Calculate and draw ROC curve
thresholds = np.linspace(min(y_score), max(y_score), 100)
roc = np.array([tpr_fpr(y_test, y_score, thres) for thres in thresholds])
plt.figure()
plt.plot(roc[:, 1], roc[:, 0])
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.show()

# Calculate and draw the PR curve
pr = np.array([precision_recall(y_test, y_score, thres) for thres in thresholds])
plt.figure()
plt.plot(pr[:, 1], pr[:, 0])
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('PR curve')
plt.show()

Results obtained:

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. OpenCV skill tree Home page Overview 23595 people are learning the system