YOLOv8 performance evaluation indicators->mAP, Precision, Recall, FPS, IoU

Before I start explaining, I would like to recommend my column. The content of this column supports (classification, detection, segmentation, tracking, key point detection). The column is currently a limited-time discount.Everyone is welcome to subscribe to this column. This column is updated every week 3- 5 latest mechanisms, as well as files and communication groups containing all my improvements are provided to everyone.

Column directory: YOLOv8 improved and effective series directory | including convolution, backbone, detection head, attention mechanism, Neck and hundreds of innovative mechanisms

Column Review: YOLOv8 Improvement Series Column – This column continues to review the content of various top conferences – essential for scientific research

1. Introduction

This blog mainly explains to you the meaning of each picture and the indicators in the result file we generated when training yolov8, to help you have a deeper understanding, and when we evaluate the model and publish the paper What are the parameters of main concern? This article helps everyone understand by giving examples of the results at a certain time during the training process. If you have any questions during the reading process, you can ask them in the comment area and I will help you answer them. First, let’s take a look at how many files can be generated after one training is completed, as shown in the figure below. The following article will focus on this result file.

2. Dataset for evaluation

The above training results are trained based on a data set for detecting aircraft, in which only one label is the aircraft. For this single-label data set, we can actually understand it as a two-class classification task.

One case -> detected as an aircraft, another case -> not an aircraft.

3. Result analysis

We can see from the result file that there are 24 files in total. The last 12 pictures are based on some detection result pictures during our training process, so that we can observe the detection results and which ones have been detected. , those files that have not been detected are not evaluated as indicators.

Weights folder

Let’s start with the analysis of the first weights folder. There are two files in it, namely best.pt and last.pt, which are respectively the result with the lowest loss during the training process and the result of model training. The model with the last saved results.

args.yaml

The second file is the args.yaml file, which mainly saves some parameters we specified during training. The content is as follows.

ConfusionMatrix

The third file is the confusion matrix. Everyone should have heard of this name. It is a tabular form used to evaluate the performance of a classification model. It counts and summarizes the sample classification results based on the actual category (true value) and the model predicted category.

For binary classification problems, the confusion matrix is usually a 2×2 matrix, including True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (False Negative, FN) four elements.

True_Label = [1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1 ,0, 1, 0 , 1 , 0, 0 , 1]
Predict_Label = [0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1 ,0 , 0 , 1 , 0, 0 , 1, 0]

Let’s analyze this picture. I’ve marked the meaning of each grid on the picture. Let’s take an example to help everyone understand this confusion matrix.

Assume that our data set predicts that the aircraft is marked as a number 0, and that the prediction is not an aircraft is marked as 1.Now suppose that we predict 20 times in a certain batch of model training. The real results and predicted results are as follows .

Among them, True_Label represents the real label, and Predict_Label represents the label we predicted using the model.

Then we can compare and generate the following analysis

The true labels and predicted labels of the 6 samples are both 0 (True Negative).

The true label of 1 sample is 0, but the predicted label is 1 (False Positive).

The true label of 8 samples is 1, but the predicted label is 0 (False Negative).

The true labels and predicted labels of 5 samples are both 1 (True Positive).

Based on our analysis results, we can draw the confusion matrix of this prediction,

From this we can get the confusion matrix of that batch.The confusion matrix generated by our final result can be understood as the statistical result of multiple confusion matrices.

Confusion Matrix Normal (Confusion Matrix Normal)

The normalization of this confusion matrix is to normalize the confusion matrix. Normalizing the confusion matrix can divide the value of each cell by the actual number of samples of the category to obtain the classification accuracy. percentage. This normalization allows us to visually compare classification accuracy between categories and identify which categories the model performs better or worse.

We can see that the columns are normalized, 0.9 + 0.1 = 1, 1 + 0 = 1.

Calculate mAP, Precision, Recall

Before explaining other pictures, we need to calculate three more important parameters, which are the basis of other pictures. The calculation here still uses the analysis results of a certain batch of examples above.

Precision: How many of the samples that are predicted to be positive are correct, Precision = TP / (TP + FP) = 5 / (5 + 1) = 5/6 ≈ 0.833
Recall: How many of the samples that are actually positive are correctly predicted to be positive, Recall = TP / (TP + FN) = 5 / (5 + 8) ≈ 0.385
F1 value (F1-Score): Indicators that take precision and recall into account, F1 = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.833 * 0.385) / ( 0.833 + 0.385) ≈ 0.526
Accuracy: The proportion of correct predictions by the model in all samples, Accuracy = (TP + TN) / (TP + TN + FP + FN) = (5 + 6) / (5 + 6 + 1 + 8) ≈ 0.565
Average Precision (AP): Used to calculate the average accuracy of different categories. For binary classification problems, AP is equal to accuracy. AP = Precision = 0.833
Mean Average Precision (mAP): The average accuracy of multi-category problems. For binary classification problems, mAP is equal to AP (accuracy), so mAP = AP = 0.833

The main thing that needs to be explained here is if AP and MAP are multi-classification problems, how to calculate AP and mAP? First of all, we need to know that the full name of AP is Average Precision, so the calculation formula of our AP is as follows?

mAP is Mean Average Precision. It is calculated as follows. Calculating each AP and averaging it is mAP.

F1_Curve

The file F1_Curve, the title of the picture we clicked on is F1-Confidence Curve, which shows the changes in F1 values under different classification thresholds.

We can understand it this way, first look at its horizontal and vertical coordinates. The horizontal coordinate is the confidence level, and the vertical coordinate is the F1-Score. We have explained the F1-Score before. So what is the confidence level?

Confidence (Confidence)->In the recognition process of our model, there will be a probability, that is, the model does not determine 100% that an object belongs to a certain category. It will give it a personal Probability, Confidence means we set a threshold. If it exceeds this probability, it is determined to be a certain category. If my model determines that an object belongs to an airplane with a probability of 0.7, then if the threshold we set is below 0.7, then the model will The object will be output as an airplane. If the threshold we set is greater than 0.7, the model will not output the object as an airplane.

F1-Confidence Curve is a curve that changes with F1-Score as Confidence gradually increases.

Labels

The Labels image represents the category and bounding box information of each detected object. Each object is represented by a rectangular bounding box and a category label,Let’s look at this image counterclockwise! ! !

Target category: The target category detected by this pixel, such as airplanes, etc.
Target position: The position of the target detected by the pixel in the image, that is, the coordinates of the pixel in the image.
Target size: The size of the target detected by the pixel, that is, the size of the area covered by the pixel.
Other information: such as the rotation angle of the target and other related information.

labels_correlogram

labels_correlogram is a term used inthe field of machine learning, which refers to a graphused to show the correlation between labels predicted by an object detection algorithm during training >.

Specifically, labels_correlogram is a color matrix graph that shows the correlation between the labels of the training set data. It can help us understand the behavior and performance of object detection algorithms during training, as well as the interaction between predicted labels.

By observing labels_correlogram, we can understand the ability of the target detection algorithm to distinguish between different categories, as well as the prediction accuracy for different categories. In addition, we can also evaluate the performance of the algorithm and the quality of the data set by comparing labels_correlogram of different algorithms or different data sets.

In summary, labels_correlogram is a useful tool that can help us better understand the behavior and performance of object detection algorithms during training, as well as evaluate the performance of the algorithm and the quality of the data set.

P_curve

The analysis of this graph is the same as that of F1_Curve. The difference is that it is about the relationship between Precision and Confidence.It can be seen that as the confidence level becomes higher and higher, the accuracy of the detection is logically getting higher and higher. of.

R_curve

The analysis of this graph is the same as that of F1_Curve. The difference is that it is about the relationship between Recall and Confidence.It can be seen that as the confidence gets higher and higher, the accuracy of the recall rate is logically getting better and better. low.

PR_curve

It shows the relationship between the precision and recall of the model under different classification thresholds.

The closer the PR curve is to the upper left corner of the coordinate axis, the better the performance of the model, the more correctly it can identify positive samples, and the higher the Precision value of correctly classifying positive samples. If it is closer to the right, it means that the model has poorer ability to identify positive samples. , that is, the recall ability is poor.

The characteristic of the PR curve is that as the classification threshold changes, the precision and recall rate will change accordingly. Typically, the PR curve is at a higher position when the classification model is able to maintain both high precision and high recall. When the model is biased towards high precision or high recall, the curve will correspondingly move in the direction of low precision or low recall.

The PR curve can help us evaluate the performance of the model under different thresholds and choose the appropriate threshold to balance precision and recall. For model comparison or selection, we can make a quantitative assessment by comparing the area under the PR curve (called Average Precision, AP). The larger the AP value, the better the performance of the model.

Summary: The PR curve is a visual tool that shows the relationship between the precision and recall of a classification model. By drawing the precision-recall curve, we can evaluate and compare the performance of the model under different classification thresholds and calculate the average precision. The average value (AP) is used to quantitatively measure the quality of the model.

results.csv

results.csv records some parameter information during our training process, including loss and learning rate. There is nothing to understand here. You can take a look. The results pictures we will draw later are drawn based on this file.

results

This picture is the last one of the generated results. We can see that many small pictures are annotated, including various losses in the training process. What we mainly look at are the following four pictures: mAP50, mAP50-95, metrics/precision, metrics/recall four pictures.

mAP50: mAP is the abbreviation of mean Average Precision, which represents the average precision on multiple categories. mAP50 represents the mAP value under the IoU threshold of 50%.

mAP50-95: This is a more stringent evaluation metric, which calculates the mAP value within the IoU threshold range of 50-95% and then averages it. This enables a more accurate assessment of model performance under different IoU thresholds.

metrics/precision: Precision is the proportion of positive samples that the evaluation model predicts correctly. In object detection, if the bounding box predicted by the model coincides with the true bounding box, the prediction is considered correct.

metrics/recall: Recall is the proportion of all real positive samples that the evaluation model can find. In object detection, if the true bounding box coincides with the predicted bounding box, the sample is considered to be correctly recalled.

Detection renderings

The last fourteen pictures are the detection renderings. Let me show you that there is nothing to explain here.

4. Other parameters

FPS and IoU are two important indicators used in the field of target detection, which respectively represent the number of images processed per second and the intersection and union ratio.

FPS: The full name is Frames Per Second, which is the frame rate per second. It is used to evaluate the processing speed of a model on given hardware, i.e. the number of images it can process per second. This indicator is very important for realizing real-time detection, because only fast processing speed can meet the needs of real-time detection.

IoU: The full name is Intersection over Union, which means intersection and union ratio. In object detection, it is used to measure the degree of overlap between the candidate boxes generated by the model and the original labeled boxes. The larger the IoU value, the higher the similarity between the two boxes. Generally, when the IoU value is greater than 0.5, it is considered that the target object can be detected. This metric is often used to evaluate the detection accuracy of a model on a specific data set.

In the field of object detection, processing speed and accuracy are two important performance indicators. In practical applications, we need to balance these two indicators according to specific needs.

5. Summary

This blog ends here. If you have anything you don’t understand, you can leave a message in the comment area. I will give you an answer when I see it. By comprehensively considering the values of these indicators, you can evaluate the performance of the YOLOv8 model in the target detection task. performance such as accuracy, recall, speed and bounding box quality. Based on specific needs, we can choose a model and parameter configuration that is more suitable for the task scenario.

Finally, I wish you all smooth study, successful scientific research, and many papers! !

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 385843 people are learning the system