Vision research career starting from scratch (5) Model training test export and related task indicators

Section 4 plans to talk about image basics and useful traditional image algorithms, such as keams, edge detection, etc. These contents will be used in a project later, so they will be integrated later.

The main goal of Section 5 is to teach newcomers how to train and test a visual deep learning model, as well as the meaning and role of various indicators.

1. Training and testing models

The train file in the general deep learning model framework represents the script for training the model. To train a model, we first need to input the data set, so there is usually a file describing the data set, and then we need to determine the specific structure of the model, so usually also There will be a file describing the model structure. Taking YOLOv8 as an example, the file describing the data set is under the ultralytics/cfg/datasets path. You can see that there are description files for various data sets. Each data set description file usually needs to be specified. The path and category of the data set. Some description files will also download the data set:

# Ultralytics YOLO , AGPL-3.0 license
# COCO8-seg dataset (first 8 images from COCO train2017) by Ultralytics
# Example usage: yolo train data=coco8-seg.yaml
# parent
# ├── ultralytics
# └── datasets
# └── coco8-seg ← downloads here (1 MB)


# Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, .. ]
path: ../datasets/coco8-seg # dataset root dir
train: images/train # train images (relative to 'path') 4 images
val: images/val # val images (relative to 'path') 4 images
test: # test images (optional)

#Classes
names:
  0: person
  1: bicycle
  2: car
  3: motorcycle
  4: airplane
  5: bus
  6: train
  7: truck
  8: boat
  9: traffic light
  10: fire hydrant
  11: stop sign
  12: parking meter
  13: bench
  14: bird
  15: cat
  16: dog
  17: horse
  18: sheep
  19: cow
  20: elephant
  21: bear
  22: zebra
  23: giraffe
  24: backpack
  25: umbrella
  26: handbag
  27:tie
  28: suitcase
  29: frisbee
  30: Skis
  31: snowboard
  32: sports ball
  33: kite
  34: baseball bat
  35: baseball glove
  36: skateboard
  37: surfboard
  38: tennis racket
  39: bottle
  40: wine glass
  41: cup
  42: fork
  43: knife
  44: spoon
  45: bowl
  46: banana
  47: apple
  48: sandwich
  49: orange
  50: broccoli
  51: carrot
  52: hot dog
  53: pizza
  54: Donut
  55: cake
  56: chair
  57: couch
  58: potted plant
  59: bed
  60: dining table
  61: toilet
  62: tv
  63: laptop
  64: mouse
  65:remote
  66: keyboard
  67: cell phone
  68: microwave
  69: oven
  70: toast
  71: sink
  72: refrigerator
  73: book
  74:clock
  75: vase
  76: scissors
  77: teddy bear
  78: hair drier
  79: toothbrush


# Download script/URL (optional)
download: https://ultralytics.com/assets/coco8-seg.zip

Taking our self-built YOLO series data set as an example, the format of the data set creation based on the process in the first lecture is as follows

Vision research career starting from scratch (1) Starting from the data set (getting started) – CSDN Blog

Then just copy and create a new yaml and replace the relevant paths in it.

YOLOv5 has a ready-made train.py script, we only need to replace the data set path in it. YOLOv8 gives examples of model training in the documentation

from ultralytics import YOLO

# Load a model
model = YOLO('yolov8n.yaml') # build a new model from YAML
model = YOLO('yolov8n.pt') # load a pretrained model (recommended for training)
model = YOLO('yolov8n.yaml').load('yolov8n.pt') # build from YAML and transfer weights

# Train the model
results = model.train(data='coco128.yaml', epochs=100, imgsz=640)

Of course, we can set more parameters related to training. For details, see ultralytics/cfg/default.yaml

During the training process, the general model will default to testing the performance of the model while training. This process is called val. What val tests is not whether the loss function reaches the optimal value, but whether the test indicators of the related tasks reach the optimal value.

The application angle of val is the same as that of train. You only need to modify the path in the relevant script. In principle, the process of val will be simpler. There is no process of data enhancement, back propagation, optimization gradient, etc. You only need to go through the model to get the result, and then Calculate relevant indicators.

The val script example of YOLOv8 is as follows

from ultralytics import YOLO

# Load a model
model = YOLO('yolov8n.pt') # load an official model
model = YOLO('path/to/best.pt') # load a custom model

# Validate the model
metrics = model.val() # no arguments needed, dataset and settings remembered
metrics.box.map # map50-95
metrics.box.map50 # map50
metrics.box.map75 # map75
metrics.box.maps # a list contains map50-95 of each category

Generally, models will have a predict script or a demo script, which means using the trained model to detect the image, and you can get the detected image, which can be used to see whether the model is effective or not. After all, errors may occur if you only look at the indicators. The basic process is to initialize the model first, then read the image, and then input the image into the model to get the prediction result. It is very simple. Let’s take YOLOv8 as an example. YOLOv8docs gives many examples, including demos of images and videos. In the demo script The inference speed of the pytorch model is usually tested together.

Predict – Ultralytics YOLOv8 Docs

from ultralytics import YOLO

# Load a model
model = YOLO('yolov8n.pt') # pretrained YOLOv8n model

# Run batched inference on a list of images
results = model(['im1.jpg', 'im2.jpg']) # return a list of Results objects

# Process results list
for result in results:
    boxes = result.boxes # Boxes object for bbox outputs
    masks = result.masks # Masks object for segmentation masks outputs
    keypoints = result.keypoints # Keypoints object for pose outputs
    probs = result.probs # Probs object for classification outputs

In addition, there are some benchmark and other scripts to test the model’s inference speed and the number of parameters of the model and other information.

2. Visual task indicators

In deep learning, three most basic test indicators are defined

Precision rate: Among the samples identified as positive categories (predicted classification results), what is the proportion of positive categories?
Recall: Among the
all positive class samples (true classification results), what is the proportion that is correctly identified as a positive class?
Accuracy: What is the proportion of correctly identified samples among all samples?

Four categories of goals are also defined,

If an instance is a positive class and the actual prediction is a positive class, it is a true class (True Positive TP)
If an instance is a negative class, the actual prediction is a negative class, which is True Negative TN.
If an instance is a negative class and the actual prediction is a positive class, it is a false positive class (False Positive FP)
If an instance is a positive class and the actual prediction is a negative class, it is a false negative class (False Negative, FN).


The more commonly used indicators for image classification tasks are accuracy and error rate(error rate)

Accuracy: (TP + TN)÷(TP + NP + TN + FN), the overall judgment ability of the classifier, that is, the proportion of correct predictions

Error rate(error rate) is the opposite of the correct rate, describing the proportion of misclassifications by the classifier, error rate = (FP + FN)/(P + N), for a certain instance, Right and wrong classification are mutually exclusive events, so accuracy =1 – error rate;

Sometimes top1-Accuracy and top5-Accuracy are also mentioned, with the following meanings
top1—– means that the label you predict takes the largest one in the final probability vector as the prediction result. If the classification of the one with the highest probability in your prediction result is correct, the prediction is correct. Otherwise the prediction is wrong
top5—– is among the top five with the largest final probability vector. As long as the correct probability appears, the prediction is correct. Otherwise the prediction is wrong.

Disadvantages of accuracy: As our most commonly used indicator, accuracy cannot reasonably reflect the prediction ability of the model when the sample is imbalanced. For example, the test data set has 90% positive samples and 10% negative samples. Assume that the model prediction results are all positive samples. At this time, the accuracy rate is 90%. However, the model has no ability to identify negative samples. At this time, the high accuracy rate cannot reflect The predictive ability of the model.

The indicators in the target detection task are more complicated. This involves how to determine whether a frame is a positive sample or a negative sample. In the image classification task, the correctly classified images are TP and TN, which is easy to understand. In addition to whether the detection frame detects the target, the position of the detection frame also needs to be correct, so the concept of IoU is introduced.

IoU refers to the overlap ratio between the predicted frame and the real frame, from which we can determine whether the position of the predicted frame is accurate.

The most commonly used detection index in target detection tasks is mAP, which takes into account precision and recall. Accuracy is rarely used in target detection tasks, and it does not reflect the performance of the target detection model very well. First, target detection does not have a true negative class (TN). If an instance is a negative class, the actual prediction is Negative class, that is, True Negative TN), this is easy to understand, we will not frame the background or things that we are not paying attention to.

Set the IoU threshold, such as IoU=0.5, and the IoU value of the prediction box is greater than 0.5. The prediction result is correct and is considered a true class (TP). If an instance is a positive class and the actual prediction is a positive class, it is a true class (True Positive TP). ); otherwise, the prediction result is wrong and is determined as a false positive class (FP, if an instance is a negative class, the actual prediction is a positive class, it is a false positive class (False Positive FP)), and there are no remaining targets that are detected at this time That is the false negative class (FN, if an instance is a positive class and the actual prediction is a negative class, it is a false negative class). In addition, there may be multiple detection frames for the same positive class. In this case, the one with the largest IoU is chosen. As the only TP, others as FP

Based on this, we can calculate the precision and recall of the model. In addition to the IoU threshold, there is also a confidence threshold to reflect the credibility of the detection frame. Obviously, the higher the confidence is set, the error The fewer boxes there are, the higher the accuracy of the model, but conversely, some correct boxes may also be excluded, resulting in a lower recall rate. Therefore, it is necessary to find an appropriate confidence threshold to reasonably calculate P and R. This is the F1 index, also known as the Balanced F score (BalancedScore), which is defined as the harmonic mean of precision and recall.

In addition, there are different F indicators to assign different weights to P and R. The F0.5 score and F2 score have also been widely used in statistics. Among them, in the F2 score, the recall rate has a higher weight than the precision. rate, and in the F0.5 score, the precision rate has a higher weight than the recall rate.

Another more comprehensive evaluation method is to set different confidence thresholds, then calculate all PR values, and then draw related images to calculate the mAP value. A mAP map generally looks like the following. In YOLOv5, the confidence level Set to 101 values starting from 0.01 to 1. Calculate the corresponding PR value and draw it on this picture. The area included in the calculator is AP. If the IoU threshold is set to 50%, then it is called AP at this time. @50, the average AP value of all categories is mAP

The indicators of image segmentation are generally similar to target detection, and its unique indicators are as follows:

PA: pixel accuracy
Corresponding: Accuracy
Meaning: The proportion of the number of pixels with correct predicted categories to the total number of pixels
Confusion matrix calculation:
Sum of diagonal elements / sum of all elements of matrix
PA = (TP + TN) / (TP + TN + FP + FN)
CPA: category pixel accuracy
Corresponding to: Precision
Meaning: Among the predicted values of category i, the accuracy of pixels that actually belong to category i. In other words: the model has many predicted values for category i, some of which are right and some of which are wrong. The proportion of correct predicted values to the total predicted value
Confusion matrix calculation:
Class 1: P1 = TP / (TP + FP)
Class 2: P2 = TN / (TN + FN)
Category 3:…
MPA: category average pixel accuracy
Meaning: Calculate the proportion of correctly classified pixels for each class, that is: CPA, and then accumulate and average.
Confusion matrix calculation:
Pixel accuracy of each category: Pi (calculation: diagonal value / total number of pixels in the corresponding column)
MPA = sum(Pi) / number of categories
IoU: intersection-union ratio
Meaning: The ratio of the intersection and union of the model’s prediction results for a certain category and the true value
Confusion matrix calculation:
Take the IoU of two categories: positive examples (category 1) as an example
Intersection: TP, union: sum of TP, FP, FN
IoU = TP / (TP + FP + FN)
MIoU: average intersection and union ratio
Meaning: The model calculates the ratio of the intersection and union of the prediction results of each category and the true value, and then sums and averages the result.
Confusion matrix calculation:
Taking the MIoU of two categories as an example
MIoU = (IoU positive example p + IoU counterexample n) / 2 = [ TP / (TP + FP + FN) + TN / (TN + FN + FP) ] / 2

Regarding the lightweight index calculation amount FLOPs and parameter amount Params of the model

The amount of calculation corresponds to our previous time complexity, and the amount of parameters corresponds to our previous space complexity. That is, the amount of calculation depends on the length of network execution time, and the amount of parameters depends on the amount of video memory occupied.

Calculation amount: FLOPs, FLOP refers to the number of floating point operations, and s refers to seconds, which means the number of floating point operations per second. It is the standard for considering the calculation amount of a network model.

Number of parameters: Params refers to the total number of parameters that need to be trained in the network model.

Taking the convolutional layer as an example, the parameter amount is (kernel*kernel) *channel_input*channel_output

The amount of calculation is (kernel*kernel*map*map) *channel_input*channel_output

Pooling layer has no parameters

Fully connected layer parameter amount = calculation amount = weight_in*weight_out

import torch
import torchvision
from top import profile
 
print('==> Building model..')
model = torchvision.models.alexnet(pretrained=False)
 
input = torch.randn(1, 3, 224, 224)
flops, params = profile(model, (input,))
print('flops: %.2f M, params: %.2f M' % (flops / 1e6, params / 1e6))

The indicator of model inference speed can usually be directly used as the average time required for inference, or FPS. There will be different inference speeds on different hardware devices.

3. Export model

If you are just writing a thesis, the third section plus the above content is basically enough. You can also use pytorch files combined with pyqt or gradient to create a visual interface, which is basically enough for undergraduate or graduate graduation projects.

However, in the field of industrial applications, the efficiency of python is too low, and all kinds of embedded devices are developed in C++, so the task of exporting the model is to convert the pytorch model into a C++ model, just as pytorch is The computing API of each layer of the network implemented based on python, there are also various APIs in C++

The most basic export is ONNX. Open Neural Network Exchange (Open Neural Network Exchange), or ONNX for short, is an open format proposed by Microsoft and Facebook to represent deep learning models. The so-called openness means that ONNX defines a set of standard formats that are independent of the environment and platform to enhance the interactivity of various AI models. In other words, no matter which training framework you use to train the model (such as TensorFlow/Pytorch/OneFlow/Paddle), after training, you can uniformly convert the models of these frameworks into a unified format such as ONNX for storage. Note that the ONNX file not only stores the weights of the neural network model, but also stores the structural information of the model, the input and output of each layer in the network, and some other auxiliary information.

onnx/onnx: Open standard for machine learning interoperability (github.com)

We can use netron to view the exported onnx structure

Netron

The use of onnx is also very simple. You can directly use dnn in opencv to run it. Different visual tasks can use different post-processing.

OpenVINO is a comprehensive tool suite launched by Intel (2023.0 version has been released) for rapid deployment of applications and solutions, supporting more than 200 CNN network structures for computer vision Remaining species.

Intel? Distribution of OpenVINO? Toolkit

TensorRT is an inference library on NVIDIA’s own GPU

TensorRT SDK | NVIDIA Developer

Core ML is an inference library proposed by Apple.

Core ML – Simplified Chinese Documentation – Apple Developer

Everyone can understand at a glance. If I am pursuing speed, I can just use whatever inference library I want to apply to any device. The one optimized by the relevant company must be the best. There is no need to compare. If I want simplicity, just use onnx. Done.

Of course, in the industrial field, these devices may not be very suitable, but use RAM-based architecture embedded microcontrollers and the like. Some related chip manufacturers have also launched corresponding inference libraries, such as Rockchip’s RKNN, which I personally recommend. If you are doing experiments by yourself, the efficiency of Rockchip microchips is still very good, and the tutorials are also relatively detailed.

RKNN is the model type used by the Rockchip npu platform. The model file ends with the suffix .rknn. Rockchip provides a complete model conversion Python tool to facilitate users to convert self-developed algorithm models into RKNN models. Rockchip also provides C/C++ and Python APIs. interface.

RKNN usage – Firefly Wiki (t-firefly.com)

There are also inference libraries on mobile phones or other chips. Android phones are generally RAM-based architecture chips. NCNN is recommended, what? Are you asking about my iPhone? Then just use CoreML

Tencent/ncnn: ncnn is a high-performance neural network inference framework optimized for the mobile platform (github.com)

In addition, Huawei has MindSpore official website, and Alibaba has MNNalibaba/MNN: MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba (github.com)

The knowledge points of the article match the official knowledge archive, and you can further learn related knowledge. OpenCV skill treeDeep learning in OpenCVImage classification 23752 people are learning the system