pruning+distillation realizes lightweight processing of YOLOv5. Based on three different parameter magnitude models of n/m/s, a tea bud detection model is developed and constructed to explore and analyze the impact of different pruning levels on model performance.

In my many previous blog posts, there are relatively few records about lightweight model post-processing, and most of them are project development. In fact, many previous projects have used corresponding lightweight technologies such as pruning and distillation, but they have not been used separately. For the record, I just have free time this week, so I wanted to briefly sort out this part of the content. In order to be more logical and organized, here I take the classic YOLOv5 network model as an example and select the three most frequently used models in daily projects. For parameter-level models, we use the tea bud scene as an example to develop and construct a target detection model. Then, we use pruning technology as an example to prun and lightweight the model to complete fine-tuning training and development. In addition, we can also perform distillation based on the original model. Training improves the original accuracy.

Yesterday, I made a detailed practice record of the complete process from native model development to pruning processing and training fine-tuning. If you are interested, you can read it by yourself, as shown below:

“Develop and build a tea bud detection and recognition model based on YOLOv5n/s/m models with different parameter levels, use pruning technology to lightweight the model, and explore the impact of model performance under different pruning levels”

“Developing and constructing a tea bud detection and recognition model based on YOLOv5n/s/m models with different parameter levels, using pruning technology to lightweight the model, and exploring the impact of model performance under different pruning levels [continued]”

The main purpose of this article is to improve the accuracy of the pruned model based on distillation training technology after completing the pruning process. This article is mainly a continuation of the content based on the first two blog posts and has a relatively strong continuity. Therefore, if you have any questions, you still need to read the first two articles. I will not go into details here.

Knowledge Distillation is a technique for model compression and transfer learning that aims to transfer the knowledge of a large, complex model to a smaller, simpler model. Its basic idea is to assist in training a “student model” by training a “teacher model”.

In knowledge distillation, there are usually two models involved in the training process:

  1. Teacher Model: Usually a complex model with high accuracy. It can be a deep neural network or other models.
  2. Student Model: Usually a smaller, simpler model with fewer parameters and hoping to achieve similar or even better performance than the teacher model.

The training process of knowledge distillation is as follows:

  1. Use the labeled training data to train the teacher model to achieve a high accuracy rate.
  2. During training, the output of the teacher model (usually the model’s soft target, i.e., the class probability distribution), is used as an auxiliary label for the student model.
  3. The student model is trained using the labeled training data so that it fits the soft target of the teacher model as best as possible.
  4. Ultimately, the student model can independently make predictions based on its own parameters and inputs.

The purpose of knowledge distillation is to provide more information and inspiration for the student model by transferring the knowledge of the teacher model, so that the student model has a smaller model volume and computational complexity while maintaining high performance. This compressed model is available for deployment on resource-constrained devices and often has lower inference time and memory consumption.

In addition to model compression, knowledge distillation can also be used for transfer learning, using the knowledge of the teacher model to help the student model learn and generalize faster when the training data for the target task is limited.

A relatively basic idea and operation here is to use the model weights obtained by training the original official model structure as the teacher network. In the previous article, I compared the model effects under different pruning levels. Here I use the model weights under different pruning levels. Model weights are used as student networks. In fact, it is possible to directly fine-tune the pruned network model without using distillation technology. However, the purpose here is to explore the effect of distillation technology on pruned and lightweight models. Performance enhancement capabilities.

It is also very convenient to create and initialize the two, just compare them here.

Here, the first is the training of the corresponding model when the pruning degree is 30, as shown below:

 #yolov5n
python train.py --teacher runs/train/yolov5n/weights/best.pt --student weights/yolov5n_layer_pruning_0.3.pt --data data/self.yaml --batch-size 4 --img-size 416 -- name yolov5n_pruning_0.30_distillation

#yolov5s
python train.py --teacher runs/train/yolov5s/weights/best.pt --student weights/yolov5s_layer_pruning_0.3.pt --data data/self.yaml --batch-size 4 --img-size 416 -- name yolov5s_pruning_0.30_distillation

#yolov5m
python train.py --teacher runs/train/yolov5m/weights/best.pt --student weights/yolov5m_layer_pruning_0.3.pt --data data/self.yaml --batch-size 4 --img-size 416 -- name yolov5m_pruning_0.30_distillation

Again, it can be seen here that the exact same training parameters as before are maintained.

After the training is completed, let’s look at the actual training effect:

[yolov5n_pruning_0.30_distillation]

【yolov5s_pruning_0.30_distillation】

【yolov5m_pruning_0.30_distillation】

The details of the comparative evaluation results of the three models are as follows:

【yolov5n_pruning_0.30_distillation】
Validating runs/train/yolov5n_pruning_0.30_distillation/weights/best.pt...
Fusing layers...
YOLOv5n summary: 157 layers, 1283792 parameters, 0 gradients, 3.0 GFLOPs
                 Class Images Instances P R mAP50 mAP50-95: 100%|| 5/5 [00:03<00:00, 1.29it/s]
                   all 40 100 0.396 0.447 0.391 0.117
Results saved to runs/train/yolov5n_pruning_0.30_distillation


【yolov5s_pruning_0.30_distillation】
Validating runs/train/yolov5s_pruning_0.30_distillation/weights/best.pt...
Fusing layers...
YOLOv5s summary: 166 layers, 5685216 parameters, 0 gradients
                 Class Images Instances P R mAP50 mAP50-95: 100%|| 5/5 [00:03<00:00, 1.59it/s]
                   all 40 100 0.653 0.49 0.57 0.207
Results saved to runs/train/yolov5s_pruning_0.30_distillation


【yolov5m_pruning_0.30_distillation】
Validating runs/train/yolov5m_pruning_0.30_distillation/weights/best.pt...
Fusing layers...
YOLOv5m summary: 212 layers, 15704926 parameters, 0 gradients, 35.5 GFLOPs
                 Class Images Instances P R mAP50 mAP50-95: 100%|| 5/5 [00:02<00:00, 1.70it/s]
                   all 40 100 0.416 0.33 0.322 0.114
Results saved to runs/train/yolov5m_pruning_0.30_distillation

In order to intuitively compare and analyze the performance differences of the models, here is a visual comparative analysis of them.

【Precision Curve】
The Precision-Recall Curve is a visualization tool for evaluating the precision performance of a binary classification model at different thresholds. It helps us understand how the model performs at different thresholds by plotting the relationship between precision and recall at different thresholds.
Precision refers to the ratio of the number of samples correctly predicted as positive to all the samples predicted as positive. The recall rate (Recall) refers to the proportion of the number of samples that are correctly predicted as positive examples to the number of samples that are actually positive examples.
The steps to draw the accuracy curve are as follows:
Convert predicted probabilities to binary class labels using different thresholds. Usually, when the predicted probability is greater than a threshold, the sample is classified as a positive example, otherwise it is classified as a negative example.
For each threshold, calculate the corresponding precision and recall.
Plot the precision and recall at each threshold on the same graph to form a precision curve.
According to the shape and changing trend of the accuracy rate curve, an appropriate threshold can be selected to achieve the required performance requirements.
By observing the precision rate curve, we can determine the optimal threshold according to the needs to balance the precision rate and recall rate. Higher precision means fewer false positives, while higher recall means fewer false negatives. Depending on specific business needs and cost trade-offs, an appropriate operating point or threshold can be chosen on the curve.
Precision curves are often used together with recall curves to provide a more comprehensive analysis of classifier performance and to help evaluate and compare the performance of different models.


【Recall Curve】
Recall Curve is a visualization tool for evaluating the recall performance of binary classification models at different thresholds. It helps us understand the performance of the model at different thresholds by plotting the relationship between the recall rate and the corresponding precision rate at different thresholds.
The recall rate (Recall) refers to the proportion of the number of samples that are correctly predicted as positive examples to the number of samples that are actually positive examples. The recall rate is also called the sensitivity (Sensitivity) or the true positive rate (True Positive Rate).
The steps to draw the recall curve are as follows:
Convert predicted probabilities to binary class labels using different thresholds. Usually, when the predicted probability is greater than a threshold, the sample is classified as a positive example, otherwise it is classified as a negative example.
For each threshold, the corresponding recall and corresponding precision are calculated.
Plot the recall and precision at each threshold on the same graph to form a recall curve.
According to the shape and changing trend of the recall rate curve, an appropriate threshold can be selected to achieve the desired performance requirements.
By observing the recall curve, we can determine the optimal threshold according to the needs to balance the recall and precision. Higher recall means fewer false negatives, while higher precision means fewer false positives. Depending on specific business needs and cost trade-offs, an appropriate operating point or threshold can be chosen on the curve.
Recall curves are often used together with Precision Curves to provide a more comprehensive analysis of classifier performance and to help evaluate and compare the performance of different models.


【F1 value curve】
The F1-score curve is a visualization tool for evaluating the performance of binary classification models at different thresholds. It helps us understand the overall performance of the model by plotting the relationship between Precision, Recall and F1 scores at different thresholds.
The F1 score is the harmonic mean of precision and recall, which takes into account both performance metrics. The F1 value curve can help us determine a balance point between different precision rates and recall rates to choose the best threshold.
The steps to draw the F1 value curve are as follows:
Convert predicted probabilities to binary class labels using different thresholds. Usually, when the predicted probability is greater than a threshold, the sample is classified as a positive example, otherwise it is classified as a negative example.
For each threshold, the corresponding precision, recall and F1 score are calculated.
The precision rate, recall rate and F1 score under each threshold are plotted on the same graph to form an F1 value curve.
According to the shape and changing trend of the F1 value curve, an appropriate threshold can be selected to achieve the required performance requirements.
F1-value curves are often used together with receiver operating characteristic curves (ROC curves) to help evaluate and compare the performance of different models. They provide a more comprehensive classifier performance analysis, and can select appropriate models and threshold settings according to specific application scenarios.


It can be seen that there is still a big gap between the pruning + distillation model and the original network model.

Next, we conduct an overall comparative analysis and show: the native model, the pruning model and the pruning + distillation model, as follows:

【F1 value】

【Precision】

【Recall】

Next is the second set of experiments, pruning The degree is 60, followed by distillation training.

【yolov5n_pruning_0.60_distillation】


【yolov5s_pruning_0.60_distillation】

【yolov5m_pruning_0.60_distillation】

The details of the comparative evaluation results of the three models are as follows:

【yolov5n_pruning_0.60_distillation】
Validating runs/train/yolov5n_pruning_0.60_distillation/weights/best.pt...
Fusing layers...
YOLOv5n summary: 157 layers, 932592 parameters, 0 gradients, 2.1 GFLOPs
                 Class Images Instances P R mAP50 mAP50-95: 100%|| 5/5 [00:03<00:00, 1.36it/s]
                   all 40 100 0.214 0.15 0.108 0.0302
Results saved to runs/train/yolov5n_pruning_0.60_distillation


[yolov5s_pruning_0.60_distillation]
Validating runs/train/yolov5s_pruning_0.60_distillation/weights/best.pt...
Fusing layers...
YOLOv5s summary: 166 layers, 4637807 parameters, 0 gradients
                 Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 5/5 [00:02<00:00, 1.78it/s]
                   all 40 100 0.0492 0.21 0.0219 0.00481
Results saved to runs/train/yolov5s_pruning_0.60_distillation


【yolov5m_pruning_0.60_distillation】
Validating runs/train/yolov5m_pruning_0.60_distillation/weights/best.pt...
Fusing layers...
YOLOv5m summary: 212 layers, 11711883 parameters, 0 gradients, 25.4 GFLOPs
                 Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 5/5 [00:02<00:00, 2.34it/s]
                   all 40 100 0.0879 0.07 0.0455 0.0152
Results saved to runs/train/yolov5m_pruning_0.60_distillation 

In order to intuitively compare and analyze the performance differences of the models, here is a visual comparative analysis of them.

【F1】

【Precision】

【Recall】

The overall comparison results are as follows:
【F1】

【Precision】

【Recall】

Intuitively, it feels that the effect of distillation training is very bad.

Finally, let’s look at the results with a pruning degree of 90.

[yolov5n_pruning_0.90_distillation]

【yolov5s_pruning_0.90_distillation】

【yolov5m_pruning_0.90_distillation】

The result details are as follows:

【yolov5n_pruning_0.90_distillation】
Validating runs/train/yolov5n_pruning_0.90_distillation/weights/best.pt...
Fusing layers...
YOLOv5n summary: 157 layers, 710530 parameters, 0 gradients, 1.4 GFLOPs
                 Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 5/5 [00:01<00:00, 2.84it/s]
                   all 40 100 0.00442 0.53 0.00368 0.000863
Results saved to runs/train/yolov5n_pruning_0.90_distillation


【yolov5s_pruning_0.90_distillation】
Validating runs/train/yolov5s_pruning_0.90_distillation/weights/best.pt...
Fusing layers...
YOLOv5s summary: 166 layers, 3920903 parameters, 0 gradients
                 Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 5/5 [00:01<00:00, 3.38it/s]
                   all 40 100 0.00525 0.63 0.00443 0.00111
Results saved to runs/train/yolov5s_pruning_0.90_distillation


【yolov5m_pruning_0.90_distillation】
Validating runs/train/yolov5m_pruning_0.90_distillation2/weights/best.pt...
Fusing layers...
YOLOv5m summary: 212 layers, 8908815 parameters, 0 gradients, 17.7 GFLOPs
                 Class Images Instances P R mAP50 mAP50-95: 100%|| 5/5 [00:01<00:00, 3.58it/s]
                   all 40 100 0.00133 0.16 0.000855 0.000268
Results saved to runs/train/yolov5m_pruning_0.90_distillation

The results here are even worse.

Not all methods can be directly effective, and have a lot to do with parameter adjustment, which will be sorted out and studied later, and today is mainly the practice of the overall process.

Finally, let’s compare the results of distillation training under three different pruning levels:

【F1 value】

【Precision】

【Recall】

Intuitive comparison shows that as the degree of pruning increases, the effect becomes worse and worse, and distillation cannot bring about effective improvement.

As far as the technology itself is concerned, there are many means and methods for knowledge distillation. Different scenarios and models may have different adaptation schemes. In actual projects, it is necessary to conduct on-the-spot analysis and practice to find the most effective method. .