Pedestrian and vehicle target detection in foggy weather based on yolov5

1. Project description

With the development of science and technology, image recognition and target detection technology have been widely used in fields such as autonomous driving and intelligent transportation. However, in complex environmental conditions, such as foggy days, existing target detection technologies may face the problem of reduced recognition rates. To this end, we proposed a pedestrian vehicle target detection project in foggy weather based on Yolov5 to improve detection accuracy in complex environments. YOLOv5 is one of the most advanced target detection algorithms in recent years, which combines high speed and accuracy and is suitable for real-time scenarios. Here we use the RTTS data set as an example. The challenges of this project are:

The goal is complex
- The environment is complex and must adapt to daytime, cloudy, foggy, hazy and other climatic environments under various visibility conditions;
- The scene is complex, and scenes such as urban roads, rural areas, and highways are very different;
Unbalanced sample
- There are many categories, including: pedestrians, cyclists, cars, buses, motorcycles, and bicycles;
- Each image contains multiple types of targets, as well as various degrees of occlusion and truncation;

Figure 1 – RTTS dataset example

2. Environment description

This example is based on the yolov5 network and was trained on the RTTS data set.

PaddlePaddle 2.2
OS 64-bit operating system
Deep learning framework: PyTorch 1.7 or higher, used to build and train the Yolov5 model.
Python 3(3.6/3.7/3.8/3.9), 64-bit version
pip/pip3(9.0.1 + ), 64-bit version
CUDA >= 10.1
cuDNN >= 7.6

3. Data preparation

3.1 Data Introduction

RTTS: Pedestrian and vehicle target detection in foggy weather_Dataset-Flying Paddle AI Studio Galaxy Community

The RTTS data set is derived from the RESIDE data set (RESIDE-Fog Dataset is a public data set used for foggy image processing and computer vision-related research). The RTTS data set contains 4322 real foggy pictures as project training. set. There are also 100 real scene pictures as a verification set. The distribution of the number of images is shown in the following table:

Dataset	train	val
Number of images	4322	100

Data preprocessing: Clean and preprocess the collected image data, including image enhancement (such as contrast enhancement, image defogging), etc.;

3.2 Data structure

The organization structure of the file is as follows (refer to COCO):

The YOLO format annotation data files are as follows:

The VOC format annotation data files are as follows:

4. Model selection

Joseph Redmon and others proposed the YOLO (You Only Look Once, YOLO) algorithm in 2015, also commonly known as YOLOv1; in 2016, they improved the algorithm and proposed the YOLOv2 version; in 2018, the YOLOv3 version was developed; released in 2020 YOLOv5 version, currently v5 is also the most widely used version among these versions.

Yolov5 is a target detection algorithm based on deep learning. Compared with the traditional two-stage target detection method, Yolov5 uses a single-stage detection process, which has a greater advantage in speed [2]. In a foggy environment, due to low visibility, a larger number of images need to be processed, so using a faster algorithm can improve detection efficiency.
Yolov5 divides the input image into a fixed-size grid and predicts the object’s bounding box and category on each grid. This grid division method makes Yolov5 better at detecting small targets and can effectively detect people and vehicle targets in foggy environments.
Yolov5 can predict at multiple scales and combine feature information at different scales to improve detection accuracy. In a foggy environment, due to the influence of light attenuation and scattering, targets in the image may appear blurred or noisy. Using multi-scale prediction can enhance the ability to distinguish targets.

In summary, Yolov5, as an algorithm for human and vehicle target detection in foggy weather, has the advantages of fast speed, good detection effect on small targets, and the ability to utilize multi-scale information. It can effectively detect human and vehicle targets in foggy weather environments.

5. Model training

The default 8-card configuration is used. If you use single-card training in AI Studio, you need to modify the train.sh file. The specific modifications are as follows:

export CUDA_VISIBLE_DEVICES=0
python -m paddle.distributed.launch --gpus 0 tools/train.py -c configs-hazedet/ppyoloe/ppyoloe_crn_m_100e_hazedet.yml --eval

In [7]

! bash train.sh

6. Model evaluation

Based on the PaddleDetection library, we provide a variety of prediction methods for you to choose.

Model location: output/ppyoloe_crn_m_100e_hazedet

In [4]

!bash eval.sh

7. Model optimization

This section focuses on showing the idea of optimizing accuracy during the model iteration process:

Baseline: The backbone network loads the CSPResNetb_m model parameters pre-trained by ImageNet, and evaluates it after training for 100 epochs:

 Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.260
 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.499
 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.237
 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.180
 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.319
 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.447
 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.208
 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.413
 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.428
 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.318
 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.550
 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.574

COCO pre-trained model: Load the COCO pre-trained ppyoloe_crn_m model model and perform finetune training on the RTTS data set. The final detection mAP was improved by 14.7%.

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.407
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.672
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.416
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.283
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.489
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.716
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.282
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.492
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.510
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.390
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.618
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.769

Offline data enhancement: Use the defogging algorithm to perform offline data augmentation on the training set. Common defogging methods include MSBDN, Trident-Dehazing network, FFA-net and other models.

Here we chose the MSBDN model, dehazed and enhanced the training set offline, and trained it together with the original training set. The purpose was to enrich the training set by generating images with different fog concentrations, and at the same time reduce the difficulty of identifying dense fog samples. Accelerate model convergence. The final detection mAP was improved by 0.6%.

11583 Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.413
11584 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.672
11585 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.415
11586 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.277
11587 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.519
11588 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.726
11589 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.282
11590 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.499
11591 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.517
11592 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.387
11593 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.641
11594 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.780

8. Inference visualization

Refer to infer.sh, the final output file is in the output directory.

In [9]

! python tools/infer.py \
  -c configs-hazedet/ppyoloe/ppyoloe_crn_m_100e_hazedet.yml \
  --infer_img=dataset/hazedet/val/HR/59.png \
-o weights=output/ppyoloe_crn_m_100e_hazedet/best_model

Before testing

Figure 2 – Comparison before and after image detection (taken from the output directory)

By default, 100 epoch iterative calculations are performed, and the result details are as follows:

LABEL visualization:

F1 value curve:

PR curve:

Batch calculation example:

An example of visual reasoning is as follows:

9. Model export

Export inference model

The weight files saved by the PaddlePaddle framework are divided into two types: training models that support forward inference and reverse gradient and inference models that only support forward inference. The difference between the two is that the inference model is optimized for inference speed and video memory. It cuts some tensors that are only needed during the training process to reduce the video memory usage, and performs some similar layer fusion and kernel selection speed optimization. Therefore, you can execute the following command to export the inference model.

By default, it is exported to the inference_model directory.

In [10]

# Model export
! bash export_model.sh

10. Model deployment

Use Paddle-inference, the native reasoning library of Paddle-inference, for server-side model deployment.

Generally divided into three steps:

Create PaddlePredictor and set the exported model path
Create a PaddleTensor for input and pass it into PaddlePredictor
Get the output PaddleTensor and take out the result

#include "paddle_inference_api.h"
 
//Create a config and modify related settings
paddle::NativeConfig config;
config.model_dir = "xxx";
config.use_gpu = false;
//Create a native PaddlePredictor
autopredictor=
      paddle::CreatePaddlePredictor<paddle::NativeConfig>(config);
// Create input tensor
int64_t data[4] = {1, 2, 3, 4};
paddle::PaddleTensor tensor;
tensor.shape = std::vector<int>({4, 1});
tensor.data.Reset(data, sizeof(data));
tensor.dtype = paddle::PaddleDType::INT64;
//Create an output tensor. The memory of the output tensor can be reused.
std::vector<paddle::PaddleTensor> outputs;
//Perform prediction
CHECK(predictor->Run(slots, & amp;outputs));
// Get outputs ...

For more details, see > C++ Prediction API Introduction

Let’s take Paddle Inference’s Python deployment as an example to illustrate:

Use the deploy/python/infer.py script provided by PaddleDetection to perform inference predictions on images. In the project, we use TensorRT FP16 for inference, and the inference speed can reach 208fps on a single card V100.

# Reasoning for a single image
CUDA_VISIBLE_DEVICES=0 python deploy/python/infer.py --model_dir=inference_model/ppyoloe_crn_m_100e_hazedet --image_file=dataset/hazedet/val/HR/0.png --device=gpu --run_mode=trt_fp16

# All pictures in the inference folder
CUDA_VISIBLE_DEVICES=0 python deploy/python/infer.py --model_dir=inference_model/ppyoloe_crn_m_100e_hazedet --image_dir=dataset/hazedet/val/ --device=gpu --run_mode=trt_fp16

In [15]

# Reasoning for a single image
! python deploy/python/infer.py --model_dir=inference_model/ppyoloe_crn_m_100e_hazedet --image_file=dataset/h