[AI Master Special Training Camp Phase III] Realizing Small Target Detection in Remote Sensing Scenarios Based on PP-YOLOE-SOD

This article is from the AlStudio community boutique project, [click here] to view more boutique content >>>

PPYOLOE: Small target detection in remote sensing scenarios based on PP-YOLOE-SOD

1. Project Background

Object detection has been a long-standing problem in remote sensing imagery and computer vision. It is usually defined as identifying the location of the target object in the input image as well as identifying the object category. Automatic object detection has been widely used in many practical applications, such as hazard detection, environment monitoring, change detection, urban planning, etc.

Over the past few decades, object detection has been extensively studied and a large number of methods have been developed to detect artificial objects (such as vehicles, buildings, roads, bridges, etc.) and natural objects (such as lakes, coasts, forests, etc.). Existing object detection methods on remote sensing image datasets can be roughly divided into four categories: (1) template-matching-based methods, (2) knowledge-based methods, (3) object-based image analysis methods, (4) machine-based learning method. Among them, machine learning based methods are robust in feature extraction and object classification and have been extensively studied by many recent approaches to achieve significant progress on this problem.

In the past few years, few-shot learning has been extensively studied in computer vision for the tasks of scene classification, image segmentation, and object detection. While in remote sensing images, the size of objects may be very different, and the spatial resolution of remote sensing images may also be very different, which makes this problem even more challenging when only a small number of annotated samples are provided.

Small target detection has extensive application value and important research significance in video surveillance, automatic driving, UAV aerial photography, remote sensing image detection, etc. For the definition of small goals, there are currently two main methods:

1.1 Definition based on relative scale

The ratio of the width and height of the target bounding box to the width and height of the image is less than a certain value
The square root of the ratio of the target bounding box area to the image area is less than a certain value

1.2 Definition based on absolute scale

Objects with a resolution less than 32*32 pixels. Such as MS-COCO dataset
Objects with pixel values in the range [10, 50]. Such as DOTA/WIDER FACE dataset

Paddle proposes the following definition from the overall level of dataset:
When the median of the ratio of the width and height of the target bounding box to the width and height of the image is less than 0.04, the dataset is determined to be a small target dataset.

images

At present, small target detection mainly has the following difficulties:

Small coverage area and few effective features
The problem of loss after downsampling of small targets makes it difficult for the bounding box to regress and the model to converge
Similar small targets are dense, and the NMS (non-maximum suppression) operation filters a large number of correctly predicted bounding boxes
There are few data sets for small target detection

In response to the above problems, based on the PP-YOLOE + general detection model, the flying paddle team has improved the process and algorithm, and proposed a set of small target-specific detectors PP-YOLOE-SOD(Small Object Detection).

2. Model introduction

2.1 Model advantages

Compared with the PP-YOLOE model, the improvements of PP-YOLOE-SOD mainly include the introduction of the Transformer global attention mechanism in the neck and the use of vector-based DFL in the regression branch.

Introduce Transformer global attention mechanism

The application of Transformer in CV is a hot research direction at present. The earliest ViT directly divided the image into multiple patches and added the position Embedding to the Transformer Encoder, and added the corresponding classification or detection head to achieve better results.

Similar here, the two modules of Position Embedding and Encoder are mainly added. The difference is that the input is the last layer of feature maps.

PP-YOLOE + structure diagram

PP-YOLOE + Structure Diagram

PP-YOLOE-SOD structure diagram

2.2 PP-YOLOE-SOD model library (COCO model)

model	m A P v a l mAP^{val} mAPval	A P 0.5 AP^{0.5} AP0.5	A P 0.75 AP^{0.75} AP0.75	A P the s m a l l AP^{small} APsmall	A P m e d i u m AP^{medium} APmedium	A P l a r g e AP^{large} APlarge	A R the s m a l l AR^{small} ARsmall	A R m e d i u m AR^{medium} ARmedium	A R l a r g e AR^{large} ARlarge	Download Link	Configuration File
PP-YOLOE + _SOD-l	53.0	70.4	57.7	37.1	57.5	69.0	56.5	77.5	86.7	Download link	Configuration file

model

mAP^{val}

mAPval

0.5

AP^{0.5}

AP0.5

0.75

AP^{0.75}

AP0.75

the s

AP^{small}

APsmall

AP^{medium}

APmedium

AP^{large}

APlarge

the s

AR^{small}

ARsmall

AR^{medium}

ARmedium

AR^{large}

ARlarge

Download Link

Configuration File

PP-YOLOE + _SOD-l

53.0

70.4

57.7

37.1

57.5

69.0

56.5

77.5

86.7

Download link

Configuration file

Note:

The models in the above table are all trained using the original image, and the original image is also evaluated and predicted. The network input scale is 640×640, the training set is COCO’s train2017, and the verification set is val2017. The total batch_size is 64 for 80 epochs.
SOD means using the vector-based DFL algorithm and the central prior optimization strategy for small targets, and adding a transformer to the Neck structure of the model, which can improve APsmall by 1.9.

3. Data preprocessing

3.1 Dataset Introduction

The NWPU VHR-10 dataset contains 800 high-resolution satellite images cropped from the Google Earth and Vaihingen datasets and then manually annotated by experts. The dataset is divided into 10 categories (aircraft, ships, storage tanks, baseball fields, tennis courts, basketball courts, ground runways, ports, bridges, and vehicles).

It consists of 715 RGB images and 85 sharpened color infrared images. Among them, 715 RGB images were collected from Google Earth, with spatial resolution ranging from 0.5m to 2m. 85 pan-sharpened infrared images with a spatial resolution of 0.08m from the Vaihingen data.

The dataset contains a total of 3775 object instances, including 757 airplanes, 390 baseball cubes, 159 basketball courts, 124 bridges, 224 ports, 163 athletic fields, 302 ships, 655 storage tanks, 524 Tennis courts and 477 cars, these object instances are manually annotated with horizontal bounding boxes.

The original dataset contains the following files:

negative image set: Contains 150 images that do not contain any objects of a given object class
positive image set: 650 images, each image contains at least one target to be detected
Ground Truth: Contains 650 individual text files, each corresponding to an image in the “Positive Image Set” folder. Each line of these text files defines a ground truth bounding box in the following format:

(x1,y1),(x2,y2),a
Where (x1, y1) represents the coordinates of the upper left corner of the bounding box, (x2, y2) represents the coordinates of the lower right corner of the bounding box,
a is the object category (1-aircraft, 2-ship, 3-tank, 4-baseball field, 5-tennis court, 6-basketball court, 7-track and field, 8-port, 9-bridge, 10-vehicle) .

The data set has been converted to COCO format, and the original data set is in VOC format.

3.2 Dataset compression

# Compress the dataset
work
!mkdir dataset
!unzip /home/aistudio/data/data198756/dataset_coco.zip -d /home/aistudio/work/dataset

4. Model training

# Clone the paddledetection warehouse
# gitee domestic download is faster
 ? /home/aistudio
!git clone https://gitee.com/paddlepaddle/PaddleDetection.git

#github
# !git clone https://github.com/PaddlePaddle/PaddleDetection.git

# If the speed of git clone is very slow, you can use the following command to directly compress the PaddleDetection suite compression package I uploaded
!unzip /home/aistudio/data/data199313/PaddleDetection.zip -d /home/aistudio

Before training, we need to go to the /home/aistudio/PaddleDetection/configs/datasets/coco_detection.yml file and modify the dataset path, as follows:

metric: COCO
num_classes: 10 # The data set category is 10

TrainDataset:
  name: COCODataSet
  image_dir: /home/aistudio/work/dataset/image
  anno_path: dataset/instances_train2017.json
  dataset_dir: /home/aistudio/work
  data_fields: ['image', 'gt_bbox', 'gt_class', 'is_crowd']

EvalDataset:
  name: COCODataSet
  image_dir: /home/aistudio/work/dataset/image
  anno_path: dataset/instances_val2017.json
  dataset_dir: /home/aistudio/work
  allow_empty: true

TestDataset:
  name: ImageFolder
  anno_path: dataset/instances_val2017.json # also support txt (like VOC's label_list.txt)
  dataset_dir: /home/aistudio/work # if set, anno_path will be 'dataset_dir/anno_path'

At the same time, we also need to modify the parameters in the /home/aistudio/PaddleDetection/configs/smalldet/ppyoloe_plus_sod_crn_l_80e_coco.yml file:

_BASE_: [
  '../datasets/coco_detection.yml',
  '../runtime.yml',
  '../ppyoloe/_base_/optimizer_80e.yml',
  '../ppyoloe/_base_/ppyoloe_plus_crn.yml',
  '../ppyoloe/_base_/ppyoloe_plus_reader.yml',
]
log_iter: 10 # Print log log interval
snapshot_epoch: 5 # how many rounds to evaluate once
weights: output/ppyoloe_plus_sod_crn_l_80e_coco/model_final

pretrain_weights: https://bj.bcebos.com/v1/paddledet/models/pretrained/ppyoloe_crn_l_obj365_pretrained.pdparams
depth_mult: 1.0
width_mult: 1.0

Custom CSPPAN:
  num_layers: 4
  use_trans: True

PPYOLOE Head:
  reg_range: [-2, 17]
  static_assigner_epoch: -1
  assigner:
    name: TaskAlignedAssigner_CR
    center_radius: 1
  nms:
    name: MultiClassNMS
    nms_top_k: 1000
    keep_top_k: 300
    score_threshold: 0.01
    nms_threshold: 0.7

At the same time, since we are training with a single card, YOLOE defaults to 8-card training, so we need to adjust the learning rate in /home/aistudio/PaddleDetection/configs/ppyoloe/_base_/optimizer_80e.yml, details as follows:

epoch: 80

LearningRate:
  base_lr: 0.000125 # here except 8 on the basis of the original 0.001
  schedulers:
    - name: CosineDecay
      max_epochs: 96
    - name: LinearWarmup
      start_factor: 0.
      epochs: 5

Optimizer Builder:
  optimizer:
    momentum: 0.9
    type: Momentum
  regularizer:
    factor: 0.0005
    type: L2

# Install required dependencies
!pip install pycocotools
# import package
!pip install -r ~/PaddleDetection/requirements.txt

# training
 ? /home/aistudio/PaddleDetection
!python tools/train.py -c configs/smalldet/ppyoloe_plus_sod_crn_l_80e_coco.yml --amp --eval --use_vdl True --vdl_log_dir vdl_log_dir/scalar

We can visualize the training through the VisualDL service, as follows:

After clicking to enter VisualDL, we can see the visualized results as follows:

5. Model evaluation

# evaluation
 ? /home/aistudio/PaddleDetection
!python tools/eval.py -c configs/smalldet/ppyoloe_plus_sod_crn_l_80e_coco.yml -o weights=output/ppyoloe_plus_sod_crn_l_80e_coco/best_model.pdparams

/home/aistudio/PaddleDetection
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import MutableMapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Iterable, Mapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Sized
Warning: Unable to use MOT metric, please install motmetrics, for example: `pip install motmetrics`, see https://github.com/longcw/py-motmetrics
Warning: Unable to use MCMOT metric, please install motmetrics, for example: `pip install motmetrics`, see https://github.com/longcw/py-motmetrics
Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly
W0315 17:09:25.167379 38757 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0315 17:09:25.170883 38757 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
[03/15 17:09:27] ppdet.data.source.coco INFO: Load [130 samples valid, 0 samples invalid] in file /home/aistudio/work/dataset/instances_val2017.json.
[03/15 17:09:29] ppdet.utils.checkpoint INFO: Finish loading model weights: output/ppyoloe_plus_sod_crn_l_80e_coco/best_model.pdparams
[03/15 17:09:29] ppdet.engine INFO: Eval iter: 0
[03/15 17:09:34] ppdet.metrics.metrics INFO: The bbox result is saved to bbox.json.
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
[03/15 17:09:34] ppdet.metrics.coco_utils INFO: Start evaluate...
Loading and preparing results... ?
DONE (t=0.46s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=2.33s).
Accumulating evaluation results...
DONE (t=0.31s).
 Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.776
 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.977
 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.882
 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.768
 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.759
 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.843
 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.288
 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.710
 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.831
 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.787
 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.813
 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.897
[03/15 17:09:37] ppdet.engine INFO: Total sample number: 130, average FPS: 26.371598818398887

Extract the validation set images from the image folder according to the instances_val2017.json file

import json
import shut-off
import os
if not os.path.exists('test'):
    os.chdir('/home/aistudio/work/dataset')
    os.mkdir('test')
datasets_path = '/home/aistudio/work/dataset/'
img_dir = '/home/aistudio/work/dataset/image'
annotation_dir = '/home/aistudio/work/dataset/test'
f = open('{}instances_val2017.json'.format(datasets_path), encoding='utf-8')
gt = json. load(f)

lst = []
for img_info in gt['images']:
    lst.append(img_info['file_name'])

for fileNum in lst:
    if not os.path.isdir(fileNum):
        imgName = os.path.join(img_dir, fileNum)
        print(imgName)
        shutil. copy(imgName, annotation_dir)

6. Model prediction

# prediction
 ? /home/aistudio/PaddleDetection
!python tools/infer.py -c configs/smalldet/ppyoloe_plus_sod_crn_l_80e_coco.yml -o weights=output/ppyoloe_plus_sod_crn_l_80e_coco/best_model.pdparams --infer_dir=/home/aistudio/work/dataset/test --output_dir/ infer_output

The reasoning results are as follows:

7. Model export

PP-YOLO-SOD needs to export the model through tools/export_model.py for deployment or speed test on GPU.

? /home/aistudio/PaddleDetection
!python tools/export_model.py -c configs/smalldet/ppyoloe_plus_sod_crn_l_80e_coco.yml -o weights=output/ppyoloe_plus_sod_crn_l_80e_coco/best_model.pdparams

/home/aistudio/PaddleDetection
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import MutableMapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Iterable, Mapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Sized
Warning: Unable to use MOT metric, please install motmetrics, for example: `pip install motmetrics`, see https://github.com/longcw/py-motmetrics
Warning: Unable to use MCMOT metric, please install motmetrics, for example: `pip install motmetrics`, see https://github.com/longcw/py-motmetrics
Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly
[03/15 18:17:47] ppdet.utils.checkpoint INFO: Finish loading model weights: output/ppyoloe_plus_sod_crn_l_80e_coco/best_model.pdparams
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
[03/15 18:17:48] ppdet.engine INFO: Export inference config file to output_inference/ppyoloe_plus_sod_crn_l_80e_coco/infer_cfg.yml
[03/15 18:18:02] ppdet.engine INFO: Export model and saved in output_inference/ppyoloe_plus_sod_crn_l_80e_coco

8. Model deployment

# Select a verification set image to test the deployment effect
 ? /home/aistudio/PaddleDetection
!python deploy/python/infer.py --model_dir=/home/aistudio/PaddleDetection/output_inference/ppyoloe_plus_sod_crn_l_80e_coco --image_file=/home/aistudio/work/dataset/test/421.jpg --device=GPU --save_images= True --threshold=0.25 --slice_infer --slice_size 500 500 --overlap_ratio 0.25 0.25 --combine_method=nms --match_threshold=0.6 --match_metric=ios

/home/aistudio/PaddleDetection
----------- Running Arguments -----------
action_file: None
batch_size: 1
camera_id: -1
combine_method: nms
cpu_threads: 1
device: GPU
enable_mkldnn: False
enable_mkldnn_bfloat16: False
image_dir: None
image_file: /home/aistudio/work/dataset/test/421.jpg
match_metric: ios
match_threshold: 0.6
model_dir: /home/aistudio/PaddleDetection/output_inference/ppyoloe_plus_sod_crn_l_80e_coco
output_dir: output
overlap_ratio: [0.25, 0.25]
random_pad: False
reid_batch_size: 50
reid_model_dir: None
run_benchmark: False
run_mode: paddle
save_images: True
save_mot_txt_per_img: False
save_mot_txts: False
save_results: False
scaled: False
slice_infer: True
slice_size: [500, 500]
threshold: 0.25
tracker_config: None
trt_calib_mode: False
trt_max_shape: 1280
trt_min_shape: 1
trt_opt_shape: 640
use_coco_category: False
use_dark: True
use_gpu: False
video_file: None
window_size: 50
------------------------------------------
----------- Model Configuration -----------
Model Arch: YOLO
Transform Order:
--transform op: Resize
--transform op:NormalizeImage
--transform op: Permute
--------------------------------------------
slice to {} sub_samples.6
class_id: 9, confidence: 0.8728, left_top: [340.72, 103.24], right_bottom: [369.60, 158.67]
class_id: 9, confidence: 0.7929, left_top: [330.79, 163.38], right_bottom: [361.78, 210.25]
class_id: 9, confidence: 0.5966, left_top: [352.58, 53.78], right_bottom: [379.66, 105.11]
class_id: 9, confidence: 0.5286, left_top: [361.60, 4.02], right_bottom: [391.68, 48.03]
class_id: 9, confidence: 0.8936, left_top: [407.95, 233.15], right_bottom: [465.88, 262.83]
class_id: 9, confidence: 0.8747, left_top: [696.56, 390.71], right_bottom: [723.77, 438.41]
class_id: 9, confidence: 0.8253, left_top: [626.39, 434.95], right_bottom: [653.88, 482.35]
class_id: 9, confidence: 0.8880, left_top: [922.41, 258.08], right_bottom: [954.44, 307.52]
class_id: 9, confidence: 0.8653, left_top: [654.27, 256.37], right_bottom: [678.71, 303.64]
class_id: 9, confidence: 0.8627, left_top: [745.64, 64.88], right_bottom: [772.16, 110.10]
class_id: 9, confidence: 0.8569, left_top: [887.41, 241.04], right_bottom: [920.09, 294.39]
class_id: 9, confidence: 0.8382, left_top: [686.31, 25.25], right_bottom: [714.68, 78.60]
class_id: 9, confidence: 0.8245, left_top: [657.26, 187.57], right_bottom: [689.04, 244.98]
class_id: 9, confidence: 0.7201, left_top: [736.03, 115.66], right_bottom: [764.34, 166.88]
class_id: 9, confidence: 0.8704, left_top: [285.31, 409.72], right_bottom: [315.14, 465.77]
class_id: 9, confidence: 0.8475, left_top: [261.15, 549.25], right_bottom: [289.68, 595.72]
class_id: 9, confidence: 0.8131, left_top: [272.90, 482.58], right_bottom: [302.10, 531.71]
class_id: 9, confidence: 0.8013, left_top: [305.93, 293.37], right_bottom: [331.93, 343.54]
class_id: 9, confidence: 0.6527, left_top: [246.47, 612.18], right_bottom: [276.39, 671.80]
class_id: 9, confidence: 0.8807, left_top: [689.47, 232.56], right_bottom: [715.29, 278.37]
class_id: 9, confidence: 0.5627, left_top: [982.50, 276.36], right_bottom: [1008.19, 328.45]
save result to: output/421.jpg
Test iter 0
------------------ Inference Time Info -------------------
total_time(ms): 1583.2, img_num: 1
average latency time(ms): 1583.20, QPS: 0.631632
preprocess_time(ms): 68.30, inference_time(ms): 1514.90, postprocess_time(ms): 0.00

average latency time(ms): 1583.20, QPS: 0.631632
preprocess_time(ms): 68.30, inference_time(ms): 1514.90, postprocess_time(ms): 0.00

The reasoning results are as follows:

Summary

PP-YOLOE-SOD is a small target detection characteristic model developed by the PaddleDetection team. It uses the vector-based DFL algorithm related to the distribution of data sets and the central prior optimization strategy for small target optimization, and in the Neck (FPN) structure of the model Add the Transformer module, and combine strategies such as increasing the P2 layer and using large size, and finally achieve extremely high accuracy on multiple small target data sets.
It is recommended to use the PP-YOLOE-SOD model to directly use the original image or sub-image to train and evaluate the prediction without cutting the puzzle. For more details and ablation experiments, please refer to the COCO model and the VisDrone model.
Through this project practice, I have learned a lot of knowledge and skills that I have not mastered before. For example, I have not used the COCO format data set before. In this project practice, I used it and mastered it.

Student: Ji Kangyi

Mentor: Zhou Jun