[MMdetection] Environment configuration, config file analysis and training custom VOC data set

MMDetection is an open source project launched for target detection tasks. It implements a large number of target detection algorithms based on Pytorch, and encapsulates the processes of data set construction, model building, and training strategies into modules. Through module calls, we A new algorithm can be implemented with a small amount of code, which greatly improves the code reuse rate. This article records how to use MMdetection. It may be more vernacular. If you are a professional, you can go to the following tutorial:
MMDetection Framework Getting Started Tutorial
Official document – config file tutorial

1. Folder structure

Download the code of mmdetection from github, and the directory obtained after decompression is as follows (only the main folder is shown here):

├─mmdetection-master
│ ├─build
│ ├─checkpoints # Store breakpoints
│ ├─configs # store configuration files
│ ├─data # store data
│ ├─demo
│ ├─dist
│ ├─docker
│ ├─docs
│ ├─mmdet # The main source code of mmdetection, including model definition and the like
│ ├─requirements
│ ├─resources
│ ├─src
│ ├─tests
│ ├─tools # Training, testing, printing config files and other main tools
│ └─work_dirs # Store training logs and training results

2. Environment configuration

Create an environment and install pytorch:
conda create --name envName python=3.7
conda activate envName
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch
Follow the tutorial on the official github to install mmcv:
pip install -U openmim
mim install mmcv-full
Install mmdet:
pip install mmdet
In the past, it was very easy to report errors when installing mmcv, but now basically as long as you install pytorch according to the corresponding version, and then use openmim to install mmcv, basically no errors will be reported. The above command is to configure the environment of python3.7, if it is other python versions, it should also work.

3. Model training

The key to mastering the training model using MMdetection is to understand config (configuration file). If you want to train faster rcnn, you only need to configure the configuration file, and then use the following command to train:
python tools/train.py configs/faster_rcnn/faster_rcnn_r101_fpn_2x_towervoc.py
Among them, configs/faster_rcnn/faster_rcnn_r101_fpn_2x_towervoc.py is the configuration file we need to use during training. All parameter settings required during training are defined in this configuration file.
When using it, try to pay attention to a few points:

Try not to modify parameters other than configuration files
Do not change the original configuration file, if you want to perform new tasks, create a new configuration file

Because there are many files in the MMdetection project, if you train a certain network and change its original configuration file or parameters in which py file, you may forget it after a while. If you use it next time, other networks also need this Modules are problematic.

ok Next, let’s introduce the config file.

1. config file naming rules:

{model}_[model setting]_{backbone}_{neck}_[norm setting]_[misc]_[gpu x batch_per_gpu]_{schedule}_{dataset}
The meaning of each field:

{model}: model type, such as faster_rcnn, mask_rcnn, etc.

[model setting]: specific model, such as without_semantic in htc, moment in reppoints, etc.

{backbone}: The type of backbone network such as r50 (ResNet-50), x101 (ResNeXt-101), etc.

{neck}: The types of Neck models include fpn, pafpn, nasfpn, c4, etc.

[norm_setting]: bn (Batch Normalization) is used by default, and other specifications can include gn (Group Normalization), syncbn (Synchronized Batch Normalization), etc. gn-head/gn-neck means that GN is only applied to the Head or Neck of the network, and gn-all means that GN is used on the entire model, such as the backbone network, Neck and Head.

[misc]: Various settings/plugins in the model, such as dconv, gcb, attention, albu, mstrain, etc.

[gpu x batch_per_gpu]: Number of GPUs and number of samples per GPU, 8x2 is used by default.

{schedule}: training program, the options are 1x, 2x, 20e, etc. 1x and 2x represent 12 epoch and 24 epoch respectively, and 20e is used in the cascade model to represent 20 epoch. For 1x/2x, the initial learning rate is decayed by a factor of 10 at the 8th/16th and 11th/22nd epoch; for 20e, the initial learning rate is decayed by a factor of 10 at the 16th and 19th epoch.

{dataset}: dataset, such as coco, cityscapes, voc_0712, wider_face, etc.

Second, config file content analysis

The config file for each network consists of four parts:

model settings
dataset settings
schedules
runtime

In the official tutorial given at the beginning of the article, there are detailed comments written line by line using the configuration file of mask rcnn as an example. Here is just a rough record of some of my initial misunderstandings. First of all, you should learn to use a tool tools/misc/print_config.py. The parameters printed by this tool are the parameters that are finally input into the network for training. The syntax is:
python tools/misc/print_config.py configs/yolox/yolox_l_8x8_300e_coco.py

1. Inherit initial parameters from _base_

This means inheriting from these base config when initializing the configuration file. If you do not redefine later, these base config parameters will be used by default. Taking configs/yolox/yolox_l_8x8_300e_coco.py as an example, the parameter lr_config about learning rate scheduling in YOLOX was initially inherited from configs/_base_/schedules/schedule_1x. py, which means it should be:

lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500, # The learning rate is "warmed up", the initial learning rate is 0.001, and it reaches the optimizer after 500 iterations
    warmup_ratio=0.001, # defined lr
    step=[8, 11])

But in the end, it was found that the learning rate schedule printed by print_config is not the case. This is because after the configuration file initially inherited lr_config from the _base_ file, it was modified later:

lr_config = dict(
    _delete_=True,
    policy='YOLOX',
    warmup='exp',
    by_epoch=False,
    warmup_by_epoch=True,
    warmup_ratio=1,
    warmup_iters=5, # 5 epochs
    num_last_epochs=num_last_epochs,
    min_lr_ratio=0.05)

_delete_=True means to delete the original lr_config inherited from _base_, and replace it with a new set of key-value pairs defined here. If you only modify some parameters, such as only modifying the step, then you don’t need _delete_, just add it in the configuration file:

lr_config = dict(
    step=[7, 10])

It should be noted that the key-value pairs in the config file are read in order. If you define the same parameter multiple times, the one written later will overwrite the previous one.

2. Learning rate automatic adjustment

At first I mistakenly thought that this parameter was to adjust batch_szie. But in fact, the meaning of this parameter is that the learning rate set in this project is based on 8 gpus*8 batch_size, if your settings are different, it will be based on this to your batchsize automatically adjusts your initial learning rate, so don’t change this value, and don’t change the initial learning rate either.
The place to adjust batch_size is here (samples_per_gpu):

4. Model training practice

It is very simple to use MMdetection to train the coco format data set, so how to train on the voc data set defined by yourself? Here I take the ssd model as an example to introduce. First, let me introduce my data set, voc format, there are three categories in total, the folder structure is as follows:

├─TowerVoc
│ └─VOC2012
│ ├─Annotations
│ ├─ImageSets
│ │ └─Main
│ └─JPEGImages

Here I only introduce how to implement it, and you can compare the configuration file I gave here with the original configuration file (the code I gave will also mark the changed places) for the specific parameters to be changed.
Open the configuration file corresponding to the ssd and you can see the following:

As you can see, the coco dataset is used for training by default. Look at the inheritance relationship of the configuration file:

To train a custom voc dataset, three configuration files need to be created:

Copy ssd512_coco.py and name it ssd512_towervoc.py. Among them, tower is the name of my data set, which is taken randomly here.
Copy ssd300_coco.py and name it ssd300_voc.py.
Copy configs/_base_/datasets/voc0712.py and name it configs/_base_/datasets/voctower.py.

The codes of the three configuration files are as follows:
ssd512_towervoc.py

_base_ = 'ssd300_voc.py' # change 1
input_size = 512
model = dict(
    neck = dict(
        out_channels=(512, 1024, 512, 256, 256, 256, 256),
        level_strides=(2, 2, 2, 2, 1),
        level_paddings=(1, 1, 1, 1, 1),
        last_kernel_size=4),
    bbox_head = dict(
        in_channels=(512, 1024, 512, 256, 256, 256, 256),
        anchor_generator = dict(
            type='SSDAnchorGenerator',
            scale_major=False,
            input_size=input_size,
            basesize_ratio_range=(0.1, 0.9),
            strides=[8, 16, 32, 64, 128, 256, 512],
            ratios=[[2], [2, 3], [2, 3], [2, 3], [2, 3], [2], [2]])))
# dataset settings
dataset_type = 'VOCDataset' # change 3
data_root = 'data/TowerVoc/' # change 4
img_norm_cfg = dict(mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='Expand',
        mean=img_norm_cfg['mean'],
        to_rgb=img_norm_cfg['to_rgb'],
        ratio_range=(1, 4)),
    dict(
        type='MinIoURandomCrop',
        min_ious=(0.1, 0.3, 0.5, 0.7, 0.9),
        min_crop_size=0.3),
    dict(type='Resize', img_scale=(640, 640), keep_ratio=False),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(
        type='PhotoMetricDistortion',
        brightness_delta=32,
        contrast_range=(0.5, 1.5),
        saturation_range=(0.5, 1.5),
        hue_delta=18),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(512, 512),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=False),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    samples_per_gpu=4, # If necessary, you can change it to your own batchsize
    workers_per_gpu=2,
    train=dict(
        _delete_=True,
        type='RepeatDataset',
        times=5,
        dataset=dict(
            type=dataset_type,
            ann_file=data_root + 'VOC2012/ImageSets/Main/train.txt', # Change 5
            img_prefix=data_root + 'VOC2012/',
            pipeline=train_pipeline)),
    val=dict(pipeline=test_pipeline),
    test=dict(pipeline=test_pipeline))
# optimizer
optimizer = dict(type='SGD', lr=2e-3, momentum=0.9, weight_decay=5e-4)
optimizer_config = dict(_delete_=True)
custom_hooks = [
    dict(type='NumClassCheckHook'),
    dict(type='CheckInvalidLossHook', interval=50, priority='VERY_LOW')
]
  
# evaluation = dict(interval=1, metric='mAP')
  
# NOTE: `auto_scale_lr` is for automatically scaling LR,
# USER SHOULD NOT CHANGE ITS VALUES.
# base_batch_size = (8 GPUs) x (8 samples per GPU)
auto_scale_lr = dict(base_batch_size=64)

ssd300_voc.py

_base_ = [
    '../_base_/models/ssd300.py', '../_base_/datasets/voctower.py', # Change 1
    '../_base_/schedules/schedule_2x.py', '../_base_/default_runtime.py'
]
# model settings
input_size = 300
model = dict(
    type='SingleStageDetector',
    backbone=dict(
        type='SSDVGG',
        depth=16,
        with_last_pool=False,
        ceil_mode=True,
        out_indices=(3, 4),
        out_feature_indices=(22, 34),
        init_cfg=dict(
            type='Pretrained', checkpoint='open-mmlab://vgg16_caffe')),
    neck = dict(
        type='SSDNeck',
        in_channels=(512, 1024),
        out_channels=(512, 1024, 512, 256, 256, 256),
        level_strides=(2, 2, 1, 1),
        level_paddings=(1, 1, 0, 0),
        l2_norm_scale=20),
    bbox_head = dict(
        type='SSSDHead',
        in_channels=(512, 1024, 512, 256, 256, 256),
        num_classes=3, # change 2
        anchor_generator = dict(
            type='SSDAnchorGenerator',
            scale_major=False,
            input_size=input_size,
            basesize_ratio_range=(0.15, 0.9),
            strides=[8, 16, 32, 64, 100, 300],
            ratios=[[2], [2, 3], [2, 3], [2, 3], [2], [2]]),
        bbox_coder = dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[.0, .0, .0, .0],
            target_stds=[0.1, 0.1, 0.2, 0.2])),
    # model training and testing settings
    train_cfg = dict(
        assigner = dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.5,
            neg_iou_thr=0.5,
            min_pos_iou=0.,
            ignore_iof_thr=-1,
            gt_max_assign_all=False),
        smoothl1_beta=1.,
        allowed_border=-1,
        pos_weight=-1,
        neg_pos_ratio=3,
        debug=False),
    test_cfg=dict(
        nms_pre=1000,
        nms=dict(type='nms', iou_threshold=0.45),
        min_bbox_size=0,
        score_thr=0.02,
        max_per_img=200))
cudnn_benchmark = True
  
# dataset settings
dataset_type = 'VOCDataset' # change 3
data_root = 'data/TowerVoc/'
img_norm_cfg = dict(mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='Expand',
        mean=img_norm_cfg['mean'],
        to_rgb=img_norm_cfg['to_rgb'],
        ratio_range=(1, 4)),
    dict(
        type='MinIoURandomCrop',
        min_ious=(0.1, 0.3, 0.5, 0.7, 0.9),
        min_crop_size=0.3),
    dict(type='Resize', img_scale=(300, 300), keep_ratio=False),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(
        type='PhotoMetricDistortion',
        brightness_delta=32,
        contrast_range=(0.5, 1.5),
        saturation_range=(0.5, 1.5),
        hue_delta=18),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(300, 300),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=False),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    samples_per_gpu=8,
    workers_per_gpu=3,
    train=dict(
        _delete_=True,
        type='RepeatDataset',
        times=5,
        dataset=dict(
            type=dataset_type,
            ann_file=data_root + 'VOC2012/ImageSets/Main/train.txt', # In fact, it can not be changed here
            img_prefix=data_root + 'VOC2012/', # Because ssd300_voc.py will rewrite
            pipeline=train_pipeline)),
    val=dict(pipeline=test_pipeline),
    test=dict(pipeline=test_pipeline))
# optimizer
optimizer = dict(type='SGD', lr=2e-3, momentum=0.9, weight_decay=5e-4)
optimizer_config = dict(_delete_=True)
custom_hooks = [
    dict(type='NumClassCheckHook'),
    dict(type='CheckInvalidLossHook', interval=50, priority='VERY_LOW')
]
  
# NOTE: `auto_scale_lr` is for automatically scaling LR,
# USER SHOULD NOT CHANGE ITS VALUES.
# base_batch_size = (8 GPUs) x (8 samples per GPU)
auto_scale_lr = dict(base_batch_size=64)

voctower.py

# dataset settings
dataset_type = 'VOCDataset'
data_root = 'data/TowerVoc/' # Change to your own dataset folder
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(640, 640), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(640, 640),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    samples_per_gpu=4, # Change it to your own batch_size here In fact, it doesn't matter whether you change it or not for the ssd network
    workers_per_gpu=2, # But some networks will not rewrite this parameter, so it is best to change it for convenience
    train=dict(
        type='RepeatDataset',
        times=3,
        dataset=dict(
            type=dataset_type,
            ann_file=data_root + 'VOC2012/ImageSets/Main/train.txt', # modify path
            img_prefix=data_root + 'VOC2012/',
            pipeline=train_pipeline)),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'VOC2012/ImageSets/Main/val.txt', # modify path
        img_prefix=data_root + 'VOC2012/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'VOC2012/ImageSets/Main/test.txt', # modify path
        img_prefix=data_root + 'VOC2012/',
        pipeline=test_pipeline))
evaluation = dict(interval=1, metric='mAP')

After you have changed it yourself, you can print_config to see if the parameters meet the requirements.

In addition to the above, the following two files also need to be modified:

anaconda3\envs\conda_env_name\lib\python3.7\site-packages\mmdet\core\evaluation\class_names.py
anaconda3\envs\conda_env_name\lib\python3.7\site-packages\mmdet\datasets\voc.py

Change the category to your own:
voc.py

class_names.py

It should be noted here that it is useless to modify the code in mmdet in the project directory. When installing the environment above, we have a step of pip install mmdet, the mmdet we use is actually a python library, not the mmdet under the project, so if the data category you want to train is the same as the PASCAL VOC dataset Different, you need to modify the above two files. In fact, the best way is of course to create a new py file for your own data set, but that will be very troublesome.

The code word is not easy, if it is helpful to you, please like it~