YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection

This paper proposes a real-time object detector YOLO-MS that can enhancemulti-scale feature representation. The core design is based onconvolution of different kernel sizesindifferent A series of studies on how scaleaffects target detection performance.

Improvements can be made from the following two perspectives:

From a local perspective, a MS-Block with a simple and effective hierarchical feature fusion strategy is designed. Inspired by Res2Net, multiple branches are introduced in MS-Block to perform feature extraction. The difference is that a reverse bottleneck block with deep convolution is used to achieve effective use of large cores.

From a global perspective, the kernel size of the convolution is gradually increased as the network goes deeper. Use small-kernel convolutions in shallow layers to process high-resolution features more efficiently, and large-kernel convolutions in deep layers to capture a wide range of information.

1 Related work

1.1 Real-time target detection

Most real-time target detection networks adopt a one-stage framework, of which the YOLO series is the most typical representative. As a key factor affecting model performance, architecture design is the focus of YOLO development process. Starting from YOLOv1, the network architecture has undergone tremendous changes: YOLOLv4 has improved DarkNet’s cross-stage partial connections (CSPNet); YOLOv6 and PPYOLOE have explored the use of cross-stage partial connections in YOLO The reparameterization techniquecan achieve higher accuracy without incurring additional inference costs. YOLOv7 proposes the Extended Effective Layer Aggregation Network (E-ELAN), which can effectively learn and converge by controlling the shortest and longest gradient path. RTMDet introduces large kernel convolution (5×5) in the network to improve the feature extraction capability of basic blocks. The larger receptive field allows for more comprehensive context modeling and significantly improves accuracy. .

1.2 Multi-scale feature representation

Powerful multi-scale feature representation capabilities can effectively improve model performance, which has been proven in many tasks, including real-time object detection.

1.2.1 Multi-scale feature learning in real-time target detection

Many real-time object detectors extract multi-scale features by integrating features at different feature levels of the neck. For example, YOLOv3 and later YOLO series introduced FPN and PAFPN respectively to capture rich multi-scale semantics. TheSPP module is also widely used to expand the receptive field. In addition, multi-scale data augmentation is also widely used as an effective training skill. However, the mainstream basic building blocks ignore the importance of multi-scale feature representation and focus on how to improve detection efficiency or how to introduce new training techniques, especially for CSP blocks and ELAN blocks. The difference is that our method focuses on learning richer multi-scale features.

1.2.2 Large Kernel Convolution

Recently, large kernel convolutions have been reactivated in a deep manner. The wider receptive field provided by large kernel convolutions can be a powerful technique for building strong multi-scale feature representations. In the field of real-time target detection, RTMDet attempts to introduce large kernel convolution into the network for the first time. However, due to speed limitations, the kernel size can only reach 5 × 5. The uniform block design at different stages limits the application of large kernel convolution. In this paper, a HKS protocol based on the results of empirical studies is proposed, employing convolutional layers with different kernel sizes at different stages, achieving a good trade-off between speed and accuracy with the help of large kernel convolutions.

2 Block proposed in this article

2.1 CSP Block and its variants

CSP Block is a stage-based gradient path network that balances gradient combination and computational cost and is a widely used basic building block in the YOLO series. Several variants have been proposed, including the original version in YOLOv4 and YOLOv5, CSPVoVNet in Scaled-yolov4, ELAN in YOLOv7, and the large kernel unit proposed in RTMDet. The following figures show the structures of the original CSP block and ELAN respectively:

A key aspect ignored by the above-mentioned real-time detectors is how to encode multi-scale features in the basic building blocks, while Res2Net is able to aggregate features from different levels to enhance multi-scale representation. However, this principle has not thoroughly explored the role of large-kernel convolutions. The main obstacle to incorporating large-kernel convolutions into Res2Net is that the computational complexity is too large when the building blocks adopt standard convolutions.This article adopts the method in MobileNet Inverted Residual replaces the standard 3×3 convolutionto ensure that you can enjoy the benefits of large-core convolution while controlling the amount of calculation.

2.2 MS-block

In this article, n=3, except for X1, all have to go through an Inverted Residual structure. The mathematical expression is described as follows:

Previous real-time object detectors employ convolutions with the same kernel size in different encoder stages, but this is not the best choice for extracting multi-scale semantic information. In a pyramid structure, high-resolution features extracted from shallow stages of the detector are often used to capture fine-grained semantics that will be used to detect small objects. Instead, low-resolution features from deep stages of the network are used to capture high-level semantics that will be used to detect large objects. If we adopt uniform small kernel convolutions in all stages, the Effective Receptive Field (ERF) of the deep stages will be limited, thus affecting the performance on large objects. Incorporating large kernel convolutions at each stage can help address this limitation. However, large cores with large ERFs can encode wider regions, which increases the probability of containing contaminating information beyond small objects and reduces the inference speed of the experimental part of the analysis.

Utilizing heterogeneous convolution at different stages to help capture richer multi-scale features, and then gradually increasing the kernel size in the intermediate stages to keep it consistent with the increment of feature resolution, this strategy can simultaneously Extracting fine-grained and coarse-grained semantic information enhances the encoder’s multi-scale feature representation capabilities.

As shown in the table below, applying large-kernel convolution to high-resolution features greatly increases the amount of calculation. However, the HKS protocol uses large-kernel convolution on low-resolution features, so compared with only using large-kernel convolution, it is significantly less expensive. Computational costs are reduced.

3 Implementation part

MS-block is implemented as follows:

class MSBlock(nn.Module): def __init__(self, in_channel: int, out_channel: int, kernel_sizes: Sequence[Union[int, Sequence[int]]], in_expand_ratio: float = 3., mid_expand_ratio: float = 2., layers_num: int = 3, in_down_ratio: float = 1., attention_cfg: OptConfigType = None, conv_cfg: OptConfigType = None, norm_cfg: OptConfigType = dict(type='BN'), act_cfg: OptConfigType = dict(type='SiLU', inplace=True), ) -> None: super().__init__() self.in_channel = int(in_channel*in_expand_ratio)//in_down_ratio self.mid_channel = self.in_channel//len(kernel_sizes) self.mid_expand_ratio = mid_expand_ratio groups = int(self.mid_channel*self.mid_expand_ratio) self.layers_num = layers_num self.in_attention = None self.attention = None if attention_cfg is not None: attention_cfg["dim"] = out_channel self.attention = MODELS.build(attention_cfg) self.in_conv = ConvModule(in_channel, self.in_channel, 1, conv_cfg=conv_cfg, act_cfg=act_cfg, norm_cfg=norm_cfg) self.mid_convs = [] for kernel_size in kernel_sizes: if kernel_size == 1: self.mid_convs.append(nn.Identity()) continue mid_convs = [MSBlockLayer(self.mid_channel, groups, kernel_size=kernel_size, conv_cfg=conv_cfg, act_cfg=act_cfg, norm_cfg=norm_cfg) for _ in range(int(self.layers_num))] self.mid_convs.append(nn.Sequential(*mid_convs)) self.mid_convs = nn.ModuleList(self.mid_convs) self.out_conv = ConvModule(self.in_channel, out_channel, 1, conv_cfg=conv_cfg, act_cfg=act_cfg, norm_cfg=norm_cfg) def forward(self, x: Tensor) -> Tensor: """Forward process Args: x (Tensor): The input tensor. """ out = self.in_conv(x) channels = [] for i,mid_conv in enumerate(self.mid_convs): channel = out[:,i*self.mid_channel:(i + 1)*self.mid_channel,...] if i >= 1: channel = channel + channels[i-1] channel = mid_conv(channel) channels.append(channel) out = torch.cat(channels, dim=1) out = self.out_conv(out) if self.attention is not None: out = self.attention(out) return out