YOLOv5, YOLOv8 improvements: ConvNeXt (backbone changed to ConvNextBlock)

Table of Contents

1 Introduction

2. YOLOv5 modifies backbone to ConvNeXt

2.1 Modify common.py

2.2 Modify yolo.py

2.3 Modify yolov5.yaml configuration

1. Introduction

Paper address: https://arxiv.org/abs/2201.03545
Official source code address: https://github.com/facebookresearch/ConvNeXt.git

Since ViT (Vision Transformer) shines in the field of CV, more and more researchers have begun to embrace Transformer. Looking back on the past year, most of the articles published in the CV field are based on Transformer, such as the best paper Swin Transformer of ICCV in 2021, and the convolutional neural network has begun to slowly fade out of the center of the stage. Will convolutional neural networks be replaced by Transformers? Maybe in the near future. In January of this year (2022), Facebook AI Research and UC Berkeley published an article A ConvNet for the 2020s, in which the ConvNeXt pure convolutional neural network was proposed. It targets the very popular Swin Transformer in 2021. Through A series of experimental comparisons show that under the same FLOPs, ConvNeXt has faster inference speed and higher accuracy than Swin Transformer. ConvNeXt-XL reached an accuracy of 87.8% on ImageNet 22K, see the figure below (original text Table 12). It seems that the proposal of ConvNeXt has forcibly renewed the life of convolutional neural networks.

ConvNeXt is a convolutional neural network model jointly proposed by Facebook AI Research and UC Berkeley. It is a pure convolutional neural network, composed of standard convolutional neural network modules, with the characteristics of high accuracy, high efficiency, strong scalability and very simple design. ConvNeXt published a paper at CVPR 2022 titled “Convolutional Neural Networks for the 2020s.” ConvNeXt has been trained on ImageNet-1K and ImageNet-22K datasets and achieved excellent performance on multiple tasks. The training code and pre-trained models of ConvNeXt are made public on GitHub.
ConvNeXt is improved based on ResNet50. Like Swin Transformer, it has 4 stages; the difference is that ConvNeXt changes the ratio of the number of blocks in each stage from 3:4:6:3 to the same 1:1 as Swin Transformer: 3:1. In addition, in terms of feature map downsampling, ConvNeXt uses a convolution kernel with a stride of 4 and a size of 4×4 consistent with Swin Transformer.
Advantages of ConvNeXt include:
ConvNeXt is a pure convolutional neural network, which is composed of standard convolutional neural network modules and has the characteristics of high accuracy, high efficiency, strong scalability and very simple design.
ConvNeXt is trained on ImageNet-1K and ImageNet-22K datasets and achieves excellent performance on multiple tasks.
ConvNeXt adopts some advanced ideas of the Transformer network to make some adjustments and improvements to the existing classic ResNet50/200 network, and introduces some of the latest ideas and technologies of the Transformer network into the existing modules of the CNN network to combine the advantages of the two networks. , improve the performance of CNN network.
Disadvantages of ConvNeXt include:
ConvNeXt has not made major innovations in the overall network framework and construction ideas. It only makes some adjustments and improvements to the existing classic ResNet50/200 network based on some advanced ideas of the Transformer network.
ConvNeXt requires more computing resources in some cases relative to other CNN models.

2. YOLOv5 modifies the backbone to ConvNeXt

2.1 Modify common.py

Add the following code to common.py.

############## ConvNext ##############
import torch.nn.functional as F
class LayerNorm_s(nn.Module):

    def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last"):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(normalized_shape))
        self.bias = nn.Parameter(torch.zeros(normalized_shape))
        self.eps = eps
        self.data_format = data_format
        if self.data_format not in ["channels_last", "channels_first"]:
            raise NotImplementedError
        self.normalized_shape = (normalized_shape,)

    def forward(self, x):
        if self.data_format == "channels_last":
            return F.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
        elif self.data_format == "channels_first":
            u = x.mean(1, keepdim=True)
            s = (x - u).pow(2).mean(1, keepdim=True)
            x = (x - u) / torch.sqrt(s + self.eps)
            x = self.weight[:, None, None] * x + self.bias[:, None, None]
            return x


class ConvNextBlock(nn.Module):

    def __init__(self, dim, drop_path=0., layer_scale_init_value=1e-6):
        super().__init__()
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim) # depthwise conv
        self.norm = LayerNorm_s(dim, eps=1e-6)
        self.pwconv1 = nn.Linear(dim, 4 * dim)
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(4 * dim, dim)
        self.gamma = nn.Parameter(layer_scale_init_value * torch.ones((dim)),
                                  requires_grad=True) if layer_scale_init_value > 0 else None
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()

    def forward(self, x):
        input = x
        x = self.dwconv(x)
        x = x.permute(0, 2, 3, 1) # (N, C, H, W) -> (N, H, W, C)
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)
        if self.gamma is not None:
            x = self.gamma * x
        x = x.permute(0, 3, 1, 2) # (N, H, W, C) -> (N, C, H, W)

        x = input + self.drop_path(x)
        return x


class DropPath(nn.Module):
    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
    """

    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path_f(x, self.drop_prob, self.training)


def drop_path_f(x, drop_prob: float = 0., training: bool = False):
    if drop_prob == 0. or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1) # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
    random_tensor.floor_() # binarize
    output = x.div(keep_prob) * random_tensor
    return output


class CNeB(nn.Module):
    # CSP ConvNextBlock with 3 convolutions by iscyy/yoloair
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5): # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e) # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)
        self.m = nn.Sequential(*(ConvNextBlock(c_) for _ in range(n)))

    def forward(self, x):
        return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))
############## ConvNext ##############

2.2 Modify yolo.py

 if m in [Conv, GhostConv, Bottleneck, GhostBottleneck, SPP, SPPF, DWConv, MixConv2d, Focus, CrossConv,
                 BottleneckCSP, C3, C3TR, C3SPP, C3Ghost, CNeB]:

2.3 Modify yolov5.yaml configuration

# YOLOv5  by Ultralytics, GPL-3.0 license

#Parameters
nc: 80 # number of classes
depth_multiple: 0.33 # model depth multiple
width_multiple: 0.25 # layer channel multiple
anchors:
  - [10,13, 16,30, 33,23] # P3/8
  - [30,61, 62,45, 59,119] # P4/16
  - [116,90, 156,198, 373,326] # P5/32

# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]], # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]], # 1-P2/4
   [-1, 3, CNeB, [128]],
   [-1, 1, Conv, [256, 3, 2]], # 3-P3/8
   [-1, 6, CNeB, [256]],
   [-1, 1, Conv, [512, 3, 2]], # 5-P4/16
   [-1, 9, CNeB, [512]],
   [-1, 1, Conv, [1024, 3, 2]], # 7-P5/32
   [-1, 3, CNeB, [1024]],
   [-1, 1, SPPF, [1024, 5]], # 9
  ]

# YOLOv5 v6.0 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]], # cat backbone P4
   [-1, 3, C3, [512, False]], # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]], # cat backbone P3
   [-1, 3, CNeB, [256, False]], # 17 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]], # cat head P4
   [-1, 3, CNeB, [512, False]], # 20 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]], # cat head P5
   [-1, 3, CNeB, [1024, False]], # 23 (P5/32-large)

   [[17, 20, 23], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5)
  ]

The v7 and v8 versions will be updated in the future.