[YOLO Improvement] Addition of Efficient Decoupled Head decoupling head

Theoretical knowledge

Decoupling and understanding in YOLOX

The decoupling head was proposed in YOLOX. It can be seen that the original detection head of YOLO is coupled. The number of channels includes three categories: confidence, category and prediction frame coordinates. The decoupling head predicts the above three prediction tasks separately rather than predicting them uniformly. The advantage of this is that it can further improve the detection accuracy.
In the article “Revisiting the Sibling Head in Object Detector” of CVPR2020, the author of the original article believes that in the target In the detection positioning and classification tasks, the focus and interest of the two tasks are different. The classification task focuses more on which category the extracted features are most similar to the existing categories, while the positioning task focuses more on the position coordinates of the GT Box. Make bounding box parameter corrections. Therefore, if the same feature map (coupling head) is used for classification and positioning, the effect will not be good.

In addition, a similar point of view was also put forward in the paper Rethinking Classification and Localization for Object Detection (fc-head is more suitable for classification tasks, and conv-head is more suitable for positioning tasks).

Decoupling header in YOLOv6

After YOLOv6 streamlined the design of the decoupling head, taking into account the balance between the representation capabilities of the relevant operators and the computing overhead of the hardware, a more efficient decoupling head structure was redesigned using the Hybrid Channels strategy, while maintaining accuracy. At the same time, the delay is reduced, and the additional delay overhead caused by the 3×3 convolution in the decoupling head is alleviated. By conducting ablation experiments on a nano-size model and comparing the decoupling head structure with the same number of channels, the accuracy is increased by 0.2% AP and the speed is increased by 6.8%. Specifically, the improvements are two points:

Change the last two 3*3 convolutional layers in the decoupling head to one layer;
Change the output channel of this 3*3 convolutional layer to be consistent with the input channel.

YOLOv5 uses the Efficient Decoupled Head proposed by YOLOv6

Introduce the following Decoupled_Detect class into yolo.py

class Decoupled_Detect(nn.Module):
    # YOLOv5 Detect head for detection models
    stride = None # strides computed during build
    dynamic = False # force grid reconstruction
    export = False # export mode

    def __init__(self, nc=80, anchors=(), ch=(), inplace=True): # detection layer
        super().__init__()
        self.nc = nc # number of classes
        self.no = nc + 5 # number of outputs per anchor
        self.nl = len(anchors) # number of detection layers
        self.na = len(anchors[0]) // 2 # number of anchors
        self.grid = [torch.empty(0) for _ in range(self.nl)] # init grid
        self.anchor_grid = [torch.empty(0) for _ in range(self.nl)] # init anchor grid
        self.register_buffer('anchors', torch.tensor(anchors).float().view(self.nl, -1, 2)) # shape(nl,na,2)
        
        self.m_stem = nn.ModuleList(Conv(x, x, 1) for x in ch) # stem conv
        self.m_cls = nn.ModuleList(nn.Sequential(Conv(x, x, 3), nn.Conv2d(x, self.na * self.nc, 1)) for x in ch) # cls conv
        self.m_reg_conf = nn.ModuleList(Conv(x, x, 3) for x in ch) # reg_conf stem conv
        self.m_reg = nn.ModuleList(nn.Conv2d(x, self.na * 4, 1) for x in ch) # reg conv
        self.m_conf = nn.ModuleList(nn.Conv2d(x, self.na * 1, 1) for x in ch) # conf conv
        
        self.inplace = inplace # use inplace ops (e.g. slice assignment)

    def forward(self, x):
        z = [] # inference output
        for i in range(self.nl):
            x[i] = self.m_stem[i](x[i]) # conv
            
            bs, _, ny, nx = x[i].shape
            x_cls = self.m_cls[i](x[i]).view(bs, self.na, self.nc, ny, nx).permute(0, 1, 3, 4, 2).contiguous()
            x_reg_conf = self.m_reg_conf[i](x[i])
            x_reg = self.m_reg[i](x_reg_conf).view(bs, self.na, 4, ny, nx).permute(0, 1, 3, 4, 2).contiguous()
            x_conf = self.m_conf[i](x_reg_conf).view(bs, self.na, 1, ny, nx).permute(0, 1, 3, 4, 2).contiguous()
            x[i] = torch.cat([x_reg, x_conf, x_cls], dim=4)

            if not self.training: # inference
                if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
                    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)

                if isinstance(self, Segment): # (boxes + masks)
                    xy, wh, conf, mask = x[i].split((2, 2, self.nc + 1, self.no - self.nc - 5), 4)
                    xy = (xy.sigmoid() * 2 + self.grid[i]) * self.stride[i] # xy
                    wh = (wh.sigmoid() * 2) ** 2 * self.anchor_grid[i] # wh
                    y = torch.cat((xy, wh, conf.sigmoid(), mask), 4)
                else: # Detect (boxes only)
                    xy, wh, conf = x[i].sigmoid().split((2, 2, self.nc + 1), 4)
                    xy = (xy * 2 + self.grid[i]) * self.stride[i] # xy
                    wh = (wh * 2) ** 2 * self.anchor_grid[i] # wh
                    y = torch.cat((xy, wh, conf), 4)
                z.append(y.view(bs, self.na * nx * ny, self.no))

        return x if self.training else (torch.cat(z, 1),) if self.export else (torch.cat(z, 1), x)

    def _make_grid(self, nx=20, ny=20, i=0, torch_1_10=check_version(torch.__version__, '1.10.0')):
        d = self.anchors[i].device
        t = self.anchors[i].dtype
        shape = 1, self.na, ny, nx, 2 # grid shape
        y, x = torch.arange(ny, device=d, dtype=t), torch.arange(nx, device=d, dtype=t)
        yv, xv = torch.meshgrid(y, x, indexing='ij') if torch_1_10 else torch.meshgrid(y, x) # torch>=0.7 compatibility
        grid = torch.stack((xv, yv), 2).expand(shape) - 0.5 # add grid offset, i.e. y = 2.0 * x - 0.5
        anchor_grid = (self.anchors[i] * self.stride[i]).view((1, self.na, 1, 1, 2)).expand(shape)
        return grid, anchor_grid

In addition, add the following code in the BaseModel class and DetectionModel of yolo.py

# ----------------------------- Add ------------------ to the BaseModel class ---
def _apply(self, fn):
    # Apply to(), cpu(), cuda(), half() to model tensors that are not parameters or registered buffers
    self = super()._apply(fn)
    m = self.model[-1] #Detect()
    if isinstance(m, (Detect, Decoupled_Detect, Segment)):
        m.stride = fn(m.stride)

# -------------------------- Added to DetectionModel class --------------------------
if isinstance(m, (Detect, Decoupled_Detect, Segment)):
    s = 256 # 2x min stride
    m.inplace = self.inplace

Finally, modify the function def _initialize_biases in the DetectionModel class

def _initialize_biases(self, cf=None): # initialize biases into Detect(), cf is class frequency
    # https://arxiv.org/abs/1708.02002 section 3.3
    # cf = torch.bincount(torch.tensor(np.concatenate(dataset.labels, 0)[:, 0]).long(), minlength=nc) + 1.
    m = self.model[-1] # Detect() module

    if isinstance(m, Detect):
        for mi, s in zip(m.m, m.stride): # from
            b = mi.bias.view(m.na, -1) # conv.bias(255) to (3,85)
            b.data[:, 4] + = math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
            b.data[:, 5:5 + m.nc] + = math.log(0.6 / (m.nc - 0.99999)) if cf is None else torch.log(
                cf / cf.sum()) # cls
            mi.bias = torch.nn.Parameter(b.view(-1), requires_grad=True)
    elif isinstance(m, Decoupled_Detect):
        for mi, s in zip(m.m_conf, m.stride): # from
            b = mi.bias.view(m.na, -1) # conv.bias(255) to (3,85)
            b.data + = math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
            mi.bias = torch.nn.Parameter(b.view(-1), requires_grad=True)

        for mi, s in zip(m.m_cls, m.stride): # from
            b = mi[-1].bias.view(m.na, -1) # conv.bias(255) to (3,85)
            b.data + = math.log(0.6 / (m.nc - 0.99999)) if cf is None else torch.log(cf / cf.sum()) # cls
            mi[-1].bias = torch.nn.Parameter(b.view(-1), requires_grad=True)

Modify the original yolov5s.yaml file and change Detect to Decoupled_Detect

# YOLOv5  by Ultralytics, AGPL-3.0 license

#Parameters
nc: 80 # number of classes
depth_multiple: 0.33 # model depth multiple
width_multiple: 0.50 # layer channel multiple
anchors:
  - [10,13, 16,30, 33,23] # P3/8
  - [30,61, 62,45, 59,119] # P4/16
  - [116,90, 156,198, 373,326] # P5/32

# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]], # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]], # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]], # 3-P3/8
   [-1, 6, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]], # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]], # 7-P5/32
   [-1, 3, C3, [1024]],
   [-1, 1, SPPF, [1024, 5]], # 9
  ]

# YOLOv5 v6.0 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]], # cat backbone P4
   [-1, 3, C3, [512, False]], # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]], # cat backbone P3
   [-1, 3, C3, [256, False]], # 17 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]], # cat head P4
   [-1, 3, C3, [512, False]], # 20 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]], # cat head P5
   [-1, 3, C3, [1024, False]], # 23 (P5/32-large)

   [[17, 20, 23], 1, Decoupled_Detect, [nc, anchors]], # Detect(P3, P4, P5)
  ]

Experiment

VisDrone Dataset

VisDrone dataset link
The VisDrone data set was released in 2018 and expanded in 2019 by Tianjin University and others. The application requirements for drones in various fields are very wide. 10209 images are provided for target detection, of which 6471 images are used for training, 548 are used for verification, and 3190 are used for testing. 96 video clips are also provided for target detection, including 56 for training ( 24201 frames in total), 7 for validation (2819 frames in total) and 33 in testing (12968 frames in total). Relatively speaking, the VisDrone data set has complex scenes, many target scale changes, many small targets, dense targets, and severe occlusion, and the requirements for the algorithm have also increased.

Experimental results

Calculate and analyze the parameters of the modified network

 from n params module arguments
  0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
  1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
  2 -1 1 18816 models.common.C3 [64, 64, 1]
  3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
  4 -1 2 115712 models.common.C3 [128, 128, 2]
  5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
  6 -1 3 625152 models.common.C3 [256, 256, 3]
  7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
  8 -1 1 1182720 models.common.C3 [512, 512, 1]
  9 -1 1 656896 models.common.SPPF [512, 512, 5]
 10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
 12 [-1, 6] 1 0 models.common.Concat [1]
 13 -1 1 361984 models.common.C3 [512, 256, 1, False]
 14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
 16 [-1, 4] 1 0 models.common.Concat [1]
 17 -1 1 90880 models.common.C3 [256, 128, 1, False]
 18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
 19 [-1, 14] 1 0 models.common.Concat [1]
 20 -1 1 296448 models.common.C3 [256, 256, 1, False]
 21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
 22 [-1, 10] 1 0 models.common.Concat [1]
 23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
 24 [17, 20, 23] 1 6771837 Decoupled_Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
YOLOv5sEDhead summary: 254 layers, 13777981 parameters, 13777981 gradients, 28.6 GFLOPs

 24 [17, 20, 23] 1 229245 Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
YOLOv5s summary: 214 layers, 7235389 parameters, 7235389 gradients, 16.6 GFLOPs

It can be seen that compared with YOLOv5s, the amount of modified network parameters and calculation amount have increased significantly. That’s the problem with decoupling.

The mAP of the VisDrone data set in YOLOv5s is 32.9%. After using the decoupling header, the mAP increases significantly, reaching 34.7%. Demonstrate the effectiveness of understanding the coupling head. However, the size of the model has increased from the original 14MB to 26.2MB. The model has grown too large and it is difficult to achieve a good balance between the number of parameters and accuracy. Not applicable to edge computing platforms (Jetson, etc.).

Training model download address, extraction code pwx6

Directory