Directory
- theoretical knowledge
-
- Decoupling and understanding in YOLOX
- Decoupling header in YOLOv6
- YOLOv5 uses the Efficient Decoupled Head proposed by YOLOv6
- experiment
-
- VisDrone dataset
- Experimental results
Theoretical knowledge
Decoupling and understanding in YOLOX
The decoupling head was proposed in YOLOX. It can be seen that the original detection head of YOLO is coupled. The number of channels includes three categories: confidence, category and prediction frame coordinates. The decoupling head predicts the above three prediction tasks separately rather than predicting them uniformly. The advantage of this is that it can further improve the detection accuracy.
In the article “Revisiting the Sibling Head in Object Detector” of CVPR2020, the author of the original article believes that in the target In the detection positioning and classification tasks, the focus and interest of the two tasks are different. The classification task focuses more on which category the extracted features are most similar to the existing categories, while the positioning task focuses more on the position coordinates of the GT Box. Make bounding box parameter corrections. Therefore, if the same feature map (coupling head) is used for classification and positioning, the effect will not be good.
In addition, a similar point of view was also put forward in the paper Rethinking Classification and Localization for Object Detection (fc-head is more suitable for classification tasks, and conv-head is more suitable for positioning tasks).
Decoupling header in YOLOv6
After YOLOv6 streamlined the design of the decoupling head, taking into account the balance between the representation capabilities of the relevant operators and the computing overhead of the hardware, a more efficient decoupling head structure was redesigned using the Hybrid Channels strategy, while maintaining accuracy. At the same time, the delay is reduced, and the additional delay overhead caused by the 3×3 convolution in the decoupling head is alleviated. By conducting ablation experiments on a nano-size model and comparing the decoupling head structure with the same number of channels, the accuracy is increased by 0.2% AP and the speed is increased by 6.8%. Specifically, the improvements are two points:
- Change the last two 3*3 convolutional layers in the decoupling head to one layer;
- Change the output channel of this 3*3 convolutional layer to be consistent with the input channel.
YOLOv5 uses the Efficient Decoupled Head proposed by YOLOv6
Introduce the following Decoupled_Detect
class into yolo.py
class Decoupled_Detect(nn.Module): # YOLOv5 Detect head for detection models stride = None # strides computed during build dynamic = False # force grid reconstruction export = False # export mode def __init__(self, nc=80, anchors=(), ch=(), inplace=True): # detection layer super().__init__() self.nc = nc # number of classes self.no = nc + 5 # number of outputs per anchor self.nl = len(anchors) # number of detection layers self.na = len(anchors[0]) // 2 # number of anchors self.grid = [torch.empty(0) for _ in range(self.nl)] # init grid self.anchor_grid = [torch.empty(0) for _ in range(self.nl)] # init anchor grid self.register_buffer('anchors', torch.tensor(anchors).float().view(self.nl, -1, 2)) # shape(nl,na,2) self.m_stem = nn.ModuleList(Conv(x, x, 1) for x in ch) # stem conv self.m_cls = nn.ModuleList(nn.Sequential(Conv(x, x, 3), nn.Conv2d(x, self.na * self.nc, 1)) for x in ch) # cls conv self.m_reg_conf = nn.ModuleList(Conv(x, x, 3) for x in ch) # reg_conf stem conv self.m_reg = nn.ModuleList(nn.Conv2d(x, self.na * 4, 1) for x in ch) # reg conv self.m_conf = nn.ModuleList(nn.Conv2d(x, self.na * 1, 1) for x in ch) # conf conv self.inplace = inplace # use inplace ops (e.g. slice assignment) def forward(self, x): z = [] # inference output for i in range(self.nl): x[i] = self.m_stem[i](x[i]) # conv bs, _, ny, nx = x[i].shape x_cls = self.m_cls[i](x[i]).view(bs, self.na, self.nc, ny, nx).permute(0, 1, 3, 4, 2).contiguous() x_reg_conf = self.m_reg_conf[i](x[i]) x_reg = self.m_reg[i](x_reg_conf).view(bs, self.na, 4, ny, nx).permute(0, 1, 3, 4, 2).contiguous() x_conf = self.m_conf[i](x_reg_conf).view(bs, self.na, 1, ny, nx).permute(0, 1, 3, 4, 2).contiguous() x[i] = torch.cat([x_reg, x_conf, x_cls], dim=4) if not self.training: # inference if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]: self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i) if isinstance(self, Segment): # (boxes + masks) xy, wh, conf, mask = x[i].split((2, 2, self.nc + 1, self.no - self.nc - 5), 4) xy = (xy.sigmoid() * 2 + self.grid[i]) * self.stride[i] # xy wh = (wh.sigmoid() * 2) ** 2 * self.anchor_grid[i] # wh y = torch.cat((xy, wh, conf.sigmoid(), mask), 4) else: # Detect (boxes only) xy, wh, conf = x[i].sigmoid().split((2, 2, self.nc + 1), 4) xy = (xy * 2 + self.grid[i]) * self.stride[i] # xy wh = (wh * 2) ** 2 * self.anchor_grid[i] # wh y = torch.cat((xy, wh, conf), 4) z.append(y.view(bs, self.na * nx * ny, self.no)) return x if self.training else (torch.cat(z, 1),) if self.export else (torch.cat(z, 1), x) def _make_grid(self, nx=20, ny=20, i=0, torch_1_10=check_version(torch.__version__, '1.10.0')): d = self.anchors[i].device t = self.anchors[i].dtype shape = 1, self.na, ny, nx, 2 # grid shape y, x = torch.arange(ny, device=d, dtype=t), torch.arange(nx, device=d, dtype=t) yv, xv = torch.meshgrid(y, x, indexing='ij') if torch_1_10 else torch.meshgrid(y, x) # torch>=0.7 compatibility grid = torch.stack((xv, yv), 2).expand(shape) - 0.5 # add grid offset, i.e. y = 2.0 * x - 0.5 anchor_grid = (self.anchors[i] * self.stride[i]).view((1, self.na, 1, 1, 2)).expand(shape) return grid, anchor_grid
In addition, add the following code in the BaseModel
class and DetectionModel
of yolo.py
# ----------------------------- Add ------------------ to the BaseModel class --- def _apply(self, fn): # Apply to(), cpu(), cuda(), half() to model tensors that are not parameters or registered buffers self = super()._apply(fn) m = self.model[-1] #Detect() if isinstance(m, (Detect, Decoupled_Detect, Segment)): m.stride = fn(m.stride) # -------------------------- Added to DetectionModel class -------------------------- if isinstance(m, (Detect, Decoupled_Detect, Segment)): s = 256 # 2x min stride m.inplace = self.inplace
Finally, modify the function def _initialize_biases
in the DetectionModel
class
def _initialize_biases(self, cf=None): # initialize biases into Detect(), cf is class frequency # https://arxiv.org/abs/1708.02002 section 3.3 # cf = torch.bincount(torch.tensor(np.concatenate(dataset.labels, 0)[:, 0]).long(), minlength=nc) + 1. m = self.model[-1] # Detect() module if isinstance(m, Detect): for mi, s in zip(m.m, m.stride): # from b = mi.bias.view(m.na, -1) # conv.bias(255) to (3,85) b.data[:, 4] + = math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image) b.data[:, 5:5 + m.nc] + = math.log(0.6 / (m.nc - 0.99999)) if cf is None else torch.log( cf / cf.sum()) # cls mi.bias = torch.nn.Parameter(b.view(-1), requires_grad=True) elif isinstance(m, Decoupled_Detect): for mi, s in zip(m.m_conf, m.stride): # from b = mi.bias.view(m.na, -1) # conv.bias(255) to (3,85) b.data + = math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image) mi.bias = torch.nn.Parameter(b.view(-1), requires_grad=True) for mi, s in zip(m.m_cls, m.stride): # from b = mi[-1].bias.view(m.na, -1) # conv.bias(255) to (3,85) b.data + = math.log(0.6 / (m.nc - 0.99999)) if cf is None else torch.log(cf / cf.sum()) # cls mi[-1].bias = torch.nn.Parameter(b.view(-1), requires_grad=True)
Modify the original yolov5s.yaml
file and change Detect
to Decoupled_Detect
# YOLOv5 by Ultralytics, AGPL-3.0 license #Parameters nc: 80 # number of classes depth_multiple: 0.33 # model depth multiple width_multiple: 0.50 # layer channel multiple anchors: - [10,13, 16,30, 33,23] # P3/8 - [30,61, 62,45, 59,119] # P4/16 - [116,90, 156,198, 373,326] # P5/32 # YOLOv5 v6.0 backbone backbone: # [from, number, module, args] [[-1, 1, Conv, [64, 6, 2, 2]], # 0-P1/2 [-1, 1, Conv, [128, 3, 2]], # 1-P2/4 [-1, 3, C3, [128]], [-1, 1, Conv, [256, 3, 2]], # 3-P3/8 [-1, 6, C3, [256]], [-1, 1, Conv, [512, 3, 2]], # 5-P4/16 [-1, 9, C3, [512]], [-1, 1, Conv, [1024, 3, 2]], # 7-P5/32 [-1, 3, C3, [1024]], [-1, 1, SPPF, [1024, 5]], # 9 ] # YOLOv5 v6.0 head head: [[-1, 1, Conv, [512, 1, 1]], [-1, 1, nn.Upsample, [None, 2, 'nearest']], [[-1, 6], 1, Concat, [1]], # cat backbone P4 [-1, 3, C3, [512, False]], # 13 [-1, 1, Conv, [256, 1, 1]], [-1, 1, nn.Upsample, [None, 2, 'nearest']], [[-1, 4], 1, Concat, [1]], # cat backbone P3 [-1, 3, C3, [256, False]], # 17 (P3/8-small) [-1, 1, Conv, [256, 3, 2]], [[-1, 14], 1, Concat, [1]], # cat head P4 [-1, 3, C3, [512, False]], # 20 (P4/16-medium) [-1, 1, Conv, [512, 3, 2]], [[-1, 10], 1, Concat, [1]], # cat head P5 [-1, 3, C3, [1024, False]], # 23 (P5/32-large) [[17, 20, 23], 1, Decoupled_Detect, [nc, anchors]], # Detect(P3, P4, P5) ]
Experiment
VisDrone Dataset
VisDrone dataset link
The VisDrone data set was released in 2018 and expanded in 2019 by Tianjin University and others. The application requirements for drones in various fields are very wide. 10209 images are provided for target detection, of which 6471 images are used for training, 548 are used for verification, and 3190 are used for testing. 96 video clips are also provided for target detection, including 56 for training ( 24201 frames in total), 7 for validation (2819 frames in total) and 33 in testing (12968 frames in total). Relatively speaking, the VisDrone data set has complex scenes, many target scale changes, many small targets, dense targets, and severe occlusion, and the requirements for the algorithm have also increased.
Experimental results
Calculate and analyze the parameters of the modified network
from n params module arguments 0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2] 1 -1 1 18560 models.common.Conv [32, 64, 3, 2] 2 -1 1 18816 models.common.C3 [64, 64, 1] 3 -1 1 73984 models.common.Conv [64, 128, 3, 2] 4 -1 2 115712 models.common.C3 [128, 128, 2] 5 -1 1 295424 models.common.Conv [128, 256, 3, 2] 6 -1 3 625152 models.common.C3 [256, 256, 3] 7 -1 1 1180672 models.common.Conv [256, 512, 3, 2] 8 -1 1 1182720 models.common.C3 [512, 512, 1] 9 -1 1 656896 models.common.SPPF [512, 512, 5] 10 -1 1 131584 models.common.Conv [512, 256, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1] 13 -1 1 361984 models.common.C3 [512, 256, 1, False] 14 -1 1 33024 models.common.Conv [256, 128, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 90880 models.common.C3 [256, 128, 1, False] 18 -1 1 147712 models.common.Conv [128, 128, 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 296448 models.common.C3 [256, 256, 1, False] 21 -1 1 590336 models.common.Conv [256, 256, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 1182720 models.common.C3 [512, 512, 1, False] 24 [17, 20, 23] 1 6771837 Decoupled_Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] YOLOv5sEDhead summary: 254 layers, 13777981 parameters, 13777981 gradients, 28.6 GFLOPs 24 [17, 20, 23] 1 229245 Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] YOLOv5s summary: 214 layers, 7235389 parameters, 7235389 gradients, 16.6 GFLOPs
It can be seen that compared with YOLOv5s, the amount of modified network parameters and calculation amount have increased significantly. That’s the problem with decoupling.
The mAP of the VisDrone data set in YOLOv5s is 32.9%. After using the decoupling header, the mAP increases significantly, reaching 34.7%. Demonstrate the effectiveness of understanding the coupling head. However, the size of the model has increased from the original 14MB to 26.2MB. The model has grown too large and it is difficult to achieve a good balance between the number of parameters and accuracy. Not applicable to edge computing platforms (Jetson, etc.).
Training model download address, extraction code pwx6