The branches output by Fpn, each of which will perform classification and regression operations
Each layer of features undergoes 4 convolutions + relu operations, and then through the head convolution
self.output = nn.Conv2d(feature_size, num_anchors * num_classes, kernel_size=3, padding=1) self. output_act = nn. Sigmoid()
Output the final predicted output, scale is
torch. Size([1, 14400, 80]) torch. Size([1, 3600, 80]) torch. Size([1, 900, 80]) torch. Size([1, 225, 80]) torch. Size([1, 81, 80])
Among them, 14400 = 40409, 9 is the number of anchors, and finally the tensor that stitches all the results together [1,19206,80]. It can be understood that 9 anchors are predicted for each feature map position, and each anchor has 80 categories. The splicing operation is unified with the form of the anchor to facilitate the calculation of loss and forward prediction. Note that the activation function here uses sigmoid(). If you want to use softmax() output, you need to add a category. However, the paper proves that Sigmoid() is better than softmax().
Similar to the classification head, it is also a 4-layer convolution + relu() operation, and finally an output convolution. Since it is a regression problem, no activation is performed.
self.output = nn.Conv2d(feature_size, num_anchors * 4, kernel_size=3, padding=1)
The scale changes to:
torch. Size([1, 14400, 4]) torch. Size([1, 3600, 4]) torch. Size([1, 900, 4]) torch. Size([1, 225, 4]) torch. Size([1, 81, 4])
Finally, all the results are stitched together [1,19206,4], 4 represents the center point + width and height of the predicted box.
The large feature map predicts small objects, and the small feature map predicts large objects. FPN has 5 outputs, so there will be 5 medium-scale anchors, and each scale is divided into 9 medium aspect ratios.
First define the level of the feature map:
self.pyramid_levels = [3, 4, 5, 6, 7]
Get the corresponding stride as:
self.strides = [2 ** x for x in self.pyramid_levels] # [8,16,32,64,128]
Get the base size on each layer:
self.sizes = [2 ** (x + 2) for x in self.pyramid_levels] # [32,64,128,256,512]
Combine 3 frame height ratios with 3 scales to get 9 anchors:
ratios = np.array([0.5, 1, 2]) scales = np.array([2 ** 0, 2 ** (1.0 / 3.0), 2 ** (2.0 / 3.0)])=[1,1.26,1.587]
First calculate the size:
anchors[:, 2:] = base_size * np.tile(scales, (2, len(ratios))).T
Get the width and height of the initial anchor (for example, the smallest output layer):
[[ 0. 0. 32. 32. ] [ 0. 0. 40.3174736 40.3174736 ] [ 0. 0. 50.79683366 50.79683366] [ 0. 0. 32. 32. ] [ 0. 0. 40.3174736 40.3174736 ] [ 0. 0. 50.79683366 50.79683366] [ 0. 0. 32. 32. ] [ 0. 0. 40.3174736 40.3174736 ] [ 0. 0. 50.79683366 50.79683366]]
Get the area at each scale:
[1024. 1625. 2580. 1024. 1625. 2580. 1024. 1625. 2580.]
Then generate the anchor according to the aspect ratio:
[[ 0. 0. 45.254834 22.627417 ] [ 0.0.57.01751796 28.50875898] [ 0. 0. 71.83757109 35.91878555] [ 0. 0. 32. 32. ] [ 0. 0. 40.3174736 40.3174736 ] [ 0. 0. 50.79683366 50.79683366] [ 0. 0. 22.627417 45.254834 ] [ 0. 0. 28.50875898 57.01751796] [ 0. 0. 35.91878555 71.83757109]]
Finally converted to xyxy form:
[[-22.627417 -11.3137085 22.627417 11.3137085 ] [-28.50875898 -14.25437949 28.50875898 14.25437949] [-35.91878555 -17.95939277 35.91878555 17.95939277] [-16. -16. 16. 16. ] [-20.1587368 -20.1587368 20.1587368 20.1587368 ] [-25.39841683 -25.39841683 25.39841683 25.39841683] [-11.3137085 -22.627417 11.3137085 22.627417 ] [-14.25437949 -28.50875898 14.25437949 28.50875898] [-17.95939277 -35.91878555 17.95939277 35.91878555]]
Therefore, the base anchor of one layer is obtained. This group of anchors is the feature picture at position (0,0) on the feature map. You only need to copy + translate to other positions to get all the anchors on the entire feature map. The feature maps of other scales are similar to finally splicing the anchors on all feature maps, and the size is also [1, 19206, 4]
The code does not split the anchor code into a separate module,
First, the gt box is converted into the form of center point and width and height:
gt_widths = assigned_annotations[:, 2] - assigned_annotations[:, 0] gt_heights = assigned_annotations[:, 3] - assigned_annotations[:, 1] gt_ctr_x = assigned_annotations[:, 0] + 0.5 * gt_widths gt_ctr_y = assigned_annotations[:, 1] + 0.5 * gt_heights
Similarly, anchor is also converted into the form of center point and width and height:
anchor_widths = anchor[:, 2] - anchor[:, 0] anchor_heights = anchor[:, 3] - anchor[:, 1] anchor_ctr_x = anchor[:, 0] + 0.5 * anchor_widths anchor_ctr_y = anchor[:, 1] + 0.5 * anchor_heights
Calculate the relative value of the two
targets_dx = (gt_ctr_x - anchor_ctr_x_pi) / anchor_widths_pi targets_dy = (gt_ctr_y - anchor_ctr_y_pi) / anchor_heights_pi targets_dw = torch.log(gt_widths / anchor_widths_pi) targets_dh = torch.log(gt_heights / anchor_heights_pi)
Of course, our goal is that the predicted value of the network is equal to these four relative values.
This part is mainly to divide the positive and negative samples according to the size of iou, which is to pick out those anchors responsible for predicting gt. The allocation strategy is very simple, it is the iou strategy.
Need to request iou:
IoU_max, IoU_argmax = torch.max(IoU, dim=1) # num_anchors x 1
Positive samples: Ancho samples with iou greater than 0.5 with gt Negative samples: anchors whose iou with gt is less than 0.4 Ignore samples: other anchors
Problem: Like the yolo series, if there is no anchor prediction greater than 0.5, at least one anchor with the largest iou will be assigned. Because retinanet believes that the coco data set follows this strategy, there are very few cases where it cannot be matched.
For focal loss, please refer to:
When the picture has no target, only the classification loss is calculated, and the box position loss is not calculated. All anchors are negative samples:
alpha_factor = torch.ones(classification.shape) * alpha alpha_factor = 1. - alpha_factor focal_weight = classification focal_weight = alpha_factor * torch.pow(focal_weight, gamma) bce = -(torch.log(1.0 - classification)) cls_loss = focal_weight * bce classification_losses.append(cls_loss.sum()) # return loss to 0 regression_losses.append(torch.tensor(0).float())
# Note that here is the use of sigmoid output, you can use alpha and 1-alpha directly. Each branch is doing binary classification of target and background alpha_factor = torch.where(torch.eq(targets, 1.), alpha_factor, 1. - alpha_factor) focal_weight = torch.where(torch.eq(targets, 1.), 1. - classification, classification) focal_weight = alpha_factor * torch.pow(focal_weight, gamma) bce = -(targets * torch.log(classification) + (1.0 - targets) * torch.log(1.0 - classification)) cls_loss = focal_weight * bce
# Only calculated on the anchor of the positive sample, abs is f1 loss regression_diff = torch.abs(targets - regression[positive_indices, :]) # Perform smoothing, which is smooth l1 loss regression_loss = torch.where( torch.le(regression_diff, 1.0 / 9.0), 0.5 * 9.0 * torch.pow(regression_diff, 2), regression_diff - 0.5 / 9.0)
Because the test reasoning process is generally relatively simple, some codes are as follows:
def forward(self, boxes, deltas): widths = boxes[:, :, 2] - boxes[:, :, 0] heights = boxes[:, :, 3] - boxes[:, :, 1] ctr_x = boxes[:, :, 0] + 0.5 * widths ctr_y = boxes[:, :, 1] + 0.5 * heights dx = deltas[:, :, 0] * self.std + self.mean dy = deltas[:, :, 1] * self.std + self.mean dw = deltas[:, :, 2] * self.std + self.mean dh = deltas[:, :, 3] * self.std + self.mean ''' where boxes are anchors, and deltas is the box branch of network regression. Note that self.std + self.mean here is a standardized reverse operation on the output, Because the supervision of the network output has standardized operations. The mean and variance used are fixed values. The purpose is to enlarge the relative value and help the network to return ''' pred_ctr_x = ctr_x + dx * widths pred_ctr_y = ctr_y + dy * heights pred_w = torch. exp(dw) * widths pred_h = torch. exp(dh) * heights pred_boxes_x1 = pred_ctr_x - 0.5 * pred_w pred_boxes_y1 = pred_ctr_y - 0.5 * pred_h pred_boxes_x2 = pred_ctr_x + 0.5 * pred_w pred_boxes_y2 = pred_ctr_y + 0.5 * pred_h pred_boxes = torch.stack([pred_boxes_x1, pred_boxes_y1, pred_boxes_x2, pred_boxes_y2], dim=2) return pred_boxes
After the decoding is completed, to obtain the real predicted box, the clipBoxes operation is required to ensure that all numbers will not exceed the scale range of the picture. Then traverse each category, get the score of the category, extract the box larger than a certain threshold, and then perform nms.
RetinaNet is a target detection framework with a very clear structure. The FPN of the backbone and neck is very easy to replace, and the definition of the head is also very simple. There is also the addition of focal loss, which has become the baseline of many algorithms, such as target detection at any angle.
Reference code link: