RetinaNet code analysis

Code Analysis

head

The branches output by Fpn, each of which will perform classification and regression operations

classification output

Each layer of features undergoes 4 convolutions + relu operations, and then through the head convolution

self.output = nn.Conv2d(feature_size, num_anchors * num_classes, kernel_size=3, padding=1)
self. output_act = nn. Sigmoid()

Output the final predicted output, scale is

torch. Size([1, 14400, 80])
torch. Size([1, 3600, 80])
torch. Size([1, 900, 80])
torch. Size([1, 225, 80])
torch. Size([1, 81, 80])

Among them, 14400 = 40409, 9 is the number of anchors, and finally the tensor that stitches all the results together [1,19206,80]. It can be understood that 9 anchors are predicted for each feature map position, and each anchor has 80 categories. The splicing operation is unified with the form of the anchor to facilitate the calculation of loss and forward prediction. Note that the activation function here uses sigmoid(). If you want to use softmax() output, you need to add a category. However, the paper proves that Sigmoid() is better than softmax().

regression output

Similar to the classification head, it is also a 4-layer convolution + relu() operation, and finally an output convolution. Since it is a regression problem, no activation is performed.

self.output = nn.Conv2d(feature_size, num_anchors * 4, kernel_size=3, padding=1)

The scale changes to:

torch. Size([1, 14400, 4])
torch. Size([1, 3600, 4])
torch. Size([1, 900, 4])
torch. Size([1, 225, 4])
torch. Size([1, 81, 4])

Finally, all the results are stitched together [1,19206,4], 4 represents the center point + width and height of the predicted box.
Anchor generation

The large feature map predicts small objects, and the small feature map predicts large objects. FPN has 5 outputs, so there will be 5 medium-scale anchors, and each scale is divided into 9 medium aspect ratios.

First define the level of the feature map:

self.pyramid_levels = [3, 4, 5, 6, 7]

Get the corresponding stride as:

self.strides = [2 ** x for x in self.pyramid_levels]
# [8,16,32,64,128]

Get the base size on each layer:

self.sizes = [2 ** (x + 2) for x in self.pyramid_levels]
# [32,64,128,256,512]

Combine 3 frame height ratios with 3 scales to get 9 anchors:

ratios = np.array([0.5, 1, 2])
scales = np.array([2 ** 0, 2 ** (1.0 / 3.0), 2 ** (2.0 / 3.0)])=[1,1.26,1.587]

First calculate the size:

anchors[:, 2:] = base_size * np.tile(scales, (2, len(ratios))).T

Get the width and height of the initial anchor (for example, the smallest output layer):

[[ 0. 0. 32. 32. ]
 [ 0. 0. 40.3174736 40.3174736 ]
 [ 0. 0. 50.79683366 50.79683366]
 [ 0. 0. 32. 32. ]
 [ 0. 0. 40.3174736 40.3174736 ]
 [ 0. 0. 50.79683366 50.79683366]
 [ 0. 0. 32. 32. ]
 [ 0. 0. 40.3174736 40.3174736 ]
 [ 0. 0. 50.79683366 50.79683366]]

Get the area at each scale:

[1024. 1625. 2580. 1024. 1625. 2580. 1024. 1625. 2580.]

Then generate the anchor according to the aspect ratio:

[[ 0. 0. 45.254834 22.627417 ]
 [ 0.0.57.01751796 28.50875898]
 [ 0. 0. 71.83757109 35.91878555]
 [ 0. 0. 32. 32. ]
 [ 0. 0. 40.3174736 40.3174736 ]
 [ 0. 0. 50.79683366 50.79683366]
 [ 0. 0. 22.627417 45.254834 ]
 [ 0. 0. 28.50875898 57.01751796]
 [ 0. 0. 35.91878555 71.83757109]]

Finally converted to xyxy form:

[[-22.627417 -11.3137085 22.627417 11.3137085 ]
 [-28.50875898 -14.25437949 28.50875898 14.25437949]
 [-35.91878555 -17.95939277 35.91878555 17.95939277]
 [-16. -16. 16. 16. ]
 [-20.1587368 -20.1587368 20.1587368 20.1587368 ]
 [-25.39841683 -25.39841683 25.39841683 25.39841683]
 [-11.3137085 -22.627417 11.3137085 22.627417 ]
 [-14.25437949 -28.50875898 14.25437949 28.50875898]
 [-17.95939277 -35.91878555 17.95939277 35.91878555]]

Therefore, the base anchor of one layer is obtained. This group of anchors is the feature picture at position (0,0) on the feature map. You only need to copy + translate to other positions to get all the anchors on the entire feature map. The feature maps of other scales are similar to finally splicing the anchors on all feature maps, and the size is also [1, 19206, 4]
anchor encoding

The code does not split the anchor code into a separate module,

First, the gt box is converted into the form of center point and width and height:

gt_widths = assigned_annotations[:, 2] - assigned_annotations[:, 0]
gt_heights = assigned_annotations[:, 3] - assigned_annotations[:, 1]
gt_ctr_x = assigned_annotations[:, 0] + 0.5 * gt_widths
gt_ctr_y = assigned_annotations[:, 1] + 0.5 * gt_heights

Similarly, anchor is also converted into the form of center point and width and height:

anchor_widths = anchor[:, 2] - anchor[:, 0]
anchor_heights = anchor[:, 3] - anchor[:, 1]
anchor_ctr_x = anchor[:, 0] + 0.5 * anchor_widths
anchor_ctr_y = anchor[:, 1] + 0.5 * anchor_heights

Calculate the relative value of the two

targets_dx = (gt_ctr_x - anchor_ctr_x_pi) / anchor_widths_pi
targets_dy = (gt_ctr_y - anchor_ctr_y_pi) / anchor_heights_pi
targets_dw = torch.log(gt_widths / anchor_widths_pi)
targets_dh = torch.log(gt_heights / anchor_heights_pi)

Of course, our goal is that the predicted value of the network is equal to these four relative values.
anchor allocation

This part is mainly to divide the positive and negative samples according to the size of iou, which is to pick out those anchors responsible for predicting gt. The allocation strategy is very simple, it is the iou strategy.

Need to request iou:

IoU_max, IoU_argmax = torch.max(IoU, dim=1) # num_anchors x 1

Positive samples: Ancho samples with iou greater than 0.5 with gt
Negative samples: anchors whose iou with gt is less than 0.4
Ignore samples: other anchors

Problem: Like the yolo series, if there is no anchor prediction greater than 0.5, at least one anchor with the largest iou will be assigned. Because retinanet believes that the coco data set follows this strategy, there are very few cases where it cannot be matched.

loss calculation

For focal loss, please refer to:
When the picture has no target, only the classification loss is calculated, and the box position loss is not calculated. All anchors are negative samples:

alpha_factor = torch.ones(classification.shape) * alpha

alpha_factor = 1. - alpha_factor
focal_weight = classification
focal_weight = alpha_factor * torch.pow(focal_weight, gamma)

bce = -(torch.log(1.0 - classification))
                    
cls_loss = focal_weight * bce
classification_losses.append(cls_loss.sum())
# return loss to 0
regression_losses.append(torch.tensor(0).float())

Classification loss:

# Note that here is the use of sigmoid output, you can use alpha and 1-alpha directly. Each branch is doing binary classification of target and background
alpha_factor = torch.where(torch.eq(targets, 1.), alpha_factor, 1. - alpha_factor)
focal_weight = torch.where(torch.eq(targets, 1.), 1. - classification, classification)
focal_weight = alpha_factor * torch.pow(focal_weight, gamma)
bce = -(targets * torch.log(classification) + (1.0 - targets) * torch.log(1.0 - classification))
cls_loss = focal_weight * bce

Return loss:

# Only calculated on the anchor of the positive sample, abs is f1 loss
regression_diff = torch.abs(targets - regression[positive_indices, :])
# Perform smoothing, which is smooth l1 loss
regression_loss = torch.where(
                    torch.le(regression_diff, 1.0 / 9.0),
                    0.5 * 9.0 * torch.pow(regression_diff, 2),
                    regression_diff - 0.5 / 9.0)

test reasoning

Because the test reasoning process is generally relatively simple, some codes are as follows:

def forward(self, boxes, deltas):
    widths = boxes[:, :, 2] - boxes[:, :, 0]
    heights = boxes[:, :, 3] - boxes[:, :, 1]
    ctr_x = boxes[:, :, 0] + 0.5 * widths
    ctr_y = boxes[:, :, 1] + 0.5 * heights

    dx = deltas[:, :, 0] * self.std[0] + self.mean[0]
    dy = deltas[:, :, 1] * self.std[1] + self.mean[1]
    dw = deltas[:, :, 2] * self.std[2] + self.mean[2]
    dh = deltas[:, :, 3] * self.std[3] + self.mean[3]
''' where boxes are anchors, and deltas is the box branch of network regression.
Note that self.std[0] + self.mean[0] here is a standardized reverse operation on the output,
Because the supervision of the network output has standardized operations. The mean and variance used are fixed values.
The purpose is to enlarge the relative value and help the network to return '''

    pred_ctr_x = ctr_x + dx * widths
    pred_ctr_y = ctr_y + dy * heights
    pred_w = torch. exp(dw) * widths
    pred_h = torch. exp(dh) * heights

    pred_boxes_x1 = pred_ctr_x - 0.5 * pred_w
    pred_boxes_y1 = pred_ctr_y - 0.5 * pred_h
    pred_boxes_x2 = pred_ctr_x + 0.5 * pred_w
    pred_boxes_y2 = pred_ctr_y + 0.5 * pred_h

    pred_boxes = torch.stack([pred_boxes_x1, pred_boxes_y1, pred_boxes_x2, pred_boxes_y2], dim=2)

 return pred_boxes

After the decoding is completed, to obtain the real predicted box, the clipBoxes operation is required to ensure that all numbers will not exceed the scale range of the picture. Then traverse each category, get the score of the category, extract the box larger than a certain threshold, and then perform nms.

epilogue

RetinaNet is a target detection framework with a very clear structure. The FPN of the backbone and neck is very easy to replace, and the definition of the head is also very simple. There is also the addition of focal loss, which has become the baseline of many algorithms, such as target detection at any angle.

Reference code link:

https://github.com/yhenon/pytorch-retinanet
https://blog.51cto.com/u_15671528/5890929