Bounding Box Regression Strategies in OSTrack

Directory

1. Crop and label settings

2. Bounding box regression of the model’s predicted output

1. Crop and label settings

1. Add offset to get the offset bounding box

jittered_anno = [self._get_jittered_box(a, s) for a in data[s + '_anno']]

2. Use the offset bounding box as the center to crop

First offset the $4^{2}$ of the bounding box area times cropped search area,

crop_sz = torch.ceil(torch.sqrt(w * h) * self.search_area_factor[s])

$sz=\sqrt{w*h}*4$

Then crop and fill

def sample_target(im, target_bb, search_area_factor, output_sz=None, mask=None):
    """ Extracts a square crop centered at target_bb box, of area search_area_factor^2 times target_bb area

    args:
        im-cv image
        target_bb - target box [x, y, w, h]
        search_area_factor - Ratio of crop size to target size
        output_sz - (float) Size to which the extracted crop is resized (always square). If None, no resizing is done.

    returns:
        cv image - extracted crop
        float - the factor by which the crop has been resized to make the crop size equal output_size
    """
    if not isinstance(target_bb, list):
        x, y, w, h = target_bb.tolist()
    else:
        x, y, w, h = target_bb
    # Crop image
    crop_sz = math. ceil(math. sqrt(w * h) * search_area_factor) # 466

    if crop_sz < 1:
        raise Exception('Too small bounding box.')

    x1 = round(x + 0.5 * w - crop_sz * 0.5)
    x2 = x1 + crop_sz

    y1 = round(y + 0.5 * h - crop_sz * 0.5)
    y2 = y1 + crop_sz

    x1_pad = max(0, -x1)
    x2_pad = max(x2 - im.shape[1] + 1, 0)

    y1_pad = max(0, -y1)
    y2_pad = max(y2 - im.shape[0] + 1, 0)

    # Crop target
    im_crop = im[y1 + y1_pad:y2 - y2_pad, x1 + x1_pad:x2 - x2_pad, :] # ndarray:(466,466,3)
    if mask is not None:
        mask_crop = mask[y1 + y1_pad:y2 - y2_pad, x1 + x1_pad:x2 - x2_pad] # Tensor:(466,466)

    #Pad
    im_crop_padded = cv.copyMakeBorder(im_crop, y1_pad, y2_pad, x1_pad, x2_pad, cv.BORDER_CONSTANT) # ndarray:(466,466,3) Fill if the cropped area exceeds the border
    # deal with attention mask
    H, W, _ = im_crop_padded.shape # 446, 446, 3
    att_mask = np.ones((H,W)) # ndarray:(466,466)
    end_x, end_y = -x2_pad, -y2_pad # 0, 0
    if y2_pad == 0:
        end_y = None
    if x2_pad == 0:
        end_x = None
    att_mask[y1_pad:end_y, x1_pad:end_x] = 0
    if mask is not None: # True
        mask_crop_padded = F.pad(mask_crop, pad=(x1_pad, x2_pad, y1_pad, y2_pad), mode='constant', value=0)

3. Resize

 if output_sz is not None: # True
        resize_factor = output_sz / crop_sz
        im_crop_padded = cv.resize(im_crop_padded, (output_sz, output_sz)) # ndarray:(128,128,3)
        att_mask = cv.resize(att_mask, (output_sz, output_sz)).astype(np.bool_) # ndarray:(128,128,3) bool type
        if mask is None:
            return im_crop_padded, resize_factor, att_mask
        mask_crop_padded = \
        F.interpolate(mask_crop_padded[None, None], (output_sz, output_sz), mode='bilinear', align_corners=False)[0, 0] # Tensor:(128,128)
        return im_crop_padded, resize_factor, att_mask, mask_crop_padded

resize into the input size, the size of output_sz/crop_sz is recorded here, which will be used later. This step has determined the cropped input image, but the labels are not yet aligned.

4. Align labels

def transform_image_to_crop(box_in: torch.Tensor, box_extract: torch.Tensor, resize_factor: float,
                            crop_sz: torch.Tensor, normalize=False) -> torch.Tensor:
    """ Transform the box co-ordinates from the original image co-ordinates to the co-ordinates of the cropped image
    args:
        box_in - the box for which the co-ordinates are to be transformed
        box_extract - the box about which the image crop has been extracted.
        resize_factor - the ratio between the original image scale and the scale of the image crop
        crop_sz - size of the cropped image

    returns:
        torch.Tensor - transformed co-ordinates of box_in
    """
    box_extract_center = box_extract[0:2] + 0.5 * box_extract[2:4]

    box_in_center = box_in[0:2] + 0.5 * box_in[2:4]

    box_out_center = (crop_sz - 1) / 2 + (box_in_center - box_extract_center) * resize_factor
    box_out_wh = box_in[2:4] * resize_factor

    box_out = torch.cat((box_out_center - 0.5 * box_out_wh, box_out_wh))
    if normalize:
        return box_out / crop_sz[0]
    else:
        return box_out

First calculate the center coordinates of the offset bounding box and the center coordinates of the ground truth bounding box

$x_1,y_1=x + 0.5*w,y + 0.5*w$

$x_0,y_0=x + 0.5*w,y + 0.5*w$

where x and y are the upper-left vertex coordinates of the bounding box.

Next align the labels

$gt_{center}=(outputsz-1)/2 + (x_0-x_1,y_0-y_1)*resizefactor$

outputsz is the size to be input,

Then the center coordinate form is converted to the left vertex coordinate form (x, y, w, h), and then normalized

return box_out / crop_sz[0]

are divided by the input size, such as 384, 256

5. Generate the label that the head needs to predict

After the above operations are not finished, just align the gt bbox and crop the input, and also need to generate the label predicted by the model.

1) Category labels,

Generate a Gaussian map from the center coordinates of the gt bbox

def generate_heatmap(bboxes, patch_size=320, stride=16): # Tensor:(1,4,4), 256, 16
    """
    Generate ground truth heatmap same as CenterNet
    Args:
        bboxes (torch.Tensor): shape of [num_search, bs, 4]

    Returns:
        gaussian_maps: list of generated heatmaps

    """
    gaussian_maps = []
    heatmap_size = patch_size // stride # 16
    for single_patch_bboxes in bboxes: # Tensor:(4,4)
        bs = single_patch_bboxes.shape[0] # 4
        gt_scoremap = torch.zeros(bs, heatmap_size, heatmap_size) # Tensor:(4,16,16)
        classes = torch.arange(bs).to(torch.long) # tensor:([0,1,2,3])
        bbox = single_patch_bboxes * heatmap_size # Tensor:(4,4)
        wh = bbox[:, 2:] # Tensor:(4,2)
        centers_int = (bbox[:, :2] + wh / 2).round() # Tensor:(4,2) center point
        CenterNetHeatMap.generate_score_map(gt_scoremap, classes, wh, centers_int, 0.7)
        gaussian_maps.append(gt_scoremap.to(bbox.device))

    return gaussian_maps

2) Regression tab

It is the gt bbox itself, but it should be noted that the gt bbox here has been normalized.

And the output of the network is the score map, size and offset, so the regression label is not direct, but indirect.

2. Bounding box regression of the model’s predicted output

The output via the output header consists of three

score_map_ctr, size_map, offset_map = self.get_score_map(x) # Tensor:(4,1,16,16) , Tensor:(4,2,16,16), Tensor:(4,2,16,16 )

regression bounding box

 def cal_bbox(self, score_map_ctr, size_map, offset_map, return_score=False):
        max_score, idx = torch.max(score_map_ctr.flatten(1), dim=1, keepdim=True) # The shapes are all Tensor: (4,1) According to the batch, take out the largest score and the corresponding index
        idx_y = idx // self.feat_sz # Tensor:(4,1)
        idx_x = idx % self.feat_sz # Tensor:(4,1)

        idx = idx. unsqueeze(1). expand(idx. shape[0], 2, 1) # Tensor:(4,2,1)
        size = size_map.flatten(2).gather(dim=2, index=idx) # Tensor:(4,2,1)
        offset = offset_map.flatten(2).gather(dim=2, index=idx).squeeze(-1) # Tensor:(4,2)

        # bbox = torch.cat([idx_x - size[:, 0] / 2, idx_y - size[:, 1] / 2,
        # idx_x + size[:, 0] / 2, idx_y + size[:, 1] / 2], dim=1) / self. feat_sz
        # cx, cy, w, h
        bbox = torch.cat([(idx_x.to(torch.float) + offset[:, :1]) / self.feat_sz,
                          (idx_y.to(torch.float) + offset[:, 1:]) / self.feat_sz,
                          size. squeeze(-1)], dim=1) # Tensor:(4,4)

        if return_score:
            return bbox, max_score
        return bbox

Here is the form of the center coordinates. They are used directly in the training phase to calculate the loss function.

reasoning stage,

pred_box = (pred_boxes.mean(
            dim=0) * self.params.search_size / resize_factor).tolist() # (cx, cy, w, h) [0,1] multiplied by search size to normalize

To normalize, convert the predicted bbox to the scale of the cropped image, and note that what is achieved here is to keep the scale of the cropped image at the same scale as the original image.

 def map_box_back(self, pred_box: list, resize_factor: float):
        cx_prev, cy_prev = self.state[0] + 0.5 * self.state[2], self.state[1] + 0.5 * self.state[3]
        cx, cy, w, h = pred_box
        half_side = 0.5 * self.params.search_size / resize_factor
        cx_real = cx + (cx_prev - half_side)
        cy_real = cy + (cy_prev - half_side)
        return [cx_real - 0.5 * w, cy_real - 0.5 * h, w, h]

Here self.state is the predicted bbox of the previous frame. At this time, the predicted bbox is the coordinate in the cropped image, so if you want to return it to the coordinates on the original img, you need to calculate the relative coordinate transformation between the coordinate system of the cropped image and the coordinate system of the original img. Therefore, use the previous A burst of predicted center coordinates of the bbox minus the center coordinates of the cropped image to obtain the relative coordinate transformation, and directly adding the relative coordinates to obtain the predicted coordinates of the original img.