Directory
1. Crop and label settings
2. Bounding box regression of the model’s predicted output
1. Crop and label settings
1. Add offset to get the offset bounding box
jittered_anno = [self._get_jittered_box(a, s) for a in data[s + '_anno']]
2. Use the offset bounding box as the center to crop
First offset the of the bounding box area times cropped search area,
crop_sz = torch.ceil(torch.sqrt(w * h) * self.search_area_factor[s])
Then crop and fill
def sample_target(im, target_bb, search_area_factor, output_sz=None, mask=None): """ Extracts a square crop centered at target_bb box, of area search_area_factor^2 times target_bb area args: im-cv image target_bb - target box [x, y, w, h] search_area_factor - Ratio of crop size to target size output_sz - (float) Size to which the extracted crop is resized (always square). If None, no resizing is done. returns: cv image - extracted crop float - the factor by which the crop has been resized to make the crop size equal output_size """ if not isinstance(target_bb, list): x, y, w, h = target_bb.tolist() else: x, y, w, h = target_bb # Crop image crop_sz = math. ceil(math. sqrt(w * h) * search_area_factor) # 466 if crop_sz < 1: raise Exception('Too small bounding box.') x1 = round(x + 0.5 * w - crop_sz * 0.5) x2 = x1 + crop_sz y1 = round(y + 0.5 * h - crop_sz * 0.5) y2 = y1 + crop_sz x1_pad = max(0, -x1) x2_pad = max(x2 - im.shape[1] + 1, 0) y1_pad = max(0, -y1) y2_pad = max(y2 - im.shape[0] + 1, 0) # Crop target im_crop = im[y1 + y1_pad:y2 - y2_pad, x1 + x1_pad:x2 - x2_pad, :] # ndarray:(466,466,3) if mask is not None: mask_crop = mask[y1 + y1_pad:y2 - y2_pad, x1 + x1_pad:x2 - x2_pad] # Tensor:(466,466) #Pad im_crop_padded = cv.copyMakeBorder(im_crop, y1_pad, y2_pad, x1_pad, x2_pad, cv.BORDER_CONSTANT) # ndarray:(466,466,3) Fill if the cropped area exceeds the border # deal with attention mask H, W, _ = im_crop_padded.shape # 446, 446, 3 att_mask = np.ones((H,W)) # ndarray:(466,466) end_x, end_y = -x2_pad, -y2_pad # 0, 0 if y2_pad == 0: end_y = None if x2_pad == 0: end_x = None att_mask[y1_pad:end_y, x1_pad:end_x] = 0 if mask is not None: # True mask_crop_padded = F.pad(mask_crop, pad=(x1_pad, x2_pad, y1_pad, y2_pad), mode='constant', value=0)
3. Resize
if output_sz is not None: # True resize_factor = output_sz / crop_sz im_crop_padded = cv.resize(im_crop_padded, (output_sz, output_sz)) # ndarray:(128,128,3) att_mask = cv.resize(att_mask, (output_sz, output_sz)).astype(np.bool_) # ndarray:(128,128,3) bool type if mask is None: return im_crop_padded, resize_factor, att_mask mask_crop_padded = \ F.interpolate(mask_crop_padded[None, None], (output_sz, output_sz), mode='bilinear', align_corners=False)[0, 0] # Tensor:(128,128) return im_crop_padded, resize_factor, att_mask, mask_crop_padded
resize into the input size, the size of output_sz/crop_sz is recorded here, which will be used later. This step has determined the cropped input image, but the labels are not yet aligned.
4. Align labels
def transform_image_to_crop(box_in: torch.Tensor, box_extract: torch.Tensor, resize_factor: float, crop_sz: torch.Tensor, normalize=False) -> torch.Tensor: """ Transform the box co-ordinates from the original image co-ordinates to the co-ordinates of the cropped image args: box_in - the box for which the co-ordinates are to be transformed box_extract - the box about which the image crop has been extracted. resize_factor - the ratio between the original image scale and the scale of the image crop crop_sz - size of the cropped image returns: torch.Tensor - transformed co-ordinates of box_in """ box_extract_center = box_extract[0:2] + 0.5 * box_extract[2:4] box_in_center = box_in[0:2] + 0.5 * box_in[2:4] box_out_center = (crop_sz - 1) / 2 + (box_in_center - box_extract_center) * resize_factor box_out_wh = box_in[2:4] * resize_factor box_out = torch.cat((box_out_center - 0.5 * box_out_wh, box_out_wh)) if normalize: return box_out / crop_sz[0] else: return box_out
First calculate the center coordinates of the offset bounding box and the center coordinates of the ground truth bounding box
where x and y are the upper-left vertex coordinates of the bounding box.
Next align the labels
outputsz is the size to be input,
Then the center coordinate form is converted to the left vertex coordinate form (x, y, w, h), and then normalized
return box_out / crop_sz[0]
are divided by the input size, such as 384, 256
5. Generate the label that the head needs to predict
After the above operations are not finished, just align the gt bbox and crop the input, and also need to generate the label predicted by the model.
1) Category labels,
Generate a Gaussian map from the center coordinates of the gt bbox
def generate_heatmap(bboxes, patch_size=320, stride=16): # Tensor:(1,4,4), 256, 16 """ Generate ground truth heatmap same as CenterNet Args: bboxes (torch.Tensor): shape of [num_search, bs, 4] Returns: gaussian_maps: list of generated heatmaps """ gaussian_maps = [] heatmap_size = patch_size // stride # 16 for single_patch_bboxes in bboxes: # Tensor:(4,4) bs = single_patch_bboxes.shape[0] # 4 gt_scoremap = torch.zeros(bs, heatmap_size, heatmap_size) # Tensor:(4,16,16) classes = torch.arange(bs).to(torch.long) # tensor:([0,1,2,3]) bbox = single_patch_bboxes * heatmap_size # Tensor:(4,4) wh = bbox[:, 2:] # Tensor:(4,2) centers_int = (bbox[:, :2] + wh / 2).round() # Tensor:(4,2) center point CenterNetHeatMap.generate_score_map(gt_scoremap, classes, wh, centers_int, 0.7) gaussian_maps.append(gt_scoremap.to(bbox.device)) return gaussian_maps
2) Regression tab
It is the gt bbox itself, but it should be noted that the gt bbox here has been normalized.
And the output of the network is the score map, size and offset, so the regression label is not direct, but indirect.
2. Bounding box regression of the model’s predicted output
The output via the output header consists of three
score_map_ctr, size_map, offset_map = self.get_score_map(x) # Tensor:(4,1,16,16) , Tensor:(4,2,16,16), Tensor:(4,2,16,16 )
regression bounding box
def cal_bbox(self, score_map_ctr, size_map, offset_map, return_score=False): max_score, idx = torch.max(score_map_ctr.flatten(1), dim=1, keepdim=True) # The shapes are all Tensor: (4,1) According to the batch, take out the largest score and the corresponding index idx_y = idx // self.feat_sz # Tensor:(4,1) idx_x = idx % self.feat_sz # Tensor:(4,1) idx = idx. unsqueeze(1). expand(idx. shape[0], 2, 1) # Tensor:(4,2,1) size = size_map.flatten(2).gather(dim=2, index=idx) # Tensor:(4,2,1) offset = offset_map.flatten(2).gather(dim=2, index=idx).squeeze(-1) # Tensor:(4,2) # bbox = torch.cat([idx_x - size[:, 0] / 2, idx_y - size[:, 1] / 2, # idx_x + size[:, 0] / 2, idx_y + size[:, 1] / 2], dim=1) / self. feat_sz # cx, cy, w, h bbox = torch.cat([(idx_x.to(torch.float) + offset[:, :1]) / self.feat_sz, (idx_y.to(torch.float) + offset[:, 1:]) / self.feat_sz, size. squeeze(-1)], dim=1) # Tensor:(4,4) if return_score: return bbox, max_score return bbox
Here is the form of the center coordinates. They are used directly in the training phase to calculate the loss function.
reasoning stage,
pred_box = (pred_boxes.mean( dim=0) * self.params.search_size / resize_factor).tolist() # (cx, cy, w, h) [0,1] multiplied by search size to normalize
To normalize, convert the predicted bbox to the scale of the cropped image, and note that what is achieved here is to keep the scale of the cropped image at the same scale as the original image.
def map_box_back(self, pred_box: list, resize_factor: float): cx_prev, cy_prev = self.state[0] + 0.5 * self.state[2], self.state[1] + 0.5 * self.state[3] cx, cy, w, h = pred_box half_side = 0.5 * self.params.search_size / resize_factor cx_real = cx + (cx_prev - half_side) cy_real = cy + (cy_prev - half_side) return [cx_real - 0.5 * w, cy_real - 0.5 * h, w, h]
Here self.state is the predicted bbox of the previous frame. At this time, the predicted bbox is the coordinate in the cropped image, so if you want to return it to the coordinates on the original img, you need to calculate the relative coordinate transformation between the coordinate system of the cropped image and the coordinate system of the original img. Therefore, use the previous A burst of predicted center coordinates of the bbox minus the center coordinates of the cropped image to obtain the relative coordinate transformation, and directly adding the relative coordinates to obtain the predicted coordinates of the original img.