Interpretation of the code details of the predict part of the SSD model to detect the image part

The train of the SSD model is based on the predict, and the predict part needs to be interpreted before studying the train.

 if mode == "predict":
        '''
        1. If you want to save the detected image, use r_image.save("img.jpg") to save it, and modify it directly in predict.py.
        2. If you want to get the coordinates of the prediction frame, you can enter the ssd.detect_image function and read the four values of top, left, bottom, and right in the drawing part.
        3. If you want to use the prediction frame to intercept the target, you can enter the ssd.detect_image function, and use the obtained four values of top, left, bottom, and right in the drawing part
        Use the matrix method to intercept on the original image.
        4. If you want to write additional words on the prediction map, such as the number of specific targets detected, you can enter the ssd.detect_image function and judge the predicted_class in the drawing part.
        For example, judge if predicted_class == 'car': to judge whether the current target is a car, and then record the quantity. Use draw.text to write.
        '''
        while True:
            # img = input('Input image filename:')
            img="E:\deeplearning-project\ssd-pytorch-master\img\street.jpg"
            try:
                image = Image. open(img)
            except:
                print('Open Error! Try again!')
                continue
            else:
                r_image = ssd. detect_image(image, crop = crop, count=count)
                r_image. show()

As shown in the above code, the image is read and sent to the ssd.detect_image function, through which the image with the prediction result box can be obtained

Expand the ssd.detect_image function as shown in the following code

 def detect_image(self, image, crop = False, count = False):
        #------------------------------------------------- --#
        # Calculate the height and width of the input image
        #------------------------------------------------- --#
        image_shape = np.array(np.shape(image)[0:2])
        #------------------------------------------------- --------#
        # Convert the image to an RGB image here to prevent the grayscale image from reporting an error in prediction.
        # The code only supports the prediction of RGB images, all other types of images will be converted to RGB
        #------------------------------------------------- --------#
        image = cvtColor(image)
        #------------------------------------------------- --------#
        # Add gray bars to the image to achieve undistorted resize
        # You can also directly resize for recognition
        #------------------------------------------------- --------#
        image_data = resize_image(image, (self.input_shape[1], self.input_shape[0]), self.letterbox_image)
        #------------------------------------------------- --------#
        # Add the batch_size dimension, image preprocessing, and normalization.
        #------------------------------------------------- --------#
        image_data = np.expand_dims(np.transpose(preprocess_input(np.array(image_data, dtype='float32')), (2, 0, 1)), 0)

        with torch.no_grad():
            #------------------------------------------------- --#
            # Convert to torch form
            #------------------------------------------------- --#
            images = torch.from_numpy(image_data).type(torch.FloatTensor)
            if self.cuda:
                images = images. cuda()
            #------------------------------------------------- --------#
            # Feed the image into the network for prediction!
            #------------------------------------------------- --------#
            # Get the prior frame position adjustment parameters on each grid point on the feature layer
            outputs = self.net(images)
            #Get the prior frame position adjustment parameters on each grid point on the feature layer
            #------------------------------------------------- ----------#
            # Decode the prediction result
            #------------------------------------------------- ----------#
            results = self.bbox_util.decode_box(outputs, self.anchors, image_shape, self.input_shape, self.letterbox_image,
                                                    nms_iou = self. nms_iou, confidence = self. confidence)

We know that the picture will output the prediction box and confidence of the model through output = net (image), and then send the output output to the decode_box function for processing. Here we can find that anchors are also sent in at the same time.

Here, first of all, we need to understand the relationship between the anchor frame (anchor), the prior frame (prior bounding), and the real frame (ground-truth).

First, let’s take a look at the coordinate expression formats of these three boxes

The conclusion is put first: the formats of the anchor, the decoded prior frame, and the real frame are all [x1, y1, x2, y2]. Because this format can directly calculate the intersection and union.

Let’s talk about the anchor frame first, which is the anchor. The format of the anchor is [x1, y1, x2, y2], which is the coordinates of the upper left corner and the lower right corner. The ssd outputs 6 feature layers. Take one of the feature layers as an example. The anchor is based on the size of the feature layer, such as 3*3, that is, 9 squares. Each square produces 6 anchors for the initial prediction box. 9 squares are 54 anchors are produced, and the same is true for other feature layers. Finally, the anchors produced by all feature layers are spliced together through the np.concatenate(anchors, axis=0) function to obtain the anchor as (8732, 4), that is, the fixed output of the ssd output feature layer is 8732 One, note that the anchors generated here have nothing to do with the pictures to be predicted. No matter what pictures are input, these anchors are fixed and output. The format of their coordinates is [x1, y1, x2, y2]. You can go to other people’s articles for details.

The a priori box, that is, the above-mentioned output is the numerical result directly output by the model. The reason why the numerical result is used here is that it needs to be decoded to obtain the coordinate value. The above-mentioned anchor is used in the decoding process. Be sure to pay attention to the next coordinate transformation.

The prior bounding also obtains the coordinates of a rectangle, but it needs to be decoded. In the decoding part, the format of the anchor is first transformed into the form of [x, y, w, h], (x, y) That is, the center point of the rectangle. By observing the code, we can find that the four values of the prior frame are actually relative to the offset of the anchor [x, y, w, h]. After compensating for the offset, the coordinates of the prior frame [x, y, w, h], when returning, convert the [x, y, w, h] format to the [x1, y1, x2, y2] format. And get (8732, 4)

 def decode_boxes(self, mbox_loc, anchors, variances):
        # Get the width and height of the prior box
        anchor_width = anchors[:, 2] - anchors[:, 0]
        anchor_height = anchors[:, 3] - anchors[:, 1]
        # Get the center point of the prior box
        anchor_center_x = 0.5 * (anchors[:, 2] + anchors[:, 0])
        anchor_center_y = 0.5 * (anchors[:, 3] + anchors[:, 1])

        # The xy axis offset of the real frame from the center of the prior frame
        decode_bbox_center_x = mbox_loc[:, 0] * anchor_width * variances[0]
        decode_bbox_center_x + = anchor_center_x
        decode_bbox_center_y = mbox_loc[:, 1] * anchor_height * variances[0]
        decode_bbox_center_y + = anchor_center_y
        
        # Find the width and height of the real box
        decode_bbox_width = torch.exp(mbox_loc[:, 2] * variances[1])
        decode_bbox_width *= anchor_width
        decode_bbox_height = torch.exp(mbox_loc[:, 3] * variances[1])
        decode_bbox_height *= anchor_height

        # Get the upper left and lower right corners of the real box
        decode_bbox_xmin = decode_bbox_center_x - 0.5 * decode_bbox_width
        decode_bbox_ymin = decode_bbox_center_y - 0.5 * decode_bbox_height
        decode_bbox_xmax = decode_bbox_center_x + 0.5 * decode_bbox_width
        decode_bbox_ymax = decode_bbox_center_y + 0.5 * decode_bbox_height

        # The upper left and lower right corners of the real frame are stacked
        decode_bbox = torch.cat((decode_bbox_xmin[:, None],
                                      decode_bbox_ymin[:, None],
                                      decode_bbox_xmax[:, None],
                                      decode_bbox_ymax[:, None]), dim=-1)
        # Prevent exceeding 0 and 1
        #torch.max takes the maximum value in the tensor, whoever is bigger takes whoever torch.min takes the minimum value in the tensor and whoever is smaller takes whoever
        decode_bbox = torch.min(torch.max(decode_bbox, torch.zeros_like(decode_bbox)), torch.ones_like(decode_bbox))
        return decode_bbox

Let’s talk about the real frame, the real frame is the manually marked frame, which is only used in the training stage, and not used in the prediction stage, so I won’t introduce it here.

Process the prior frame through NMS

As shown above, the prior frame has been obtained, the coordinates of the prior frame are [x1, y1, x2, y2], and the output also outputs the confidence format of the category corresponding to each prior frame as (8732, 21 ).

NMS processing reference https://blog.csdn.net/qq_38316300/article/details/120174900?spm=1001.2014.3001.5506

The a priori frame obtained after processing, in other words, the prediction frame is the result of normalization, and it needs to be restored to its real size. After the restoration, the real coordinate points of the prediction result [x1, y1, x2, y2 ]. After drawing it on the original picture, the final display will immediately get the top picture.

 if len(results[-1]) > 0:
                results[-1] = np.array(results[-1])
                #ssd_correct_boxes only supports [x,y,w,h] input, so the size type needs to be converted from [x,y,x,y] to [x,y,w,h]
                box_xy, box_wh = (results[-1][:, 0:2] + results[-1][:, 2:4])/2, results[-1][:, 2:4] - results[- 1][:, 0:2]
                #The above result is the result of normalization
                #below correct--boxes is the actual size of the original image compared to the original image
                results[-1][:, :4] = self.ssd_correct_boxes(box_xy, box_wh, input_shape, image_shape, letterbox_image)
                #The decoded image is in the format of [x,y,x,y]
    def ssd_correct_boxes(self, box_xy, box_wh, input_shape, image_shape, letterbox_image):
        #------------------------------------------------- ----------------#
        # Put the y-axis in front because it is convenient to multiply the width and height of the prediction frame and the image
        #------------------------------------------------- ----------------#
        box_yx = box_xy[..., ::-1]
        box_hw = box_wh[..., ::-1]
        input_shape = np.array(input_shape)
        image_shape = np.array(image_shape)

        if letterbox_image:
            #------------------------------------------------- ----------------#
            # The offset obtained here is the offset of the effective area of the image relative to the upper left corner of the image
            # new_shape refers to the width and height scaling
            #------------------------------------------------- ----------------#
            new_shape = np. round(image_shape * np. min(input_shape/image_shape))
            offset = (input_shape - new_shape)/2./input_shape
            scale = input_shape/new_shape

            box_yx = (box_yx - offset) * scale
            box_hw *= scale

        box_mins = box_yx - (box_hw / 2.)
        box_maxes = box_yx + (box_hw / 2.)
        boxes = np.concatenate([box_mins[..., 0:1], box_mins[..., 1:2], box_maxes[..., 0:1], box_maxes[..., 1:2] ], axis=-1)
        boxes *= np. concatenate([image_shape, image_shape], axis=-1)
        return boxes