Detectron2-maskrcnn inference/prediction video

Foreword

I want to use some existing frameworks and models to reason/predict the videos we have taken, and get the marked boxes in coco format. Not long ago, I used the mmdetection framework for reasoning, and found that fasterrcnn can be used for reasoning under this framework. If maskrcnn is used, such an error will be reported:
test_cfg specified both in outer and xxx (forget what it is) field
After searching for a long time, I couldn’t find the reason. I asked gpt and said that there is another framework called detectron2, so I just picked up my small bench and came here.

Some of my configuration (for reference only)

ubuntu x86_64
cuda11.1
python3.8
torch1.10.0 + cu111
torchvision0.11.0+cu111
detectron20.6 + cu111

Inference

Put the complete code first, there are a lot of comments, and there will be some explanations of the code and the download address of the model weight later.

import numpy as np
import os
import json
import cv2
import time

from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.data import MetadataCatalog
from pycocotools import mask as maskUtils

workdir = "/home/xxxxxx/detectron2-main/configs/COCO-InstanceSegmentation/"

model_cfg = "mask_rcnn_R_50_FPN_3x.yaml"
weights_cfg = "model_final_fpn50.pkl"

image_dir = "/home/xxxxxx/inference/image_video/"
image = "oip.jpg"
video = "6.mp4"

coco_before_partion_dir = "/home/xxxxxx/inference/coco_before_partion/"
coco_json = "video.json"

frame_width = 0
frame_height = 0

def main():
    # image_path = os.path.join(image_dir, image) # image address/path
    video_path = os.path.join(image_dir, video) # video address = video directory + video file name
    model_cfg_path = os.path.join(workdir, model_cfg) # model configuration parameter address/path
    weights_cfg_path = os.path.join(workdir, weights_cfg) # weight configuration address/path
    threshold = 0.8 # confidence threshold
    # The address (directory) of the generated coco dataset, without dividing the training set and test set
    # image_save_path = os.path.join(coco_before_partion_dir, coco_json)
    video_save_path = os.path.join(coco_before_partion_dir, coco_json)

    # 1.
    predictor, metadata = load_model(
        model_cfg_path = model_cfg_path,
        weights_cfg_path = weights_cfg_path,
        threshold = threshold
    )

    # Inference video, put the inference results of each frame together in a json file
    results = inference_video(video_path, predictor)

    # Make coco dataset
    
    make_coco_video(results, video_save_path, metadata)



def load_model(model_cfg_path, weights_cfg_path, threshold):
    print("-------------------------------------")
    print("Start loading model...")
    
    # record start time
    t = time. time()
    
    # Initialize configuration variables
    cfg = get_cfg()

    # Load the model configuration file
    cfg. merge_from_file(model_cfg_path)

    # Load the pretrained weights file
    cfg.MODEL.WEIGHTS = weights_cfg_path

    # Set the confidence threshold, the box with confidence less than the threshold will be discarded, and only if it is greater than or equal to it can it be retained
    cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5

    # Generate a predictor, which will be used for prediction/inference for each frame of the video
    predictor = DefaultPredictor(cfg)

    # A metadata object containing model information. This metadata provides various information about the loaded model
    metadata = MetadataCatalog.get(cfg.DATASETS.TEST[0])

    # calculation time
    print("Loading model completed!\\
Time-consuming {} s".format(time.time() - t))
    print("-------------------------------------")

    return predictor, metadata


def inference_video(video_path, predictor):
    # read the video to infer
    cap = cv2.VideoCapture(video_path)
# Get the width, height and total number of frames of the video
    global frame_width, frame_height
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap. get(cv2. CAP_PROP_FRAME_HEIGHT))
    total_frames = int(cap. get(cv2. CAP_PROP_FRAME_COUNT))
    print("-------------------------------------")
    print("total frames:", total_frames)
    print("frame width:", frame_width)
    print("frame height:", frame_height)
    print("-------------------------------------")

    # start inference/prediction
    print("Start reasoning...")
    # Define an empty list to store the results
    results = []
    # Set the initial image image and the id of the annotations
    image_id = 1
    annotations_id = 1
    
    # Create progress bar object
    with tqdm(total = total_frames, desc="prediction/inference video frame", unit=" frames") as pbar:
        while cap.isOpened():
            # read video frames
            ret, frame = cap. read()
            if not ret:
                break

            # Convert frame format, BGR to RGB
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            # The prediction result of a frame of pictures
            outputs = predictor(frame)

            # Get prediction results.
            """
            Used to transfer Instances objects from GPU to CPU. The purpose of this is to be able to process and visualize the results using conventional CPU operations and libraries such as numpy and OpenCV.
            """
            instances = outputs["instances"].to("cpu")

            # loop through each instance
            # # First convert the rectangular box of the prediction result into a list, and use the elements in it directly according to the index
            bbox_list = instances.pred_boxes.tensor.tolist()
            for i in range(len(instances)):
                # Get the class of the instance
                class_id = instances.pred_classes[i]

                # Get the confidence of the instance
                score = instances. scores[i]

                # Get the mask of the instance
                mask = instances.pred_masks[i]

                # Get the bounding box of the instance
                # bbox = instances.pred_boxes[i]

                # Convert mask image to segmented region representation using RLE encoding format
                segmentation = maskUtils.encode(np.asfortranarray(mask,dtype=np.uint8))
                segmentation["counts"] = segmentation["counts"].decode("utf-8")
        

                # add the result to the list
                result = {<!-- -->
                    "id": annotations_id,
                    "image_id": image_id,
                    "category_id": class_id. item(),
                    "score": score. item(),
                    "segmentation": segmentation,
                    "bbox": bbox_list[i], # One box per instance
                    "area": int(maskUtils. area(segmentation))
                }
                results.append(result)
                annotations_id += 1
            image_id += 1 # increment image id counter
            # progress bar
            pbar. update(1)

    # release video resources
    cap. release()
    cv2.destroyAllWindows()

    print("Inference completed!")
    print("The video contains the number of instances:{}".format(len(results)))
    print("-------------------------------------")
    return results



def make_coco_video(results, save_path, metadata):
    print("-------------------------------------")
    print("Generating coco format dataset")
    print("-------------------------------------")
    # Define a dictionary to store the final result
    coco_output = {<!-- -->
        "images": [],
        "annotations": [],
        "categories": []
    }

    # A dictionary for recording image IDs.
    '''
    Because of the result of our prediction in the previous step (a list), each element of this list is not all instances of a frame of pictures,
    Rather, each instance, meaning that each instance is an element of the list. A frame of picture may have multiple instances, or only one instance.
    All instances are equal and occupy an element in the list.
        In the future, we need to save all the picture ids without redundancy, but when we extract them, we traverse each instance,
    Extract the id from the 'image_id' field. Direct extraction will be redundant. You can set a dictionary, and add an image_id to the
    When adding an element, and then adding the image_id of the next instance, check whether the id already exists, add it if it does not exist, and not add it otherwise
    '''
    # Set image_id dictionary
    image_id_dict = {<!-- -->}

    # Create a progress bar object
    with tqdm(total=len(results), desc="convert to coco format", unit=" instances") as pbar:
    # Iterate over each instance, grouping the results by image
        for i, result in enumerate(results):
            image_id = result["image_id"]

            # Check if the image id already exists in the dictionary
            if image_id not in image_id_dict:
                # Add image information
                image_info = {<!-- -->
                    "id": image_id,
                    "file_name": "frame{}.jpg".format(image_id),
                    "width": frame_width,
                    "height": frame_height
                }
                # 1. Add image information
                coco_output["images"].append(image_info)
                # Add the image ID to the dictionary
                image_id_dict[image_id] = len(coco_output["images"]) - 1
        
            # 2. Add comment information
            annotation_info = {<!-- -->
                "id": result["id"],
                "image_id": result["image_id"],
                "category_id": result["category_id"],
                "segmentation": result["segmentation"],
                "area": result["area"],
                "bbox": result["bbox"],
                "iscrowd": 0,
                "score": result["score"]
            }
            coco_output["annotations"].append(annotation_info)
            pbar. update(1)

    # 3. Update the "categories" field
    for class_id, class_name in enumerate(metadata. get("thing_classes", [])):
        category_info = {<!-- -->
            "id": class_id,
            "name": class_name,
            "supercategory": ""
        }

        coco_output["categories"].append(category_info)

    print("-------------------------------------")
    print("The coco data set was successfully generated!")
    print("-------------------------------------")
    # Save the final result as a JSON file
    with open(save_path, "w") as f:
        json.dump(coco_output, f)



if __name__ == '__main__':
    main()

0. Set some global variables first:

workdir = "/home/yhy/jzh/detectron2-main/configs/COCO-InstanceSegmentation/"
model_cfg = "mask_rcnn_R_50_FPN_3x.yaml"
weights_cfg = "model_final_fpn50.pkl"

image_dir = "/home/yhy/jzh/inference/image_video/"
image = "oip.jpg"
video = "6.mp4"

coco_before_partion_dir = "/home/yhy/jzh/inference/coco_before_partion/"
coco_json = "video.json"

Workspace directory workdir
Model parameters model_cfg Model parameters are generally available locally. In the configs directory of detectron2, there are many yaml files, corresponding to the parameters of different models.
The weight file weights_cfg needs to be downloaded from the official website: https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md

The download address on the right

image_dir is the directory for storing pictures/videos. image and video are picture files and video files respectively. If you only reason about videos, you don’t need to set image
frame_width and frame_height are the width and height of the picture, because they will be used in different functions later, so set global variables (the method does not feel very good, and I have not thought of a better method for the time being.

main() function

def main():
    # 0. Set some
    # image_path = os.path.join(image_dir, image) # image address/path
    video_path = os.path.join(image_dir, video) # video address = video directory + video file name
    model_cfg_path = os.path.join(workdir, model_cfg) # model configuration parameter address/path
    weights_cfg_path = os.path.join(workdir, weights_cfg) # weight configuration address/path
    threshold = 0.8 # confidence threshold
    # The address (directory) of the generated coco dataset, without dividing the training set and test set
    # image_save_path = os.path.join(coco_before_partion_dir, coco_json)
    video_save_path = os.path.join(coco_before_partion_dir, coco_json)

    # 1. Load the model for prediction/reasoning
    predictor, metadata = load_model(
        model_cfg_path = model_cfg_path,
        weights_cfg_path = weights_cfg_path,
        threshold = threshold
    )

    # Inference video, put the inference results of each frame together in a json file
    results = inference_video(video_path, predictor)

    # Make coco dataset
    make_coco_video(results, video_save_path, metadata)

For the load_model function:

In the Detectron2 framework, the load_model function is used to load the trained model and return a metadata (metadata) object containing model information. This metadata provides various information about the loaded model to help users understand and manipulate the model.
The return value metadata is a MetadataCatalog object, which contains the following common attributes:
“model_type“: A string representation of the model type. For example, for the Mask R-CNN model, the value is “mask_rcnn”.
“thing_classes“: A list containing the class names in the detection/segmentation task. Each category is represented by a string.
“stuff_classes“: For background classes in segmentation tasks, a list containing class names. Usually used in instance segmentation, not the same as “thing_classes“.
“keypoint_names“: If the model supports keypoint detection, a list containing keypoint names.
“keypoint_flip_map“: Keypoint mirror mapping table, used to flip the keypoint on the mirror image.
“thing_dataset_id_to_contiguous_id“: A dictionary that maps the category IDs in the dataset to the continuous category IDs in the model output.
“stuff_dataset_id_to_contiguous_id“: A dictionary that maps background category IDs in the dataset to continuous category IDs in the model output.
“thing_colors“: A list of color codes for each instance class. These colors are used to visualize instance segmentation results.
“stuff_colors“: A color-coded list of each background category. These colors are used to visualize the background in the segmentation results.
We mainly use its thing_classes attribute in the subsequent code to generate the category_id attribute of the coco dataset
coco data format:

coco_data{<!-- -->['image'], ['annotations'], ['category_id']}

Inference/Prediction

results = inference_video(video_path, predictor)

For the video at the video_path address, use the loaded model predictor to perform inference/prediction.

Make the result into coco format data

make_coco_video(results, video_save_path, metadata)

results is the predicted result of the previous step, video_save_path is the address of the final json file to be saved, and metadata is the variable we need to fill in the category_id field of coco.