Based on Segment-and-Track Anything and ProPainter to achieve one-click target removal and one-click watermark removal from videos

1. ProPainter

1. Introduction to the algorithm

ProPainter is a video repair tool developed by the S-Lab team of Nanyang Technological University in Singapore. It combines the advantages of image and feature restoration with efficient Transformer technology, aiming to provide high-quality video restoration effects while maintaining high efficiency.
ProPainter includes the following features:

Object Removal: Easily remove unwanted objects from videos.
Watermark removal: Can be used to remove watermarks from videos and improve visual quality.
Video content integrity repair: Able to repair damaged video content to make it look complete and coherent.

2. Project deployment

If you want to know more about ProPainter or want to deploy a ProPainter project, you can check my previous blog:
One-click intelligent video editing and video repair algorithm-ProPainter source code analysis and deployment

3. Project limitations

The current open source code of ProPainter only has the source code for removing the object part of the video, but before removing the object, a mask image must be generated. ProPainter does not provide the code to generate the mask image. The code to generate the mask image relies on target segmentation and target tracking.
For example, if I want to move the projector in the middle of the table, I need to use Segment-and-Track Anything to segment and track the target, and then generate a mask image for each frame:

Generated mask image:

2. Segment-and-Track Anything

1. Introduction to the algorithm

“Segment-and-Track Anything” is a multifunctional video segmentation and target tracking model developed by the ReLER Laboratory of Zhejiang University. It deeply integrates SAM (Segment Anything Model) and video segmentation technology to enable it to efficiently track videos. target, and supports multiple interaction methods (such as point, brush, and text input).

On this basis, SAM-Track unifies multiple traditional video segmentation tasks, enabling it to segment and track any target in any video with one click, pushing traditional video segmentation technology into the field of general video segmentation. SAM-Track exhibits excellent performance in complex scenes and can stably track hundreds of targets with high quality even on a single GPU card.

The SAM-Track model is based on the four-track champion scheme DeAOT of the ECCV’22 VOT Workshop. DeAOT is an efficient multi-target video object segmentation model that can track and segment objects in the remaining frames of the video by providing object annotations in the first frame. DeAOT uses a recognition mechanism to embed multiple targets in a video into the same high-dimensional space to achieve simultaneous tracking of multiple objects. DeAOT’s speed performance in multi-object tracking is comparable to other VOS methods that focus on single object tracking. In addition, through the hierarchical Transformer-based propagation mechanism, DeAOT better integrates long-term and short-term information, showing excellent tracking performance. However, DeAOT requires reference frame annotation to initialize. In order to improve convenience, SAM-Track utilizes SAM, a star model in the field of image segmentation, to obtain high-quality reference frame annotation information. SAM’s excellent zero-sample migration capabilities and multiple interaction methods enable SAM-Track to efficiently obtain high-quality reference frame annotation information for DeAOT.

Although the SAM model performs well in the field of image segmentation, it cannot output semantic labels, and text hints cannot effectively support Referring Object Segmentation and other tasks that rely on deep semantic understanding. Therefore, the SAM-Track model further integrates Grounding DINO to achieve high-precision language-guided video segmentation. Grounding DINO is an open ensemble object detection model with excellent language understanding capabilities.

2. Project deployment

Please refer to previous blogs:
?Segment-and-Track Anything–General intelligent video segmentation, target tracking, editing algorithm interpretation and source code deployment

3. Project integration

1. Target segmentation and tracking

After integrating Segment-and-Track Anything and ProPainter, target segmentation and target tracking are achieved.
Target segmentation and target tracking:

def tracking_objects_in_video(SegTracker, input_video, input_img_seq=None, frame_num=0):

    if input_video is not None:
        video_name = os.path.basename(input_video).split('.')[0]
    else:
        return None, None

    # create dir to save result
    tracking_result_dir = f'{<!-- -->os.path.join(os.path.dirname(__file__), "output", f"{<!-- -->video_name}")}'
    create_dir(tracking_result_dir)

    io_args = {<!-- -->
        'tracking_result_dir': tracking_result_dir,
        'output_mask_dir': f'{<!-- -->tracking_result_dir}/{<!-- -->video_name}_masks',
        'output_masked_frame_dir': f'{<!-- -->tracking_result_dir}/{<!-- -->video_name}_masked_frames',
        'output_video': f'{<!-- -->tracking_result_dir}/{<!-- -->video_name}_seg.mp4', # keep same format as input video
        # 'output_gif': f'{tracking_result_dir}/{video_name}_seg.gif',
    }

    return video_type_input_tracking(SegTracker, input_video, io_args, video_name, frame_num)


def video_type_input_tracking(SegTracker, input_video, io_args, video_name, frame_num=0):

    pred_list = []
    masked_pred_list = []

    # source video to segment
    cap = cv2.VideoCapture(input_video)
    fps = cap.get(cv2.CAP_PROP_FPS)

    if frame_num > 0:
        output_mask_name = sorted([img_name for img_name in os.listdir(io_args['output_mask_dir'])])
        output_masked_frame_name = sorted([img_name for img_name in os.listdir(io_args['output_masked_frame_dir'])])

        for i in range(0, frame_num):
            cap.read()
            pred_list.append(
                np.array(Image.open(os.path.join(io_args['output_mask_dir'], output_mask_name[i])).convert('P')))
            masked_pred_list.append(
                cv2.imread(os.path.join(io_args['output_masked_frame_dir'], output_masked_frame_name[i])))

    # create dir to save predicted mask and masked frame
    if frame_num == 0:
        if os.path.isdir(io_args['output_mask_dir']):
            # os.system(f"rm -r {io_args['output_mask_dir']}")
            pass
        if os.path.isdir(io_args['output_masked_frame_dir']):
            # os.system(f"rm -r {io_args['output_masked_frame_dir']}")
            pass
    output_mask_dir = io_args['output_mask_dir']
    create_dir(io_args['output_mask_dir'])
    create_dir(io_args['output_masked_frame_dir'])

    torch.cuda.empty_cache()
    gc.collect()
    sam_gap = SegTracker.sam_gap
    frame_idx = 0

    with torch.cuda.amp.autocast():
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

            if frame_idx == 0:
                pred_mask = SegTracker.first_frame_mask
                torch.cuda.empty_cache()
                gc.collect()
            elif (frame_idx % sam_gap) == 0:
                seg_mask = SegTracker.seg(frame)
                torch.cuda.empty_cache()
                gc.collect()
                track_mask = SegTracker.track(frame)
                # find new objects, and update tracker with new objects
                new_obj_mask = SegTracker.find_new_objs(track_mask, seg_mask)
                save_prediction(new_obj_mask, output_mask_dir, str(frame_idx + frame_num).zfill(5) + '_new.png')
                pred_mask = track_mask + new_obj_mask
                # segtracker.restart_tracker()
                SegTracker.add_reference(frame, pred_mask)
            else:
                pred_mask = SegTracker.track(frame, update_memory=True)
            torch.cuda.empty_cache()
            gc.collect()

            save_prediction(pred_mask, output_mask_dir, str(frame_idx + frame_num).zfill(5) + '.png')
            pred_list.append(pred_mask)

            print("processed frame {}, obj_num {}".format(frame_idx + frame_num, SegTracker.get_obj_num()), end='\r')
            frame_idx + = 1
        cap.release()
        print('\
finished')

    ##################
    #Visualization
    ##################

    # draw pred mask on frame and save as a video
    cap = cv2.VideoCapture(input_video)
    fps = cap.get(cv2.CAP_PROP_FPS)
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

    fourcc = cv2.VideoWriter_fourcc(*"mp4v")
    out = cv2.VideoWriter(io_args['output_video'], fourcc, fps, (width, height))

    frame_idx = 0
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        pred_mask = pred_list[frame_idx]
        masked_frame = draw_mask(frame, pred_mask)
        cv2.imwrite(f"{<!-- -->io_args['output_masked_frame_dir']}/{<!-- -->str(frame_idx).zfill(5)}.png", masked_frame[:, :, ::-1])

        masked_pred_list.append(masked_frame)
        masked_frame = cv2.cvtColor(masked_frame, cv2.COLOR_RGB2BGR)
        out.write(masked_frame)

        print('frame {} written'.format(frame_idx), end='\r')
        frame_idx + = 1
    out.release()
    cap.release()
    print("\
{} saved".format(io_args['output_video']))
    print('\
finished')

    # manually release memory (after cuda out of memory)
    delSegTracker
    torch.cuda.empty_cache()
    gc.collect()

    return io_args['output_video']

After execution, the mask graph is generated in the output directory of the project root directory:

2. Target removal

After getting the mask image, you can use ProPainter to remove the video target:

def remove_watermark(input_video):
    print("Start removing target")
    # print('cwd', os.getcwd())
    root_path = os.getcwd()

    os.chdir(os.path.join(root_path,'ProPainter'))

    python_exe = resolve_relative_path(os.path.join(root_path,'env/python.exe'))
    inference = resolve_relative_path(os.path.join(root_path,'ProPainter/inference_propainter.py'))

    video_name = os.path.basename(input_video).split('.')[0].split('_')[0]
    output_base_path = resolve_relative_path('./output/')
    output_path = f'{<!-- -->output_base_path}/{<!-- -->video_name}/'
    mask = f'{<!-- -->output_path}/{<!-- -->video_name}_masks/'

    command = f'{<!-- -->python_exe} {<!-- -->inference} --video {<!-- -->input_video} --mask {<!-- -->mask} --output {<!-- -->output_path} --fp16 --subvideo_length 50'
    print(command)
    result = subprocess.run(command, shell=True)

    if result.returncode != 0:
        error_message = result.stderr.decode('utf-8', 'ignore')
        print(f"Error {<!-- -->error_message}")
    else:
        print("success")
    file_name = input_video.split('')[-1].split('.')[0]
    print(file_name)
    os.chdir(resolve_relative_path('./'))
    print('cwd', os.getcwd())
    return output_path + '/' + file_name + '/' + 'inpaint_out' + '.mp4'
    # return input_video

4. Project source code

1. Project configuration

The hardware environment I use is a GPU 3080. The current project can only process short videos, and there are also restrictions on the size of the input video. If the input size is too large, the GPU memory will be insufficient. The input video is too long, more than 1 minute. The video will freeze.

2. Project source code

For the convenience of running, the project is packaged into a package and run directly after downloading. There is no need to install any environment, but it must be used under the GPU. Download address: https://download.csdn.net/download/matt45m/88460539

3. Source code execution

After downloading, click start.bat to start the project, use a browser to open http://127.0.0.1:7860 to display the operation interface, and the completed project is in the output directory of the project root directory.