FFmpeg5.0 source code reading – VideoToobox hardware decoding

Abstract: This article describes how the videotoobox decoder in FFmpeg performs decoding and how to decode an encoded code stream into the final naked stream.
Keyword: videotoobox, decoder, ffmpeg
VideoToolbox is a low-level framework that provides direct access to hardware encoders and decoders. It provides video compression and decompression services, and conversion services between raster image formats stored in CoreVideo pixel buffers. These services are provided as session objects (compression, decompression, and pixel transfer) and exported as Core Foundation (CF) types. VideoToolbox supports H.263, H.264, HEVC, MPEG-1, MPEG-2, MPEG-4 Part 2, ProRes decoding, H.264, HEVC, ProRes encoding. The latest version also seems to support VP9 decoding.

1 Main process

1.1 Context involved

Each decoder in FFmpeg has its own Context description, which describes the corresponding decoder parameters and the decoder’s processing function pointer according to the agreed format. The main implementation code of the VideoToolbox decoder in FFmpeg is in libavcodec/videotoobox.{h,c}, which defines an independent Context for each supported decoding format, such as ff_h263_videotoolbox_hwaccel, ff_h263_videotoolbox_hwaccel, ff_h264_videotoolbox_hwaccel,..., etc. are just different in implementation. We can focus on one of them. Here we mainly focus on ff_h264_videotoolbox_hwaccel.

const AVHWAccel ff_h264_videotoolbox_hwaccel = {<!-- -->
    .name = "h264_videotoolbox",
    .type = AVMEDIA_TYPE_VIDEO,
    .id = AV_CODEC_ID_H264,
    .pix_fmt = AV_PIX_FMT_VIDEOTOOLBOX,
    .alloc_frame = ff_videotoolbox_alloc_frame,
    .start_frame = ff_videotoolbox_h264_start_frame,
    .decode_slice = ff_videotoolbox_h264_decode_slice,
    .decode_params = videotoolbox_h264_decode_params,
    .end_frame = videotoolbox_h264_end_frame,
    .frame_params = ff_videotoolbox_frame_params,
    .init = ff_videotoolbox_common_init,
    .uninit = ff_videotoolbox_uninit,
    .priv_data_size = sizeof(VTContext),
};

This structure defines:

The name of the decoder;
Type of decoded data;
Decoder ID;
Hardware decoding format;
Apply for a function pointer of a hardware-related frame structure;
Perform memory copy and other operations on the frame before decoding starts;
decode data;
Parse the parameters required by the decoder such as sps, etc.;
Post-processing after sending the frame;
Initialize the hardware decoder;
Destroy hardware decoder;
Description structure of the current hardware decoder.

ff_h264_videotoolbox_hwaccel is stored in hw_configs, and the runtime traverses the list to find the desired hardware decoder. Therefore, the decoding work first goes through the ff_h264_decoder decoder in FFmpeg and then enters the hardware decoder.

const AVCodec ff_h264_decoder = {<!-- -->
    .name = "h264",
    .long_name = NULL_IF_CONFIG_SMALL("H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10"),
    .type = AVMEDIA_TYPE_VIDEO,
    .id = AV_CODEC_ID_H264,
    .priv_data_size = sizeof(H264Context),
    .init = h264_decode_init,
    .close = h264_decode_end,
    .decode = h264_decode_frame,
    .capabilities = /*AV_CODEC_CAP_DRAW_HORIZ_BAND |*/ AV_CODEC_CAP_DR1 |
                             AV_CODEC_CAP_DELAY | AV_CODEC_CAP_SLICE_THREADS |
                             AV_CODEC_CAP_FRAME_THREADS,
    .hw_configs = (const AVCodecHWConfigInternal *const []) {<!-- -->
#if CONFIG_H264_DXVA2_HWACCEL
                               HWACCEL_DXVA2(h264),
#endif
#if CONFIG_H264_D3D11VA_HWACCEL
                               HWACCEL_D3D11VA(h264),
#endif
#if CONFIG_H264_D3D11VA2_HWACCEL
                               HWACCEL_D3D11VA2(h264),
#endif
#if CONFIG_H264_NVDEC_HWACCEL
                               HWACCEL_NVDEC(h264),
#endif
#if CONFIG_H264_VAAPI_HWACCEL
                               HWACCEL_VAAPI(h264),
#endif
#if CONFIG_H264_VDPAU_HWACCEL
                               HWACCEL_VDPAU(h264),
#endif
#if CONFIG_H264_VIDEOTOOLBOX_HWACCEL
                               HWACCEL_VIDEOTOOLBOX(h264),
#endif
                               NULL
                           },
    .caps_internal = FF_CODEC_CAP_INIT_THREADSAFE | FF_CODEC_CAP_EXPORTS_CROPPING |
                             FF_CODEC_CAP_ALLOCATE_PROGRESS | FF_CODEC_CAP_INIT_CLEANUP,
    .flush = h264_decode_flush,
    .update_thread_context = ONLY_IF_THREADS_ENABLED(ff_h264_update_thread_context),
    .update_thread_context_for_user = ONLY_IF_THREADS_ENABLED(ff_h264_update_thread_context_for_user),
    .profiles = NULL_IF_CONFIG_SMALL(ff_h264_profiles),
    .priv_class = &h264_class,
};

VTContext describes the VT Context during VT decoding.

typedef struct VTContext {<!-- -->
    // The current bitstream buffer.
    uint8_t *bitstream;
    // The current size of the bitstream.
    int bitstream_size;
    // The reference size used for fast reallocation.
    int allocated_size;
    // The core video buffer
    CVImageBufferRef frame;
    // Current dummy frames context (depends on exact CVImageBufferRef params).
    struct AVBufferRef *cached_hw_frames_ctx;
    // Non-NULL if the new hwaccel API is used. This is only a separate struct
    // to ease compatibility with the old API.
    struct AVVideotoolboxContext *vt_ctx;

    // Current H264 parameters (used to trigger decoder restart on SPS changes).
    uint8_t sps[3];
    bool reconfig_needed;
    void *logctx;
} VTContext;

1.2 Main process

2 Specific implementation of each step

2.1`ff_videotoolbox_common_init`

ff_videotoolbox_common_init is called when the decoder is initialized, usually when avcodec_open2 initializes the hardware decoder. Generally, in order to more accurately detect the media information of the current video, FFmpeg will initialize the decoder to decode a small number of frames during avformat_find_stream_info to detect streaming media information.
When initializing, first apply for the Context memory of VT and set some parameters. In fact, only the callback function and PixFormat of VT are set. After that, initialize AVHWFramesContext in time as needed, mainly to apply for memory and set the frame format such as width and height, format, etc.
The last step is to call videotoolbox_start to create a VT Session. The creation process is relatively simple, just call Apple’s API directly to create a Session. What needs to be focused on is how to set it up. The specific implementation function is videotoolbox_decoder_config_create, in which the hardware acceleration configuration is hard-coded and cannot be configured. The other is to take out sps and other information from the current CodecCteonxt and send it to the decoder. Without this information, the decoder cannot accurately identify the timestamp information. The parsing of sps and pps is done by FFmpeg.

 switch (codec_type) {<!-- -->
    case kCMVideoCodecType_MPEG4Video:
        if (avctx->extradata_size)
            data = videotoolbox_esds_extradata_create(avctx);
        if (data)
            CFDictionarySetValue(avc_info, CFSTR("esds"), data);
        break;
    case kCMVideoCodecType_H264:
        data = ff_videotoolbox_avcc_extradata_create(avctx);
        if (data)
            CFDictionarySetValue(avc_info, CFSTR("avcC"), data);
        break;
    case kCMVideoCodecType_HEVC:
        data = ff_videotoolbox_hvcc_extradata_create(avctx);
        if (data)
            CFDictionarySetValue(avc_info, CFSTR("hvcC"), data);
        break;
#if CONFIG_VP9_VIDEOTOOLBOX_HWACCEL
    case kCMVideoCodecType_VP9:
        data = ff_videotoolbox_vpcc_extradata_create(avctx);
        if (data)
            CFDictionarySetValue(avc_info, CFSTR("vpcC"), data);
        break;
#endif
    default:
        break;
    }

The implementation of decoding callback is relatively simple, just Retain CVPixelBuffer.

static void videotoolbox_decoder_callback(void *opaque,
                                          void *sourceFrameRefCon,
                                          OSStatus status,
                                          VTDecodeInfoFlags flags,
                                          CVImageBufferRef image_buffer,
                                          CMTime pts,
                                          CMTimeduration)
{<!-- -->
    VTContext *vtctx = opaque;

    if (vtctx->frame) {<!-- -->
        CVPixelBufferRelease(vtctx->frame);
        vtctx->frame = NULL;
    }

    if (!image_buffer) {<!-- -->
        av_log(vtctx->logctx, AV_LOG_DEBUG,
               "vt decoder cb: output image buffer is null: %i\
", status);
        return;
    }

    vtctx->frame = CVPixelBufferRetain(image_buffer);
}

2.2 `videotoolbox_h264_decode_params` and `ff_videotoolbox_frame_params`

? & amp;esmp;videotoolbox_h264_decode_paramsThe main job is to copy the sps and pps information decoded by the upper layer to VTContext.

case H264_NAL_SPS: {<!-- -->
    GetBitContext tmp_gb = nal->gb;
    if (avctx->hwaccel & amp; & amp; avctx->hwaccel->decode_params) {<!-- -->
        ret = avctx->hwaccel->decode_params(avctx,
                                            nal->type,
                                            nal->raw_data,
                                            nal->raw_size);
        if (ret < 0)
            goto end;
    }
    if (ff_h264_decode_seq_parameter_set( &tmp_gb, avctx, &h->ps, 0) >= 0)
        break;
    av_log(h->avctx, AV_LOG_DEBUG,
            "SPS decoding failure, trying again with the complete NAL\
");
    init_get_bits8( &tmp_gb, nal->raw_data + 1, nal->raw_size - 1);
    if (ff_h264_decode_seq_parameter_set( &tmp_gb, avctx, &h->ps, 0) >= 0)
        break;
    ff_h264_decode_seq_parameter_set( & amp;nal->gb, avctx, & amp;h->ps, 1);
    break;

ff_videotoolbox_frame_params is relatively simple to pass the parameters in CodecContext to HWFramesContext.

`ff_videotoolbox_alloc_frame,ff_videotoolbox_h264_start_frame,ff_videotoolbox_h264_decode_slice,videotoolbox_h264_end_frame`

These functions will be called every frame, the order is alloc_frame->start_frame->decode_frame->end_frame.
ff_videotoolbox_alloc_frame is used to apply for a piece of memory. At this time, the memory is just a piece of bare memory. It just sets the release function pointer to the release pointer of VT. It has not yet been bound to CVPixelBuffer. The binding is in decoding. in the Callback of the device.
ff_videotoolbox_h264_start_frame is mainly to copy the stream data flow from the upper layer to VTContext.
videotoolbox_common_decode_slice also copies the data stream.
videotoolbox_h264_end_frame is where the data is specifically sent to the decoder. The core place is videotoolbox_session_decode_frame. The data stream sent to the decoder here is the data stream copied above. , it should be noted that in the callback during initialization, only the memory is copied and nothing else is done. This is because VTDecompressionSessionWaitForAsynchronousFrames is called here to wait for the asynchronous decoding to complete, which ensures that the next frame of data is sent before the previous frame is decoded.

2.3 `ff_videotoolbox_uninit`

ff_videotoolbox_uninit is relatively simple to release the Context of the decoder and the memory in the cache.

Apple Documentation–Video Toolbox

1 Main process

1.1 Context involved

1.2 Main process

2 Specific implementation of each step

2.1ff_videotoolbox_common_init

2.2 videotoolbox_h264_decode_params and ff_videotoolbox_frame_params

ff_videotoolbox_alloc_frame,ff_videotoolbox_h264_start_frame,ff_videotoolbox_h264_decode_slice,videotoolbox_h264_end_frame

2.3 ff_videotoolbox_uninit

2.1`ff_videotoolbox_common_init`

2.2 `videotoolbox_h264_decode_params` and `ff_videotoolbox_frame_params`

`ff_videotoolbox_alloc_frame,ff_videotoolbox_h264_start_frame,ff_videotoolbox_h264_decode_slice,videotoolbox_h264_end_frame`

2.3 `ff_videotoolbox_uninit`