FFmpeg hardcoded VideoToolBox process

Introduction

FFmpeg has provided codec support for VideoToolBox; the main files involved are videotoolbox.c, videotoolbox.h, videotoolboxenc.c, ffmepg_videotoolbox.c.
When compiling FFmpeg source code, if you want to support VideoToolBox, you need –enable-videotoolbox command when configure.
Command line ffmpeg -hwaccels to see which hardcoders are supported.
ffmpeg supports encoding of videotoolbox h264 and h265, namely h264_videotoolbox, hevc_videotoolbox.
The command line ffmpeg -h encoder=hevc_videotoolbox can view the encoding parameters of the corresponding device hardcoded (if it is h264, that is, ffmpeg -h encoder=h264_videotoolbox to view the encoding parameters).

FFmpeg

FFmpeg is a software that can process audio and video. It has very powerful functions, mainly including codec conversion, package format conversion, and filter effects.
FFmpeg supports various network protocols, supports push-pull streams of high-level protocols such as RTMP, RTSP, and HLS, and also supports push-pull streams of lower-level TCP/UDP protocols.
FFmpeg can run on Windows, Linux, Mac, iOS, Android and other operating systems.
FFmpeg is an abbreviation for ” Fast Forward mpeg “;
FFMPEG is divided into several modules functionally, namely core tool (libutils), media format (libavformat), codec (libavcodec), device (libavdevice) and post-processing (libavfilter, libswscale, libpostproc), responsible for providing public Functions, realize the reading and writing of multimedia files, complete audio and video encoding and decoding, manage the operation of audio and video equipment, and perform audio and video post-processing.

VideoToolBox

VideoToolBox is an optimized video codec framework, developed by Apple and optimized for iOS and macOS platforms, as one of the indispensable components in modern mobile applications, it is used for H.264 decoding and encoding, HEVC decoding and encoding, and MPEG-2 decoding and encoding, while also supporting access to Core Audio and Core Video.
The advantages of VideoToolBox are high efficiency and ease of use; on iOS and macOS devices, its codec speed is much faster than other frameworks; in addition, it provides developers with various functions, including modifying video frame rate, changing Encoding format and so on.

FFmpeg hard-coded VideoToolBox process

It can be seen that the interaction between FFmpeg and VideoToolBox is mainly completed through three function pointers init, encode2, and close;
From the overall process analysis, the workflow of VideoToolBox is:
Create a compression session;
add session attribute;
Encode video frames and accept video encoding callbacks;
Force completion of some or all unprocessed video frames;
Release the compression session and release memory resources.
The core function of the init module is vtenc_configure_encode();
The core function of the encode2 module is vtenc_send_frame();
The core function of the close module is VTCompressionSessionCompleteFrames();

h264_videotoolbox

The h264 hardcoding of VideoToolBox uses three structures h264_options, h264_videotoolbox_class, ff_h264_videotoolbox_encoder to complete the interaction with FFmpeg.
h264_options mainly involves internal parameters, such as profile, level, entropy encoding selection, etc.
h264_videotoolbox_class to define the private class of h264, specify the encoding type and encoding parameters.
ff_h264_videotoolbox_encoder is a specific external interaction structure with FFmpeg, which completes h264 hardcoding.

static const AVOption h264_options[] = {<!-- -->
    {<!-- --> "profile", "Profile", OFFSET(profile), AV_OPT_TYPE_INT, {<!-- --> .i64 = H264_PROF_AUTO }, H264_PROF_AUTO, H264_PROF_COUNT, VE, "profile" },
    {<!-- --> "baseline", "Baseline Profile", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = H264_PROF_BASELINE }, INT_MIN, INT_MAX, VE, "profile" },
    {<!-- --> "main", "Main Profile", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = H264_PROF_MAIN }, INT_MIN, INT_MAX, VE, "profile" },
    {<!-- --> "high", "High Profile", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = H264_PROF_HIGH }, INT_MIN, INT_MAX, VE, "profile" },
    {<!-- --> "extended", "Extend Profile", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = H264_PROF_EXTENDED }, INT_MIN, INT_MAX, VE, "profile" },

    {<!-- --> "level", "Level", OFFSET(level), AV_OPT_TYPE_INT, {<!-- --> .i64 = 0 }, 0, 52, VE, "level" },
    {<!-- --> "1.3", "Level 1.3, only available with Baseline Profile", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = 13 }, INT_MIN, INT_MAX, VE, "level" },
    {<!-- --> "3.0", "Level 3.0", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = 30 }, INT_MIN, INT_MAX, VE, "level" },
    {<!-- --> "3.1", "Level 3.1", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = 31 }, INT_MIN, INT_MAX, VE, "level" },
    {<!-- --> "3.2", "Level 3.2", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = 32 }, INT_MIN, INT_MAX, VE, "level" },
    {<!-- --> "4.0", "Level 4.0", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = 40 }, INT_MIN, INT_MAX, VE, "level" },
    {<!-- --> "4.1", "Level 4.1", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = 41 }, INT_MIN, INT_MAX, VE, "level" },
    {<!-- --> "4.2", "Level 4.2", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = 42 }, INT_MIN, INT_MAX, VE, "level" },
    {<!-- --> "5.0", "Level 5.0", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = 50 }, INT_MIN, INT_MAX, VE, "level" },
    {<!-- --> "5.1", "Level 5.1", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = 51 }, INT_MIN, INT_MAX, VE, "level" },
    {<!-- --> "5.2", "Level 5.2", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = 52 }, INT_MIN, INT_MAX, VE, "level" },

    {<!-- --> "coder", "Entropy coding", OFFSET(entropy), AV_OPT_TYPE_INT, {<!-- --> .i64 = VT_ENTROPY_NOT_SET }, VT_ENTROPY_NOT_SET, VT_CABAC, VE, "coder" },
    {<!-- --> "cavlc", "CAVLC entropy coding", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = VT_CAVLC }, INT_MIN, INT_MAX, VE, "coder" },
    {<!-- --> "vlc", "CAVLC entropy coding", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = VT_CAVLC }, INT_MIN, INT_MAX, VE, "coder" },
    {<!-- --> "cabac", "CABAC entropy coding", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = VT_CABAC }, INT_MIN, INT_MAX, VE, "coder" },
    {<!-- --> "ac", "CABAC entropy coding", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = VT_CABAC }, INT_MIN, INT_MAX, VE, "coder" },

    {<!-- --> "a53cc", "Use A53 Closed Captions (if available)", OFFSET(a53_cc), AV_OPT_TYPE_BOOL, {<!-- -->.i64 = 1}, 0, 1, VE } ,

    COMMON_OPTIONS
    {<!-- --> NULL },
};

static const AVClass h264_videotoolbox_class = {<!-- -->
    .class_name = "h264_videotoolbox",
    .item_name = av_default_item_name,
    .option = h264_options,
    .version = LIBAVUTIL_VERSION_INT,
};

AVCodec ff_h264_videotoolbox_encoder = {<!-- -->
    .name = "h264_videotoolbox",
    .long_name = NULL_IF_CONFIG_SMALL("VideoToolbox H.264 Encoder"),
    .type = AVMEDIA_TYPE_VIDEO,
    .id = AV_CODEC_ID_H264,
    .priv_data_size = sizeof(VTEncContext),
    .pix_fmts = avc_pix_fmts,
    .init = vtenc_init,
    .encode2 = vtenc_frame,
    .close = vtenc_close,
    .capabilities = AV_CODEC_CAP_DELAY,
    .priv_class = &h264_videotoolbox_class,
    .caps_internal = FF_CODEC_CAP_INIT_THREADSAFE |
                        FF_CODEC_CAP_INIT_CLEANUP,
};

hevc_videotoolbox

The HEVC hardcoding of VideoToolBox uses three structures hevc_options, hevc_videotoolbox_class, ff_hevc_videotoolbox_encoder to complete the interaction with FFmpeg.
hevc_options mainly involves internal parameters, such as profile selection.
hevc_videotoolbox_class to define the private class of HEVC, specify the encoding type and encoding parameters.
ff_hevc_videotoolbox_encoder is a specific external interaction structure with FFmpeg, which completes HEVC hardcoding.

static const AVOption hevc_options[] = {<!-- -->
    {<!-- --> "profile", "Profile", OFFSET(profile), AV_OPT_TYPE_INT, {<!-- --> .i64 = HEVC_PROF_AUTO }, HEVC_PROF_AUTO, HEVC_PROF_COUNT, VE, "profile" },
    {<!-- --> "main", "Main Profile", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = HEVC_PROF_MAIN }, INT_MIN, INT_MAX, VE, "profile" },
    {<!-- --> "main10", "Main10 Profile", 0, AV_OPT_TYPE_CONST, {<!-- --> .i64 = HEVC_PROF_MAIN10 }, INT_MIN, INT_MAX, VE, "profile" },

    COMMON_OPTIONS
    {<!-- --> NULL },
};

static const AVClass hevc_videotoolbox_class = {<!-- -->
    .class_name = "hevc_videotoolbox",
    .item_name = av_default_item_name,
    .option = hevc_options,
    .version = LIBAVUTIL_VERSION_INT,
};

AVCodec ff_hevc_videotoolbox_encoder = {<!-- -->
    .name = "hevc_videotoolbox",
    .long_name = NULL_IF_CONFIG_SMALL("VideoToolbox H.265 Encoder"),
    .type = AVMEDIA_TYPE_VIDEO,
    .id = AV_CODEC_ID_HEVC,
    .priv_data_size = sizeof(VTEncContext),
    .pix_fmts = hevc_pix_fmts,
    .init = vtenc_init,
    .encode2 = vtenc_frame,
    .close = vtenc_close,
    .capabilities = AV_CODEC_CAP_DELAY | AV_CODEC_CAP_HARDWARE,
    .priv_class = &hevc_videotoolbox_class,
    .caps_internal = FF_CODEC_CAP_INIT_THREADSAFE |
                        FF_CODEC_CAP_INIT_CLEANUP,
    .wrapper_name = "videotoolbox",
};

Introduction to core modules

.init

The .init module completes the initialization work, and the corresponding function is vtenc_init(); inside the function, it mainly completes thread initialization, encoder configuration, attribute retrieval and B frame related processing.

static av_cold int vtenc_init(AVCodecContext *avctx)
{<!-- -->
    VTEncContext *vtctx = avctx->priv_data;
    CFBooleanRef has_b_frames_cfbool;
    int status;

    pthread_once( &once_ctrl, loadVTEncSymbols);

    pthread_mutex_init( & vtctx->lock, NULL);
    pthread_cond_init( & vtctx->cv_sample_sent, NULL);

    vtctx->session = NULL;
    status = vtenc_configure_encoder(avctx);
    if (status) return status;

    status = VTSessionCopyProperty(vtctx->session,
                                   kVTCompressionPropertyKey_AllowFrameReordering,
                                   kCFAllocatorDefault,
                                    &has_b_frames_cfbool);

    if (!status & amp; & amp; has_b_frames_cfbool) {<!-- -->
        //Some devices don't output B-frames for main profile, even if requested.
        vtctx->has_b_frames = CFBooleanGetValue(has_b_frames_cfbool);
        CFRelease(has_b_frames_cfbool);
    }
    avctx->has_b_frames = vtctx->has_b_frames;

    return 0;
}

The vtenc_configure_encoder() function is the core function of the init module, which mainly completes the configuration of the encoder; configures profile, level, entropy encoding and other information according to the encoder type (h264/HEVC); in addition, it also selects clipping information, transfer function, YCbCr Matrix, primary color and additional information; finally call vtenc_create_encoder() to complete the creation of the encoder;

static int vtenc_configure_encoder(AVCodecContext *avctx)
{<!-- -->
    CFMutableDictionaryRef enc_info;
    CFMutableDictionaryRef pixel_buffer_info;
    CMVideoCodecType codec_type;
    VTEncContext *vtctx = avctx->priv_data;
    CFStringRef profile_level;
    CFNumberRef gamma_level = NULL;
    int status;

    codec_type = get_cm_codec_type(avctx->codec_id);
    if (!codec_type) {<!-- -->
        av_log(avctx, AV_LOG_ERROR, "Error: no mapping for AVCodecID %d\
", avctx->codec_id);
        return AVERROR(EINVAL);
    }

    vtctx->codec_id = avctx->codec_id;

    if (vtctx->codec_id == AV_CODEC_ID_H264) {<!-- -->
        vtctx->get_param_set_func = CMVideoFormatDescriptionGetH264ParameterSetAtIndex;

        vtctx->has_b_frames = avctx->max_b_frames > 0;
        if(vtctx->has_b_frames & amp; & amp; vtctx->profile == H264_PROF_BASELINE){<!-- -->
            av_log(avctx, AV_LOG_WARNING, "Cannot use B-frames with baseline profile. Output will not contain B-frames.\
");
            vtctx->has_b_frames = false;
        }

        if (vtctx->entropy == VT_CABAC & amp; & amp; vtctx->profile == H264_PROF_BASELINE) {<!-- -->
            av_log(avctx, AV_LOG_WARNING, "CABAC entropy requires 'main' or 'high' profile, but baseline was requested. Encode will not use CABAC entropy.\
");
            vtctx->entropy = VT_ENTROPY_NOT_SET;
        }

        if (!get_vt_h264_profile_level(avctx, & amp;profile_level)) return AVERROR(EINVAL);
    } else {<!-- -->
        vtctx->get_param_set_func = compat_keys.CMVideoFormatDescriptionGetHEVCParameterSetAtIndex;
        if (!vtctx->get_param_set_func) return AVERROR(EINVAL);
        if (!get_vt_hevc_profile_level(avctx, & amp;profile_level)) return AVERROR(EINVAL);
    }

    enc_info = CFDictionaryCreateMutable(
        kCFAllocatorDefault,
        20,
         &kCFCopyStringDictionaryKeyCallBacks,
         &kCFTypeDictionaryValueCallBacks
    );

    if (!enc_info) return AVERROR(ENOMEM);

#if !TARGET_OS_IPHONE
    if(vtctx->require_sw) {<!-- -->
        CFDictionarySetValue(enc_info,
                             compat_keys.kVTVideoEncoderSpecification_EnableHardwareAcceleratedVideoEncoder,
                             kCFBooleanFalse);
    } else if (!vtctx->allow_sw) {<!-- -->
        CFDictionarySetValue(enc_info,
                             compat_keys.kVTVideoEncoderSpecification_RequireHardwareAcceleratedVideoEncoder,
                             kCFBooleanTrue);
    } else {<!-- -->
        CFDictionarySetValue(enc_info,
                             compat_keys.kVTVideoEncoderSpecification_EnableHardwareAcceleratedVideoEncoder,
                             kCFBooleanTrue);
    }
#endif

    if (avctx->pix_fmt != AV_PIX_FMT_VIDEOTOOLBOX) {<!-- -->
        status = create_cv_pixel_buffer_info(avctx, &pixel_buffer_info);
        if (status)
            goto init_cleanup;
    } else {<!-- -->
        pixel_buffer_info = NULL;
    }

    vtctx->dts_delta = vtctx->has_b_frames ? -1 : 0;

    get_cv_transfer_function(avctx, & amp; vtctx->transfer_function, & amp; gamma_level);
    get_cv_ycbcr_matrix(avctx, &vtctx->ycbcr_matrix);
    get_cv_color_primaries(avctx, &vtctx->color_primaries);


    if (avctx->flags & AV_CODEC_FLAG_GLOBAL_HEADER) {<!-- -->
        status = vtenc_populate_extradata(avctx,
                                          codec_type,
                                          profile_level,
                                          gamma_level,
                                          enc_info,
                                          pixel_buffer_info);
        if (status)
            goto init_cleanup;
    }

    status = vtenc_create_encoder(avctx,
                                  codec_type,
                                  profile_level,
                                  gamma_level,
                                  enc_info,
                                  pixel_buffer_info,
                                   &vtctx->session);

init_cleanup:
    if (gamma_level)
        CFRelease(gamma_level);

    if (pixel_buffer_info)
        CFRelease(pixel_buffer_info);

    CFRelease(enc_info);

    return status;
}

vtenc_create_encoder() completes the encoder creation work; calls VTCompressionSessionCreate() to create a compressed frame instance, then creates various objects such as code rate/code control, and configures corresponding attributes; finally, (optional) calls VTCompressionSessionPrepareToEncodeFrames() to complete the pre-encoding Reasonable resource allocation.

.encode2

The .encode2 module completes the specific encoding work, the corresponding function is vtenc_frame(); judge whether there is frame data in the AVFrame, call vtenc_send_frame() to complete the specific encoding if there is data, and flush if not; then call vtenc_q_pop() to complete thread related Operation; finally use vtenc_cm_to_avpacket() to get packet information, such as SEI, pts, dts, etc.

static av_cold int vtenc_frame(
    AVCodecContext *avctx,
    AVPacket *pkt,
    const AVFrame *frame,
    int *got_packet)
{<!-- -->
    VTEncContext *vtctx = avctx->priv_data;
    bool get_frame;
    int status;
    CMSampleBufferRef buf = NULL;
    ExtraSEI *sei = NULL;

    if (frame) {<!-- -->
        status = vtenc_send_frame(avctx, vtctx, frame);

        if (status) {<!-- -->
            status = AVERROR_EXTERNAL;
            goto end_nopkt;
        }

        if (vtctx->frame_ct_in == 0) {<!-- -->
            vtctx->first_pts = frame->pts;
        } else if(vtctx->frame_ct_in == 1 & amp; & amp; vtctx->has_b_frames) {<!-- -->
            vtctx->dts_delta = frame->pts - vtctx->first_pts;
        }

        vtctx->frame_ct_in + + ;
    } else if(!vtctx->flushing) {<!-- -->
        vtctx->flushing = true;

        status = VTCompressionSessionCompleteFrames(vtctx->session,
                                                    kCMTimeIndefinite);

        if (status) {<!-- -->
            av_log(avctx, AV_LOG_ERROR, "Error flushing frames: %d\
", status);
            status = AVERROR_EXTERNAL;
            goto end_nopkt;
        }
    }

    *got_packet = 0;
    get_frame = vtctx->dts_delta >= 0 || !frame;
    if (!get_frame) {<!-- -->
        status = 0;
        goto end_nopkt;
    }

    status = vtenc_q_pop(vtctx, !frame, & amp;buf, & amp;sei);
    if (status) goto end_nopkt;
    if (!buf) goto end_nopkt;

    status = vtenc_cm_to_avpacket(avctx, buf, pkt, sei);
    if (sei) {<!-- -->
        if (sei->data) av_free(sei->data);
        av_free(sei);
    }
    CFRelease(buf);
    if (status) goto end_nopkt;

    *got_packet = 1;
    return 0;

end_nopkt:
    av_packet_unref(pkt);
    return status;
}

vtenc_send_frame() completes the encoding core work; internally mainly calls the core API function VTCompressionSessionEncodeFrame() of VideoToolBox to complete the specific encoding work.

static int vtenc_send_frame(AVCodecContext *avctx,
                            VTEncContext *vtctx,
                            const AVFrame *frame)
{<!-- -->
    CMTime time;
    CFDictionaryRef frame_dict;
    CVPixelBufferRef cv_img = NULL;
    AVFrameSideData *side_data = NULL;
    ExtraSEI *sei = NULL;
    int status = create_cv_pixel_buffer(avctx, frame, &cv_img);

    if (status) return status;

    status = create_encoder_dict_h264(frame, & frame_dict);
    if (status) {<!-- -->
        CFRelease(cv_img);
        return status;
    }

    side_data = av_frame_get_side_data(frame, AV_FRAME_DATA_A53_CC);
    if (vtctx->a53_cc & amp; & amp; side_data & amp; & amp; side_data->size) {<!-- -->
        sei = av_mallocz(sizeof(*sei));
        if (!sei) {<!-- -->
            av_log(avctx, AV_LOG_ERROR, "Not enough memory for closed captions, skipping\
");
        } else {<!-- -->
            int ret = ff_alloc_a53_sei(frame, 0, &sei->data, &sei->size);
            if (ret < 0) {<!-- -->
                av_log(avctx, AV_LOG_ERROR, "Not enough memory for closed captions, skipping\
");
                av_free(sei);
                sei = NULL;
            }
        }
    }

    time = CMTimeMake(frame->pts * avctx->time_base.num, avctx->time_base.den);
    status = VTCompressionSessionEncodeFrame(
        vtctx->session,
        cv_img,
        time,
        kCMTimeInvalid,
        frame_dict,
        sei,
        NULL
    );

    if (frame_dict) CFRelease(frame_dict);
    CFRelease(cv_img);

    if (status) {<!-- -->
        av_log(avctx, AV_LOG_ERROR, "Error: cannot encode frame: %d\
", status);
        return AVERROR_EXTERNAL;
    }

    return 0;
}

.close

The .close module completes the closing and recycling work, and the corresponding function is vtenc_close(); internally, it mainly destroys threads, forces the completion of some or all unprocessed video frames, clears the frame queue, and releases resources.

static av_cold int vtenc_close(AVCodecContext *avctx)
{<!-- -->
    VTEncContext *vtctx = avctx->priv_data;

    pthread_cond_destroy( &vtctx->cv_sample_sent);
    pthread_mutex_destroy( &vtctx->lock);

    if(!vtctx->session) return 0;

    VTCompressionSessionCompleteFrames(vtctx->session,
                                       kCMTimeIndefinite);
    clear_frame_queue(vtctx);
    CFRelease(vtctx->session);
    vtctx->session = NULL;

    if (vtctx->color_primaries) {<!-- -->
        CFRelease(vtctx->color_primaries);
        vtctx->color_primaries = NULL;
    }

    if (vtctx->transfer_function) {<!-- -->
        CFRelease(vtctx->transfer_function);
        vtctx->transfer_function = NULL;
    }

    if (vtctx->ycbcr_matrix) {<!-- -->
        CFRelease(vtctx->ycbcr_matrix);
        vtctx->ycbcr_matrix = NULL;
    }

    return 0;
}

Reference

https://developer.apple.com/documentation/videotoolbox
http://ffmpeg.org/