Abstract: This article describes how the videotoobox
decoder in FFmpeg performs decoding and how to decode an encoded code stream into the final naked stream.
Keyword: videotoobox, decoder, ffmpeg
VideoToolbox is a low-level framework that provides direct access to hardware encoders and decoders. It provides video compression and decompression services, and conversion services between raster image formats stored in CoreVideo pixel buffers. These services are provided as session objects (compression, decompression, and pixel transfer) and exported as Core Foundation (CF) types. VideoToolbox supports H.263, H.264, HEVC, MPEG-1, MPEG-2, MPEG-4 Part 2, ProRes decoding, H.264, HEVC, ProRes encoding. The latest version also seems to support VP9 decoding.
1 Main process
1.1 Context involved
Each decoder in FFmpeg has its own Context description, which describes the corresponding decoder parameters and the decoder’s processing function pointer according to the agreed format. The main implementation code of the VideoToolbox decoder in FFmpeg is in libavcodec/videotoobox.{h,c}
, which defines an independent Context for each supported decoding format, such as ff_h263_videotoolbox_hwaccel, ff_h263_videotoolbox_hwaccel, ff_h264_videotoolbox_hwaccel,...
, etc. are just different in implementation. We can focus on one of them. Here we mainly focus on ff_h264_videotoolbox_hwaccel
.
const AVHWAccel ff_h264_videotoolbox_hwaccel = {<!-- --> .name = "h264_videotoolbox", .type = AVMEDIA_TYPE_VIDEO, .id = AV_CODEC_ID_H264, .pix_fmt = AV_PIX_FMT_VIDEOTOOLBOX, .alloc_frame = ff_videotoolbox_alloc_frame, .start_frame = ff_videotoolbox_h264_start_frame, .decode_slice = ff_videotoolbox_h264_decode_slice, .decode_params = videotoolbox_h264_decode_params, .end_frame = videotoolbox_h264_end_frame, .frame_params = ff_videotoolbox_frame_params, .init = ff_videotoolbox_common_init, .uninit = ff_videotoolbox_uninit, .priv_data_size = sizeof(VTContext), };
This structure defines:
- The name of the decoder;
- Type of decoded data;
- Decoder ID;
- Hardware decoding format;
- Apply for a function pointer of a hardware-related frame structure;
- Perform memory copy and other operations on the frame before decoding starts;
- decode data;
- Parse the parameters required by the decoder such as sps, etc.;
- Post-processing after sending the frame;
- Initialize the hardware decoder;
- Destroy hardware decoder;
- Description structure of the current hardware decoder.
ff_h264_videotoolbox_hwaccel
is stored in hw_configs
, and the runtime traverses the list to find the desired hardware decoder. Therefore, the decoding work first goes through the ff_h264_decoder
decoder in FFmpeg and then enters the hardware decoder.
const AVCodec ff_h264_decoder = {<!-- --> .name = "h264", .long_name = NULL_IF_CONFIG_SMALL("H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10"), .type = AVMEDIA_TYPE_VIDEO, .id = AV_CODEC_ID_H264, .priv_data_size = sizeof(H264Context), .init = h264_decode_init, .close = h264_decode_end, .decode = h264_decode_frame, .capabilities = /*AV_CODEC_CAP_DRAW_HORIZ_BAND |*/ AV_CODEC_CAP_DR1 | AV_CODEC_CAP_DELAY | AV_CODEC_CAP_SLICE_THREADS | AV_CODEC_CAP_FRAME_THREADS, .hw_configs = (const AVCodecHWConfigInternal *const []) {<!-- --> #if CONFIG_H264_DXVA2_HWACCEL HWACCEL_DXVA2(h264), #endif #if CONFIG_H264_D3D11VA_HWACCEL HWACCEL_D3D11VA(h264), #endif #if CONFIG_H264_D3D11VA2_HWACCEL HWACCEL_D3D11VA2(h264), #endif #if CONFIG_H264_NVDEC_HWACCEL HWACCEL_NVDEC(h264), #endif #if CONFIG_H264_VAAPI_HWACCEL HWACCEL_VAAPI(h264), #endif #if CONFIG_H264_VDPAU_HWACCEL HWACCEL_VDPAU(h264), #endif #if CONFIG_H264_VIDEOTOOLBOX_HWACCEL HWACCEL_VIDEOTOOLBOX(h264), #endif NULL }, .caps_internal = FF_CODEC_CAP_INIT_THREADSAFE | FF_CODEC_CAP_EXPORTS_CROPPING | FF_CODEC_CAP_ALLOCATE_PROGRESS | FF_CODEC_CAP_INIT_CLEANUP, .flush = h264_decode_flush, .update_thread_context = ONLY_IF_THREADS_ENABLED(ff_h264_update_thread_context), .update_thread_context_for_user = ONLY_IF_THREADS_ENABLED(ff_h264_update_thread_context_for_user), .profiles = NULL_IF_CONFIG_SMALL(ff_h264_profiles), .priv_class = &h264_class, };
VTContext
describes the VT Context during VT decoding.
typedef struct VTContext {<!-- --> // The current bitstream buffer. uint8_t *bitstream; // The current size of the bitstream. int bitstream_size; // The reference size used for fast reallocation. int allocated_size; // The core video buffer CVImageBufferRef frame; // Current dummy frames context (depends on exact CVImageBufferRef params). struct AVBufferRef *cached_hw_frames_ctx; // Non-NULL if the new hwaccel API is used. This is only a separate struct // to ease compatibility with the old API. struct AVVideotoolboxContext *vt_ctx; // Current H264 parameters (used to trigger decoder restart on SPS changes). uint8_t sps[3]; bool reconfig_needed; void *logctx; } VTContext;
1.2 Main process
2 Specific implementation of each step
2.1ff_videotoolbox_common_init
ff_videotoolbox_common_init
is called when the decoder is initialized, usually when avcodec_open2
initializes the hardware decoder. Generally, in order to more accurately detect the media information of the current video, FFmpeg will initialize the decoder to decode a small number of frames during avformat_find_stream_info
to detect streaming media information.
When initializing, first apply for the Context memory of VT and set some parameters. In fact, only the callback function and PixFormat of VT are set. After that, initialize AVHWFramesContext
in time as needed, mainly to apply for memory and set the frame format such as width and height, format, etc.
The last step is to call videotoolbox_start
to create a VT Session. The creation process is relatively simple, just call Apple’s API directly to create a Session. What needs to be focused on is how to set it up. The specific implementation function is videotoolbox_decoder_config_create
, in which the hardware acceleration configuration is hard-coded and cannot be configured. The other is to take out sps and other information from the current CodecCteonxt and send it to the decoder. Without this information, the decoder cannot accurately identify the timestamp information. The parsing of sps and pps is done by FFmpeg.
switch (codec_type) {<!-- --> case kCMVideoCodecType_MPEG4Video: if (avctx->extradata_size) data = videotoolbox_esds_extradata_create(avctx); if (data) CFDictionarySetValue(avc_info, CFSTR("esds"), data); break; case kCMVideoCodecType_H264: data = ff_videotoolbox_avcc_extradata_create(avctx); if (data) CFDictionarySetValue(avc_info, CFSTR("avcC"), data); break; case kCMVideoCodecType_HEVC: data = ff_videotoolbox_hvcc_extradata_create(avctx); if (data) CFDictionarySetValue(avc_info, CFSTR("hvcC"), data); break; #if CONFIG_VP9_VIDEOTOOLBOX_HWACCEL case kCMVideoCodecType_VP9: data = ff_videotoolbox_vpcc_extradata_create(avctx); if (data) CFDictionarySetValue(avc_info, CFSTR("vpcC"), data); break; #endif default: break; }
The implementation of decoding callback is relatively simple, just Retain CVPixelBuffer.
static void videotoolbox_decoder_callback(void *opaque, void *sourceFrameRefCon, OSStatus status, VTDecodeInfoFlags flags, CVImageBufferRef image_buffer, CMTime pts, CMTimeduration) {<!-- --> VTContext *vtctx = opaque; if (vtctx->frame) {<!-- --> CVPixelBufferRelease(vtctx->frame); vtctx->frame = NULL; } if (!image_buffer) {<!-- --> av_log(vtctx->logctx, AV_LOG_DEBUG, "vt decoder cb: output image buffer is null: %i\ ", status); return; } vtctx->frame = CVPixelBufferRetain(image_buffer); }
2.2 videotoolbox_h264_decode_params
and ff_videotoolbox_frame_params
? & amp;esmp;videotoolbox_h264_decode_params
The main job is to copy the sps and pps information decoded by the upper layer to VTContext.
case H264_NAL_SPS: {<!-- --> GetBitContext tmp_gb = nal->gb; if (avctx->hwaccel & amp; & amp; avctx->hwaccel->decode_params) {<!-- --> ret = avctx->hwaccel->decode_params(avctx, nal->type, nal->raw_data, nal->raw_size); if (ret < 0) goto end; } if (ff_h264_decode_seq_parameter_set( &tmp_gb, avctx, &h->ps, 0) >= 0) break; av_log(h->avctx, AV_LOG_DEBUG, "SPS decoding failure, trying again with the complete NAL\ "); init_get_bits8( &tmp_gb, nal->raw_data + 1, nal->raw_size - 1); if (ff_h264_decode_seq_parameter_set( &tmp_gb, avctx, &h->ps, 0) >= 0) break; ff_h264_decode_seq_parameter_set( & amp;nal->gb, avctx, & amp;h->ps, 1); break;
ff_videotoolbox_frame_params
is relatively simple to pass the parameters in CodecContext to HWFramesContext.
ff_videotoolbox_alloc_frame,ff_videotoolbox_h264_start_frame,ff_videotoolbox_h264_decode_slice,videotoolbox_h264_end_frame
These functions will be called every frame, the order is alloc_frame->start_frame->decode_frame->end_frame
.
ff_videotoolbox_alloc_frame
is used to apply for a piece of memory. At this time, the memory is just a piece of bare memory. It just sets the release function pointer to the release pointer of VT. It has not yet been bound to CVPixelBuffer. The binding is in decoding. in the Callback of the device.
ff_videotoolbox_h264_start_frame
is mainly to copy the stream data flow from the upper layer to VTContext.
videotoolbox_common_decode_slice
also copies the data stream.
videotoolbox_h264_end_frame
is where the data is specifically sent to the decoder. The core place is videotoolbox_session_decode_frame
. The data stream sent to the decoder here is the data stream copied above. , it should be noted that in the callback during initialization, only the memory is copied and nothing else is done. This is because VTDecompressionSessionWaitForAsynchronousFrames
is called here to wait for the asynchronous decoding to complete, which ensures that the next frame of data is sent before the previous frame is decoded.
2.3 ff_videotoolbox_uninit
ff_videotoolbox_uninit
is relatively simple to release the Context of the decoder and the memory in the cache.
- Apple Documentation–Video Toolbox