6. Audio and video synchronization algorithm

Audio and video synchronization algorithm

Digression

In physics, pitch refers to the pitch of the sound, loudness refers to the strength of the sound, and timbre refers to the characteristics of the sound, which should be distinguished.

Video: color depth, color gamut, brightness.

The hearing range of the human ear is 20Hz-20KHz. Signals within this range become audio signals, called audible sound, and the human ear is most sensitive to the mid-frequency band 1-4KHz;

Subjective feelings about sound: timbre – Zhihu (zhihu.com)

For timbre, the spectral distribution (Spectral Envelope) and time envelope curve (Time Envelope) of the sound have a greater influence among many factors. In addition, the average frequency(Mean Frequencies), the noise in the sound(Noise), the spectrum center(Spectral Centroid), some random Changes in components(Irregularity Parameters) and spectrum(Spectral Flux) will also affect the timbre. [1] [2]

The main difference between different sound sources is the difference in shape and material. The difference in shape and material determines the different vibration modes of the object. Different vibration modes lead to different spectrums of vibration. Therefore, spectral distribution is an important feature for distinguishing different sound sources.

Audio and video playback process

The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly

Due to the difference between video playback and audio playback, the two need to be synchronized, otherwise the audio and video will be out of sync.

Why audio and video synchronization is needed

Both the sound card and the graphics card use one frame of data as the playback unit. If you rely solely on the frame rate and sampling rate for playback, under ideal conditions, they should be synchronized and there will be no deviation.

Video is played at a frame rate and audio is played at a sampling rate. There is no synchronization mechanism between the two. Even if the audio and video are synchronized at the beginning, as time goes by, the audio and video will gradually lose synchronization, and the out-of-sync phenomenon will disappear over time. getting more serious. This is because:

Playback time is difficult to control precisely. The time consuming of audio and video decoding and rendering is different, which may cause a slight difference in the output of each frame. Over time, the desynchronization will become more and more obvious. (For example, due to performance limitations, it takes 42ms to output one frame)
Audio output is linear, while video output can be non-linear, resulting in bias.
There is a gap between the audio and video of the media stream itself. (Especially for TS real-time streaming, the starting point of the first frame that audio and video can play is different)

Therefore, a certain synchronization strategy must be adopted to continuously correct the time difference between audio and video so that the image display and sound playback are generally consistent.

Therefore, to solve the problem of audio and video synchronization, timestamps are introduced: first select a reference clock (requiring the time on the reference clock to increase linearly); when encoding, time stamp each audio and video data block based on the reference clock; During playback, the playback is adjusted based on the audio and video timestamps and reference clock. Therefore, the synchronization of video and audio is actually a dynamic process. Synchronization is temporary, and desynchronization is normal. Using the reference clock as the standard, if the playback speed is too fast, the playback speed will be slowed down; if the playback speed is too fast, the playback speed will be accelerated.

Time stamp in audio and video

Due to different encoding methods, the picture frames in the video are divided into different categories, including: I frame, P frame, B frame:

I-frames (Intra coded frames) images use intra-frame compression and do not use motion compensation. Since I-frames do not depend on other frames, they can be decoded independently. The compression factor of I-frame images is relatively low, and they appear periodically in the image sequence. The frequency of occurrence can be selected by the encoder.

P frames (Predicted frames) adopt inter-frame coding, which uses both spatial and temporal correlation. P-frame images only use forward temporal prediction, which can improve compression efficiency and image quality. The P frame image can contain intra-frame coding, that is, each macroblock in the P frame can be forward predicted or intra-frame coded.

B-frame (Bi-directional predicted frames) images use inter-frame coding and use bi-directional temporal prediction, which can greatly increase the compression factor. That is to say, its temporal correlation also depends on the subsequent video frames. It is precisely because the B-frame image uses the subsequent frames as a reference, so the transmission order and display order of the video frames are different.

Time stamp DTS, PTS

DTS (Decoding Time Stamp): decoding timestamp. The meaning of this timestamp is to tell the player when to decode the data of this frame.
PTS (Presentation Time Stamp): Display timestamp. This timestamp is used to tell the player when to display the data of this frame.

When there are no B frames in the video stream, usually the order of DTS and PTS is consistent. But if there are B frames, it returns to the problem we mentioned earlier: the decoding order and the playback order are inconsistent, that is, the video output is non-linear.

Sync strategy

The timestamp is PTS, so there are generally three choices for reference clocks:

Synchronize video to audio: synchronize video based on the playback speed of audio.

Synchronize audio to video: synchronize audio based on the playback speed of the video.

Synchronize video and audio to an external clock: Select an external clock as the reference, and the playback speed of video and audio will be based on this clock.

When the playback source is slower than the reference clock, the playback speed will be accelerated or discarded; if it is faster, the playback will be delayed.

Considering that people are more sensitive to sound than video, frequently adjusting the audio will bring a poor viewing experience, and the audio playback clock increases linearly, so the audio clock is generally used as the reference clock, and the video is synchronized to the audio superior.

Threshold

Undetectable: The time stamp difference between audio and video is between: -100ms ~ + 25ms

Able to detect: audio lags more than 100ms, or leads more than 25ms

Unacceptable: Audio lags more than 185ms, or leads more than 90ms

Synchronization algorithm

MediaSync | Android Developers (google.cn)

This method should be able to achieve related synchronization, but the workload of rewriting your own code is relatively large, so we do not rely on this API to implement this function for the time being.

Audio and video development-Audio and video synchronization algorithm_In-stream synchronization algorithm based on playback time limit-CSDN Blog

Determine the delay time of the next frame directly based on the value of diff.

Audio and video synchronization principle and implementation_Audio and video synchronization strategy-CSDN Blog

Using audio as the reference clock, sample code for synchronizing video to audio:

Get the PTS of the video currently to be displayed, subtract the PTS of the video of the previous frame, and get the delay that the video of the previous frame should be displayed;
Compare the current video PTS with the current audio PTS of the reference clock to obtain the audio and video gap diff;
Get the synchronization threshold sync_threshold, which is the video gap of one frame, ranging from 10ms-100ms;
If diff is less than sync_threshold, it is considered that synchronization is not needed; otherwise delay + diff value means delay is correctly corrected;
If the sync_threshold is exceeded and the video lags behind the audio, then the delay (FFMAX(0, delay + diff)) needs to be reduced to allow the current frame to be displayed as soon as possible.
If the video lags behind for more than 1 second, and the video frames have been output quickly for the past 10 times, then feedback needs to be fed back to the audio source to slow down, and at the same time, feedback to the video source is required for frame loss processing, so that the video can catch up with the audio as soon as possible. Because it is likely that the video decoding cannot keep up, and it is useless to adjust the delay.
If the sync_threshold is exceeded and the video is faster than the audio, then the delay needs to be increased to delay the display of the current frame.
Set delay*2 to slowly adjust the gap. This is to adjust the gap smoothly, because direct delay + diff will make the screen lag.
If the display time of the previous frame of the video itself is very long, then directly delay + diff can be adjusted in one step, because in this case, it does not make much sense to adjust it slowly.
Taking into account the time-consuming rendering, adjustments need to be made. frame_timer is the system time for one frame display, frame_timer + delay- curr_time, then we get the time that needs to be delayed to display the current frame.

In-depth understanding of Android audio and video synchronization mechanism (1) Overview-CSDN Blog

This blog gives a series of video player synchronization mechanisms:

Code

Assume that the audio is synchronized with the system time and the video moves closer to the system time.

Because there is currently no way to obtain audio-related timeline information, we can only obtain the video playback timeline. However, audio playback is played through audiotrack, which is consistent with the system time, so we adjust the video.

//Delayed rendering
private boolean adjustPlay(long videoTimeUs, long sysTimeMs) {
    long diff = videoTimeUs / 1000 - (System.currentTimeMillis() - sysTimeMs);
Log.i(TAG, "The diff is " + diff);
    if (diff >= 10 & amp; & amp; diff <= 100) {
        return true;
}
    if (diff > 100) {
        try {
            DecodeVideoThread.sleep(diff);
} catch (InterruptedException e) {
            e.printStackTrace();
}
        return true;
}
    if (diff < 0) {
        return false;
}
    return true;
}