OpenAI Whisper + FFmpeg + TTS: Dynamic cross-language video and audio translation

The author of this article is a front-end development engineer of 360 Qiwu Troupe

Summary:

This article describes how to combine OpenAI Whisper, FFmpeg and TTS (Text-to-Speech) technology to realize the process of translating video into other languages and changing the voice. We will explore how to use OpenAI Whisper for speech recognition and translation, then use FFmpeg to extract video audio tracks and process videos, and finally use TTS technology to generate new speech and replace the original video’s audio track. In this way, we can add new language versions to videos while maintaining their original visual content.

Introduction:

Today, video content is growing rapidly around the world, making cross-language distribution and multilingual support an important requirement. However, manually subtitling or dubbing videos in different languages can be time-consuming and expensive. This article will introduce a method that leverages OpenAI Whisper, FFmpeg and TTS technologies, allowing us to translate videos to other languages and replace voices to meet multilingual needs while reducing cost and time.

  1. OpenAI Whisper: is a powerful speech recognition model that can convert speech to text and supports multiple languages. We’ll use Whisper to extract the raw speech from the video into text, and use a translation service to convert it into text in the target language.

  2. FFmpeg: Processing video and audio track extraction Next, we use FFmpeg tools to process video and extract audio tracks. FFmpeg is a powerful multimedia processing tool that supports various audio and video processing operations. We can use FFmpeg to extract the audio track of the original video to later replace with the newly generated speech.

  3. TTS Technology: Generating New Speech In order to replace the audio track of the original video, we need to generate a new speech. Here we use TTS (Text-to-Speech) technology to convert the previously translated text in the target language into speech in the corresponding language. Based on a deep learning model, TTS technology can generate natural and smooth speech to match the content of the original video.

  4. Combining Whisper, FFmpeg, and TTS: Implementing Video Translation and Voice Replacement Finally, we combine the target language text generated by Whisper with the new voice generated by TTS, and use FFmpeg to replace the new voice into the audio track of the original video. By using FFmpeg’s audio track replacement feature, we can ensure that the new voice is in sync with the video content and generated on target.

Result display

  • Original video: https://caining0.github.io/statichtml.github.io/test.mp4

  • Converted video: https://caining0.github.io/statichtml.github.io/output.mp4

Prerequisites and dependencies

pip3 install openai-whisper
pip3 install ffmpeg-python
brew install ffmpeg
pip3 install TTS//https://github.com/coqui-ai/TTS

openai-whisper usage

Command line usage

The following command will transcribe speech from an audio file using the medium model:

whisper audio.flac audio.mp3 audio.wav --model medium

The default setting (choose model small) is suitable for transcribing English. To transcribe audio files that contain non-English speech, you can specify the language --language with the following option:

whisper japanese.wav --language Japanese

Adding --task translate will translate the speech to English:

whisper japanese.wav --language Japanese --task translate

Run the following command to see all available options:

whisper --help

Python usage

import whisper

model = whisper. load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

Example

whisper test.mp4 --language Chinese --task translate
[00:00.000 --> 00:03.400] If the Chinese people come to design a new building, it will be like this
[00:03.400 --> 00:06.360] A new building that has been rebuilt by a Chinese city
[00:06.360 --> 00:09.480] This is a real city, maybe it's your hometown
[00:09.480 --> 00:12.640] Let's take a short film with us and show its real face
[00:12.640 --> 00:14.480] The opening is a one-minute long lens
[00:14.480 --> 00:16.520] First, the time has changed, the new season has no shadow
[00:16.520 --> 00:18.680] A sense of depression is born
[00:18.680 --> 00:20.400] We randomly saw the red tail of it
[00:20.400 --> 00:22.120] This is the new building in the hundreds of square kilometers
[00:22.120 --> 00:24.480] The blue protective tent inside the blue sky city in the front
[00:24.480 --> 00:26.080] As in the front of the crystal ball
[00:26.080 --> 00:28.360] The back is a larger environmental structure
[00:28.360 --> 00:29.800] This is the shadow of the new building
[00:29.800 --> 00:30.600] The lens is far away
[00:30.600 --> 00:32.040] We see that there is a bandage
[00:32.040 --> 00:33.560] It is passing through a huge star
[00:33.560 --> 00:35.240] Those are the stars of the stars
[00:35.240 --> 00:37.280] The stars do not affect the shape of the bandage
[00:37.280 --> 00:39.240] This means that their motivation is super
[00:39.240 --> 00:42.040] At this time, the lens enters the blue protective tent inside the first crystal ball

TTS

from TTS.api import TTS
model_name = TTS.list_models()[0]
tts = TTS(model_name)
tts.tts_to_file(text="Hello world!", speaker=tts.speakers[0], language=tts.languages[0], file_path="output.wav")
#In practice, you need to replace text with whisper to extract content

ffmpeg

  • Extract video without audio

ffmpeg -i /Users/cnn/Downloads/test.mp4 -an -y output_new.mp4
  • denoising

ffmpeg -y -i output_new.wav -af "anlmdn=ns=20" output_clean.wav
  • Merge and cut

ffmpeg -i merge1.wav -i a_p1.wav -filter_complex "[0:0] [1:0] concat=n=2:v=0:a=1 [a]" -map [a] - y merge0.wav
  • Other problems, due to the voice generated by tts, the actual duration is different from the original video duration and needs to be adjusted dynamically

# The idea is to obtain the ratio of the video duration to the original video duration, and set and adjust the speech rate
ffmpeg -y -i output.wav -filter:a "atempo=0.8" output_new.wav

Foreground

Cross-language video translation and voice localization applications combining OpenAI Whisper, FFmpeg and TTS technologies have broad prospects and market potential. With the advancement of globalization, the demand for multilingual video content is increasing, and fields such as education, media, entertainment, and business all need to provide multilingual support. This application can help content creators quickly localize videos to meet the needs of global audiences, while reducing cost and time investment. In the field of education, multilingual support can promote global learning exchanges and cooperation; the media and entertainment industry can attract a wider audience market through localized video content. In addition, enterprises can also use this application for voice localization in transnational business and cross-cultural communication, and promote global teamwork and business communication. In the future, such applications are expected to become part of video content creation tools and services, providing efficient and automated cross-language translation and voice localization functions. In short, this application brings business opportunities to various industries while meeting the needs of multilingual video, and promotes the development of global communication and cooperation.

Insufficient

  • TTS is slightly noisy, follow-up optimization, or consider a paid version, such as Polly: https://aws.amazon.com/cn/polly/,

Cite

  • https://github.com/openai/whisper

  • https://github.com/coqui-ai/TTS

  • https://ffmpeg.org/

-END-

About Qi Wu Troupe

Qi Wu Troupe is the largest front-end team of 360 Group, and participates in the work of W3C and ECMA members (TC39) on behalf of the group. Qi Wu Troupe attaches great importance to talent training, and has various development directions such as engineers, lecturers, translators, business interface people, and team leaders for employees to choose from, and provides corresponding technical, professional, general, and leadership training course. Qi Dance Troupe welcomes all kinds of outstanding talents to pay attention to and join Qi Dance Troupe with an open and talent-seeking attitude.

90fae4332479d2e0b1894a676ba2fab5.png