[whisper] Call whisper in python to extract subtitles or translate subtitles into text

Recently I have been doing business related to video processing. There is a need to extract subtitles from the video. Our implementation process is divided into two steps: first separate the audio, and then use whisper for speech recognition or translation. This article will introduce in detail the basic use of whisper and the two ways to call whisper in python.

1. Introduction to whisper

whisper is an open source library for speech recognition that supports multiple languages, including Chinese. In this article, we will introduce how to install whisper and how to use it to recognize Chinese subtitles.

2. Install whisper

First, we need to install whisper. Depending on your operating system, you can install it by following these steps:

For Windows users, you can download the applicable Python version of whisper installation package from whisper’s GitHub page (https://github.com/qingzhao/whisper), and then run the installer.
For macOS users, you can install it using Homebrew (https://brew.sh/). Run the following command in the terminal: brew install [email protected] whisper.
For Linux users, you can install using a package manager such as apt or yum. For example, for Ubuntu users using Python 3.10, run the following command in the terminal: sudo apt install python3.10 whisper.

Of course, we also need to configure the environment. Here we can refer to this article. This article uses the console to translate subtitles, which is more suitable for non-developers.

3. Use Whisper to extract video subtitles and generate files

3.1 Install Whisper library

First, we need to install the Whisper library. It can be installed from the command line using:

pip install whisper

3.2 Import required libraries and modules

import whisper
import arrow
import time
from datetime import datetime, timedelta
import subprocess
import re
import datetime

Refer to the two methods of generating requirements.txt in python
Build failure reference here
requirements.txt information generated by the corresponding version

arrow==1.3.0
asposestorage==1.0.2
numpy==1.25.0
openai_whisper==20230918

3.3 Extract subtitles and generate files

Below is a function that extracts subtitles from the target video and generates them to a specified file:

1. How to directly adjust the library in python

def extract_subtitles(video_file, output_file, actual_start_time=None):
    #Load whisper model
    model = whisper.load_model("medium") #Choose the appropriate model according to your needs
    subtitles = []
    # Extract subtitles
    result = model.transcribe(video_file)
    start_time = arrow.get(actual_start_time, 'HH:mm:ss.SSS') if actual_start_time is not None else arrow.get(0)

    for segment in result["segments"]:
        # Calculate start time and end time
        start = format_time(start_time.shift(seconds=segment["start"]))
        end = format_time(start_time.shift(seconds=segment["end"]))
        # Build subtitle text
        subtitle_text = f"[{<!-- -->start} -> {<!-- -->end}]: {<!-- -->segment['text']}"
        print(subtitle_text)
        subtitles.append(subtitle_text)
    #Write subtitle text to the specified file
    with open(output_file, "w", encoding="utf-8") as f:
        for subtitle in subtitles:
            f.write(subtitle + "\\
")

2. Call console commands in python

"""
Extract subtitles from the target video and generate them to the specified file
parameter:

video_file (str): path to the target video file
output_file (str): path to output file
actual_start_time (str): The actual start time of the audio, in the format of 'hour:minute:second.millisecond' (optional)
target_lang (str): target language code, for example, 'en' means English, 'zh-CN' means Simplified Chinese, etc. (optional)
"""


def extract_subtitles_translate(video_file, output_file, actual_start_time=None, target_lang=None):
#Specify the path of whisper
    whisper_path = r"D:\soft46\AncondaSelfInfo\envs\py39\Scripts\whisper"
    subtitles = []
    # Build command line parameters
    command = [whisper_path, video_file, "--task", "translate", "--language", target_lang, "--model", "large"]

    if actual_start_time is not None:
        command.extend(["--start-time", actual_start_time])

    print(command)

    try:
        #Run the command line command and get the byte stream output
        output = subprocess.check_output(command)
        output = output.decode('utf-8') #Decode to string
        subtitle_lines = output.split('\\
') # Split subtitle text by line

        start_time = time_to_milliseconds(actual_start_time) if actual_start_time is not None else 0
        for line in subtitle_lines:
            line = line.strip()
            if line: # Skip empty lines
                # Parse each line of subtitle text
                match = re.match(r'\[(\d{2}:\d{2}.\d{3})\s + -->\s + (\d{ 2}:\d{2}.\d{3})\]\s + (. + )', line)
                if match:
                # This is the time in seconds
                    # start = seconds_to_time(start_time + time_to_seconds(match.group(1)))
                    # end = seconds_to_time(start_time + time_to_seconds(match.group(2)))
                    start = start_time + time_to_milliseconds(match.group(1))
                    end = start_time + time_to_milliseconds(match.group(2))
                    text = match.group(3)
                    # Build subtitle text custom output format
                    subtitle_text = f"[{<!-- -->start} -> {<!-- -->end}]: {<!-- -->text}"
                    print(subtitle_text)
                    subtitles.append(subtitle_text)
        #Write subtitle text to the specified file
        with open(output_file, "w", encoding="utf-8") as f:
            for subtitle in subtitles:
                f.write(subtitle + "\\
")

    except subprocess.CalledProcessError as e:
        print(f"Command execution failed: {<!-- -->e}")

3.4 Auxiliary functions

In the above code, some auxiliary functions are also used to handle the conversion and formatting of time format:

def time_to_milliseconds(time_str):
    h, m, s = time_str.split(":")
    seconds = int(h) * 3600 + int(m) * 60 + float(s)
    return int(seconds * 1000)

def format_time(time):
    return time.format('HH:mm:ss.SSS')

def format_time_dot(time):
    return str(timedelta(seconds=time.total_seconds())).replace(".", ",")[:-3]
    
# Encapsulate a function that calculates the running time of a method
def time_it(func, *args, **kwargs):
    start_time = time.time()
    print("Start time:", time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(start_time)))

    result = func(*args, **kwargs)

    end_time = time.time()
    total_time = end_time - start_time

    minutes = total_time // 60
    seconds = total_time % 60

    print("End time:", time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(end_time)))
    print("Total execution time: {} minutes {} seconds".format(minutes, seconds))

    return result

3.5 Call function to extract subtitles

The following code can be used to call the above function and extract the subtitles to the specified output file:

if __name__ == '__main__':
    video_file = "C:/path/to/video.mp4" # Replace with the path of the target video file
    output_file = "C:/path/to/output.txt" # Replace with the path of the output txt file
    actual_start_time = '00:00:00.000' # Replace with the actual audio start time, the format is 'hour:minute:second.millisecond', if not provided, the default is 00:00:00.000 time
# Call directly in the main method
    # extract_subtitles(video_file, output_file, actual_start_time)
    time_it(extract_subtitles_translate, video_file, output_file, None, 'en')

Note replacing video_file and output_file with the actual video file path and output file path. The actual_start_time parameter can be replaced if there is an actual audio start time.

In the above code, we first import the whisper library and then define a function named recognize_chinese_subtitle which accepts an audio file path as input and uses the whisper client for speech recognition. The recognition results are stored in the result dictionary, where the text field contains the recognized subtitle text.

In the if __name__ == "__main__" block, we call the recognize_chinese_subtitle function, passing in an audio file path, and then print the recognized subtitles.

3.6 Model selection, please refer to the following

_MODELS = {<!-- -->
    "tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt",
    "tiny": "https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt",
    "base.en": "https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt",
    "base": "https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt",
    "small.en": "https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt",
    "small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt",
    "medium.en": "https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt",
    "medium": "https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt",
    "large-v1": "https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt",
    "large-v2": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",
    "large": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",
}

# 1.
# tiny.en/tiny:
# - Link: https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt
# - Link: https://openaipublic.azureedge.net/main/whisper/models/65147644
# a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt
# - Advantages: The model is small in size, suitable for use in resource-constrained environments, and has fast inference speed.
# - Disadvantages: Because the model is smaller, it may not perform as well as other large models when dealing with complex or long text. -------------Many errors
#
# 2.
# base.en / base:
# - Link: https://openaipublic.azureedge.net/main/whisper/models/25
# a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt
# - Link: https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt
# - Advantages: It has larger model capacity and can handle more complex dialogue and text tasks.
# - Disadvantages: Inference may be slightly slower relative to smaller models.
#
# 3.
# small.en / small:
# - Link: https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt
# - Link: https://openaipublic.azureedge.net/main/whisper/models/9
#ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt
# - Advantages: The model is moderate in size, has certain performance capabilities and inference speed.
# - Disadvantages: May not perform as well as larger models when handling more complex dialogue and text tasks.
#
#4.
# medium.en/medium:
# - Link: https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt
# - Link: https://openaipublic.azureedge.net/main/whisper/models/345
# ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt
# - Advantages: Larger model capacity, can handle more complex dialogue and text tasks.
# - Disadvantages: Inference may be slower relative to smaller models. ---The sentence is very long [00:00:52.000 -> 00:01:03.000]: Well, there is a small box, can you see it? There is that white foam on it, and it is covered with white plastic paper. Pick up the white plastic paper, it's underneath.
#
#5.
# large - v1 / large - v2 / large:
# - Link: https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large - v1.pt
# - Link: https://openaipublic.azureedge.net/main/whisper/models/81
# f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524 / large - v2.pt
# - Link: https://openaipublic.azureedge.net/main/whisper/models/81
# f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524 / large - v2.pt
# - Advantages: The largest model capacity, the most powerful performance capabilities, and can handle complex dialogue and text tasks.
# - Disadvantages: Compared with other smaller models, the inference speed is slower and takes up more memory.


# whisper C:/Users/Lenovo/Desktop/whisper/luyin.aac --language Chinese --task translate

4. Conclusion

Through the above steps, whisper has been successfully installed and the function of recognizing Chinese subtitles has been implemented. In actual applications, it may be necessary to make some adjustments to the code based on the actual situation, such as processing audio file paths, recognition results, etc. Hope this article helps you!