WebRTC Voice Activation Detection (VAD) Algorithm
webrtc learning and practice ? October 25, 2022 2:49 pm ? WebRTC
Voice activation detection was first used in telephone transmission and detection systems to allocate time for communication channels and improve the utilization efficiency of transmission lines. Activation detection is a front-end operation of the speech processing system and is of great significance in the field of speech detection.
However, the current voice activation detection, especially the detection of the endpoints of the start and end of the human voice, is always a technical difficulty. Companies are always at a stage where they can make judgments, but cannot guarantee the accuracy of the judgment.
Usuallybuilding a robot chat system mainly includes the following three aspects:
- Speech to text (ASR/STT)
- Semantic content (NLU/NLP)
- Text-to-speech (TTS)
Speech to text mainly includes the following aspects:
- Microphone noise reduction
- Sound source localization
- echo cancellation
- Wake word/voice activation detection
- audio format compression
The main functions of voice activation detection can include:
- Automatically interrupt
- Remove silent components from speech
- Get the valid voice in the input voice
- Remove noise and enhance speech
Detection principle
WebRTC’s VAD model uses a Gaussian model, which is extremely widely used.
The detection principle is to divide the input spectrum into six sub-bands (80Hz~250Hz, 250Hz~500Hz, 500Hz~1K, 1K~2K, 2K~3K, 3K~4K) based on the spectrum range of the human voice, and calculate these six sub-bands respectively. energy of. Then use the probability density function of the Gaussian model to perform operations to obtain a log-likelihood ratio function. The log-likelihood score is divided into global and local. The global is the weighted sum of the six sub-bands, and the local refers to each sub-band. Therefore, the speech decision will first judge the sub-band. If there is no sub-band judgment, the global will be judged. As long as one party passes, there will be a voice.
The advantage of this algorithm is that it is unsupervised and does not require rigorous training. The noise and speech models of GMM are as follows:
p(xk|z,rk)={1/sqrt(2*pi*sita^2)} * exp{ - (xk-uz) ^2/(2 * sita ^2 )}
xk is the selected feature quantity, which specifically refers to the sub-band energy in the VAD of webrtc, and rk is the parameter set including the mean uz and variance sita. z=0 represents noise, z=1 represents speech.
python activation detection
In practical applications, it is difficult to determine the starting point of human speech simply by relying on methods such as energy detection and feature detection. Therefore, most voice products on the market use wake-up words to determine the start of speech. In addition, with the addition of voice loops, it can also be done Voice interruption. This kind of interaction may be a bit silly. You have to shout the wake word every time to continue chatting. Now there is an open source library of snowboy wake-up words on github. You can log in to the official snowboy website to train your own wake-up word model.
- Kitt-AI : Snowboy
- Sensory: Sensory
Considering that using the wake-up word will make your mouth tired, use VAD to wake up automatically. This method is easily interfered by strong noise. In far-field voice interaction scenarios, VAD faces two problems:
1. The lowest energy speech can be successfully detected (sensitivity).
2. How to successfully detect in a noisy environment (missed detection rate and false detection rate).
The missed detection rate reflects the speech signal that was originally heard but was not detected, while the false detection rate reflects the probability that it is detected as a speech signal instead of a speech signal. Relatively speaking, missed detections are unacceptable, and false detections can be further filtered through the back-end ASR and NLP algorithms. However, false detections will increase the system resource utilization, and then the system power consumption and heat will further increase, and this It will become a problem for portable and portable devices.
Suitable for one person to play at home:
- pyaudio: pip install pyaudio can read the original audio stream data from the device node, and the audio encoding is in PCM format;
- webrtcvad: pip install webrtcvad detects whether a set of voice data is empty voice;
When the duration length T1 vad detection has voice activity, it can be determined as the onset of voice.
When no voice activity is detected for the duration T2 vad, it can be determined that the voice has ended.
The procedure is very simple, I believe you will understand it after a while.
''' Requirements: + pyaudio - `pip install pyaudio` + py-webrtcvad - `pip install webrtcvad` ''' import webrtcvad import collections importsys import signal importpyaudio from array import array from struct import pack import wave import time FORMAT = pyaudio.paInt16 CHANNELS = 1 RATE = 16000 CHUNK_DURATION_MS = 30 # supports 10, 20 and 30 (ms) PADDING_DURATION_MS = 1500 # 1 sec jugement CHUNK_SIZE = int(RATE CHUNK_DURATION_MS / 1000) # chunk to read CHUNK_BYTES = CHUNK_SIZE 2 # 16bit = 2 bytes, PCM NUM_PADDING_CHUNKS = int(PADDING_DURATION_MS / CHUNK_DURATION_MS) # NUM_WINDOW_CHUNKS = int(240 / CHUNK_DURATION_MS) NUM_WINDOW_CHUNKS = int(400 / CHUNK_DURATION_MS) # 400 ms/ 30ms ge NUM_WINDOW_CHUNKS_END = NUM_WINDOW_CHUNKS 2 START_OFFSET = int(NUM_WINDOW_CHUNKS CHUNK_DURATION_MS 0.5 RATE) vad = webrtcvad.Vad(1) pa = pyaudio.PyAudio() stream = pa.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, start=False, # input_device_index=2, frames_per_buffer=CHUNK_SIZE) got_a_sentence = False leave = False def handle_int(sig, chunk): global leave, got_a_sentence leave = True got_a_sentence = True def record_to_file(path, data, sample_width): "Records from the microphone and outputs the resulting data to 'path'" # sample_width, data = record() data = pack('<' + ('h' len(data)), data) wf = wave.open(path, 'wb') wf.setnchannels(1) wf.setsampwidth(sample_width) wf.setframerate(RATE) wf.writeframes(data) wf.close() def normalize(snd_data): "Average the volume out" MAXIMUM = 32767 # 16384 times = float(MAXIMUM) / max(abs(i) for i in snd_data) r = array('h') for i in snd_data: r.append(int(i times)) return r signal.signal(signal.SIGINT, handle_int) while not leaving: ring_buffer = collections.deque(maxlen=NUM_PADDING_CHUNKS) triggered=False voiced_frames = [] ring_buffer_flags = [0] NUM_WINDOW_CHUNKS ring_buffer_index = 0 ring_buffer_flags_end = [0] NUM_WINDOW_CHUNKS_END ring_buffer_index_end = 0 buffer_in = '' #WangS raw_data = array('h') index = 0 start_point = 0 StartTime = time.time() print("recording: ") stream.start_stream() while not got_a_sentence and not leave: chunk = stream.read(CHUNK_SIZE) # addWANGS raw_data.extend(array('h', chunk)) index + = CHUNK_SIZE TimeUse = time.time() - StartTime active = vad.is_speech(chunk, RATE) sys.stdout.write('1' if active else '_') ring_buffer_flags[ring_buffer_index] = 1 if active else 0 ring_buffer_index + = 1 ring_buffer_index %= NUM_WINDOW_CHUNKS ring_buffer_flags_end[ring_buffer_index_end] = 1 if active else 0 ring_buffer_index_end + = 1 ring_buffer_index_end %= NUM_WINDOW_CHUNKS_END # start point detection if not triggered: ring_buffer.append(chunk) num_voiced = sum(ring_buffer_flags) if num_voiced > 0.8 NUM_WINDOW_CHUNKS: sys.stdout.write(' Open ') triggered=True start_point = index - CHUNK_SIZE 20 # start point # voiced_frames.extend(ring_buffer) ring_buffer.clear() # end point detection else: # voiced_frames.append(chunk) ring_buffer.append(chunk) num_unvoiced = NUM_WINDOW_CHUNKS_END - sum(ring_buffer_flags_end) if num_unvoiced > 0.90 NUM_WINDOW_CHUNKS_END or TimeUse > 10: sys.stdout.write(' Close ') triggered=False got_a_sentence = True sys.stdout.flush() sys.stdout.write('\ ') # data = b''.join(voiced_frames) stream.stop_stream() print("done recording") got_a_sentence = False # write to file raw_data.reverse() for index in range(start_point): raw_data.pop() raw_data.reverse() raw_data = normalize(raw_data) record_to_file("recording.wav", raw_data, 2) leave = True stream.close()
Program running method sudo python vad.py