WebRTC Voice Activation Detection (VAD) Algorithm

webrtc learning and practice ? October 25, 2022 2:49 pm ? WebRTC

Voice activation detection was first used in telephone transmission and detection systems to allocate time for communication channels and improve the utilization efficiency of transmission lines. Activation detection is a front-end operation of the speech processing system and is of great significance in the field of speech detection.

However, the current voice activation detection, especially the detection of the endpoints of the start and end of the human voice, is always a technical difficulty. Companies are always at a stage where they can make judgments, but cannot guarantee the accuracy of the judgment.

Usuallybuilding a robot chat system mainly includes the following three aspects:

Speech to text (ASR/STT)
Semantic content (NLU/NLP)
Text-to-speech (TTS)

Speech to text mainly includes the following aspects:

Microphone noise reduction
Sound source localization
echo cancellation
Wake word/voice activation detection
audio format compression

The main functions of voice activation detection can include:

Automatically interrupt
Remove silent components from speech
Get the valid voice in the input voice
Remove noise and enhance speech

Detection principle

WebRTC’s VAD model uses a Gaussian model, which is extremely widely used.

The detection principle is to divide the input spectrum into six sub-bands (80Hz~250Hz, 250Hz~500Hz, 500Hz~1K, 1K~2K, 2K~3K, 3K~4K) based on the spectrum range of the human voice, and calculate these six sub-bands respectively. energy of. Then use the probability density function of the Gaussian model to perform operations to obtain a log-likelihood ratio function. The log-likelihood score is divided into global and local. The global is the weighted sum of the six sub-bands, and the local refers to each sub-band. Therefore, the speech decision will first judge the sub-band. If there is no sub-band judgment, the global will be judged. As long as one party passes, there will be a voice.

The advantage of this algorithm is that it is unsupervised and does not require rigorous training. The noise and speech models of GMM are as follows:

 p(xk|z,rk)={1/sqrt(2*pi*sita^2)} * exp{ - (xk-uz) ^2/(2 * sita ^2 )}

xk is the selected feature quantity, which specifically refers to the sub-band energy in the VAD of webrtc, and rk is the parameter set including the mean uz and variance sita. z=0 represents noise, z=1 represents speech.

python activation detection

In practical applications, it is difficult to determine the starting point of human speech simply by relying on methods such as energy detection and feature detection. Therefore, most voice products on the market use wake-up words to determine the start of speech. In addition, with the addition of voice loops, it can also be done Voice interruption. This kind of interaction may be a bit silly. You have to shout the wake word every time to continue chatting. Now there is an open source library of snowboy wake-up words on github. You can log in to the official snowboy website to train your own wake-up word model.

Kitt-AI : Snowboy
Sensory: Sensory

Considering that using the wake-up word will make your mouth tired, use VAD to wake up automatically. This method is easily interfered by strong noise. In far-field voice interaction scenarios, VAD faces two problems:

1. The lowest energy speech can be successfully detected (sensitivity).

2. How to successfully detect in a noisy environment (missed detection rate and false detection rate).

The missed detection rate reflects the speech signal that was originally heard but was not detected, while the false detection rate reflects the probability that it is detected as a speech signal instead of a speech signal. Relatively speaking, missed detections are unacceptable, and false detections can be further filtered through the back-end ASR and NLP algorithms. However, false detections will increase the system resource utilization, and then the system power consumption and heat will further increase, and this It will become a problem for portable and portable devices.

Suitable for one person to play at home:

pyaudio: pip install pyaudio can read the original audio stream data from the device node, and the audio encoding is in PCM format;
webrtcvad: pip install webrtcvad detects whether a set of voice data is empty voice;

When the duration length T1 vad detection has voice activity, it can be determined as the onset of voice.

When no voice activity is detected for the duration T2 vad, it can be determined that the voice has ended.

The procedure is very simple, I believe you will understand it after a while.

'''

Requirements:

 + pyaudio - `pip install pyaudio`

 + py-webrtcvad - `pip install webrtcvad`

'''

import webrtcvad

import collections

importsys

import signal

importpyaudio

from array import array

from struct import pack

import wave

import time

FORMAT = pyaudio.paInt16

CHANNELS = 1

RATE = 16000

CHUNK_DURATION_MS = 30 # supports 10, 20 and 30 (ms)

PADDING_DURATION_MS = 1500 # 1 sec jugement

CHUNK_SIZE = int(RATE CHUNK_DURATION_MS / 1000) # chunk to read

CHUNK_BYTES = CHUNK_SIZE 2 # 16bit = 2 bytes, PCM

NUM_PADDING_CHUNKS = int(PADDING_DURATION_MS / CHUNK_DURATION_MS)

# NUM_WINDOW_CHUNKS = int(240 / CHUNK_DURATION_MS)

NUM_WINDOW_CHUNKS = int(400 / CHUNK_DURATION_MS) # 400 ms/ 30ms ge

NUM_WINDOW_CHUNKS_END = NUM_WINDOW_CHUNKS 2

START_OFFSET = int(NUM_WINDOW_CHUNKS CHUNK_DURATION_MS 0.5 RATE)

vad = webrtcvad.Vad(1)

pa = pyaudio.PyAudio()

stream = pa.open(format=FORMAT,

channels=CHANNELS,

rate=RATE,

input=True,

start=False,

# input_device_index=2,

frames_per_buffer=CHUNK_SIZE)

got_a_sentence = False

leave = False

def handle_int(sig, chunk):

global leave, got_a_sentence

leave = True

got_a_sentence = True

def record_to_file(path, data, sample_width):

"Records from the microphone and outputs the resulting data to 'path'"

# sample_width, data = record()

data = pack('<' + ('h' len(data)), data)

wf = wave.open(path, 'wb')

wf.setnchannels(1)

wf.setsampwidth(sample_width)

wf.setframerate(RATE)

wf.writeframes(data)

wf.close()

def normalize(snd_data):

"Average the volume out"

MAXIMUM = 32767 # 16384

times = float(MAXIMUM) / max(abs(i) for i in snd_data)

r = array('h')

for i in snd_data:

r.append(int(i times))

return r

signal.signal(signal.SIGINT, handle_int)

while not leaving:

ring_buffer = collections.deque(maxlen=NUM_PADDING_CHUNKS)

triggered=False

voiced_frames = []

ring_buffer_flags = [0] NUM_WINDOW_CHUNKS

ring_buffer_index = 0

ring_buffer_flags_end = [0] NUM_WINDOW_CHUNKS_END

ring_buffer_index_end = 0

buffer_in = ''

#WangS

raw_data = array('h')

index = 0

start_point = 0

StartTime = time.time()

print("recording: ")

stream.start_stream()

while not got_a_sentence and not leave:

chunk = stream.read(CHUNK_SIZE)

# addWANGS

raw_data.extend(array('h', chunk))

index + = CHUNK_SIZE

TimeUse = time.time() - StartTime

active = vad.is_speech(chunk, RATE)

sys.stdout.write('1' if active else '_')

ring_buffer_flags[ring_buffer_index] = 1 if active else 0

ring_buffer_index + = 1

ring_buffer_index %= NUM_WINDOW_CHUNKS

ring_buffer_flags_end[ring_buffer_index_end] = 1 if active else 0

ring_buffer_index_end + = 1

ring_buffer_index_end %= NUM_WINDOW_CHUNKS_END

# start point detection

if not triggered:

ring_buffer.append(chunk)

num_voiced = sum(ring_buffer_flags)

if num_voiced > 0.8 NUM_WINDOW_CHUNKS:

sys.stdout.write(' Open ')

triggered=True

start_point = index - CHUNK_SIZE 20 # start point

# voiced_frames.extend(ring_buffer)

ring_buffer.clear()

# end point detection

else:

# voiced_frames.append(chunk)

ring_buffer.append(chunk)

num_unvoiced = NUM_WINDOW_CHUNKS_END - sum(ring_buffer_flags_end)

if num_unvoiced > 0.90 NUM_WINDOW_CHUNKS_END or TimeUse > 10:

sys.stdout.write(' Close ')

triggered=False

got_a_sentence = True

sys.stdout.flush()

sys.stdout.write('\
')

# data = b''.join(voiced_frames)

stream.stop_stream()

print("done recording")

got_a_sentence = False

# write to file

raw_data.reverse()

for index in range(start_point):

raw_data.pop()

raw_data.reverse()

raw_data = normalize(raw_data)

record_to_file("recording.wav", raw_data, 2)

leave = True

stream.close()

Program running method sudo python vad.py