Skip to content

vad : add initial Voice Activity Detection (VAD) support #3065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
May 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
871da0b
vad : add initial Voice Activity Detection (VAD) support
danbev Apr 7, 2025
2490168
examples : add VAD parameters to CLI [no ci]
danbev Apr 22, 2025
eb23253
ci : add job to test VAD
danbev Apr 28, 2025
59252c2
vad : map timestamps to original audio
danbev May 2, 2025
37a36a3
squash! vad : add initial Voice Activity Detection (VAD) support [no ci]
danbev May 3, 2025
033c0ce
vad : extract VAD processing to a separate function
danbev May 3, 2025
028481e
vad : add TODOs to optimize segment access [no ci]
danbev May 4, 2025
fc7ebf2
vad : only use CPU backend for VAD processing [no ci]
danbev May 4, 2025
3276232
tests : fix strcmp assert and use beam search
danbev May 4, 2025
abc05c5
vad : dont reshape stft_forward_basis tensor
danbev May 7, 2025
0e18ceb
vad : use ggml_row_size() and rename hdim_bytes to hdim_size
danbev May 7, 2025
9bf1b4b
vad : remove unnecessary ggml_cont
danbev May 7, 2025
dc52995
vad : fix typo in log message
danbev May 7, 2025
2b05773
vad : don't use left leaning ref for segment
danbev May 7, 2025
44bdef1
vad : use std::vector<float> instead float pointers
danbev May 7, 2025
27eb59b
vad : enable GPU support for VAD but default to false
danbev May 8, 2025
643a91b
vad : use kebab-case and not snake_case for VAD options
danbev May 8, 2025
e4d4307
vad : add h_state and c_state to whisper_vad_state
danbev May 8, 2025
94c3aba
vad : always initialize filtered_n_samples to 0
danbev May 8, 2025
e70e486
vad : use orig timestamp for first segment
danbev May 8, 2025
436baeb
vad : fix buffers and enable GPU support by default
ggerganov May 9, 2025
eb2c83e
vad : fix use_gpu assert in test-vad.cpp
danbev May 9, 2025
47c8f02
vad : remove unnecessary reserve [no ci]
danbev May 9, 2025
327cdae
vad : add probs to whisper_vad_state
danbev May 9, 2025
bf2b0df
vad : add timing of vad processing [no ci]
danbev May 9, 2025
243e0db
vad : force GPU off for now
ggerganov May 10, 2025
65c421d
vad : minor style and naming changes
ggerganov May 10, 2025
cae38fd
vad : minor style
ggerganov May 10, 2025
cd953eb
vad : remove obsolete whisper_vad_free_speech
ggerganov May 10, 2025
f42e6e4
vad : refactor whiser_vad_params API
ggerganov May 10, 2025
4ff858b
vad : simplify whisper_vad_timestamps_from_probs()
ggerganov May 10, 2025
13a7517
vad : refactor whisper_vad_timestamps_from_probs to use C++
ggerganov May 10, 2025
3bcc44c
vad : make whisper_vad_timestamps oblique in API
danbev May 10, 2025
5543c80
vad : rename whisper_vad_speech to whisper_vad_probs
danbev May 10, 2025
8b6f19c
vad : move whisper_vad_segment to whisper.cpp
danbev May 10, 2025
7625ba1
vad : make segments vector a std::vector
danbev May 10, 2025
b0b2f9b
vad : use std::vector for segments in whisper_vad_timestamps_from_probs
danbev May 10, 2025
20fe0b3
vad : rename pcmf32 parameters to samples [no ci]
danbev May 10, 2025
f212310
vad : remove n_segments from struct whisper_vad_timestamps
danbev May 10, 2025
4c7fe00
vad : rename whisper_vad_timestamps to whisper_vad_segments [no ci]
danbev May 10, 2025
dc541f9
vad : remove whisper_vad_probs struct [no ci]
danbev May 11, 2025
163ad53
vad : remove whisper_vad_state struct
danbev May 11, 2025
810981f
vad : remove window_size_samples from VAD params
danbev May 12, 2025
050038c
vad : clarify VAD CLI options [no ci]
danbev May 12, 2025
3cff658
docs : add VAD section to README.md [no ci]
danbev May 12, 2025
acc8747
squash! docs : add VAD section to README.md [no ci]
danbev May 12, 2025
7aac6ec
vad : minor rename
ggerganov May 12, 2025
41c2010
squash! docs : add VAD section to README.md [no ci]
danbev May 12, 2025
67f0fd4
vad : fix cli option names [no ci]
danbev May 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1253,3 +1253,23 @@ jobs:
source venv/bin/activate
pip install ane_transformers openai-whisper coremltools
./models/generate-coreml-model.sh ${{ env.MODEL_NAME }}

vad:
if: ${{ github.event_name == 'push' || github.event_name == 'pull_request' ||
github.event.inputs.run_type == 'full-ci' }}
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Build
shell: bash
run: |
cmake -B build
cmake --build build --config Release

- name: Test
shell: bash
run: |
ctest -R ^test-vad$ --test-dir build --output-on-failure -VV
59 changes: 59 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ High-performance inference of [OpenAI's Whisper](https://github.com/openai/whisp
- [Ascend NPU Support](#ascend-npu-support)
- [Moore Threads GPU Support](#moore-threads-gpu-support)
- [C-style API](https://github.com/ggml-org/whisper.cpp/blob/master/include/whisper.h)
- [Voice Activity Detection (VAD)](#voice-activity-detection-vad)

Supported platforms:

Expand Down Expand Up @@ -732,6 +733,64 @@ let package = Package(
)
```

### Voice Activity Detection (VAD)
Support for Voice Activity Detection (VAD) can be enabled using the `--vad`
argument to `whisper-cli`. In addition to this option a VAD model is also
required.

The way this works is that first the audio samples are passed through
the VAD model which will detect speech segments. Using this information the
only the speech segments that are detected are extracted from the original audio
input and passed to whisper for processing. This reduces the amount of audio
data that needs to be processed by whisper and can significantly speed up the
transcription process.

The following VAD models are currently supported:

#### Silero-VAD
[Silero-vad](https://github.com/snakers4/silero-vad) is a lightweight VAD model
written in Python that is fast and accurate.

This model can be converted to ggml using the following command:
```console
$ python3 -m venv venv && source venv/bin/activate
$ (venv) pip install silero-vad
$ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin
Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin
```
And it can then be used with whisper as follows:
```console
$ ./build/bin/whisper-cli \
--file ./samples/jfk.wav \
--model ./models/ggml-base.en.bin \
--vad \
--vad-model ./models/silero-v5.1.2-ggml.bin
```

#### VAD Options

* --vad-threshold: Threshold probability for speech detection. A probability
for a speech segment/frame above this threshold will be considered as speech.

* --vad-min-speech-duration-ms: Minimum speech duration in milliseconds. Speech
segments shorter than this value will be discarded to filter out brief noise or
false positives.

* --vad-min-silence-duration-ms: Minimum silence duration in milliseconds. Silence
periods must be at least this long to end a speech segment. Shorter silence
periods will be ignored and included as part of the speech.

* --vad-max-speech-duration-s: Maximum speech duration in seconds. Speech segments
longer than this will be automatically split into multiple segments at silence
points exceeding 98ms to prevent excessively long segments.

* --vad-speech-pad-ms: Speech padding in milliseconds. Adds this amount of padding
before and after each detected speech segment to avoid cutting off speech edges.

* --vad-samples-overlap: Amount of audio to extend from each speech segment into
the next one, in seconds (e.g., 0.10 = 100ms overlap). This ensures speech isn't
cut off abruptly between segments when they're concatenated together.

## Examples

There are various examples of using the library for different projects in the [examples](examples) folder.
Expand Down
42 changes: 42 additions & 0 deletions examples/cli/cli.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
#include <thread>
#include <vector>
#include <cstring>
#include <cfloat>

#if defined(_WIN32)
#ifndef NOMINMAX
Expand Down Expand Up @@ -97,6 +98,16 @@ struct whisper_params {
std::vector<std::string> fname_out = {};

grammar_parser::parse_state grammar_parsed;

// Voice Activity Detection (VAD) parameters
bool vad = false;
std::string vad_model = "";
float vad_threshold = 0.5f;
int vad_min_speech_duration_ms = 250;
int vad_min_silence_duration_ms = 100;
float vad_max_speech_duration_s = FLT_MAX;
int vad_speech_pad_ms = 30;
float vad_samples_overlap = 0.1f;
};

static void whisper_print_usage(int argc, char ** argv, const whisper_params & params);
Expand Down Expand Up @@ -185,6 +196,15 @@ static bool whisper_params_parse(int argc, char ** argv, whisper_params & params
else if ( arg == "--grammar") { params.grammar = ARGV_NEXT; }
else if ( arg == "--grammar-rule") { params.grammar_rule = ARGV_NEXT; }
else if ( arg == "--grammar-penalty") { params.grammar_penalty = std::stof(ARGV_NEXT); }
// Voice Activity Detection (VAD)
else if (arg == "-v" || arg == "--vad") { params.vad = true; }
else if (arg == "-vm" || arg == "--vad-model") { params.vad_model = ARGV_NEXT; }
else if (arg == "-vt" || arg == "--vad-threshold") { params.vad_threshold = std::stof(ARGV_NEXT); }
else if (arg == "-vsd" || arg == "--vad-min-speech-duration-ms") { params.vad_min_speech_duration_ms = std::stoi(ARGV_NEXT); }
else if (arg == "-vsd" || arg == "--vad-min-silence-duration-ms") { params.vad_min_speech_duration_ms = std::stoi(ARGV_NEXT); }
else if (arg == "-vmsd" || arg == "--vad-max-speech-duration-s") { params.vad_max_speech_duration_s = std::stof(ARGV_NEXT); }
else if (arg == "-vp" || arg == "--vad-speech-pad-ms") { params.vad_speech_pad_ms = std::stoi(ARGV_NEXT); }
else if (arg == "-vo" || arg == "--vad-samples-overlap") { params.vad_samples_overlap = std::stof(ARGV_NEXT); }
else {
fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
whisper_print_usage(argc, argv, params);
Expand Down Expand Up @@ -254,6 +274,18 @@ static void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params
fprintf(stderr, " --grammar GRAMMAR [%-7s] GBNF grammar to guide decoding\n", params.grammar.c_str());
fprintf(stderr, " --grammar-rule RULE [%-7s] top-level GBNF grammar rule name\n", params.grammar_rule.c_str());
fprintf(stderr, " --grammar-penalty N [%-7.1f] scales down logits of nongrammar tokens\n", params.grammar_penalty);
// Voice Activity Detection (VAD) parameters
fprintf(stderr, "\nVoice Activity Detection (VAD) options:\n");
fprintf(stderr, " -v, --vad [%-7s] enable Voice Activity Detection (VAD)\n", params.vad ? "true" : "false");
fprintf(stderr, " -vm FNAME, --vad-model FNAME [%-7s] VAD model path\n", params.vad_model.c_str());
fprintf(stderr, " -vt N, --vad-threshold N [%-7.2f] VAD threshold for speech recognition\n", params.vad_threshold);
fprintf(stderr, " -vspd N, --vad-min-speech-duration-ms N [%-7d] VAD min speech duration (0.0-1.0)\n", params.vad_min_speech_duration_ms);
fprintf(stderr, " -vsd N, --vad-min-silence-duration-ms N [%-7d] VAD min silence duration (to split segments)\n", params.vad_min_silence_duration_ms);
fprintf(stderr, " -vmsd N, --vad-max-speech-duration-s N [%-7s] VAD max speech duration (auto-split longer)\n", params.vad_max_speech_duration_s == FLT_MAX ?
std::string("FLT_MAX").c_str() :
std::to_string(params.vad_max_speech_duration_s).c_str());
fprintf(stderr, " -vp N, --vad-speech-pad-ms N [%-7d] VAD speech padding (extend segments)\n", params.vad_speech_pad_ms);
fprintf(stderr, " -vo N, --vad-samples-overlap N [%-7.2f] VAD samples overlap (seconds between segments)\n", params.vad_samples_overlap);
fprintf(stderr, "\n");
}

Expand Down Expand Up @@ -1131,6 +1163,16 @@ int main(int argc, char ** argv) {

wparams.suppress_nst = params.suppress_nst;

wparams.vad = params.vad;
wparams.vad_model_path = params.vad_model.c_str();

wparams.vad_params.threshold = params.vad_threshold;
wparams.vad_params.min_speech_duration_ms = params.vad_min_speech_duration_ms;
wparams.vad_params.min_silence_duration_ms = params.vad_min_silence_duration_ms;
wparams.vad_params.max_speech_duration_s = params.vad_max_speech_duration_s;
wparams.vad_params.speech_pad_ms = params.vad_speech_pad_ms;
wparams.vad_params.samples_overlap = params.vad_samples_overlap;

whisper_print_user_data user_data = { &params, &pcmf32s, 0 };

const auto & grammar_parsed = params.grammar_parsed;
Expand Down
63 changes: 63 additions & 0 deletions include/whisper.h
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,15 @@ extern "C" {
uint32_t value; // Unicode code point or rule ID
} whisper_grammar_element;

typedef struct whisper_vad_params {
float threshold; // Probability threshold to consider as speech.
int min_speech_duration_ms; // Min duration for a valid speech segment.
int min_silence_duration_ms; // Min silence duration to consider speech as ended.
float max_speech_duration_s; // Max duration of a speech segment before forcing a new segment.
int speech_pad_ms; // Padding added before and after speech segments.
float samples_overlap; // Overlap in seconds when copying audio samples from speech segment.
} whisper_vad_params;

// Various functions for loading a ggml whisper model.
// Allocate (almost) all memory needed for the model.
// Return NULL on failure
Expand Down Expand Up @@ -570,11 +579,18 @@ extern "C" {
size_t n_grammar_rules;
size_t i_start_rule;
float grammar_penalty;

// Voice Activity Detection (VAD) params
bool vad; // Enable VAD
const char * vad_model_path; // Path to VAD model

whisper_vad_params vad_params;
};

// NOTE: this function allocates memory, and it is the responsibility of the caller to free the pointer - see whisper_free_context_params & whisper_free_params()
WHISPER_API struct whisper_context_params * whisper_context_default_params_by_ref(void);
WHISPER_API struct whisper_context_params whisper_context_default_params (void);

WHISPER_API struct whisper_full_params * whisper_full_default_params_by_ref(enum whisper_sampling_strategy strategy);
WHISPER_API struct whisper_full_params whisper_full_default_params (enum whisper_sampling_strategy strategy);

Expand Down Expand Up @@ -652,6 +668,53 @@ extern "C" {
WHISPER_API float whisper_full_get_token_p (struct whisper_context * ctx, int i_segment, int i_token);
WHISPER_API float whisper_full_get_token_p_from_state(struct whisper_state * state, int i_segment, int i_token);

//
// Voice Activity Detection (VAD)
//

struct whisper_vad_context;

WHISPER_API struct whisper_vad_params whisper_vad_default_params(void);

struct whisper_vad_context_params {
int n_threads; // The number of threads to use for processing.
bool use_gpu;
int gpu_device; // CUDA device
};

WHISPER_API struct whisper_vad_context_params whisper_vad_default_context_params(void);

WHISPER_API struct whisper_vad_context * whisper_vad_init_from_file_with_params(const char * path_model, struct whisper_vad_context_params params);
WHISPER_API struct whisper_vad_context * whisper_vad_init_with_params (struct whisper_model_loader * loader, struct whisper_vad_context_params params);

WHISPER_API bool whisper_vad_detect_speech(
struct whisper_vad_context * vctx,
const float * samples,
int n_samples);

WHISPER_API int whisper_vad_n_probs(struct whisper_vad_context * vctx);
WHISPER_API float * whisper_vad_probs (struct whisper_vad_context * vctx);

struct whisper_vad_segments;

WHISPER_API struct whisper_vad_segments * whisper_vad_segments_from_probs(
struct whisper_vad_context * vctx,
struct whisper_vad_params params);

WHISPER_API struct whisper_vad_segments * whisper_vad_segments_from_samples(
struct whisper_vad_context * vctx,
struct whisper_vad_params params,
const float * samples,
int n_samples);

WHISPER_API int whisper_vad_segments_n_segments(struct whisper_vad_segments * segments);

WHISPER_API float whisper_vad_segments_get_segment_t0(struct whisper_vad_segments * segments, int i_segment);
WHISPER_API float whisper_vad_segments_get_segment_t1(struct whisper_vad_segments * segments, int i_segment);

WHISPER_API void whisper_vad_free_segments(struct whisper_vad_segments * segments);
WHISPER_API void whisper_vad_free (struct whisper_vad_context * ctx);

////////////////////////////////////////////////////////////////////////////

// Temporary helpers needed for exposing ggml interface
Expand Down
Loading