Skip to content

Commit e41bc5c

Browse files
danbevggerganov
andauthored
vad : add initial Voice Activity Detection (VAD) support (#3065)
* vad : add initial Voice Activity Detection (VAD) support This commit add support for Voice Activity Detection (VAD). When enabled this feature will process the audio input and detect speech segments. This information is then used to reduce the number of samples that need to be processed by whisper_full. Resolves: #3003 --------- Co-authored-by: Georgi Gerganov <[email protected]>
1 parent e39ba75 commit e41bc5c

File tree

11 files changed

+2154
-193
lines changed

11 files changed

+2154
-193
lines changed

.github/workflows/build.yml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1253,3 +1253,23 @@ jobs:
12531253
source venv/bin/activate
12541254
pip install ane_transformers openai-whisper coremltools
12551255
./models/generate-coreml-model.sh ${{ env.MODEL_NAME }}
1256+
1257+
vad:
1258+
if: ${{ github.event_name == 'push' || github.event_name == 'pull_request' ||
1259+
github.event.inputs.run_type == 'full-ci' }}
1260+
runs-on: ubuntu-latest
1261+
1262+
steps:
1263+
- name: Checkout
1264+
uses: actions/checkout@v4
1265+
1266+
- name: Build
1267+
shell: bash
1268+
run: |
1269+
cmake -B build
1270+
cmake --build build --config Release
1271+
1272+
- name: Test
1273+
shell: bash
1274+
run: |
1275+
ctest -R ^test-vad$ --test-dir build --output-on-failure -VV

README.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ High-performance inference of [OpenAI's Whisper](https://github.com/openai/whisp
2525
- [Ascend NPU Support](#ascend-npu-support)
2626
- [Moore Threads GPU Support](#moore-threads-gpu-support)
2727
- [C-style API](https://github.com/ggml-org/whisper.cpp/blob/master/include/whisper.h)
28+
- [Voice Activity Detection (VAD)](#voice-activity-detection-vad)
2829

2930
Supported platforms:
3031

@@ -732,6 +733,64 @@ let package = Package(
732733
)
733734
```
734735

736+
### Voice Activity Detection (VAD)
737+
Support for Voice Activity Detection (VAD) can be enabled using the `--vad`
738+
argument to `whisper-cli`. In addition to this option a VAD model is also
739+
required.
740+
741+
The way this works is that first the audio samples are passed through
742+
the VAD model which will detect speech segments. Using this information the
743+
only the speech segments that are detected are extracted from the original audio
744+
input and passed to whisper for processing. This reduces the amount of audio
745+
data that needs to be processed by whisper and can significantly speed up the
746+
transcription process.
747+
748+
The following VAD models are currently supported:
749+
750+
#### Silero-VAD
751+
[Silero-vad](https://github.com/snakers4/silero-vad) is a lightweight VAD model
752+
written in Python that is fast and accurate.
753+
754+
This model can be converted to ggml using the following command:
755+
```console
756+
$ python3 -m venv venv && source venv/bin/activate
757+
$ (venv) pip install silero-vad
758+
$ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin
759+
Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin
760+
```
761+
And it can then be used with whisper as follows:
762+
```console
763+
$ ./build/bin/whisper-cli \
764+
--file ./samples/jfk.wav \
765+
--model ./models/ggml-base.en.bin \
766+
--vad \
767+
--vad-model ./models/silero-v5.1.2-ggml.bin
768+
```
769+
770+
#### VAD Options
771+
772+
* --vad-threshold: Threshold probability for speech detection. A probability
773+
for a speech segment/frame above this threshold will be considered as speech.
774+
775+
* --vad-min-speech-duration-ms: Minimum speech duration in milliseconds. Speech
776+
segments shorter than this value will be discarded to filter out brief noise or
777+
false positives.
778+
779+
* --vad-min-silence-duration-ms: Minimum silence duration in milliseconds. Silence
780+
periods must be at least this long to end a speech segment. Shorter silence
781+
periods will be ignored and included as part of the speech.
782+
783+
* --vad-max-speech-duration-s: Maximum speech duration in seconds. Speech segments
784+
longer than this will be automatically split into multiple segments at silence
785+
points exceeding 98ms to prevent excessively long segments.
786+
787+
* --vad-speech-pad-ms: Speech padding in milliseconds. Adds this amount of padding
788+
before and after each detected speech segment to avoid cutting off speech edges.
789+
790+
* --vad-samples-overlap: Amount of audio to extend from each speech segment into
791+
the next one, in seconds (e.g., 0.10 = 100ms overlap). This ensures speech isn't
792+
cut off abruptly between segments when they're concatenated together.
793+
735794
## Examples
736795

737796
There are various examples of using the library for different projects in the [examples](examples) folder.

examples/cli/cli.cpp

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
#include <thread>
1212
#include <vector>
1313
#include <cstring>
14+
#include <cfloat>
1415

1516
#if defined(_WIN32)
1617
#ifndef NOMINMAX
@@ -97,6 +98,16 @@ struct whisper_params {
9798
std::vector<std::string> fname_out = {};
9899

99100
grammar_parser::parse_state grammar_parsed;
101+
102+
// Voice Activity Detection (VAD) parameters
103+
bool vad = false;
104+
std::string vad_model = "";
105+
float vad_threshold = 0.5f;
106+
int vad_min_speech_duration_ms = 250;
107+
int vad_min_silence_duration_ms = 100;
108+
float vad_max_speech_duration_s = FLT_MAX;
109+
int vad_speech_pad_ms = 30;
110+
float vad_samples_overlap = 0.1f;
100111
};
101112

102113
static void whisper_print_usage(int argc, char ** argv, const whisper_params & params);
@@ -185,6 +196,15 @@ static bool whisper_params_parse(int argc, char ** argv, whisper_params & params
185196
else if ( arg == "--grammar") { params.grammar = ARGV_NEXT; }
186197
else if ( arg == "--grammar-rule") { params.grammar_rule = ARGV_NEXT; }
187198
else if ( arg == "--grammar-penalty") { params.grammar_penalty = std::stof(ARGV_NEXT); }
199+
// Voice Activity Detection (VAD)
200+
else if (arg == "-v" || arg == "--vad") { params.vad = true; }
201+
else if (arg == "-vm" || arg == "--vad-model") { params.vad_model = ARGV_NEXT; }
202+
else if (arg == "-vt" || arg == "--vad-threshold") { params.vad_threshold = std::stof(ARGV_NEXT); }
203+
else if (arg == "-vsd" || arg == "--vad-min-speech-duration-ms") { params.vad_min_speech_duration_ms = std::stoi(ARGV_NEXT); }
204+
else if (arg == "-vsd" || arg == "--vad-min-silence-duration-ms") { params.vad_min_speech_duration_ms = std::stoi(ARGV_NEXT); }
205+
else if (arg == "-vmsd" || arg == "--vad-max-speech-duration-s") { params.vad_max_speech_duration_s = std::stof(ARGV_NEXT); }
206+
else if (arg == "-vp" || arg == "--vad-speech-pad-ms") { params.vad_speech_pad_ms = std::stoi(ARGV_NEXT); }
207+
else if (arg == "-vo" || arg == "--vad-samples-overlap") { params.vad_samples_overlap = std::stof(ARGV_NEXT); }
188208
else {
189209
fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
190210
whisper_print_usage(argc, argv, params);
@@ -254,6 +274,18 @@ static void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params
254274
fprintf(stderr, " --grammar GRAMMAR [%-7s] GBNF grammar to guide decoding\n", params.grammar.c_str());
255275
fprintf(stderr, " --grammar-rule RULE [%-7s] top-level GBNF grammar rule name\n", params.grammar_rule.c_str());
256276
fprintf(stderr, " --grammar-penalty N [%-7.1f] scales down logits of nongrammar tokens\n", params.grammar_penalty);
277+
// Voice Activity Detection (VAD) parameters
278+
fprintf(stderr, "\nVoice Activity Detection (VAD) options:\n");
279+
fprintf(stderr, " -v, --vad [%-7s] enable Voice Activity Detection (VAD)\n", params.vad ? "true" : "false");
280+
fprintf(stderr, " -vm FNAME, --vad-model FNAME [%-7s] VAD model path\n", params.vad_model.c_str());
281+
fprintf(stderr, " -vt N, --vad-threshold N [%-7.2f] VAD threshold for speech recognition\n", params.vad_threshold);
282+
fprintf(stderr, " -vspd N, --vad-min-speech-duration-ms N [%-7d] VAD min speech duration (0.0-1.0)\n", params.vad_min_speech_duration_ms);
283+
fprintf(stderr, " -vsd N, --vad-min-silence-duration-ms N [%-7d] VAD min silence duration (to split segments)\n", params.vad_min_silence_duration_ms);
284+
fprintf(stderr, " -vmsd N, --vad-max-speech-duration-s N [%-7s] VAD max speech duration (auto-split longer)\n", params.vad_max_speech_duration_s == FLT_MAX ?
285+
std::string("FLT_MAX").c_str() :
286+
std::to_string(params.vad_max_speech_duration_s).c_str());
287+
fprintf(stderr, " -vp N, --vad-speech-pad-ms N [%-7d] VAD speech padding (extend segments)\n", params.vad_speech_pad_ms);
288+
fprintf(stderr, " -vo N, --vad-samples-overlap N [%-7.2f] VAD samples overlap (seconds between segments)\n", params.vad_samples_overlap);
257289
fprintf(stderr, "\n");
258290
}
259291

@@ -1134,6 +1166,16 @@ int main(int argc, char ** argv) {
11341166

11351167
wparams.suppress_nst = params.suppress_nst;
11361168

1169+
wparams.vad = params.vad;
1170+
wparams.vad_model_path = params.vad_model.c_str();
1171+
1172+
wparams.vad_params.threshold = params.vad_threshold;
1173+
wparams.vad_params.min_speech_duration_ms = params.vad_min_speech_duration_ms;
1174+
wparams.vad_params.min_silence_duration_ms = params.vad_min_silence_duration_ms;
1175+
wparams.vad_params.max_speech_duration_s = params.vad_max_speech_duration_s;
1176+
wparams.vad_params.speech_pad_ms = params.vad_speech_pad_ms;
1177+
wparams.vad_params.samples_overlap = params.vad_samples_overlap;
1178+
11371179
whisper_print_user_data user_data = { &params, &pcmf32s, 0 };
11381180

11391181
const auto & grammar_parsed = params.grammar_parsed;

include/whisper.h

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,15 @@ extern "C" {
189189
uint32_t value; // Unicode code point or rule ID
190190
} whisper_grammar_element;
191191

192+
typedef struct whisper_vad_params {
193+
float threshold; // Probability threshold to consider as speech.
194+
int min_speech_duration_ms; // Min duration for a valid speech segment.
195+
int min_silence_duration_ms; // Min silence duration to consider speech as ended.
196+
float max_speech_duration_s; // Max duration of a speech segment before forcing a new segment.
197+
int speech_pad_ms; // Padding added before and after speech segments.
198+
float samples_overlap; // Overlap in seconds when copying audio samples from speech segment.
199+
} whisper_vad_params;
200+
192201
// Various functions for loading a ggml whisper model.
193202
// Allocate (almost) all memory needed for the model.
194203
// Return NULL on failure
@@ -570,11 +579,18 @@ extern "C" {
570579
size_t n_grammar_rules;
571580
size_t i_start_rule;
572581
float grammar_penalty;
582+
583+
// Voice Activity Detection (VAD) params
584+
bool vad; // Enable VAD
585+
const char * vad_model_path; // Path to VAD model
586+
587+
whisper_vad_params vad_params;
573588
};
574589

575590
// NOTE: this function allocates memory, and it is the responsibility of the caller to free the pointer - see whisper_free_context_params & whisper_free_params()
576591
WHISPER_API struct whisper_context_params * whisper_context_default_params_by_ref(void);
577592
WHISPER_API struct whisper_context_params whisper_context_default_params (void);
593+
578594
WHISPER_API struct whisper_full_params * whisper_full_default_params_by_ref(enum whisper_sampling_strategy strategy);
579595
WHISPER_API struct whisper_full_params whisper_full_default_params (enum whisper_sampling_strategy strategy);
580596

@@ -652,6 +668,53 @@ extern "C" {
652668
WHISPER_API float whisper_full_get_token_p (struct whisper_context * ctx, int i_segment, int i_token);
653669
WHISPER_API float whisper_full_get_token_p_from_state(struct whisper_state * state, int i_segment, int i_token);
654670

671+
//
672+
// Voice Activity Detection (VAD)
673+
//
674+
675+
struct whisper_vad_context;
676+
677+
WHISPER_API struct whisper_vad_params whisper_vad_default_params(void);
678+
679+
struct whisper_vad_context_params {
680+
int n_threads; // The number of threads to use for processing.
681+
bool use_gpu;
682+
int gpu_device; // CUDA device
683+
};
684+
685+
WHISPER_API struct whisper_vad_context_params whisper_vad_default_context_params(void);
686+
687+
WHISPER_API struct whisper_vad_context * whisper_vad_init_from_file_with_params(const char * path_model, struct whisper_vad_context_params params);
688+
WHISPER_API struct whisper_vad_context * whisper_vad_init_with_params (struct whisper_model_loader * loader, struct whisper_vad_context_params params);
689+
690+
WHISPER_API bool whisper_vad_detect_speech(
691+
struct whisper_vad_context * vctx,
692+
const float * samples,
693+
int n_samples);
694+
695+
WHISPER_API int whisper_vad_n_probs(struct whisper_vad_context * vctx);
696+
WHISPER_API float * whisper_vad_probs (struct whisper_vad_context * vctx);
697+
698+
struct whisper_vad_segments;
699+
700+
WHISPER_API struct whisper_vad_segments * whisper_vad_segments_from_probs(
701+
struct whisper_vad_context * vctx,
702+
struct whisper_vad_params params);
703+
704+
WHISPER_API struct whisper_vad_segments * whisper_vad_segments_from_samples(
705+
struct whisper_vad_context * vctx,
706+
struct whisper_vad_params params,
707+
const float * samples,
708+
int n_samples);
709+
710+
WHISPER_API int whisper_vad_segments_n_segments(struct whisper_vad_segments * segments);
711+
712+
WHISPER_API float whisper_vad_segments_get_segment_t0(struct whisper_vad_segments * segments, int i_segment);
713+
WHISPER_API float whisper_vad_segments_get_segment_t1(struct whisper_vad_segments * segments, int i_segment);
714+
715+
WHISPER_API void whisper_vad_free_segments(struct whisper_vad_segments * segments);
716+
WHISPER_API void whisper_vad_free (struct whisper_vad_context * ctx);
717+
655718
////////////////////////////////////////////////////////////////////////////
656719

657720
// Temporary helpers needed for exposing ggml interface

0 commit comments

Comments
 (0)