-
Notifications
You must be signed in to change notification settings - Fork 4.3k
vad : revisit timestamp alignment/mapping #3173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
+195
−144
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit improving the timestamp alignment by introducing a mapping table, adding intermediate reference points for longer segments, and binary search for lookups. The motivation for this changes is to address issues with the currently solution where zero-length segments are possible, and also to improve the precision of the VAD timestamps. Refs: ggml-org#3162
This commit changes the type of the `processed_time` and `original_time` fields in the `vad_time_mapping` struct from `double` to `uint64_t`. The motivation for this change is made to improve precision and avoid floating-point inaccuracies and also be consistent with other part of the code base that use `uint64_t` for time representation. This is a part of a refactoring where I'm also going to change the vad_segment_info struct to use `uint64_t` for the start and end times. This is the reason for the not so pleasant conversion and casts in the code at the moment.
82d6980
to
4c5ca93
Compare
To be consistent with other timestamps in the codebase.
This change seems to not be compatible with the ./bin/whisper-cli -m ../models/ggml-large-v3-turbo.bin -f ../samples/gb0.wav --vad --vad-model ../models/silero-v5.1.2-ggml.bin -fa -p 2 The second half of the transcription has the same repeating timestamp for all segments.
|
@ggerganov I had not tried this with |
This commit extracts the VAD processing from the `whisper_full_with_state` function into the `whisper_full` and `whisper_full_parallel` functions. The motivation for this is that I did not take into account that when `whisper_full_parallel` is called with `n_processors > 1`, then the vad processing would not be applied correctly. Instead the VAD processing should be done prior to processing in the case of `whisper_full_parallel`.
The commit removes the parameter `filtered_n_samples` from the `whisper_vad` function signature and its usage, as it is no longer needed since filtered samples is now a vector (previously it was a float*) The motivation for this is to simplify the usage of this function.
Example using $ ./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f samples/gb0.wav --vad --vad-model models/silero-v5.1.2-ggml.bin -fa -p 2 output[00:00:00.000 --> 00:00:03.250] - Good morning, this Tuesday is election day.
[00:00:03.250 --> 00:00:05.950] After months of spirited debate and vigorous campaigning,
[00:00:05.950 --> 00:00:08.570] the time has come for Americans to make important decisions
[00:00:08.570 --> 00:00:10.150] about our nation's future.
[00:00:10.150 --> 00:00:13.750] I encourage all Americans to go to the polls and vote.
[00:00:13.750 --> 00:00:16.100] Election season brings out the spirit of competition
[00:00:16.100 --> 00:00:18.020] between our political parties.
[00:00:18.020 --> 00:00:20.210] And that competition is an essential part
[00:00:20.210 --> 00:00:21.760] of a healthy democracy.
[00:00:21.760 --> 00:00:23.520] But as the campaigns come to a close,
[00:00:23.520 --> 00:00:25.920] Republicans, Democrats, and Independents
[00:00:25.920 --> 00:00:29.120] can find common ground on at least one point.
[00:00:29.120 --> 00:00:31.510] Our system of representative democracy
[00:00:31.510 --> 00:00:34.440] is one of America's greatest strengths.
[00:00:34.440 --> 00:00:36.220] The United States was founded on the belief
[00:00:36.220 --> 00:00:38.280] that all men are created equal.
[00:00:38.280 --> 00:00:41.440] Every election day, millions of Americans of all races,
[00:00:41.440 --> 00:00:43.810] religions, and backgrounds step into voting booths
[00:00:43.810 --> 00:00:45.300] throughout the nation.
[00:00:45.300 --> 00:00:47.730] Whether they are rich or poor, old or young,
[00:00:47.730 --> 00:00:50.640] each of them has an equal share in choosing the path
[00:00:50.640 --> 00:00:52.450] that our country will take.
[00:00:52.450 --> 00:00:54.870] And every ballot they cast is a reminder
[00:00:54.870 --> 00:00:58.290] that our founding principles are alive and well.
[00:00:58.290 --> 00:00:59.720] Voting is one of the great privileges
[00:00:59.720 --> 00:01:01.770] of American citizenship.
[00:01:01.770 --> 00:01:03.460] And it has always required brave defenders.
[00:01:03.460 --> 00:01:09.140] As you head to the polls next week, remember the sacrifices that have been made by generations
[00:01:09.140 --> 00:01:13.100] of Americans in uniform to preserve our way of life.
[00:01:13.100 --> 00:01:17.390] From Bunker Hill to Baghdad, the men and women of American armed forces have been devoted
[00:01:17.390 --> 00:01:20.060] guardians of our democracy.
[00:01:20.060 --> 00:01:25.590] All of us owe them and their families a special debt of gratitude on Election Day.
[00:01:25.590 --> 00:01:28.670] Americans should also remember the important example that our elections set throughout
[00:01:28.670 --> 00:01:30.260] the world.
[00:01:30.260 --> 00:01:34.140] Young democracies from Georgia and Ukraine to Afghanistan and Iraq can look to the United
[00:01:34.140 --> 00:01:39.190] States for proof that self-government can endure, and nations that still live under tyranny
[00:01:39.190 --> 00:01:44.160] and oppression can find hope and inspiration in our commitment to liberty.
[00:01:44.160 --> 00:01:48.220] For more than two centuries, Americans have demonstrated the ability of free people to choose their
[00:01:48.220 --> 00:01:49.680] own leaders.
[00:01:49.680 --> 00:01:54.720] Our nation has flourished because of its commitment to trusting the wisdom of our citizenry.
[00:01:54.720 --> 00:02:00.200] In this year's election, we will see this tradition continue, and we will be reminded once again
[00:02:00.200 --> 00:02:05.510] that we are blessed to live in a free nation guided by the will of the people.
[00:02:05.510 --> 00:02:06.230] Thank you for listening.
whisper_full_parallel: the audio has been split into 2 chunks at the following times:
whisper_full_parallel: split 1 - 00:00:59.210
whisper_full_parallel: the transcription quality may be degraded near these boundaries
|
ggerganov
approved these changes
May 29, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit improving the timestamp alignment by introducing a mapping table, adding intermediate reference points for longer segments, and binary search for lookups.
The motivation for this changes is to address issues with the currently solution where zero-length segments are possible, and also to improve the precision of the VAD timestamps.
Refs: #3162
Notes regarding the changes can be found here.