Skip to content

For videos over two hours long, the output timeline will be confusingly intersected (Large-v3 + VAD) #3162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Makememo opened this issue May 16, 2025 · 7 comments
Assignees

Comments

@Makememo
Copy link

Makememo commented May 16, 2025

Using master branching down code, Large-v3 model, with VAD, found that at the end of the day the timestamps get messed up.

./build/bin/whisper-cli -vm /Users/xxx/Developer/whisper.cpp/models/ggml-silero-v5.1.2.bin --vad -f samples/hd.wav -m models/ggml-large-v3.bin -osrt

If the original audio is needed, I can send it to your email.

Image

@danbev
Copy link
Collaborator

danbev commented May 16, 2025

Thanks for the report! It would be great if you could send this to me (email is in my profile) and I'l take a closer look.

@Makememo
Copy link
Author

Done, Thanks to your excellent work.

@danbev danbev self-assigned this May 19, 2025
@danbev
Copy link
Collaborator

danbev commented May 19, 2025

@Makememo I'm currently running this, but I notice that I'm getting some repeating transcriptions earlier than the output you reported above. I'm just wondering if you see them too, or if perhaps the .wav got corrupted in some way for me:

[00:38:00.540 --> 00:38:16.570]   so yeah absolutely yeah yeah so that is that is in my opinion what i would do the ultrasound
[00:38:16.570 --> 00:38:22.310]   um i don't know have you ever have you ever used the icg guys for
[00:38:22.310 --> 00:38:25.290]   to identify the thrombus this is interesting i've never used it
[00:38:34.230 --> 00:38:36.160]   um i don't know if i've ever used the icg guys for
[00:38:36.160 --> 00:38:42.060]   um i don't know if i've ever used the icg guys for
[00:38:42.060 --> 00:38:44.300]   um i don't know if i've ever used the icg guys for
[00:38:44.300 --> 00:39:04.440]   um i don't know if i've ever used the icg guys for
[00:39:04.440 --> 00:39:06.590]   um i don't know if i've ever used the icg guys for
[00:39:06.590 --> 00:39:08.580]   um i don't know if i've ever used the icg guys for
[00:39:08.580 --> 00:39:11.500]   um i don't know if i've ever used the icg guys for
[00:39:11.500 --> 00:39:22.690]   um i don't know if i've ever used the icg guys for
[00:39:22.690 --> 00:39:24.690]   um i don't know if i've ever used the icg guys for
[00:39:24.690 --> 00:39:27.470]   um i don't know if i've ever used the icg guys for
[00:39:27.470 --> 00:39:34.880]   um i don't know if i've ever used the icg guys for
[00:39:34.880 --> 00:39:37.410]   um i don't know if i've ever used the icg guys for
[00:39:37.410 --> 00:39:39.510]   um i don't know if i've ever used the icg guys for
[00:39:39.510 --> 00:39:41.710]   um i don't know if i've ever used the icg guys for
[00:39:41.710 --> 00:39:53.040]   um i don't know if i've ever used the icg guys for
[00:39:53.040 --> 00:39:55.310]   um i don't know if i've ever used the icg guys for
[00:39:55.310 --> 00:39:57.270]   um i don't know if i've ever used the icg guys for
...

@Makememo
Copy link
Author

Yes, the whisper hallucination is very severe, especially when the large-v3 model transcribes audio to text.

@danbev
Copy link
Collaborator

danbev commented May 19, 2025

Yes, the whisper hallucination is very severe, especially when the large-v3 model transcribes audio to text.

Ah I see, I've not used large-v3 much before so I did not know what to expect.
Now, if I run this without VAD enabled I also see repeats so I'm thinking it might not be specifically related to VAD.

@ggerganov
Copy link
Member

Yes, the repetitions are likely not related - it's something about the V3 model. Adding -mc 0 usually seems to reduce them. But the timestamp misalignment issue reported here can be investigated even with the base or small models - no need to run V3.

Btw, I think I also noticed some misalignment of the timestamps when VAD is enabled and using a long audio. I didn't specifically observe intersected segments, but I did observe significantly different time position of the same phrase with VAD on/off. I can try to find a repro later if you don't reproduce.

@danbev
Copy link
Collaborator

danbev commented May 19, 2025

I'm able to reproduce this now. I think I need to revisit the alignment/mapping of timestamps and use different approach. Looking into this now.

danbev added a commit to danbev/whisper.cpp that referenced this issue May 20, 2025
This commit improving the timestamp alignment by introducing a mapping
table, adding intermediate reference points for longer segments, and
binary search for lookups.

The motivation for this changes is to address issues with the currently
solution where zero-length segments are possible, and also to improve
the precision of the VAD timestamps.

Refs: ggml-org#3162
danbev added a commit to danbev/whisper.cpp that referenced this issue May 22, 2025
This commit improving the timestamp alignment by introducing a mapping
table, adding intermediate reference points for longer segments, and
binary search for lookups.

The motivation for this changes is to address issues with the currently
solution where zero-length segments are possible, and also to improve
the precision of the VAD timestamps.

Refs: ggml-org#3162
danbev added a commit that referenced this issue May 30, 2025
* vad : revisit timestamp alignment/mapping

This commit improving the timestamp alignment by introducing a mapping
table, adding intermediate reference points for longer segments, and
binary search for lookups.

The motivation for this changes is to address issues with the currently
solution where zero-length segments are possible, and also to improve
the precision of the VAD timestamps.

Refs: #3162

* vad : use uint64_t for time mapping

This commit changes the type of the `processed_time` and `original_time`
fields in the `vad_time_mapping` struct from `double` to `uint64_t`.

The motivation for this change is made to improve precision and avoid
floating-point inaccuracies and also be consistent with other part of
the code base that use `uint64_t` for time representation.

This is a part of a refactoring where I'm also going to change the
vad_segment_info struct to use `uint64_t` for the start and end times.
This is the reason for the not so pleasant conversion and casts in the
code at the moment.

* vad : change vad_segment_info and whisper_vad_segment to use uint64_t

* vad : use int64_t instead of uint64_t for timestamps

To be consistent with other timestamps in the codebase.

* vad : add centisecond conversion functions

* vad : extract vad processing from whisper_full_with_state

This commit extracts the VAD processing from the
`whisper_full_with_state` function into the `whisper_full` and
`whisper_full_parallel` functions.

The motivation for this is that I did not take into account that when
`whisper_full_parallel` is called with `n_processors > 1`, then the
vad processing would not be applied correctly. Instead the VAD
processing should be done prior to processing in the case of
`whisper_full_parallel`.

* vad : remove filtered_n_samples from whisper_vad

The commit removes the parameter `filtered_n_samples` from the
`whisper_vad` function signature and its usage, as it is no longer
needed since filtered samples is now a vector (previously it was a
float*)

The motivation for this is to simplify the usage of this function.

* vad : remove vad_mapping_table_initialized flag

* vad : fix leaning (none) of pointer/references
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants