For videos over two hours long, the output timeline will be confusingly intersected （Large-v3 + VAD） #3162

Makememo · 2025-05-16T02:43:34Z

Using master branching down code, Large-v3 model, with VAD, found that at the end of the day the timestamps get messed up.

./build/bin/whisper-cli -vm /Users/xxx/Developer/whisper.cpp/models/ggml-silero-v5.1.2.bin --vad -f samples/hd.wav -m models/ggml-large-v3.bin -osrt

If the original audio is needed, I can send it to your email.

The text was updated successfully, but these errors were encountered:

danbev · 2025-05-16T08:21:29Z

Thanks for the report! It would be great if you could send this to me (email is in my profile) and I'l take a closer look.

Makememo · 2025-05-16T10:30:22Z

Done, Thanks to your excellent work.

danbev · 2025-05-19T06:35:06Z

@Makememo I'm currently running this, but I notice that I'm getting some repeating transcriptions earlier than the output you reported above. I'm just wondering if you see them too, or if perhaps the .wav got corrupted in some way for me:

[00:38:00.540 --> 00:38:16.570]   so yeah absolutely yeah yeah so that is that is in my opinion what i would do the ultrasound
[00:38:16.570 --> 00:38:22.310]   um i don't know have you ever have you ever used the icg guys for
[00:38:22.310 --> 00:38:25.290]   to identify the thrombus this is interesting i've never used it
[00:38:34.230 --> 00:38:36.160]   um i don't know if i've ever used the icg guys for
[00:38:36.160 --> 00:38:42.060]   um i don't know if i've ever used the icg guys for
[00:38:42.060 --> 00:38:44.300]   um i don't know if i've ever used the icg guys for
[00:38:44.300 --> 00:39:04.440]   um i don't know if i've ever used the icg guys for
[00:39:04.440 --> 00:39:06.590]   um i don't know if i've ever used the icg guys for
[00:39:06.590 --> 00:39:08.580]   um i don't know if i've ever used the icg guys for
[00:39:08.580 --> 00:39:11.500]   um i don't know if i've ever used the icg guys for
[00:39:11.500 --> 00:39:22.690]   um i don't know if i've ever used the icg guys for
[00:39:22.690 --> 00:39:24.690]   um i don't know if i've ever used the icg guys for
[00:39:24.690 --> 00:39:27.470]   um i don't know if i've ever used the icg guys for
[00:39:27.470 --> 00:39:34.880]   um i don't know if i've ever used the icg guys for
[00:39:34.880 --> 00:39:37.410]   um i don't know if i've ever used the icg guys for
[00:39:37.410 --> 00:39:39.510]   um i don't know if i've ever used the icg guys for
[00:39:39.510 --> 00:39:41.710]   um i don't know if i've ever used the icg guys for
[00:39:41.710 --> 00:39:53.040]   um i don't know if i've ever used the icg guys for
[00:39:53.040 --> 00:39:55.310]   um i don't know if i've ever used the icg guys for
[00:39:55.310 --> 00:39:57.270]   um i don't know if i've ever used the icg guys for
...

Makememo · 2025-05-19T07:14:56Z

Yes, the whisper hallucination is very severe, especially when the large-v3 model transcribes audio to text.

danbev · 2025-05-19T08:02:41Z

Yes, the whisper hallucination is very severe, especially when the large-v3 model transcribes audio to text.

Ah I see, I've not used large-v3 much before so I did not know what to expect.
Now, if I run this without VAD enabled I also see repeats so I'm thinking it might not be specifically related to VAD.

ggerganov · 2025-05-19T09:43:59Z

Yes, the repetitions are likely not related - it's something about the V3 model. Adding -mc 0 usually seems to reduce them. But the timestamp misalignment issue reported here can be investigated even with the base or small models - no need to run V3.

Btw, I think I also noticed some misalignment of the timestamps when VAD is enabled and using a long audio. I didn't specifically observe intersected segments, but I did observe significantly different time position of the same phrase with VAD on/off. I can try to find a repro later if you don't reproduce.

danbev · 2025-05-19T15:15:32Z

I'm able to reproduce this now. I think I need to revisit the alignment/mapping of timestamps and use different approach. Looking into this now.

This commit improving the timestamp alignment by introducing a mapping table, adding intermediate reference points for longer segments, and binary search for lookups. The motivation for this changes is to address issues with the currently solution where zero-length segments are possible, and also to improve the precision of the VAD timestamps. Refs: ggml-org#3162

* vad : revisit timestamp alignment/mapping This commit improving the timestamp alignment by introducing a mapping table, adding intermediate reference points for longer segments, and binary search for lookups. The motivation for this changes is to address issues with the currently solution where zero-length segments are possible, and also to improve the precision of the VAD timestamps. Refs: #3162 * vad : use uint64_t for time mapping This commit changes the type of the `processed_time` and `original_time` fields in the `vad_time_mapping` struct from `double` to `uint64_t`. The motivation for this change is made to improve precision and avoid floating-point inaccuracies and also be consistent with other part of the code base that use `uint64_t` for time representation. This is a part of a refactoring where I'm also going to change the vad_segment_info struct to use `uint64_t` for the start and end times. This is the reason for the not so pleasant conversion and casts in the code at the moment. * vad : change vad_segment_info and whisper_vad_segment to use uint64_t * vad : use int64_t instead of uint64_t for timestamps To be consistent with other timestamps in the codebase. * vad : add centisecond conversion functions * vad : extract vad processing from whisper_full_with_state This commit extracts the VAD processing from the `whisper_full_with_state` function into the `whisper_full` and `whisper_full_parallel` functions. The motivation for this is that I did not take into account that when `whisper_full_parallel` is called with `n_processors > 1`, then the vad processing would not be applied correctly. Instead the VAD processing should be done prior to processing in the case of `whisper_full_parallel`. * vad : remove filtered_n_samples from whisper_vad The commit removes the parameter `filtered_n_samples` from the `whisper_vad` function signature and its usage, as it is no longer needed since filtered samples is now a vector (previously it was a float*) The motivation for this is to simplify the usage of this function. * vad : remove vad_mapping_table_initialized flag * vad : fix leaning (none) of pointer/references

danbev self-assigned this May 19, 2025

danbev mentioned this issue May 20, 2025

vad : revisit timestamp alignment/mapping #3173

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

For videos over two hours long, the output timeline will be confusingly intersected （Large-v3 + VAD） #3162

For videos over two hours long, the output timeline will be confusingly intersected （Large-v3 + VAD） #3162

Makememo commented May 16, 2025 •

edited

Loading

danbev commented May 16, 2025

Uh oh!

Makememo commented May 16, 2025

Uh oh!

danbev commented May 19, 2025

Uh oh!

Makememo commented May 19, 2025

Uh oh!

danbev commented May 19, 2025

Uh oh!

ggerganov commented May 19, 2025

Uh oh!

danbev commented May 19, 2025

Uh oh!

For videos over two hours long, the output timeline will be confusingly intersected （Large-v3 + VAD） #3162

For videos over two hours long, the output timeline will be confusingly intersected （Large-v3 + VAD） #3162

Comments

Makememo commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

danbev commented May 16, 2025

Uh oh!

Makememo commented May 16, 2025

Uh oh!

danbev commented May 19, 2025

Uh oh!

Makememo commented May 19, 2025

Uh oh!

danbev commented May 19, 2025

Uh oh!

ggerganov commented May 19, 2025

Uh oh!

danbev commented May 19, 2025

Uh oh!

Makememo commented May 16, 2025 •

edited

Loading