-
Notifications
You must be signed in to change notification settings - Fork 4.3k
For videos over two hours long, the output timeline will be confusingly intersected (Large-v3 + VAD) #3162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report! It would be great if you could send this to me (email is in my profile) and I'l take a closer look. |
Done, Thanks to your excellent work. |
@Makememo I'm currently running this, but I notice that I'm getting some repeating transcriptions earlier than the output you reported above. I'm just wondering if you see them too, or if perhaps the .wav got corrupted in some way for me: [00:38:00.540 --> 00:38:16.570] so yeah absolutely yeah yeah so that is that is in my opinion what i would do the ultrasound
[00:38:16.570 --> 00:38:22.310] um i don't know have you ever have you ever used the icg guys for
[00:38:22.310 --> 00:38:25.290] to identify the thrombus this is interesting i've never used it
[00:38:34.230 --> 00:38:36.160] um i don't know if i've ever used the icg guys for
[00:38:36.160 --> 00:38:42.060] um i don't know if i've ever used the icg guys for
[00:38:42.060 --> 00:38:44.300] um i don't know if i've ever used the icg guys for
[00:38:44.300 --> 00:39:04.440] um i don't know if i've ever used the icg guys for
[00:39:04.440 --> 00:39:06.590] um i don't know if i've ever used the icg guys for
[00:39:06.590 --> 00:39:08.580] um i don't know if i've ever used the icg guys for
[00:39:08.580 --> 00:39:11.500] um i don't know if i've ever used the icg guys for
[00:39:11.500 --> 00:39:22.690] um i don't know if i've ever used the icg guys for
[00:39:22.690 --> 00:39:24.690] um i don't know if i've ever used the icg guys for
[00:39:24.690 --> 00:39:27.470] um i don't know if i've ever used the icg guys for
[00:39:27.470 --> 00:39:34.880] um i don't know if i've ever used the icg guys for
[00:39:34.880 --> 00:39:37.410] um i don't know if i've ever used the icg guys for
[00:39:37.410 --> 00:39:39.510] um i don't know if i've ever used the icg guys for
[00:39:39.510 --> 00:39:41.710] um i don't know if i've ever used the icg guys for
[00:39:41.710 --> 00:39:53.040] um i don't know if i've ever used the icg guys for
[00:39:53.040 --> 00:39:55.310] um i don't know if i've ever used the icg guys for
[00:39:55.310 --> 00:39:57.270] um i don't know if i've ever used the icg guys for
... |
Yes, the whisper hallucination is very severe, especially when the large-v3 model transcribes audio to text. |
Ah I see, I've not used |
Yes, the repetitions are likely not related - it's something about the V3 model. Adding Btw, I think I also noticed some misalignment of the timestamps when VAD is enabled and using a long audio. I didn't specifically observe intersected segments, but I did observe significantly different time position of the same phrase with VAD on/off. I can try to find a repro later if you don't reproduce. |
I'm able to reproduce this now. I think I need to revisit the alignment/mapping of timestamps and use different approach. Looking into this now. |
This commit improving the timestamp alignment by introducing a mapping table, adding intermediate reference points for longer segments, and binary search for lookups. The motivation for this changes is to address issues with the currently solution where zero-length segments are possible, and also to improve the precision of the VAD timestamps. Refs: ggml-org#3162
This commit improving the timestamp alignment by introducing a mapping table, adding intermediate reference points for longer segments, and binary search for lookups. The motivation for this changes is to address issues with the currently solution where zero-length segments are possible, and also to improve the precision of the VAD timestamps. Refs: ggml-org#3162
* vad : revisit timestamp alignment/mapping This commit improving the timestamp alignment by introducing a mapping table, adding intermediate reference points for longer segments, and binary search for lookups. The motivation for this changes is to address issues with the currently solution where zero-length segments are possible, and also to improve the precision of the VAD timestamps. Refs: #3162 * vad : use uint64_t for time mapping This commit changes the type of the `processed_time` and `original_time` fields in the `vad_time_mapping` struct from `double` to `uint64_t`. The motivation for this change is made to improve precision and avoid floating-point inaccuracies and also be consistent with other part of the code base that use `uint64_t` for time representation. This is a part of a refactoring where I'm also going to change the vad_segment_info struct to use `uint64_t` for the start and end times. This is the reason for the not so pleasant conversion and casts in the code at the moment. * vad : change vad_segment_info and whisper_vad_segment to use uint64_t * vad : use int64_t instead of uint64_t for timestamps To be consistent with other timestamps in the codebase. * vad : add centisecond conversion functions * vad : extract vad processing from whisper_full_with_state This commit extracts the VAD processing from the `whisper_full_with_state` function into the `whisper_full` and `whisper_full_parallel` functions. The motivation for this is that I did not take into account that when `whisper_full_parallel` is called with `n_processors > 1`, then the vad processing would not be applied correctly. Instead the VAD processing should be done prior to processing in the case of `whisper_full_parallel`. * vad : remove filtered_n_samples from whisper_vad The commit removes the parameter `filtered_n_samples` from the `whisper_vad` function signature and its usage, as it is no longer needed since filtered samples is now a vector (previously it was a float*) The motivation for this is to simplify the usage of this function. * vad : remove vad_mapping_table_initialized flag * vad : fix leaning (none) of pointer/references
Uh oh!
There was an error while loading. Please reload this page.
Using master branching down code, Large-v3 model, with VAD, found that at the end of the day the timestamps get messed up.
./build/bin/whisper-cli -vm /Users/xxx/Developer/whisper.cpp/models/ggml-silero-v5.1.2.bin --vad -f samples/hd.wav -m models/ggml-large-v3.bin -osrt
If the original audio is needed, I can send it to your email.
The text was updated successfully, but these errors were encountered: