-
Notifications
You must be signed in to change notification settings - Fork 4.3k
tests : add WER benchmarks #2454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi Grigory, perhaps we can use LibriSpeech for measuring long audio (approx ~ 1000 hours but could trim it to fit for requirements). For short audio, we can use Libri-Light Alternatively, there are other audio datasets for measuring WER: https://github.com/jim-schwoebel/voice_datasets I could start making small sample scripts to see how whisper.cpp fairs among these datasets |
Thanks. Yes, I'm not sure what is typically used. But in general, I think any dataset would work. The main goal here is not to compare Ideally, we can have scripts that perform heavier benchmarks that developers would use locally. But we also need a mode where the scripts run just a few fast benchmarks that can be added to the CI without overloading it, so that these would be computed on every commit. |
@harvestingmoon are you working on this? |
@foldl hi yes I'm looking at it, more or less likely to start after 12 as it's currently Chinese new year period... |
I think we need a tiny dataset (~10MB) just contained in this repo. WER can then be measured on-the-fly. |
Sorry please ignore the WER calculation above, I will develop another script since the calculations are completely off from what it should be . I will also look for a smaller lightweight dataset so that audio can be measured on the fly |
I have created a better and more robust lightweight script that meets the requirements @foldl , @ggerganov WER is measured at 0.3. It uses this lightweight dataset: My script calculates the WER for each individual audio file as well as the overall average file, here is the pull request #2824 Link for reference: https://huggingface.co/learn/audio-course/en/chapter5/evaluation |
The pull request contains the script as well as the full ~10mb dataset, making it fairly lightweight when measuring on the fly as well |
Hi @harvestingmoon, thank you for the effort, but I expect more attention to detail. Will close the PR for now and let someone else give this a try. |
Should nt the very first step to add an edit dist (used to compute WER/TER) minimalist src code (header only?) to measure it? |
@ggerganov I'm not sure there is a reasonably sized dataset containing short audio, long audio, english, & non-english content. What do you think about an approach like This would be lightweight for each commit. We could have a script to download a larger dataset for local testing. Smaller datasets usually contain a single language or consistent audio duration. |
Yes, sounds good. The CI should download audio files with
Yes, a much bigger dataset for local testing would be useful. |
@ggerganov Hi. I was working on this ticket for a while, and spent last Now, here is the summary of my measurement results:
Comparison with OpenAI whisper To illustrate the result shown above, the following table compares whisper.cpp's To put it very short, the performance was pretty much comparable!
How I performed the benchmark test I submitted the code I wrote for the benchmark test in PR #2999. The code The testing process is fairly automated (using the power of Makefile), Please tell me if anything is unclear! I hope it's interesting for you. |
@fujimotos Thank you, this is very interesting! Will be taking a look in the next few days. |
@ggerganov Thank you! Techinical Note: how long it took to perform the full benchmark This time, I rent an EC2 c8g.xlarge instance from AWS to perform the It took roughly 80 hours to benchmark all the eight model sizes.
Observation: Tradeoff between speed vs accuracy Looking from a different angle, I think this confirms the existence of The following graph should illustrate the relationship:
![]() |
It would be interesting to perform these benchmarks with |
Here are some results on M2 Ultra with Flash Attention enabled:
Though the timings might be a bit off because I was using the computer while the computations were running. But overall, there is no degradation of the quality when going to Q8 models, which is expected, but good to confirm. |
The WER tests in #2999 are very useful, but all samples from the LibriSpeech dataset are relatively short and don't include non-speech segments. We should add some additional tests with longer audio samples, preferably with silence intervals which is what usually trips Whisper Large v3. When we add the VAD support (#3065) we will be able to measure quantitatively how much it improves the quality in such cases. |
I guess this dataset has what you need, I'm using it in longform evaluation in faster whisper |
Actually, I know a couple of public benchmark datasets that can be used
If you don't mind, I think I'm able to post another PR next week that |
@ggerganov @danbev I have just created a pull-request #3185 that adds a long-form Benchmark datasetThis time I used
Here is some audio example: earnings21_Kuehne_Nagel_90s.mp4I think there are two benefits in using Earnings-21:
Benchmark ResultI ran the benchmark test using two models: The following table summarizes the benchmark result:
Some notes on this table:
Some Analysis and InsightsSo wondering why VAD did not necessarily improve the recognition accuracy, First, the following graph shows the detailed WER score for each audio recording: As you can see, the effect of enabling VAD is hit-and-miss. It improves the performance Looking at the transcription produced by whisper.cpp, it seems that the VAD support So its effectiveness on improving recognition accuracy was limited (I attached Appendix: Hallucination Examples4359732.mp3 (Kuehne Nagel International)Whisper.cpp (tiny)
With VAD enabled
4320211.mp3 (Monroe Inc)Whisper.cpp (tiny)
With VAD enabled
|
I'd be able to do more detailed analysis if I had access to more computing power (currenltly @ggerganov @danbev If anything is not clear, please just ask me! Hope it's interesting to you. |
@fujimotos Thank you - very interesting and useful! Will be taking a look in the next days - curious to understand why the VAD can lead to degraded quality. Somehow my expectation is that it should either improve or keep the quality the same in all cases. Maybe some parameters need adjustment. Also, would be interesting to run VAD/no-VAD with the Large V3 and see how it compares there. |
Very interesting indeed! I've saw something similar this week when using the Large V3 model where without VAD it would hallucinate but when VAD was enabled it did not and seemed to produce valid output. I''ve tried running [00:25:12.680 --> 00:25:29.990] At the midpoint of our guidance range, we expect an operating margin of approximately 10.2% interest expense to be approximately $29 million depreciation and amortization to be approximately $65 million in EBITDA to be approximately $196 million.
[00:25:29.990 --> 00:25:43.880] We expect capital expenditures to be approximately $60 million this year. This guidance reflects an effective tax rate of approximately 23.5% and is based on 34 million diluted weighted average shares outstanding.
[00:25:43.880 --> 00:25:49.160] As always, our guidance does not assume any future acquisitions or greenfield store opening.
[00:25:49.160 --> 00:25:54.690] I'll now turn the call over to brought, provide some closing remarks before we move to Q&A.
[00:25:54.690 --> 00:26:03.580] Thanks, Brian. We are making solid strides in the execution of our Monroe forward strategy, in particular, our store rebrand and reimage initiative. And I also tried with the tiny model: [00:25:13.160 --> 00:25:19.050] At the midpoint of our guidance range, we expect an operating margin of approximately $10.2%
[00:25:19.050 --> 00:25:25.480] interest expense to be approximately $29 million, depreciation and amortization to be approximately
[00:25:25.480 --> 00:25:33.720] $65 million and EBITDA to be approximately $196 million. We expect capital expenditures to be approximately
[00:25:33.720 --> 00:25:40.180] $60 million this year. This guidance reflects an effective tax rate of approximately 23.5%
[00:25:40.180 --> 00:25:45.890] and is based on $34 million diluted weighted average shares outstanding. As always, our guidance
[00:25:45.890 --> 00:25:51.440] does not assume any future acquisitions or greenfield store opening. I'll now turn the call over
[00:25:51.440 --> 00:25:57.560] to brought some closing remarks before we move to Q&A. Thanks Brian. We are making solid
[00:25:57.560 --> 00:26:03.960] strides in the execution of our Monroe Forward Strategy in particular our store rebrand and remaged initiative.
One thing to note is that I'm using the version of whisper.cpp from #3173 which I've been working on this week. The changes were mostly related to how VAD timestamps are aligned to the original audio timestamps, but I also changed from using float/doubles to use int64_t for the timestamps and perhaps this has an impact on the audio samples that are passed to whisper_full. I need to look into this a bit further but it would be interesting to run the benchmarks using #3173 to see if this has an impact (and also that I'm not missing something being the end of the week). I've run the benchmarks on my mac (macOS 15.4.1 on Mac15,3 with Apple M3 and 24GB RAM) using the tiny model and also applied #3173 with the following result with VAD enabled: (venv) $ cat tiny.txt
WER: 16.78% And without VAD: (venv) $ cat tiny.txt
WER: 18.70% I'm not seeing the repeats in At the midpoint of our guidance range, we expect an operating
margin of approximately 10.2% interest expense to be approximately $29 million dollars depreciation
and amortization to be approximately $65 million dollars and EBITDA to be approximately $196 million.
We expect capital expenditures to be approximately $60 million this year.
This guidance reflects an effective tax rate of approximately 23.5% and is based on 34 million diluted
weighted average shares outstanding. I'll try using the (venv) $ cat base.txt
WER: 13.40% And without VAD: $ cat base.txt
WER: 12.57% |
Tks @danbev |
It would be nice to start measuring the word error rate (WER) of
whisper.cpp
across some representative dataset:This will help us catch regressions in the future. I'm not familiar with what is typically used for TTS WER benchmarks, so looking for help from the community.
The text was updated successfully, but these errors were encountered: