Skip to content

tests : add WER benchmarks #2454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ggerganov opened this issue Oct 5, 2024 · 25 comments
Open

tests : add WER benchmarks #2454

ggerganov opened this issue Oct 5, 2024 · 25 comments
Labels
help wanted Extra attention is needed high priority Very important issue research🔬 roadmap Part of a roadmap project

Comments

@ggerganov
Copy link
Member

It would be nice to start measuring the word error rate (WER) of whisper.cpp across some representative dataset:

  • short audio
  • long audio
  • english
  • non-english
  • etc.

This will help us catch regressions in the future. I'm not familiar with what is typically used for TTS WER benchmarks, so looking for help from the community.

@ggerganov ggerganov added help wanted Extra attention is needed research🔬 labels Oct 5, 2024
@ggerganov ggerganov moved this to Todo in ggml : roadmap Oct 5, 2024
@ggerganov ggerganov changed the title whisper : add WER tests tests : add WER benchmarks Feb 4, 2025
@ggerganov ggerganov added roadmap Part of a roadmap project high priority Very important issue labels Feb 4, 2025
@harvestingmoon
Copy link

harvestingmoon commented Feb 5, 2025

Hi Grigory, perhaps we can use LibriSpeech for measuring long audio (approx ~ 1000 hours but could trim it to fit for requirements). For short audio, we can use Libri-Light

Alternatively, there are other audio datasets for measuring WER: https://github.com/jim-schwoebel/voice_datasets

I could start making small sample scripts to see how whisper.cpp fairs among these datasets

@ggerganov
Copy link
Member Author

Thanks. Yes, I'm not sure what is typically used. But in general, I think any dataset would work. The main goal here is not to compare whisper.cpp numbers with other numbers, but to create a reference set of WER numbers that we track as the development continues. This would allow us to catch regressions when they appear, because the WER scores would get worse in such cases.

Ideally, we can have scripts that perform heavier benchmarks that developers would use locally. But we also need a mode where the scripts run just a few fast benchmarks that can be added to the CI without overloading it, so that these would be computed on every commit.

@foldl
Copy link
Collaborator

foldl commented Feb 7, 2025

@harvestingmoon are you working on this?

@harvestingmoon
Copy link

@foldl hi yes I'm looking at it, more or less likely to start after 12 as it's currently Chinese new year period...

@foldl
Copy link
Collaborator

foldl commented Feb 17, 2025

I think we need a tiny dataset (~10MB) just contained in this repo. WER can then be measured on-the-fly.

@harvestingmoon
Copy link

harvestingmoon commented Feb 17, 2025

Sorry please ignore the WER calculation above, I will develop another script since the calculations are completely off from what it should be . I will also look for a smaller lightweight dataset so that audio can be measured on the fly

@harvestingmoon
Copy link

harvestingmoon commented Feb 17, 2025

I have created a better and more robust lightweight script that meets the requirements @foldl , @ggerganov

WER is measured at 0.3.

It uses this lightweight dataset:
https://arxiv.org/abs/2104.01497 and is based off nvidia's tutorial for calculating WER:
https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/tts-evaluate.html

Image

My script calculates the WER for each individual audio file as well as the overall average file, here is the pull request #2824
For context, WER is measured between 0 and 1. Ideally, it is good to have a WER of around 0.33%, that means transcription accuracy is about 67%. The current measurement is around 70%, which is fairly good for a lightweight model.

Link for reference: https://huggingface.co/learn/audio-course/en/chapter5/evaluation

@harvestingmoon
Copy link

The pull request contains the script as well as the full ~10mb dataset, making it fairly lightweight when measuring on the fly as well

@ggerganov
Copy link
Member Author

Hi @harvestingmoon, thank you for the effort, but I expect more attention to detail. Will close the PR for now and let someone else give this a try.

@WilliamTambellini
Copy link
Contributor

Should nt the very first step to add an edit dist (used to compute WER/TER) minimalist src code (header only?) to measure it?
eg
https://github.com/flashlight/flashlight/blob/f59d770b52ea678b039d9ba44693341ba80cf7c5/flashlight/fl/meter/EditDistanceMeter.h

@redraskal
Copy link
Collaborator

@ggerganov I'm not sure there is a reasonably sized dataset containing short audio, long audio, english, & non-english content.

What do you think about an approach like tests/run-tests.sh in the CI script to measure WER for a list of audio URLs in the aforementioned categories? Could print the results for each category/model as a table.

This would be lightweight for each commit.

We could have a script to download a larger dataset for local testing. Smaller datasets usually contain a single language or consistent audio duration.

@ggerganov
Copy link
Member Author

What do you think about an approach like tests/run-tests.sh in the CI script to measure WER for a list of audio URLs in the aforementioned categories? Could print the results for each category/model as a table.

Yes, sounds good. The CI should download audio files with wget or curl and run WER tests on them. We can combine different sources at the start. Later on, we can expand the more powerful nodes such as the CUDA and M1 to run larger datasets.

We could have a script to download a larger dataset for local testing. Smaller datasets usually contain a single language or consistent audio duration.

Yes, a much bigger dataset for local testing would be useful.

@fujimotos
Copy link
Contributor

fujimotos commented Apr 3, 2025

@ggerganov Hi. I was working on this ticket for a while, and spent last
few days benchmarking whisper.cpp on LibriSpeech corpus.

Now, here is the summary of my measurement results:

  • The following graph shows the recognition accuracy (measured in
    Word Error Rate) on LibriSpeech test-clean dataset.

Image

Comparison with OpenAI whisper

To illustrate the result shown above, the following table compares whisper.cpp's
performance with OpenAI's official WER scores.

To put it very short, the performance was pretty much comparable!

Model WER [whisper.cpp] WER [openai-whisper] *
tiny 6.90 6.7
base 4.81 4.9
small 3.37 3.3
medium 2.70 2.7
large-v1 2.67 2.8
large-v2 2.58 2.5
large-v3 1.85 Not published
large-v3-turbo 1.92 Not published

How I performed the benchmark test

I submitted the code I wrote for the benchmark test in PR #2999. The code
should be basically the same as how OpenAI evaluate their models.

The testing process is fairly automated (using the power of Makefile),
and I also attached some documentation how to use it.

Please tell me if anything is unclear! I hope it's interesting for you.

@ggerganov
Copy link
Member Author

@fujimotos Thank you, this is very interesting! Will be taking a look in the next few days.

@fujimotos
Copy link
Contributor

@ggerganov Thank you!


Techinical Note: how long it took to perform the full benchmark

This time, I rent an EC2 c8g.xlarge instance from AWS to perform the
benchmark test.

It took roughly 80 hours to benchmark all the eight model sizes.
Here is the breakdown of the running time:

MODEL WER TIME [REAL] Real Time Factor
tiny 6.90 28m 0.08
base 4.81 56m 0.17
small 3.37 3h2m 0.56
medium 2.70 9h20m 1.72
large-v1 2.67 17h52m 3.30
large-v2 2.58 17h55m 3.31
large-v3 1.85 17h46m 3.29
large-v3-turbo 1.92 14h28m 2.67

Observation: Tradeoff between speed vs accuracy

Looking from a different angle, I think this confirms the existence of
trade-off between speed vs accuracy in whisper.cpp models.

The following graph should illustrate the relationship:

  • The X-axis ("Real time factor") is computed by (inference time) / (Audio Length), so the lower is better.
  • Note that LibriSpeech test-clean contains 5 hours 24 minutes of speech.

@ggerganov
Copy link
Member Author

It would be interesting to perform these benchmarks with Q8_0 quantized models and see how the WER changes. But I think it would be better to run this on a GPU in order to reduce the processing time. Will see how this performs on my M2 Ultra - I think it would be much faster than the AWS instance.

@ggerganov ggerganov moved this from Todo to In Progress in whisper.cpp : roadmap Apr 4, 2025
@ggerganov
Copy link
Member Author

Here are some results on M2 Ultra with Flash Attention enabled:

MODEL WER TIME [REAL]
base 4.90 13m28s
base-q8_0 4.89 12m32s
small 3.39 24m4s
small-q8_0 3.36 20m33s

Though the timings might be a bit off because I was using the computer while the computations were running. But overall, there is no degradation of the quality when going to Q8 models, which is expected, but good to confirm.

@ggerganov
Copy link
Member Author

The WER tests in #2999 are very useful, but all samples from the LibriSpeech dataset are relatively short and don't include non-speech segments. We should add some additional tests with longer audio samples, preferably with silence intervals which is what usually trips Whisper Large v3. When we add the VAD support (#3065) we will be able to measure quantitatively how much it improves the quality in such cases.

@MahmoudAshraf97
Copy link

I guess this dataset has what you need, I'm using it in longform evaluation in faster whisper
SYSTRAN/faster-whisper#1101

@fujimotos
Copy link
Contributor

Actually, I know a couple of public benchmark datasets that can be used
for the benchmark purpose.

When we add the VAD support (#3065) we will be able to measure quantitatively
how much it improves the quality in such cases.

If you don't mind, I think I'm able to post another PR next week that
contains long-form WER benchmark testing.

@fujimotos
Copy link
Contributor

@ggerganov @danbev I have just created a pull-request #3185 that adds a long-form
transcription benchmark test.

Benchmark dataset

This time I used Earnings-21 dataset by Del Rio et al. (2020) which provides
49 hours of English speech, sourced from corporate earning calls.

Earnings-21: A Practical Benchmark for ASR in the Wild (2021)
https://arxiv.org/abs/2104.11348

Here is some audio example:

earnings21_Kuehne_Nagel_90s.mp4

I think there are two benefits in using Earnings-21:

  1. It makes the benchmark result comparable. OpenAI used this dataset in their
    paper, so we can compare our WER score against OpenAI's official number.

  2. Easy to access. The full dataset is distributed as a Git repo, and
    the total file size is relatively small (just 49 files in mp3 format)

Benchmark Result

I ran the benchmark test using two models: tiny and base. I also tested the
VAD support (introduced by #3065) to see if it improves the general accuracy.

The following table summarizes the benchmark result:

Speech Recognition WER (Tiny) WER (Base)
Whisper.cpp 17.37% 12.53%
Whisper.cpp (w. VAD) 18.91% 15.70%
OpenAI Whisper 18.7% 13.5%
OpenAI Whisper (.en model) 17.0% 12.5%

Some notes on this table:

  • The version of whisper.cpp I have used was 2c4b904, and I enabled the VAD
    support by adding the following inference parameter:

    WHISPER_FLAGS = --no-prints --language en --output-txt --vad --vad-model ../../models/silero-v5.1.2-ggml.bin
  • OpenAI's scores are retrived from their original paper (See Appendix D.4)

Some Analysis and Insights

So wondering why VAD did not necessarily improve the recognition accuracy,
I looked a bit deeper at the benchmark result.

First, the following graph shows the detailed WER score for each audio recording:

Image

As you can see, the effect of enabling VAD is hit-and-miss. It improves the performance
on some audio, but degrades the accuracy on another audio.

Looking at the transcription produced by whisper.cpp, it seems that the VAD support
does prevent hallucination on same case, but it introduces another hallucination
on another audio data.

So its effectiveness on improving recognition accuracy was limited (I attached
some hallucination examples below).

Appendix: Hallucination Examples

4359732.mp3 (Kuehne Nagel International)

Whisper.cpp (tiny)

we have a second we have a second really state portfolio is I at for sale I do not expect material impact on the PNL.
That is going to be the last of 10 million.
Thank you very much.
Thank you.
Thank you very much.
I want to thank you very much.
Thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
...

With VAD enabled

We have a second. We have a second real estate portfolio is for sale,
I do not expect material impact on the P&L other state. So it's going to be less than 10 million.
Thank you very much. Thank you.
Thank you for coming from Manivakaya, and from Bank of America, please go ahead.
  • VAD prevents the hallucination successfully.

4320211.mp3 (Monroe Inc)

Whisper.cpp (tiny)

At the mid-point of our guidance range, we expect an operating margin of approximately $10.2% interest expense to be approximately $29 million,
depreciation and amortization to be approximately $65 million, and even to be approximately $196 million.
We expect capital expenditures to be approximately $60 million this year.
This guidance reflects an effective tax rate of approximately 23.5% and is based on $34 million diluted weighted average shares outstanding.

With VAD enabled

The appreciation and amortization to be approximately $65 million, and even to be approximately $196 million. We expect capital expenditures to be approximately $60 million this year.
An adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis.
An adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis.
An adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis.
An adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis.
An adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis.
An adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis.
  • VAD introduces hallucination where it did not occur originally.

@fujimotos
Copy link
Contributor

fujimotos commented May 23, 2025

I'd be able to do more detailed analysis if I had access to more computing power (currenltly
I did all the benchmark test on my personal AWS account), but, anyway, this is the current
state of analysis so far.

@ggerganov @danbev If anything is not clear, please just ask me! Hope it's interesting to you.

@ggerganov
Copy link
Member Author

ggerganov commented May 23, 2025

@fujimotos Thank you - very interesting and useful! Will be taking a look in the next days - curious to understand why the VAD can lead to degraded quality. Somehow my expectation is that it should either improve or keep the quality the same in all cases. Maybe some parameters need adjustment. Also, would be interesting to run VAD/no-VAD with the Large V3 and see how it compares there.

@danbev
Copy link
Collaborator

danbev commented May 23, 2025

Very interesting indeed! I've saw something similar this week when using the Large V3 model where without VAD it would hallucinate but when VAD was enabled it did not and seemed to produce valid output.

I''ve tried running 4320211.mp3 (Monroe Inc) with Large V3 does not show these hallucinations:

[00:25:12.680 --> 00:25:29.990]   At the midpoint of our guidance range, we expect an operating margin of approximately 10.2% interest expense to be approximately $29 million depreciation and amortization to be approximately $65 million in EBITDA to be approximately $196 million.
[00:25:29.990 --> 00:25:43.880]   We expect capital expenditures to be approximately $60 million this year. This guidance reflects an effective tax rate of approximately 23.5% and is based on 34 million diluted weighted average shares outstanding.
[00:25:43.880 --> 00:25:49.160]   As always, our guidance does not assume any future acquisitions or greenfield store opening.
[00:25:49.160 --> 00:25:54.690]   I'll now turn the call over to brought, provide some closing remarks before we move to Q&A.
[00:25:54.690 --> 00:26:03.580]   Thanks, Brian. We are making solid strides in the execution of our Monroe forward strategy, in particular, our store rebrand and reimage initiative.

And I also tried with the tiny model:

[00:25:13.160 --> 00:25:19.050]   At the midpoint of our guidance range, we expect an operating margin of approximately $10.2%
[00:25:19.050 --> 00:25:25.480]   interest expense to be approximately $29 million, depreciation and amortization to be approximately
[00:25:25.480 --> 00:25:33.720]   $65 million and EBITDA to be approximately $196 million. We expect capital expenditures to be approximately
[00:25:33.720 --> 00:25:40.180]   $60 million this year. This guidance reflects an effective tax rate of approximately 23.5%
[00:25:40.180 --> 00:25:45.890]   and is based on $34 million diluted weighted average shares outstanding. As always, our guidance
[00:25:45.890 --> 00:25:51.440]   does not assume any future acquisitions or greenfield store opening. I'll now turn the call over
[00:25:51.440 --> 00:25:57.560]   to brought some closing remarks before we move to Q&A. Thanks Brian. We are making solid
[00:25:57.560 --> 00:26:03.960]   strides in the execution of our Monroe Forward Strategy in particular our store rebrand and remaged initiative.

One thing to note is that I'm using the version of whisper.cpp from #3173 which I've been working on this week. The changes were mostly related to how VAD timestamps are aligned to the original audio timestamps, but I also changed from using float/doubles to use int64_t for the timestamps and perhaps this has an impact on the audio samples that are passed to whisper_full. I need to look into this a bit further but it would be interesting to run the benchmarks using #3173 to see if this has an impact (and also that I'm not missing something being the end of the week).

I've run the benchmarks on my mac (macOS 15.4.1 on Mac15,3 with Apple M3 and 24GB RAM) using the tiny model and also applied #3173 with the following result with VAD enabled:

(venv) $ cat tiny.txt
WER: 16.78%

And without VAD:

(venv) $ cat tiny.txt
WER: 18.70%

I'm not seeing the repeats in speech-datasets/earnings21/media/4320211.mp3.txt:

At the midpoint of our guidance range, we expect an operating
margin of approximately 10.2% interest expense to be approximately $29 million dollars depreciation
and amortization to be approximately $65 million dollars and EBITDA to be approximately $196 million.
We expect capital expenditures to be approximately $60 million this year.
This guidance reflects an effective tax rate of approximately 23.5% and is based on 34 million diluted
weighted average shares outstanding.

I'll try using the base model with VAD:

(venv) $ cat base.txt
WER: 13.40%

And without VAD:

$ cat base.txt
WER: 12.57%

@WilliamTambellini
Copy link
Contributor

Tks @danbev
Indeed would be better to (re)test with whisperV3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed high priority Very important issue research🔬 roadmap Part of a roadmap project
Projects
Status: In Progress
Development

No branches or pull requests

9 participants