tests : add a new benchmark test for long-form audio #3185

fujimotos · 2025-05-23T01:27:20Z

Based on Earnings-21 corpus by Del Rio et al.

Earnings-21: A Practical Benchmark for ASR in the Wild (2021)
https://arxiv.org/abs/2104.11348

This dataset contains 39 hours of long-form speech, sourced from public
earning calls. Each recording contains roughly 50 minutes of English
dialogues between multiple speakers (2-20 persons).

This benchmark suite should allow us to evaluate the performance of
whisper.cpp on long-form audio data.

Based on "Earnings-21" corpus by Del Rio et al. Earnings-21: A Practical Benchmark for ASR in the Wild (2021) https://arxiv.org/abs/2104.11348 This dataset contains 39 hours of long-form speech, sourced from public earning calls. Each recording contains roughly 50 minutes of English dialogues between multiple speakers (2-20 persons). This benchmark suite should allow us to evaluate the performance of whisper.cpp on long-form audio data. Signed-off-by: Fujimoto Seiji <[email protected]>

tests/earnings21/README.md

Based on feedback from Daniel Bevenius. - Simplify how to download & prepare a Silero VAD model. - Fix typo: inferece -> inference Signed-off-by: Fujimoto Seiji <[email protected]>

danbev · 2025-05-27T05:59:22Z

@fujimotos I ran into this issue yesterday when testing:

output_txt: saving output to 'speech-datasets/earnings21/media/4397829.mp3.tmp.txt'
mv speech-datasets/earnings21/media/4397829.mp3.tmp.txt speech-datasets/earnings21/media/4397829.mp3.txt
python eval.py speech-datasets/earnings21/earnings21-file-metadata.csv > tiny.txt.tmp
Traceback (most recent call last):
  File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 59, in <module>
    main()
  File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 46, in main
    hyp_orig = get_hypothesis()
               ^^^^^^^^^^^^^^^^
  File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 24, in get_hypothesis
    text = fp.read().strip()
           ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 31672: invalid start byte

I was able to get around this using the following change:

diff --git a/tests/earnings21/eval.py b/tests/earnings21/eval.py
index a8e57bbe..a4dc7afa 100644
--- a/tests/earnings21/eval.py
+++ b/tests/earnings21/eval.py
@@ -20,7 +20,7 @@ def get_reference():
 def get_hypothesis():
     hyp = {}
     for path in glob.glob("speech-datasets/earnings21/media/*.mp3.txt"):
-        with open(path) as fp:
+        with open(path, errors='ignore') as fp:
             text = fp.read().strip()
         code = os.path.basename(path).replace(".mp3.txt", "")
         hyp[code] = text

So instead of crashing this should just skip invalid characters. I think this was when testing using VAD and the tiny model.

Based on feedback from Daniel Bevenius. Add 'errors' parameter to open() in order to avoid unhandled exception on invalid UTF-8 bytes. Signed-off-by: Fujimoto Seiji <[email protected]>

fujimotos · 2025-05-27T09:19:01Z

So instead of crashing this should just skip invalid characters.

@danbev OK. I just pushed 57c15c5 that modifies the open() call based on your suggestion.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 31672: invalid start byte

Side note: I'm mildly curious why this happened in the first place. I did not see this error when I
ran this benchmark test, and also have never seen a Whisper model emitting tokens that cannot
be interpreted as UTF-8.

danbev · 2025-05-27T13:06:18Z

I'm mildly curious why this happened in the first place.

I'm re-running this now to reproduce it and hopefully provide some more information.

(venv) $ cat eval.conf
WHISPER_MODEL = tiny
WHISPER_FLAGS = --no-prints --language en --output-txt --vad --vad-model ../../models/for-tests-silero-v5.1.2-ggml.bin

$ make
...
[01:14:27.400 --> 01:14:31.130]   gentlemen that concludes today's call thank you for participating and you may now disconnect.
output_txt: saving output to 'speech-datasets/earnings21/media/4397829.mp3.tmp.txt'
mv speech-datasets/earnings21/media/4397829.mp3.tmp.txt speech-datasets/earnings21/media/4397829.mp3.txt
python eval.py speech-datasets/earnings21/earnings21-file-metadata.csv > tiny.txt.tmp
Traceback (most recent call last):
  File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 59, in <module>
    main()
  File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 46, in main
    hyp_orig = get_hypothesis()
               ^^^^^^^^^^^^^^^^
  File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 24, in get_hypothesis
    text = fp.read().strip()
           ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 31672: invalid start byte
make[1]: *** [tiny.txt] Error 1
make: *** [eval] Error 2

Inspecting the file in question (after adding a try/catch block to get it):

venv) $ hexdump -C speech-datasets/earnings21/media/4375653.mp3.txt | grep " 93 "
00007bb0  78 61 6d 70 6c 65 2e 0a  93 69 6e 67 20 74 68 65  |xample...ing the|

(venv) $ file -I speech-datasets/earnings21/media/4375653.mp3.txt
speech-datasets/earnings21/media/4375653.mp3.txt: text/plain; charset=unknown-8bit

(venv) $ find speech-datasets/earnings21/media/ -name "*.txt" -exec file -I {} \; | grep -v utf-8
speech-datasets/earnings21/media/4341191.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4384198.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4387383.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4394084.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4384683.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4364366.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4367318.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4385388.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4374910.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4360674.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4387332.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4330115.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4360717.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4387865.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4392809.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4366522.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4397800.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4367535.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4385072.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4366893.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4366429.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4385939.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4346818.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4320211.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4386541.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4375653.mp3.txt: text/plain; charset=unknown-8bit
speech-datasets/earnings21/media/4382825.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4383161.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4359971.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4360366.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4384744.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4344338.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4365948.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4366302.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4397829.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4368670.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4384964.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4344866.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4389907.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4346923.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4365024.mp3.txt: text/plain; charset=us-ascii

~~It just seems to be this one file so perhaps we can convert it~~: This won't work as the file is generated.

(venv) $ iconv -f windows-1252 -t utf-8 speech-datasets/earnings21/media/4375653.mp3.txt > temp.txt
(venv) $ mv temp.txt speech-datasets/earnings21/media/4375653.mp3.txt

After doing that I'm able to run without getting this error:

(venv) $ python eval.py speech-datasets/earnings21/earnings21-file-metadata.csv > tiny.txt.tmp

ggerganov

Awesome! Haven't had time yet to play with these, but the change looks good. We should utilize these tests to improve the VAD and to start tracking the WER across new builds of whisper.cpp.

Based on the discussion in PR#3185. Evidently Whisper.cpp can represent a quotation mark as '0x93', which implifies Windows-1252 (Microsoft's ASCII excention), and cannot be decoded by UTF-8. Add an explicit decoding loop to address the issue. Signed-off-by: Fujimoto Seiji <[email protected]>

fujimotos · 2025-05-27T23:54:54Z

@ggerganov Thank you!

@danbev Based on your investigation, I pushed 4681e55 that updates
the decoding routine. Just let me know if you notice any other issue to
merge this PR.

danbev · 2025-05-28T05:09:38Z

@fujimotos Thanks for this, with your latest change I was able to run the tests without error 👍

fujimotos mentioned this pull request May 23, 2025

tests : add WER benchmarks #2454

Open

danbev approved these changes May 26, 2025

View reviewed changes

tests/earnings21/README.md Outdated Show resolved Hide resolved

tests/earnings21/README.md Outdated Show resolved Hide resolved

tests : apply PR feedback to 'earnings21/README.md'

eb2d0ef

Based on feedback from Daniel Bevenius. - Simplify how to download & prepare a Silero VAD model. - Fix typo: inferece -> inference Signed-off-by: Fujimoto Seiji <[email protected]>

tests : avoid crashing on non-UTF-8 characters

57c15c5

Based on feedback from Daniel Bevenius. Add 'errors' parameter to open() in order to avoid unhandled exception on invalid UTF-8 bytes. Signed-off-by: Fujimoto Seiji <[email protected]>

ggerganov approved these changes May 27, 2025

View reviewed changes

danbev merged commit b9d27b1 into ggml-org:master May 28, 2025
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tests : add a new benchmark test for long-form audio #3185

tests : add a new benchmark test for long-form audio #3185

Uh oh!

fujimotos commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

danbev commented May 27, 2025

Uh oh!

fujimotos commented May 27, 2025 •

edited

Loading

Uh oh!

danbev commented May 27, 2025 •

edited

Loading

Uh oh!

ggerganov left a comment

Uh oh!

fujimotos commented May 27, 2025

Uh oh!

Uh oh!

danbev commented May 28, 2025

Uh oh!

Uh oh!

tests : add a new benchmark test for long-form audio #3185

tests : add a new benchmark test for long-form audio #3185

Uh oh!

Conversation

fujimotos commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

danbev commented May 27, 2025

Uh oh!

fujimotos commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danbev commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

fujimotos commented May 27, 2025

Uh oh!

Uh oh!

danbev commented May 28, 2025

Uh oh!

Uh oh!

fujimotos commented May 27, 2025 •

edited

Loading

danbev commented May 27, 2025 •

edited

Loading