Skip to content

tests : add a new benchmark test for long-form audio #3185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 28, 2025

Conversation

fujimotos
Copy link
Contributor

Based on Earnings-21 corpus by Del Rio et al.

Earnings-21: A Practical Benchmark for ASR in the Wild (2021)
https://arxiv.org/abs/2104.11348

This dataset contains 39 hours of long-form speech, sourced from public
earning calls. Each recording contains roughly 50 minutes of English
dialogues between multiple speakers (2-20 persons).

This benchmark suite should allow us to evaluate the performance of
whisper.cpp on long-form audio data.

Based on "Earnings-21" corpus by Del Rio et al.

    Earnings-21: A Practical Benchmark for ASR in the Wild (2021)
    https://arxiv.org/abs/2104.11348

This dataset contains 39 hours of long-form speech, sourced from public
earning calls. Each recording contains roughly 50 minutes of English
dialogues between multiple speakers (2-20 persons).

This benchmark suite should allow us to evaluate the performance of
whisper.cpp on long-form audio data.

Signed-off-by: Fujimoto Seiji <[email protected]>
Based on feedback from Daniel Bevenius.

 - Simplify how to download & prepare a Silero VAD model.
 - Fix typo: inferece -> inference

Signed-off-by: Fujimoto Seiji <[email protected]>
@danbev
Copy link
Collaborator

danbev commented May 27, 2025

@fujimotos I ran into this issue yesterday when testing:

output_txt: saving output to 'speech-datasets/earnings21/media/4397829.mp3.tmp.txt'
mv speech-datasets/earnings21/media/4397829.mp3.tmp.txt speech-datasets/earnings21/media/4397829.mp3.txt
python eval.py speech-datasets/earnings21/earnings21-file-metadata.csv > tiny.txt.tmp
Traceback (most recent call last):
  File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 59, in <module>
    main()
  File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 46, in main
    hyp_orig = get_hypothesis()
               ^^^^^^^^^^^^^^^^
  File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 24, in get_hypothesis
    text = fp.read().strip()
           ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 31672: invalid start byte

I was able to get around this using the following change:

diff --git a/tests/earnings21/eval.py b/tests/earnings21/eval.py
index a8e57bbe..a4dc7afa 100644
--- a/tests/earnings21/eval.py
+++ b/tests/earnings21/eval.py
@@ -20,7 +20,7 @@ def get_reference():
 def get_hypothesis():
     hyp = {}
     for path in glob.glob("speech-datasets/earnings21/media/*.mp3.txt"):
-        with open(path) as fp:
+        with open(path, errors='ignore') as fp:
             text = fp.read().strip()
         code = os.path.basename(path).replace(".mp3.txt", "")
         hyp[code] = text
         

So instead of crashing this should just skip invalid characters. I think this was when testing using VAD and the tiny model.

Based on feedback from Daniel Bevenius.

Add 'errors' parameter to open() in order to avoid unhandled
exception on invalid UTF-8 bytes.

Signed-off-by: Fujimoto Seiji <[email protected]>
@fujimotos
Copy link
Contributor Author

fujimotos commented May 27, 2025

So instead of crashing this should just skip invalid characters.

@danbev OK. I just pushed 57c15c5 that modifies the open() call based on your suggestion.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 31672: invalid start byte

Side note: I'm mildly curious why this happened in the first place. I did not see this error when I
ran this benchmark test, and also have never seen a Whisper model emitting tokens that cannot
be interpreted as UTF-8.

@danbev
Copy link
Collaborator

danbev commented May 27, 2025

I'm mildly curious why this happened in the first place.

I'm re-running this now to reproduce it and hopefully provide some more information.

(venv) $ cat eval.conf
WHISPER_MODEL = tiny
WHISPER_FLAGS = --no-prints --language en --output-txt --vad --vad-model ../../models/for-tests-silero-v5.1.2-ggml.bin

$ make
...
[01:14:27.400 --> 01:14:31.130]   gentlemen that concludes today's call thank you for participating and you may now disconnect.
output_txt: saving output to 'speech-datasets/earnings21/media/4397829.mp3.tmp.txt'
mv speech-datasets/earnings21/media/4397829.mp3.tmp.txt speech-datasets/earnings21/media/4397829.mp3.txt
python eval.py speech-datasets/earnings21/earnings21-file-metadata.csv > tiny.txt.tmp
Traceback (most recent call last):
  File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 59, in <module>
    main()
  File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 46, in main
    hyp_orig = get_hypothesis()
               ^^^^^^^^^^^^^^^^
  File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 24, in get_hypothesis
    text = fp.read().strip()
           ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 31672: invalid start byte
make[1]: *** [tiny.txt] Error 1
make: *** [eval] Error 2

Inspecting the file in question (after adding a try/catch block to get it):

venv) $ hexdump -C speech-datasets/earnings21/media/4375653.mp3.txt | grep " 93 "
00007bb0  78 61 6d 70 6c 65 2e 0a  93 69 6e 67 20 74 68 65  |xample...ing the|

(venv) $ file -I speech-datasets/earnings21/media/4375653.mp3.txt
speech-datasets/earnings21/media/4375653.mp3.txt: text/plain; charset=unknown-8bit

(venv) $ find speech-datasets/earnings21/media/ -name "*.txt" -exec file -I {} \; | grep -v utf-8
speech-datasets/earnings21/media/4341191.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4384198.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4387383.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4394084.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4384683.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4364366.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4367318.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4385388.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4374910.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4360674.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4387332.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4330115.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4360717.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4387865.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4392809.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4366522.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4397800.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4367535.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4385072.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4366893.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4366429.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4385939.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4346818.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4320211.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4386541.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4375653.mp3.txt: text/plain; charset=unknown-8bit
speech-datasets/earnings21/media/4382825.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4383161.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4359971.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4360366.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4384744.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4344338.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4365948.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4366302.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4397829.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4368670.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4384964.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4344866.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4389907.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4346923.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4365024.mp3.txt: text/plain; charset=us-ascii

It just seems to be this one file so perhaps we can convert it: This won't work as the file is generated.

(venv) $ iconv -f windows-1252 -t utf-8 speech-datasets/earnings21/media/4375653.mp3.txt > temp.txt
(venv) $ mv temp.txt speech-datasets/earnings21/media/4375653.mp3.txt

After doing that I'm able to run without getting this error:

(venv) $ python eval.py speech-datasets/earnings21/earnings21-file-metadata.csv > tiny.txt.tmp

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Haven't had time yet to play with these, but the change looks good. We should utilize these tests to improve the VAD and to start tracking the WER across new builds of whisper.cpp.

Based on the discussion in PR#3185.

Evidently Whisper.cpp can represent a quotation mark as '0x93', which
implifies Windows-1252 (Microsoft's ASCII excention), and cannot be
decoded by UTF-8.

Add an explicit decoding loop to address the issue.

Signed-off-by: Fujimoto Seiji <[email protected]>
@fujimotos
Copy link
Contributor Author

@ggerganov Thank you!

@danbev Based on your investigation, I pushed 4681e55 that updates
the decoding routine. Just let me know if you notice any other issue to
merge this PR.

@danbev danbev merged commit b9d27b1 into ggml-org:master May 28, 2025
53 checks passed
@danbev
Copy link
Collaborator

danbev commented May 28, 2025

@fujimotos Thanks for this, with your latest change I was able to run the tests without error 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants