-
Notifications
You must be signed in to change notification settings - Fork 4.3k
tests : add a new benchmark test for long-form audio #3185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Based on "Earnings-21" corpus by Del Rio et al. Earnings-21: A Practical Benchmark for ASR in the Wild (2021) https://arxiv.org/abs/2104.11348 This dataset contains 39 hours of long-form speech, sourced from public earning calls. Each recording contains roughly 50 minutes of English dialogues between multiple speakers (2-20 persons). This benchmark suite should allow us to evaluate the performance of whisper.cpp on long-form audio data. Signed-off-by: Fujimoto Seiji <[email protected]>
Based on feedback from Daniel Bevenius. - Simplify how to download & prepare a Silero VAD model. - Fix typo: inferece -> inference Signed-off-by: Fujimoto Seiji <[email protected]>
@fujimotos I ran into this issue yesterday when testing: output_txt: saving output to 'speech-datasets/earnings21/media/4397829.mp3.tmp.txt'
mv speech-datasets/earnings21/media/4397829.mp3.tmp.txt speech-datasets/earnings21/media/4397829.mp3.txt
python eval.py speech-datasets/earnings21/earnings21-file-metadata.csv > tiny.txt.tmp
Traceback (most recent call last):
File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 59, in <module>
main()
File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 46, in main
hyp_orig = get_hypothesis()
^^^^^^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 24, in get_hypothesis
text = fp.read().strip()
^^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 31672: invalid start byte I was able to get around this using the following change: diff --git a/tests/earnings21/eval.py b/tests/earnings21/eval.py
index a8e57bbe..a4dc7afa 100644
--- a/tests/earnings21/eval.py
+++ b/tests/earnings21/eval.py
@@ -20,7 +20,7 @@ def get_reference():
def get_hypothesis():
hyp = {}
for path in glob.glob("speech-datasets/earnings21/media/*.mp3.txt"):
- with open(path) as fp:
+ with open(path, errors='ignore') as fp:
text = fp.read().strip()
code = os.path.basename(path).replace(".mp3.txt", "")
hyp[code] = text
So instead of crashing this should just skip invalid characters. I think this was when testing using VAD and the |
Based on feedback from Daniel Bevenius. Add 'errors' parameter to open() in order to avoid unhandled exception on invalid UTF-8 bytes. Signed-off-by: Fujimoto Seiji <[email protected]>
@danbev OK. I just pushed 57c15c5 that modifies the
Side note: I'm mildly curious why this happened in the first place. I did not see this error when I |
I'm re-running this now to reproduce it and hopefully provide some more information. (venv) $ cat eval.conf
WHISPER_MODEL = tiny
WHISPER_FLAGS = --no-prints --language en --output-txt --vad --vad-model ../../models/for-tests-silero-v5.1.2-ggml.bin
$ make
...
[01:14:27.400 --> 01:14:31.130] gentlemen that concludes today's call thank you for participating and you may now disconnect.
output_txt: saving output to 'speech-datasets/earnings21/media/4397829.mp3.tmp.txt'
mv speech-datasets/earnings21/media/4397829.mp3.tmp.txt speech-datasets/earnings21/media/4397829.mp3.txt
python eval.py speech-datasets/earnings21/earnings21-file-metadata.csv > tiny.txt.tmp
Traceback (most recent call last):
File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 59, in <module>
main()
File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 46, in main
hyp_orig = get_hypothesis()
^^^^^^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/tests/earnings21/eval.py", line 24, in get_hypothesis
text = fp.read().strip()
^^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 31672: invalid start byte
make[1]: *** [tiny.txt] Error 1
make: *** [eval] Error 2 Inspecting the file in question (after adding a try/catch block to get it): venv) $ hexdump -C speech-datasets/earnings21/media/4375653.mp3.txt | grep " 93 "
00007bb0 78 61 6d 70 6c 65 2e 0a 93 69 6e 67 20 74 68 65 |xample...ing the|
(venv) $ file -I speech-datasets/earnings21/media/4375653.mp3.txt
speech-datasets/earnings21/media/4375653.mp3.txt: text/plain; charset=unknown-8bit
(venv) $ find speech-datasets/earnings21/media/ -name "*.txt" -exec file -I {} \; | grep -v utf-8
speech-datasets/earnings21/media/4341191.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4384198.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4387383.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4394084.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4384683.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4364366.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4367318.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4385388.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4374910.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4360674.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4387332.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4330115.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4360717.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4387865.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4392809.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4366522.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4397800.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4367535.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4385072.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4366893.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4366429.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4385939.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4346818.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4320211.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4386541.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4375653.mp3.txt: text/plain; charset=unknown-8bit
speech-datasets/earnings21/media/4382825.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4383161.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4359971.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4360366.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4384744.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4344338.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4365948.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4366302.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4397829.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4368670.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4384964.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4344866.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4389907.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4346923.mp3.txt: text/plain; charset=us-ascii
speech-datasets/earnings21/media/4365024.mp3.txt: text/plain; charset=us-ascii
(venv) $ iconv -f windows-1252 -t utf-8 speech-datasets/earnings21/media/4375653.mp3.txt > temp.txt
(venv) $ mv temp.txt speech-datasets/earnings21/media/4375653.mp3.txt After doing that I'm able to run without getting this error: (venv) $ python eval.py speech-datasets/earnings21/earnings21-file-metadata.csv > tiny.txt.tmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Haven't had time yet to play with these, but the change looks good. We should utilize these tests to improve the VAD and to start tracking the WER across new builds of whisper.cpp
.
Based on the discussion in PR#3185. Evidently Whisper.cpp can represent a quotation mark as '0x93', which implifies Windows-1252 (Microsoft's ASCII excention), and cannot be decoded by UTF-8. Add an explicit decoding loop to address the issue. Signed-off-by: Fujimoto Seiji <[email protected]>
@ggerganov Thank you! @danbev Based on your investigation, I pushed 4681e55 that updates |
@fujimotos Thanks for this, with your latest change I was able to run the tests without error 👍 |
Based on
Earnings-21
corpus by Del Rio et al.This dataset contains 39 hours of long-form speech, sourced from public
earning calls. Each recording contains roughly 50 minutes of English
dialogues between multiple speakers (2-20 persons).
This benchmark suite should allow us to evaluate the performance of
whisper.cpp on long-form audio data.