Skip to content

Commit b9d27b1

Browse files
authored
tests : add a new benchmark test for long-form audio (#3185)
* tests : add a new benchmark test for long-form audio Based on "Earnings-21" corpus by Del Rio et al. Earnings-21: A Practical Benchmark for ASR in the Wild (2021) https://arxiv.org/abs/2104.11348 This dataset contains 39 hours of long-form speech, sourced from public earning calls. Each recording contains roughly 50 minutes of English dialogues between multiple speakers (2-20 persons). This benchmark suite should allow us to evaluate the performance of whisper.cpp on long-form audio data. Signed-off-by: Fujimoto Seiji <[email protected]> * tests : apply PR feedback to 'earnings21/README.md' Based on feedback from Daniel Bevenius. - Simplify how to download & prepare a Silero VAD model. - Fix typo: inferece -> inference Signed-off-by: Fujimoto Seiji <[email protected]> * tests : avoid crashing on non-UTF-8 characters Based on feedback from Daniel Bevenius. Add 'errors' parameter to open() in order to avoid unhandled exception on invalid UTF-8 bytes. Signed-off-by: Fujimoto Seiji <[email protected]> * tests : try to interpret the hypothesis as Windows-1252 Based on the discussion in PR#3185. Evidently Whisper.cpp can represent a quotation mark as '0x93', which implifies Windows-1252 (Microsoft's ASCII excention), and cannot be decoded by UTF-8. Add an explicit decoding loop to address the issue. Signed-off-by: Fujimoto Seiji <[email protected]> --------- Signed-off-by: Fujimoto Seiji <[email protected]>
1 parent 0ed00d9 commit b9d27b1

File tree

11 files changed

+2639
-0
lines changed

11 files changed

+2639
-0
lines changed

tests/earnings21/.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
__pycache__
2+
*.tar.gz
3+
*.txt
4+
eval.conf
5+
venv
6+
speech-datasets

tests/earnings21/Makefile

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
GIT_URL = https://github.com/revdotcom/speech-datasets
2+
3+
all: eval
4+
5+
eval:
6+
$(MAKE) -f eval.mk
7+
8+
clean:
9+
$(MAKE) -f eval.mk clean
10+
11+
get-audio:
12+
git clone --depth 1 --filter=blob:none --sparse $(GIT_URL)
13+
git -C speech-datasets sparse-checkout init --cone
14+
git -C speech-datasets sparse-checkout set earnings21
15+
16+
.PHONY: all eval clean get-audio

tests/earnings21/README.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# whisper.cpp/tests/earnings21
2+
3+
[Earnings-21](https://arxiv.org/abs/2104.11348) is a real-world benchmark
4+
dataset that contains 39-hours of long-form English speech, sourced from
5+
public earning calls.
6+
7+
This directory contains a set of scripts to evaluate the performance of
8+
whisper.cpp on Earnings-21 corpus.
9+
10+
## Quick Start
11+
12+
1. (Pre-requirement) Compile `whisper-cli` and prepare the Whisper
13+
model in `ggml` format.
14+
15+
```
16+
$ # Execute the commands below in the project root dir.
17+
$ cmake -B build
18+
$ cmake --build build --config Release
19+
$ ./models/download-ggml-model.sh tiny
20+
```
21+
22+
Consult [whisper.cpp/README.md](../../README.md) for more details.
23+
24+
2. Download the audio files.
25+
26+
```
27+
$ make get-audio
28+
```
29+
30+
3. Set up the environment to compute WER score.
31+
32+
```
33+
$ pip install -r requirements.txt
34+
```
35+
36+
For example, if you use `virtualenv`, you can set up it as follows:
37+
38+
```
39+
$ python3 -m venv venv
40+
$ . venv/bin/activate
41+
$ pip install -r requirements.txt
42+
```
43+
44+
4. Run the benchmark test.
45+
46+
```
47+
$ make
48+
```
49+
50+
## How-to guides
51+
52+
### How to change the inference parameters
53+
54+
Create `eval.conf` and override variables.
55+
56+
```
57+
WHISPER_MODEL = large-v3-turbo
58+
WHISPER_FLAGS = --no-prints --threads 8 --language en --output-txt
59+
```
60+
61+
Check out `eval.mk` for more details.
62+
63+
### How to perform the benchmark test on a 10-hour subset
64+
65+
Earnings-21 provides a small but representative subset (approximately
66+
10-hour audio data) to evaluate ASR systems quickly.
67+
68+
To switch to the subset, create `eval.conf` and add the following line:
69+
70+
```
71+
EARNINGS21_EVAL10 = yes
72+
```
73+
74+
### How to run the benchmark test using VAD
75+
76+
First, you need to download a VAD model:
77+
78+
```
79+
$ # Execute the commands below in the project root dir.
80+
$ ./models/download-vad-model.sh silero-v5.1.2
81+
```
82+
83+
Create `eval.conf` with the following content:
84+
85+
```
86+
WHISPER_FLAGS = --no-prints --language en --output-txt --vad --vad-model ../../models/ggml-silero-v5.1.2.bin
87+
```

tests/earnings21/eval.mk

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
PYTHON = python
2+
3+
WHISPER_PREFIX = ../../
4+
WHISPER_MODEL = tiny
5+
6+
WHISPER_CLI = $(WHISPER_PREFIX)build/bin/whisper-cli
7+
WHISPER_FLAGS = --no-prints --language en --output-txt
8+
9+
# You can create eval.conf to override the WHISPER_* variables
10+
# defined above.
11+
-include eval.conf
12+
13+
# Add `EARNINGS21_EVAL10 = yes` to eval.conf to switch to a
14+
# 10-hour subset. See "speech-datasets/earnings21/README.md" for
15+
# more details about this subset.
16+
ifdef EARNINGS21_EVAL10
17+
METADATA_CSV = speech-datasets/earnings21/eval10-file-metadata.csv
18+
AUDIO_SRCS = speech-datasets/earnings21/media/4320211.mp3 \
19+
speech-datasets/earnings21/media/4341191.mp3 \
20+
speech-datasets/earnings21/media/4346818.mp3 \
21+
speech-datasets/earnings21/media/4359971.mp3 \
22+
speech-datasets/earnings21/media/4365024.mp3 \
23+
speech-datasets/earnings21/media/4366522.mp3 \
24+
speech-datasets/earnings21/media/4366893.mp3 \
25+
speech-datasets/earnings21/media/4367535.mp3 \
26+
speech-datasets/earnings21/media/4383161.mp3 \
27+
speech-datasets/earnings21/media/4384964.mp3 \
28+
speech-datasets/earnings21/media/4387332.mp3
29+
else
30+
METADATA_CSV = speech-datasets/earnings21/earnings21-file-metadata.csv
31+
AUDIO_SRCS = $(sort $(wildcard speech-datasets/earnings21/media/*.mp3))
32+
endif
33+
34+
TRANS_TXTS = $(addsuffix .txt, $(AUDIO_SRCS))
35+
36+
# We output the evaluation result to this file.
37+
DONE = $(WHISPER_MODEL).txt
38+
39+
all: $(DONE)
40+
41+
$(DONE): $(TRANS_TXTS)
42+
$(PYTHON) eval.py $(METADATA_CSV) > $@.tmp
43+
mv $@.tmp $@
44+
45+
# Note: This task writes to a temporary file first to
46+
# create the target file atomically.
47+
%.mp3.txt: %.mp3
48+
$(WHISPER_CLI) $(WHISPER_FLAGS) --model $(WHISPER_PREFIX)models/ggml-$(WHISPER_MODEL).bin --file $^ --output-file $^.tmp
49+
mv $^.tmp.txt $^.txt
50+
51+
archive:
52+
tar -czf $(WHISPER_MODEL).tar.gz --exclude="*.mp3" speech-datasets/earnings21/media $(DONE)
53+
54+
clean:
55+
@rm -f $(TRANS_TXTS)
56+
@rm -f $(DONE)
57+
58+
.PHONY: all archive clean

tests/earnings21/eval.py

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
import os
2+
import sys
3+
import glob
4+
import jiwer
5+
from normalizers import EnglishTextNormalizer
6+
7+
def decode_hypothesis(b):
8+
try:
9+
# Depending on platforms, Whisper can emit a left double quotation
10+
# mark (0x93), which is Microsoft's extension to ASCII. See #3185
11+
# for the background.
12+
return b.decode('windows-1252')
13+
except UnicodeDecodeError:
14+
return b.decode('utf-8', errors='ignore')
15+
16+
def get_reference():
17+
ref = {}
18+
for path in glob.glob("speech-datasets/earnings21/transcripts/nlp_references/*.nlp"):
19+
code = os.path.basename(path).replace(".nlp", "")
20+
buf = []
21+
with open(path) as fp:
22+
fp.readline()
23+
for line in fp:
24+
token = line.split("|", maxsplit=1)[0]
25+
buf.append(token)
26+
ref[code] = " ".join(buf)
27+
return ref
28+
29+
def get_hypothesis():
30+
hyp = {}
31+
for path in glob.glob("speech-datasets/earnings21/media/*.mp3.txt"):
32+
with open(path, 'rb') as fp:
33+
text = decode_hypothesis(fp.read()).strip()
34+
code = os.path.basename(path).replace(".mp3.txt", "")
35+
hyp[code] = text
36+
return hyp
37+
38+
def get_codes(metadata_csv):
39+
codes = []
40+
with open(metadata_csv) as fp:
41+
fp.readline()
42+
for line in fp:
43+
codes.append(line.split(",")[0])
44+
return sorted(codes)
45+
46+
def main():
47+
if len(sys.argv) < 2:
48+
print("Usage: %s METADATA_CSV" % sys.argv[0], file=sys.stderr)
49+
return 1
50+
51+
metadata_csv = sys.argv[1]
52+
normalizer = EnglishTextNormalizer()
53+
54+
ref_orig = get_reference()
55+
hyp_orig = get_hypothesis()
56+
57+
ref_clean = []
58+
hyp_clean = []
59+
60+
for code in get_codes(metadata_csv):
61+
ref_clean.append(normalizer(ref_orig[code]))
62+
hyp_clean.append(normalizer(hyp_orig[code]))
63+
64+
wer = jiwer.wer(ref_clean, hyp_clean)
65+
print(f"WER: {wer * 100:.2f}%")
66+
67+
if __name__ == "__main__":
68+
main()

tests/earnings21/normalizers/LICENSE

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
Code in this directory is adapted from OpenAI Whisper project
2+
(https://github.com/openai/whisper) and carries the following
3+
copyright and license.
4+
5+
MIT License
6+
7+
Copyright (c) 2022 OpenAI
8+
9+
Permission is hereby granted, free of charge, to any person obtaining a copy
10+
of this software and associated documentation files (the "Software"), to deal
11+
in the Software without restriction, including without limitation the rights
12+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
13+
copies of the Software, and to permit persons to whom the Software is
14+
furnished to do so, subject to the following conditions:
15+
16+
The above copyright notice and this permission notice shall be included in all
17+
copies or substantial portions of the Software.
18+
19+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
20+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
21+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
22+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
23+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
24+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
25+
SOFTWARE.
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from .basic import BasicTextNormalizer as BasicTextNormalizer
2+
from .english import EnglishTextNormalizer as EnglishTextNormalizer

tests/earnings21/normalizers/basic.py

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
import re
2+
import unicodedata
3+
4+
import regex
5+
6+
# non-ASCII letters that are not separated by "NFKD" normalization
7+
ADDITIONAL_DIACRITICS = {
8+
"œ": "oe",
9+
"Œ": "OE",
10+
"ø": "o",
11+
"Ø": "O",
12+
"æ": "ae",
13+
"Æ": "AE",
14+
"ß": "ss",
15+
"ẞ": "SS",
16+
"đ": "d",
17+
"Đ": "D",
18+
"ð": "d",
19+
"Ð": "D",
20+
"þ": "th",
21+
"Þ": "th",
22+
"ł": "l",
23+
"Ł": "L",
24+
}
25+
26+
27+
def remove_symbols_and_diacritics(s: str, keep=""):
28+
"""
29+
Replace any other markers, symbols, and punctuations with a space,
30+
and drop any diacritics (category 'Mn' and some manual mappings)
31+
"""
32+
return "".join(
33+
(
34+
c
35+
if c in keep
36+
else (
37+
ADDITIONAL_DIACRITICS[c]
38+
if c in ADDITIONAL_DIACRITICS
39+
else (
40+
""
41+
if unicodedata.category(c) == "Mn"
42+
else " " if unicodedata.category(c)[0] in "MSP" else c
43+
)
44+
)
45+
)
46+
for c in unicodedata.normalize("NFKD", s)
47+
)
48+
49+
50+
def remove_symbols(s: str):
51+
"""
52+
Replace any other markers, symbols, punctuations with a space, keeping diacritics
53+
"""
54+
return "".join(
55+
" " if unicodedata.category(c)[0] in "MSP" else c
56+
for c in unicodedata.normalize("NFKC", s)
57+
)
58+
59+
60+
class BasicTextNormalizer:
61+
def __init__(self, remove_diacritics: bool = False, split_letters: bool = False):
62+
self.clean = (
63+
remove_symbols_and_diacritics if remove_diacritics else remove_symbols
64+
)
65+
self.split_letters = split_letters
66+
67+
def __call__(self, s: str):
68+
s = s.lower()
69+
s = re.sub(r"[<\[][^>\]]*[>\]]", "", s) # remove words between brackets
70+
s = re.sub(r"\(([^)]+?)\)", "", s) # remove words between parenthesis
71+
s = self.clean(s).lower()
72+
73+
if self.split_letters:
74+
s = " ".join(regex.findall(r"\X", s, regex.U))
75+
76+
s = re.sub(
77+
r"\s+", " ", s
78+
) # replace any successive whitespace characters with a space
79+
80+
return s

0 commit comments

Comments
 (0)