tests : add a new benchmark test for long-form audio (#3185)

fujimotos · web-flow · commit b9d27b13581d · 2025-05-28T07:08:44.000+02:00
* tests : add a new benchmark test for long-form audio Based on "Earnings-21" corpus by Del Rio et al. Earnings-21: A Practical Benchmark for ASR in the Wild (2021) https://arxiv.org/abs/2104.11348 This dataset contains 39 hours of long-form speech, sourced from public earning calls. Each recording contains roughly 50 minutes of English dialogues between multiple speakers (2-20 persons). This benchmark suite should allow us to evaluate the performance of whisper.cpp on long-form audio data. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * tests : apply PR feedback to 'earnings21/README.md' Based on feedback from Daniel Bevenius. - Simplify how to download & prepare a Silero VAD model. - Fix typo: inferece -> inference Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * tests : avoid crashing on non-UTF-8 characters Based on feedback from Daniel Bevenius. Add 'errors' parameter to open() in order to avoid unhandled exception on invalid UTF-8 bytes. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * tests : try to interpret the hypothesis as Windows-1252 Based on the discussion in PR#3185. Evidently Whisper.cpp can represent a quotation mark as '0x93', which implifies Windows-1252 (Microsoft's ASCII excention), and cannot be decoded by UTF-8. Add an explicit decoding loop to address the issue. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> --------- Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
diff --git a/tests/earnings21/.gitignore b/tests/earnings21/.gitignore
@@ -0,0 +1,6 @@
+__pycache__
+*.tar.gz
+*.txt
+eval.conf
+venv
+speech-datasets
diff --git a/tests/earnings21/Makefile b/tests/earnings21/Makefile
@@ -0,0 +1,16 @@
+GIT_URL = https://github.com/revdotcom/speech-datasets
+
+all: eval
+
+eval:
+	$(MAKE) -f eval.mk
+
+clean:
+	$(MAKE) -f eval.mk clean
+
+get-audio:
+	git clone --depth 1  --filter=blob:none  --sparse $(GIT_URL)
+	git -C speech-datasets sparse-checkout init --cone
+	git -C speech-datasets sparse-checkout set earnings21
+
+.PHONY: all eval clean get-audio
diff --git a/tests/earnings21/README.md b/tests/earnings21/README.md
@@ -0,0 +1,87 @@
+# whisper.cpp/tests/earnings21
+
+[Earnings-21](https://arxiv.org/abs/2104.11348) is a real-world benchmark
+dataset that contains 39-hours of long-form English speech, sourced from
+public earning calls.
+
+This directory contains a set of scripts to evaluate the performance of
+whisper.cpp on Earnings-21 corpus.
+
+## Quick Start
+
+1. (Pre-requirement) Compile `whisper-cli` and prepare the Whisper
+   model in `ggml` format.
+
+   ```
+   $ # Execute the commands below in the project root dir.
+   $ cmake -B build
+   $ cmake --build build --config Release
+   $ ./models/download-ggml-model.sh tiny
+   ```
+
+   Consult [whisper.cpp/README.md](../../README.md) for more details.
+
+2. Download the audio files.
+
+   ```
+   $ make get-audio
+   ```
+
+3. Set up the environment to compute WER score.
+
+   ```
+   $ pip install -r requirements.txt
+   ```
+
+   For example, if you use `virtualenv`, you can set up it as follows:
+
+   ```
+   $ python3 -m venv venv
+   $ . venv/bin/activate
+   $ pip install -r requirements.txt
+   ```
+
+4. Run the benchmark test.
+
+   ```
+   $ make
+   ```
+
+## How-to guides
+
+### How to change the inference parameters
+
+Create `eval.conf` and override variables.
+
+```
+WHISPER_MODEL = large-v3-turbo
+WHISPER_FLAGS = --no-prints --threads 8 --language en --output-txt
+```
+
+Check out `eval.mk` for more details.
+
+### How to perform the benchmark test on a 10-hour subset
+
+Earnings-21 provides a small but representative subset (approximately
+10-hour audio data) to evaluate ASR systems quickly.
+
+To switch to the subset, create `eval.conf` and add the following line:
+
+```
+EARNINGS21_EVAL10 = yes
+```
+
+### How to run the benchmark test using VAD
+
+First, you need to download a VAD model:
+
+```
+$ # Execute the commands below in the project root dir.
+$ ./models/download-vad-model.sh silero-v5.1.2
+```
+
+Create `eval.conf` with the following content:
+
+```
+WHISPER_FLAGS = --no-prints --language en --output-txt --vad --vad-model ../../models/ggml-silero-v5.1.2.bin
+```
diff --git a/tests/earnings21/eval.mk b/tests/earnings21/eval.mk
@@ -0,0 +1,58 @@
+PYTHON = python
+
+WHISPER_PREFIX = ../../
+WHISPER_MODEL = tiny
+
+WHISPER_CLI = $(WHISPER_PREFIX)build/bin/whisper-cli
+WHISPER_FLAGS = --no-prints --language en --output-txt
+
+# You can create eval.conf to override the WHISPER_* variables
+# defined above.
+-include eval.conf
+
+# Add  `EARNINGS21_EVAL10 = yes` to eval.conf to switch to a
+# 10-hour subset. See "speech-datasets/earnings21/README.md" for
+# more details about this subset.
+ifdef EARNINGS21_EVAL10
+METADATA_CSV = speech-datasets/earnings21/eval10-file-metadata.csv
+AUDIO_SRCS = speech-datasets/earnings21/media/4320211.mp3 \
+             speech-datasets/earnings21/media/4341191.mp3 \
+             speech-datasets/earnings21/media/4346818.mp3 \
+             speech-datasets/earnings21/media/4359971.mp3 \
+             speech-datasets/earnings21/media/4365024.mp3 \
+             speech-datasets/earnings21/media/4366522.mp3 \
+             speech-datasets/earnings21/media/4366893.mp3 \
+             speech-datasets/earnings21/media/4367535.mp3 \
+             speech-datasets/earnings21/media/4383161.mp3 \
+             speech-datasets/earnings21/media/4384964.mp3 \
+             speech-datasets/earnings21/media/4387332.mp3
+else
+METADATA_CSV = speech-datasets/earnings21/earnings21-file-metadata.csv
+AUDIO_SRCS = $(sort $(wildcard speech-datasets/earnings21/media/*.mp3))
+endif
+
+TRANS_TXTS = $(addsuffix .txt, $(AUDIO_SRCS))
+
+# We output the evaluation result to this file.
+DONE = $(WHISPER_MODEL).txt
+
+all: $(DONE)
+
+$(DONE): $(TRANS_TXTS)
+	$(PYTHON) eval.py $(METADATA_CSV) > $@.tmp
+	mv $@.tmp $@
+
+# Note: This task writes to a temporary file first to
+# create the target file atomically.
+%.mp3.txt: %.mp3
+	$(WHISPER_CLI) $(WHISPER_FLAGS) --model $(WHISPER_PREFIX)models/ggml-$(WHISPER_MODEL).bin --file $^ --output-file $^.tmp
+	mv $^.tmp.txt $^.txt
+
+archive:
+	tar -czf $(WHISPER_MODEL).tar.gz --exclude="*.mp3" speech-datasets/earnings21/media $(DONE)
+
+clean:
+	@rm -f $(TRANS_TXTS)
+	@rm -f $(DONE)
+
+.PHONY: all archive clean
diff --git a/tests/earnings21/eval.py b/tests/earnings21/eval.py
@@ -0,0 +1,68 @@
+import os
+import sys
+import glob
+import jiwer
+from normalizers import EnglishTextNormalizer
+
+def decode_hypothesis(b):
+    try:
+        # Depending on platforms, Whisper can emit a left double quotation
+        # mark (0x93), which is Microsoft's extension to ASCII. See #3185
+        # for the background.
+        return b.decode('windows-1252')
+    except UnicodeDecodeError:
+        return b.decode('utf-8', errors='ignore')
+
+def get_reference():
+    ref = {}
+    for path in glob.glob("speech-datasets/earnings21/transcripts/nlp_references/*.nlp"):
+        code = os.path.basename(path).replace(".nlp", "")
+        buf = []
+        with open(path) as fp:
+            fp.readline()
+            for line in fp:
+                token = line.split("|", maxsplit=1)[0]
+                buf.append(token)
+            ref[code] = " ".join(buf)
+    return ref
+
+def get_hypothesis():
+    hyp = {}
+    for path in glob.glob("speech-datasets/earnings21/media/*.mp3.txt"):
+        with open(path, 'rb') as fp:
+            text = decode_hypothesis(fp.read()).strip()
+        code = os.path.basename(path).replace(".mp3.txt", "")
+        hyp[code] = text
+    return hyp
+
+def get_codes(metadata_csv):
+    codes = []
+    with open(metadata_csv) as fp:
+        fp.readline()
+        for line in fp:
+            codes.append(line.split(",")[0])
+    return sorted(codes)
+
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: %s METADATA_CSV" % sys.argv[0], file=sys.stderr)
+        return 1
+
+    metadata_csv = sys.argv[1]
+    normalizer = EnglishTextNormalizer()
+
+    ref_orig = get_reference()
+    hyp_orig = get_hypothesis()
+
+    ref_clean = []
+    hyp_clean = []
+
+    for code in get_codes(metadata_csv):
+        ref_clean.append(normalizer(ref_orig[code]))
+        hyp_clean.append(normalizer(hyp_orig[code]))
+
+    wer = jiwer.wer(ref_clean, hyp_clean)
+    print(f"WER: {wer * 100:.2f}%")
+
+if __name__ == "__main__":
+    main()
diff --git a/tests/earnings21/normalizers/LICENSE b/tests/earnings21/normalizers/LICENSE
@@ -0,0 +1,25 @@
+Code in this directory is adapted from OpenAI Whisper project
+(https://github.com/openai/whisper) and carries the following
+copyright and license.
+
+    MIT License
+
+    Copyright (c) 2022 OpenAI
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+
+    The above copyright notice and this permission notice shall be included in all
+    copies or substantial portions of the Software.
+
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+    SOFTWARE.
diff --git a/tests/earnings21/normalizers/__init__.py b/tests/earnings21/normalizers/__init__.py
@@ -0,0 +1,2 @@
+from .basic import BasicTextNormalizer as BasicTextNormalizer
+from .english import EnglishTextNormalizer as EnglishTextNormalizer
diff --git a/tests/earnings21/normalizers/basic.py b/tests/earnings21/normalizers/basic.py
@@ -0,0 +1,80 @@
+import re
+import unicodedata
+
+import regex
+
+# non-ASCII letters that are not separated by "NFKD" normalization
+ADDITIONAL_DIACRITICS = {
+    "œ": "oe",
+    "Œ": "OE",
+    "ø": "o",
+    "Ø": "O",
+    "æ": "ae",
+    "Æ": "AE",
+    "ß": "ss",
+    "ẞ": "SS",
+    "đ": "d",
+    "Đ": "D",
+    "ð": "d",
+    "Ð": "D",
+    "þ": "th",
+    "Þ": "th",
+    "ł": "l",
+    "Ł": "L",
+}
+
+
+def remove_symbols_and_diacritics(s: str, keep=""):
+    """
+    Replace any other markers, symbols, and punctuations with a space,
+    and drop any diacritics (category 'Mn' and some manual mappings)
+    """
+    return "".join(
+        (
+            c
+            if c in keep
+            else (
+                ADDITIONAL_DIACRITICS[c]
+                if c in ADDITIONAL_DIACRITICS
+                else (
+                    ""
+                    if unicodedata.category(c) == "Mn"
+                    else " " if unicodedata.category(c)[0] in "MSP" else c
+                )
+            )
+        )
+        for c in unicodedata.normalize("NFKD", s)
+    )
+
+
+def remove_symbols(s: str):
+    """
+    Replace any other markers, symbols, punctuations with a space, keeping diacritics
+    """
+    return "".join(
+        " " if unicodedata.category(c)[0] in "MSP" else c
+        for c in unicodedata.normalize("NFKC", s)
+    )
+
+
+class BasicTextNormalizer:
+    def __init__(self, remove_diacritics: bool = False, split_letters: bool = False):
+        self.clean = (
+            remove_symbols_and_diacritics if remove_diacritics else remove_symbols
+        )
+        self.split_letters = split_letters
+
+    def __call__(self, s: str):
+        s = s.lower()
+        s = re.sub(r"[<\[][^>\]]*[>\]]", "", s)  # remove words between brackets
+        s = re.sub(r"\(([^)]+?)\)", "", s)  # remove words between parenthesis
+        s = self.clean(s).lower()
+
+        if self.split_letters:
+            s = " ".join(regex.findall(r"\X", s, regex.U))
+
+        s = re.sub(
+            r"\s+", " ", s
+        )  # replace any successive whitespace characters with a space
+
+        return s
diff --git a/tests/earnings21/normalizers/english.json b/tests/earnings21/normalizers/english.json
diff --git a/tests/earnings21/normalizers/english.py b/tests/earnings21/normalizers/english.py
diff --git a/tests/earnings21/requirements.txt b/tests/earnings21/requirements.txt

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+from .basic import BasicTextNormalizer as BasicTextNormalizer`
	`2`	`+from .english import EnglishTextNormalizer as EnglishTextNormalizer`