Skip to content

Mis-leading whisper-bench (Now with more Macs) #3139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
peardox opened this issue May 10, 2025 · 10 comments
Open

Mis-leading whisper-bench (Now with more Macs) #3139

peardox opened this issue May 10, 2025 · 10 comments

Comments

@peardox
Copy link
Contributor

peardox commented May 10, 2025

The numbers returned by whisper-bench are misleading

I've got a Mac M4 Mini 256G (the cheap one) and a Lenovo Laptop with a 4070 GPU in it. I usually use the laptop in Hybrid mode (Performance mode gets noisy) - the Mac is, of course, silent all the time anyway

I've been relying on whisper-bench to indicate which device is faster, thing is, it's wrong (or more accurately, rather misleading)

I just got the M4 and my Laptop on fairly equal footing when it comes to whisper.cpp facilities as both have OpenVINO available and one has an M4 with OpenCL while the other has CUDA

I'd always thought that the Mac was slower than the Laptop - well, until I drag-raced them against each other via CLI and a 35 minute public domain recording of Aladdin and the Magic Lamp

The figures come out like this (all tests use the medium.en model)...

whisper-bench (total runtime in seconds)

PC (Perf Mode) = 3.694
PC (Hybrid Mode) = 3.721
Mac M4 Mini 256G = 6.929

So the laptop's performance mode is 187.6% the speed of the M4?

Or is it?

Next I run whisper.cli over the recording of Aladdin and the Magic Lamp (35m 06s - mp3@64k - 16M filesize - 5379 words)

whisper-cli (total runtime in seconds + words translated per minute)

Mac M4 Mini 256G = 172.623 = 1869.623 wpm
PC (Perf Mode) = 186.730 = 1728.378 wpm
PC (Hybrid Mode) = 202.850 = 1591.027 wpm

Now the Mac is 108% the speed of the Laptop in Performance Mode

In the real world of course we'd be using whisper.cpp for things more like the whisper-cli test

If anyone wants to run the same test against their setup you can find the audio file here (Archive.org)

Mac was 1/3rd the price of the Laptop (but the Laptop plays better games)

Suppose I'd better switch to Linux on the Laptop and try the same test there next...

@peardox
Copy link
Contributor Author

peardox commented May 11, 2025

Linux results create a new winner

Bench : PC (Perf Linux) = 2.975
CLI : PC (Perf Linux) = 126.425 = 2552.817 wpm

So, same PC gets 25% and 47% faster in the results vs Win11.

Actually the Laptop is in this case running Linux off an external USB so overall experience is a tad clunky but this won't affect the runtimes

@peardox
Copy link
Contributor Author

peardox commented May 14, 2025

Rented some Mac Minis today so I now have a comparison of M1, M2, and M2 Pro to add to the list and a new 2nd place holder
New 1st place (17/05/25)

Platform Timing Words Per Min
PC (Perf Vulkan) 32G 8C/16T/4070M 93.005 3470.136 wpm
Mac Mini M4 Pro 64G 14C/20G/16N 118.842 2715.707 wpm
PC (Perf Linux) 32G 8C/16T/4070M 126.425 2552.817 wpm
Mac Mini M2 Pro 16G 10C/16G/16N 156.521 2061.960 wpm
Mac Mini M4 16G 10C/10G/16N 172.623 1869.623 wpm
PC (Perf Mode) 32G 8C/16T/4070M 186.730 1728.378 wpm
PC (Hybrid Mode) 32G 8C/16T/4070M 202.850 1591.027 wpm
Mac Mini M2 16G 8C/10G/16N 223.157 1446.246 wpm
Mac Mini M1 8G 8C/ 8G/16N 357.103 903.772 wpm
Pi5 8G 4915.470 65.658 wpm

And the runner up (Pi 5) can even type faster than me

The world record typing speed is 360 wpm :)

@peardox peardox changed the title Mis-leading whisper-bench Mis-leading whisper-bench (Now with more Macs) May 14, 2025
@kth8
Copy link

kth8 commented May 15, 2025

Thought I'd share my results. I performed 3 tests on a base M1 MBA, first using whisper-cli:

import time
import subprocess
start_time = time.time()
subprocess.run(
    ["whisper-cli", "-m", "ggml-medium.en.bin", "shortstory034_aladdinandthemagiclamp_llf_64kb.mp3"], 
    capture_output=True, 
    check=True
)
print(time.time() - start_time)

Second using Whisper with mlx-audio:

import time
from mlx_audio.stt.utils import load_model
stt_model = load_model("mlx-community/whisper-medium.en-mlx")
start_time = time.time()
result = stt_model.generate("shortstory034_aladdinandthemagiclamp_llf_64kb.mp3")
print(time.time() - start_time)

Third using Parakeet with mlx-audio:

import time
from mlx_audio.stt.models.parakeet import Model as ParakeetSTTModel
stt_model = ParakeetSTTModel.from_pretrained("mlx-community/parakeet-tdt-0.6b-v2")
start_time = time.time()
result = stt_model.generate(
    "shortstory034_aladdinandthemagiclamp_llf_64kb.mp3",
    chunk_duration=60,
    overlap_duration=5
    )
print(time.time() - start_time)

results
whisper-cli seems like it wasn't able to fully utilize the GPU which is why it was so slow. parakeet-tdt-0.6b-v2 being much faster as well as rank 1 on the Open ASR Leaderboard seems like the clear winner to me.

@tennyson-mccalla
Copy link

Have any of you done any comparisons with WhisperKit?

@peardox
Copy link
Contributor Author

peardox commented May 15, 2025

Not me - that one is too platform-specific for my liking.

@tennyson-mccalla
Copy link

Fair enough.

@peardox
Copy link
Contributor Author

peardox commented May 15, 2025

it'd be interesting to hear from @kth8 why the speeds are so different when they're doing the same thing

e.g. Is it exactly the same? The Whisper.-cli version for example outputs all the text while grabbing all the tokens

@kth8
Copy link

kth8 commented May 15, 2025

Another option I've been using is Google Gemini 2.5. Considering the potato specs of my MBA, it's the fastest option and can handle complex systems prompts with diarization.

import os
import time
from google import genai

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

prompt = """
You are a dedicated, expert, and highly precise audio transcription model.
Your sole function is to provide exceptionally accurate, verbatim transcriptions from audio input.

Your transcription must strictly adhere to the following guidelines:

*   **Verbatim Accuracy:**
    *   Capture *every* spoken word exactly as uttered, including hesitations (e.g., "um," "uh"), filler words, false starts, self-corrections, and repetitions.
    *   Do not omit, paraphrase, or "correct" spoken imperfections into cleaner language. The goal is a faithful representation of the raw speech.

*   **Standard Formatting & Readability:**
    *   Apply correct English spelling, grammar, punctuation (periods, commas, question marks, etc.), and capitalization as appropriate for standard written text.
    *   Ensure the output is formatted as clean, readable prose, with sentences and paragraphs structured naturally.

*   **Diarization:**
    *   Required only when there are 2 or more speakers present.
    *   Clearly attribute speech to the correct speaker on a new line for each speaker turn.
    *   Attempt to determine speaker names from the audio context (e.g., if they introduce themselves or are referred to by name). Use these identified names consistently.
    *   If names cannot be reliably determined for one or more speakers, use distinct, consistent placeholder labels for each unique unknown speaker: "Speaker 1", "Speaker 2", "Speaker 3", etc. Assign these sequentially as new unknown speakers are identified.
*   **Format for speaker turns (with an empty line separating subsequent turns):**
        `SPEAKER_NAME_OR_LABEL: Speech text here.`

        Example with identified name: `John Doe: Hello, this is my statement.`

        Example with placeholder: `Speaker 1: And I agree with that.`

        Example subsequent turn: `Jane Smith: Following up on John's point...`

        Example with another placeholder: `Speaker 2: I have a question.`

*   **Pure Transcript Output:**
    *   The final output must consist *exclusively* of the transcribed text (with diarization labels formatted as specified above, if applicable).
    *   **Absolutely NO:**
        *   Speaker tag when there is only 1 speaker present.
        *   Timestamps of any kind (e.g., `[00:00:00.000 --> 00:00:05.000]`).
        *   Confidence scores or any other processing metadata.
        *   Editor's notes, comments, or annotations within the text (e.g., `[unintelligible]`, `[laughter]`, `[pause]`). If a word is truly unintelligible, transcribe your best guess or omit if absolutely impossible, but do not insert an annotation.
        *   Introductory or concluding phrases from you, the model (e.g., "Here is the transcription:", "Transcription complete.", "I hope this helps!").
        *   Summaries, analyses, or any content other than the direct transcription.
""".strip()

start_time = time.time()
audio_file = client.files.upload(file="shortstory034_aladdinandthemagiclamp_llf_64kb.mp3")
response = client.models.generate_content(
    model="gemini-2.5-flash-preview-04-17", contents=[prompt, audio_file]
)
client.files.delete(name=audio_file.name)
#print(response.text)
print(time.time() - start_time)

@kth8
Copy link

kth8 commented May 16, 2025

it'd be interesting to hear from @kth8 why the speeds are so different when they're doing the same thing

e.g. Is it exactly the same? The Whisper.-cli version for example outputs all the text while grabbing all the tokens

Using whiser-cli I see GPU usage bounce between 75-100% instead of being pinned. whiser-server has a bit higher usage between 90-100%. The mlx-audio library seems better optimized if you can use it on Mac.

@peardox
Copy link
Contributor Author

peardox commented May 16, 2025

The mlx-audio library seems better optimized if you can use it on Mac.

Yet to delve into audio. whisper-bench tends to go for SDL-relates stuff on initial examination. There's conceptual FFMPEG for Linux but I want it portable...

Anyway - New winner (though obviously not for long) for my version(s)

Further investigations and I get mine down to 93 secs under Windows (the lower figures got me into a frenzy of benchmarking)

The full output after the end of the Aladdin transcription is now...

whisper_print_timings: load time = 1300.74 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 821.07 ms
whisper_print_timings: sample time = 10798.19 ms / 37309 runs ( 0.29 ms per run)
whisper_print_timings: encode time = 10380.34 ms / 79 runs ( 131.40 ms per run)
whisper_print_timings: decode time = 387.50 ms / 51 runs ( 7.60 ms per run)
whisper_print_timings: batchd time = 65441.72 ms / 36864 runs ( 1.78 ms per run)
whisper_print_timings: prompt time = 2064.51 ms / 17311 runs ( 0.12 ms per run)
whisper_print_timings: total time = 92.920 s

Linux should shave some more off that as well (but I need to re-build my ext Linux drive so that's a --- coming soon)

The trick was rather unintuitive. DON'T use CUDA / BLAS / OpenVINO. Stick to pure Vulkan with CPU for backup.

That's pretty portable (but not perfect) for the non-Mac world.

A few things are high on the let's explore list now. GPU support may be a nice option for OpenVINO (only CPU easily available AFAIK)

And, of course, playing with Audio is ripe for more speed increases. The Sample Rate on the test audio is 22.5 khz which must be producing way more input data than needed (bright thought, I'll resample to 8k + see what happens)

I've got a really junky old laptop that should prove useful for non-CUDA testing. My fork of the project allows me to select which backends to use which is also a great benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants