[Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation #24209

sangbumlikeagod · 2025-09-04T01:52:51Z

Purpose

Currently, vLLM does not support the timestamp feature for Whisper, which was available in the original Whisper API. However, there is significant demand for this feature, as evidenced by discussions

such as #15012 (comment)

To address this, I created this PR to implement support for the segment timestamp feature.

Test Plan

run vllm server with whisper server on it.
send “v1/audio/transcriptions” api but set response_format into “verbose_json”
get the result

Test Result

transcription

request

curl --location 'http://175.123.89.199:8051/v1/audio/transcriptions' \
--header 'Authorization: ••••••' \
--form 'model="openai/whisper-large-v3"' \
--form 'language="en"' \
--form 'response_format="verbose_json"' \
--form 'temperature="0"' \
--form 'file=@"/C:/Users/audio/audio.wav"'

result

{
    "duration": "17.653125",
    "language": "en",
    "text": " *OUTRO MUSIC* Hey, who remembers PlayStation All-Stars Battle Royale? Oh Jesus. Sorry, I should have given you more warning. Here, use this bucket to collect your",
    "segments": [
        {
            "id": 42,
            "avg_logprob": -1.0,
            "compression_ratio": -1.0,
            "end": 11.540000000000001,
            "no_speech_prob": -1.0,
            "seek": 0,
            "start": 0.0,
            "temperature": -1.0,
            "text": " *OUTRO MUSIC*",
            "tokens": [
                1853,
                27276,
                7142,
                16924,
                9
            ]
        },
        {
            "id": 42,
            "avg_logprob": -1.0,
            "compression_ratio": -1.0,
            "end": 17.66,
            "no_speech_prob": -1.0,
            "seek": 0,
            "start": 11.540000000000001,
            "temperature": -1.0,
            "text": " Hey, who remembers PlayStation All-Stars Battle Royale? Oh Jesus. Sorry, I should have given you more warning. Here, use this bucket to collect your",
            "tokens": [
                1911,
                11,
                567,
                26228,
                20599,
                1057,
                12,
                4520,
                685,
                11846,
                8751,
                1220,
                30,
                876,
                2705,
                13,
                4919,
                11,
                286,
                820,
                362,
                2212,
                291,
                544,
                9164,
                13,
                1692,
                11,
                764,
                341,
                13058,
                281,
                2500,
                428
            ]
        }
    ],
    "words": null
}

translation

request

curl --location 'http://175.123.89.199:8051/v1/audio/translations' \
--header 'Authorization: Bearer EMPTY' \
--form 'model="openai/whisper-large-v3"' \
--form 'response_format="verbose_json"' \
--form 'temperature="0"' \
--form 'file=@"/C:/Users/audio/korean_audio.mp3"'

response

{
    "text": " Give me a gift for my 7-day subscription.",
    "language": null,
    "duration": "4.336375",
    "segments": [
        {
            "id": 12,
            "avg_logprob": -1.0,
            "compression_ratio": -1.0,
            "end": -1007.0600000000001,
            "no_speech_prob": -1.0,
            "seek": 0,
            "start": 0.0,
            "temperature": -1.0,
            "text": " Give me a gift for my 7-day subscription.",
            "tokens": [
                5303,
                385,
                257,
                5306,
                337,
                452,
                1614,
                12,
                810,
                17231,
                13
            ]
        }
    ]
}

transcription in other languages (korean in this case)

request

curl --location 'http://175.123.89.199:8051/v1/audio/transcriptions' \
--header 'Authorization: ••••••' \
--form 'language="ko"' \
--form 'temperature="0"' \
--form 'model="openai/whisper-large-v3"' \
--form 'response_format="verbose_json"' \
--form 'file=@"/C:/Users/audio/korean_audio.mp3"'

response

{
    "duration": "4.336375",
    "language": "ko",
    "text": " 구독권 칠일치 에이한테 선물해줘.",
    "segments": [
        {
            "id": 14,
            "avg_logprob": -1.0,
            "compression_ratio": -1.0,
            "end": -1007.02,
            "no_speech_prob": -1.0,
            "seek": 0,
            "start": 0.0,
            "temperature": -1.0,
            "text": " 구독권 칠일치 에이한테 선물해줘.",
            "tokens": [
                32800,
                23605,
                6639,
                254,
                6403,
                8464,
                20122,
                1129,
                15863,
                44956,
                44487,
                246,
                13
            ]
        }
    ],
    "words": null
}

Note

There’s a lingering issue in this PR. Basically, even if we remove the '<|notimestamps|>' token from the decoder prompt, the model still predicts it.

The most straightforward solution for this issue is to add logits that set the probability of '<|notimestamps|>' to negative infinity during the model's inference. Hugging Face's Transformers library also uses a similar approach.

However, vLLM prevents us from applying logits during inference due to its multi-step process. So, I used an indirect approach.

I replaced '<|notimestamps|>' with '<|0.00|>' to generate other timestamp tokens. This approach might cause issues, but I tested it multiple times, and it appears to work well. Additionally, this timestamp method only works in 'verbose_json' mode, a new feature I introduced in this PR.

Therefore, it may not affect current users. Thus, I submitted the PR despite this issue.

Essential Elements of an Effective PR Description Checklist

[ v ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
[ v ] The test plan, such as providing test command.
[ v ] The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: sangbumlikeagod <[email protected]>

…mestamp_option

Signed-off-by: sangbumlikeagod <[email protected]>

…mestamp_option

gemini-code-assist

Code Review

This pull request adds the 'verbose_json' response format to Whisper transcription and translation, which is a valuable feature for providing segment-level timestamps. However, the current implementation contains several critical and high-severity bugs that must be addressed. There's an inverted conditional logic for selecting the response class in translations, which will cause incorrect behavior. The new function for creating segments, _get_verbose_segments, is called with incorrect arguments, which will lead to a runtime error. Additionally, this function has logic errors in segment ID assignment and timestamp calculation. These issues need to be resolved to ensure the feature works correctly.

vllm/entrypoints/openai/serving_transcription.py

vllm/entrypoints/openai/speech_to_text.py

…mestamp_option

Signed-off-by: sangbumlikeagod <[email protected]>

…://github.com/sangbumlikeagod/vllm into sangbumlikeagod/frontend/add_timestamp_option

mr8bit · 2025-09-16T00:23:53Z

Hi! When I run your PR using the command below:

vllm serve \
  --host 0.0.0.0 \
  --port 8080 \
  /models/openai/whisper-large-v3 \
  --served-model-name openai/whisper-large-v3 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --enable-prompt-tokens-details \
  --enable-force-include-usage

I get the following error when requesting response_format=verbose_json:

curl --location 'http://localhost:8080/v1/audio/transcriptions' \
--form 'file=@"./audio.mp3"' \
--form 'model="openai/whisper-large-v3"' \
--form 'response_format="verbose_json"' \
--form 'temperature="0"

{
    "error": {
        "message": "can only concatenate tuple (not \"list\") to tuple",
        "type": "Internal Server Error",
        "param": null,
        "code": 500
    }
}

Could you advise what I might be doing wrong?

Signed-off-by: sangbumlikeagod <[email protected]>

sangbumlikeagod · 2025-09-17T00:39:14Z

Hello! mr8bit

Sorry for the mistake—I accidentally dropped the casting part while fixing the myPy issue.
I've now updated and committed the changes. Could you please test it again?

mr8bit · 2025-09-23T23:13:54Z

@sangbumlikeagod
Yes, everything works, thank you very much.

Do you plan to add timestamp_granularities for words?

sangbumlikeagod · 2025-09-24T01:15:54Z

@mr8bit
Yes i do. once this pr get merged i would make others to fully support timestamp feature.

NickLucche · 2025-09-26T07:29:07Z

Apologies for the delay, I missed the notification on this one. Will try to take a look asap

sangbumlikeagod · 2025-09-27T02:49:48Z

@NickLucche Thanks a lot! please feel free to suggest if there could be any improvement. I'll put it in right away.

sangbumlikeagod and others added 4 commits September 3, 2025 18:58

[Frontend] add verbose_json feature and timestamps mode

4f8f0c1

Signed-off-by: sangbumlikeagod <[email protected]>

Merge branch 'vllm-project:main' into sangbumlikeagod/frontend/add_ti…

087f2e9

…mestamp_option

[Frontend] meet yapf, ruff condition

8ceb2cd

Signed-off-by: sangbumlikeagod <[email protected]>

Merge branch 'vllm-project:main' into sangbumlikeagod/frontend/add_ti…

5ad7bd2

…mestamp_option

sangbumlikeagod requested a review from aarnphm as a code owner September 4, 2025 01:52

mergify bot added the frontend label Sep 4, 2025

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

sangbumlikeagod and others added 11 commits September 8, 2025 21:57

Merge branch 'vllm-project:main' into sangbumlikeagod/frontend/add_ti…

178dd48

…mestamp_option

[Frontend] fix pre-commit issues

0cd8915

Signed-off-by: sangbumlikeagod <[email protected]>

[Frontend] fix more pre-commit issues

f303c60

Signed-off-by: sangbumlikeagod <[email protected]>

Merge branch 'main' into sangbumlikeagod/frontend/add_timestamp_option

c0f797b

[Frontend] fix pre-commit errors

700b1f4

Signed-off-by: sangbumlikeagod <[email protected]>

[Frontend] resolve isort error

d5caac3

Signed-off-by: sangbumlikeagod <[email protected]>

[Frontend] erase spaces

efe2f1a

Signed-off-by: sangbumlikeagod <[email protected]>

[Frontend] fix yapf errors

cc63c9d

Signed-off-by: sangbumlikeagod <[email protected]>

Merge branch 'main' into sangbumlikeagod/frontend/add_timestamp_option

e50d57f

[Frontend] add option to get_tokenizer function

139fd09

Signed-off-by: sangbumlikeagod <[email protected]>

merge branch 'sangbumlikeagod/frontend/add_timestamp_option' of https…

fa79a7a

…://github.com/sangbumlikeagod/vllm into sangbumlikeagod/frontend/add_timestamp_option

sangbumlikeagod mentioned this pull request Sep 11, 2025

[Frontend] add previous context to whisper transcription over 30s audio #20249

Draft

4 tasks

Merge branch 'main' into sangbumlikeagod/frontend/add_timestamp_option

d0bc0b8

sangbumlikeagod requested a review from chaunceyjiang as a code owner September 13, 2025 07:37

sangbumlikeagod added 2 commits September 14, 2025 00:46

Merge branch 'main' into sangbumlikeagod/frontend/add_timestamp_option

c98deee

Merge branch 'main' into sangbumlikeagod/frontend/add_timestamp_option

ce5308f

[Frontend] fix type issue

11c074e

Signed-off-by: sangbumlikeagod <[email protected]>

sangbumlikeagod added 4 commits September 17, 2025 11:41

Merge branch 'main' into sangbumlikeagod/frontend/add_timestamp_option

8b209ce

Merge branch 'main' into sangbumlikeagod/frontend/add_timestamp_option

bbc93ef

Merge branch 'main' into sangbumlikeagod/frontend/add_timestamp_option

06f58ed

Merge branch 'main' into sangbumlikeagod/frontend/add_timestamp_option

bd423a4

Merge branch 'main' into sangbumlikeagod/frontend/add_timestamp_option

d41f1c7

NickLucche mentioned this pull request Sep 26, 2025

[Feature]: Tracking Whisper feature requests #25750

Open

6 tasks

sangbumlikeagod added 2 commits October 3, 2025 10:57

Merge branch 'main' into sangbumlikeagod/frontend/add_timestamp_option

283d4eb

Merge branch 'main' into sangbumlikeagod/frontend/add_timestamp_option

486d950

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation #24209

[Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation #24209

sangbumlikeagod commented Sep 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mr8bit commented Sep 16, 2025 •

edited

Loading

Uh oh!

sangbumlikeagod commented Sep 17, 2025

Uh oh!

mr8bit commented Sep 23, 2025

Uh oh!

sangbumlikeagod commented Sep 24, 2025

Uh oh!

NickLucche commented Sep 26, 2025

Uh oh!

sangbumlikeagod commented Sep 27, 2025

Uh oh!

Uh oh!

Uh oh!

[Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation #24209

Are you sure you want to change the base?

[Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation #24209

Conversation

sangbumlikeagod commented Sep 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

transcription

request

result

translation

request

response

transcription in other languages (korean in this case)

request

response

Note

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mr8bit commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sangbumlikeagod commented Sep 17, 2025

Uh oh!

mr8bit commented Sep 23, 2025

Uh oh!

sangbumlikeagod commented Sep 24, 2025

Uh oh!

NickLucche commented Sep 26, 2025

Uh oh!

sangbumlikeagod commented Sep 27, 2025

Uh oh!

Uh oh!

sangbumlikeagod commented Sep 4, 2025 •

edited by github-actions bot

Loading

mr8bit commented Sep 16, 2025 •

edited

Loading