Skip to content

Conversation

sangbumlikeagod
Copy link
Contributor

@sangbumlikeagod sangbumlikeagod commented Sep 4, 2025

Purpose

Currently, vLLM does not support the timestamp feature for Whisper, which was available in the original Whisper API. However, there is significant demand for this feature, as evidenced by discussions

such as #15012 (comment)

To address this, I created this PR to implement support for the segment timestamp feature.

Test Plan

  1. run vllm server with whisper server on it.
  2. send “v1/audio/transcriptions” api but set response_format into “verbose_json”
  3. get the result

Test Result

transcription

request

curl --location 'http://175.123.89.199:8051/v1/audio/transcriptions' \
--header 'Authorization: ••••••' \
--form 'model="openai/whisper-large-v3"' \
--form 'language="en"' \
--form 'response_format="verbose_json"' \
--form 'temperature="0"' \
--form 'file=@"/C:/Users/audio/audio.wav"'

result

{
    "duration": "17.653125",
    "language": "en",
    "text": " *OUTRO MUSIC* Hey, who remembers PlayStation All-Stars Battle Royale? Oh Jesus. Sorry, I should have given you more warning. Here, use this bucket to collect your",
    "segments": [
        {
            "id": 42,
            "avg_logprob": -1.0,
            "compression_ratio": -1.0,
            "end": 11.540000000000001,
            "no_speech_prob": -1.0,
            "seek": 0,
            "start": 0.0,
            "temperature": -1.0,
            "text": " *OUTRO MUSIC*",
            "tokens": [
                1853,
                27276,
                7142,
                16924,
                9
            ]
        },
        {
            "id": 42,
            "avg_logprob": -1.0,
            "compression_ratio": -1.0,
            "end": 17.66,
            "no_speech_prob": -1.0,
            "seek": 0,
            "start": 11.540000000000001,
            "temperature": -1.0,
            "text": " Hey, who remembers PlayStation All-Stars Battle Royale? Oh Jesus. Sorry, I should have given you more warning. Here, use this bucket to collect your",
            "tokens": [
                1911,
                11,
                567,
                26228,
                20599,
                1057,
                12,
                4520,
                685,
                11846,
                8751,
                1220,
                30,
                876,
                2705,
                13,
                4919,
                11,
                286,
                820,
                362,
                2212,
                291,
                544,
                9164,
                13,
                1692,
                11,
                764,
                341,
                13058,
                281,
                2500,
                428
            ]
        }
    ],
    "words": null
}

translation

request

curl --location 'http://175.123.89.199:8051/v1/audio/translations' \
--header 'Authorization: Bearer EMPTY' \
--form 'model="openai/whisper-large-v3"' \
--form 'response_format="verbose_json"' \
--form 'temperature="0"' \
--form 'file=@"/C:/Users/audio/korean_audio.mp3"'

response

{
    "text": " Give me a gift for my 7-day subscription.",
    "language": null,
    "duration": "4.336375",
    "segments": [
        {
            "id": 12,
            "avg_logprob": -1.0,
            "compression_ratio": -1.0,
            "end": -1007.0600000000001,
            "no_speech_prob": -1.0,
            "seek": 0,
            "start": 0.0,
            "temperature": -1.0,
            "text": " Give me a gift for my 7-day subscription.",
            "tokens": [
                5303,
                385,
                257,
                5306,
                337,
                452,
                1614,
                12,
                810,
                17231,
                13
            ]
        }
    ]
}

transcription in other languages (korean in this case)

request

curl --location 'http://175.123.89.199:8051/v1/audio/transcriptions' \
--header 'Authorization: ••••••' \
--form 'language="ko"' \
--form 'temperature="0"' \
--form 'model="openai/whisper-large-v3"' \
--form 'response_format="verbose_json"' \
--form 'file=@"/C:/Users/audio/korean_audio.mp3"'

response

{
    "duration": "4.336375",
    "language": "ko",
    "text": " 구독권 칠일치 에이한테 선물해줘.",
    "segments": [
        {
            "id": 14,
            "avg_logprob": -1.0,
            "compression_ratio": -1.0,
            "end": -1007.02,
            "no_speech_prob": -1.0,
            "seek": 0,
            "start": 0.0,
            "temperature": -1.0,
            "text": " 구독권 칠일치 에이한테 선물해줘.",
            "tokens": [
                32800,
                23605,
                6639,
                254,
                6403,
                8464,
                20122,
                1129,
                15863,
                44956,
                44487,
                246,
                13
            ]
        }
    ],
    "words": null
}

Note

There’s a lingering issue in this PR. Basically, even if we remove the '<|notimestamps|>' token from the decoder prompt, the model still predicts it.

The most straightforward solution for this issue is to add logits that set the probability of '<|notimestamps|>' to negative infinity during the model's inference. Hugging Face's Transformers library also uses a similar approach.

However, vLLM prevents us from applying logits during inference due to its multi-step process. So, I used an indirect approach.

I replaced '<|notimestamps|>' with '<|0.00|>' to generate other timestamp tokens. This approach might cause issues, but I tested it multiple times, and it appears to work well. Additionally, this timestamp method only works in 'verbose_json' mode, a new feature I introduced in this PR.

Therefore, it may not affect current users. Thus, I submitted the PR despite this issue.

Essential Elements of an Effective PR Description Checklist
  • [ v ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [ v ] The test plan, such as providing test command.
  • [ v ] The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds the 'verbose_json' response format to Whisper transcription and translation, which is a valuable feature for providing segment-level timestamps. However, the current implementation contains several critical and high-severity bugs that must be addressed. There's an inverted conditional logic for selecting the response class in translations, which will cause incorrect behavior. The new function for creating segments, _get_verbose_segments, is called with incorrect arguments, which will lead to a runtime error. Additionally, this function has logic errors in segment ID assignment and timestamp calculation. These issues need to be resolved to ensure the feature works correctly.

@mr8bit
Copy link

mr8bit commented Sep 16, 2025

Hi! When I run your PR using the command below:

vllm serve \
  --host 0.0.0.0 \
  --port 8080 \
  /models/openai/whisper-large-v3 \
  --served-model-name openai/whisper-large-v3 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --enable-prompt-tokens-details \
  --enable-force-include-usage

I get the following error when requesting response_format=verbose_json:

curl --location 'http://localhost:8080/v1/audio/transcriptions' \
--form 'file=@"./audio.mp3"' \
--form 'model="openai/whisper-large-v3"' \
--form 'response_format="verbose_json"' \
--form 'temperature="0"
{
    "error": {
        "message": "can only concatenate tuple (not \"list\") to tuple",
        "type": "Internal Server Error",
        "param": null,
        "code": 500
    }
}

Could you advise what I might be doing wrong?

Signed-off-by: sangbumlikeagod <[email protected]>
@sangbumlikeagod
Copy link
Contributor Author

Hello! mr8bit

Sorry for the mistake—I accidentally dropped the casting part while fixing the myPy issue.
I've now updated and committed the changes. Could you please test it again?

@mr8bit
Copy link

mr8bit commented Sep 23, 2025

@sangbumlikeagod
Yes, everything works, thank you very much.

Do you plan to add timestamp_granularities for words?

@sangbumlikeagod
Copy link
Contributor Author

@mr8bit
Yes i do. once this pr get merged i would make others to fully support timestamp feature.

@NickLucche
Copy link
Collaborator

Apologies for the delay, I missed the notification on this one. Will try to take a look asap

@sangbumlikeagod
Copy link
Contributor Author

@NickLucche Thanks a lot! please feel free to suggest if there could be any improvement. I'll put it in right away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants