-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation #24209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation #24209
Conversation
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds the 'verbose_json' response format to Whisper transcription and translation, which is a valuable feature for providing segment-level timestamps. However, the current implementation contains several critical and high-severity bugs that must be addressed. There's an inverted conditional logic for selecting the response class in translations, which will cause incorrect behavior. The new function for creating segments, _get_verbose_segments
, is called with incorrect arguments, which will lead to a runtime error. Additionally, this function has logic errors in segment ID assignment and timestamp calculation. These issues need to be resolved to ensure the feature works correctly.
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
…://github.com/sangbumlikeagod/vllm into sangbumlikeagod/frontend/add_timestamp_option
Hi! When I run your PR using the command below:
I get the following error when requesting curl --location 'http://localhost:8080/v1/audio/transcriptions' \
--form 'file=@"./audio.mp3"' \
--form 'model="openai/whisper-large-v3"' \
--form 'response_format="verbose_json"' \
--form 'temperature="0" {
"error": {
"message": "can only concatenate tuple (not \"list\") to tuple",
"type": "Internal Server Error",
"param": null,
"code": 500
}
} Could you advise what I might be doing wrong? |
Signed-off-by: sangbumlikeagod <[email protected]>
Hello! mr8bit Sorry for the mistake—I accidentally dropped the casting part while fixing the myPy issue. |
@sangbumlikeagod Do you plan to add timestamp_granularities for words? |
@mr8bit |
Apologies for the delay, I missed the notification on this one. Will try to take a look asap |
@NickLucche Thanks a lot! please feel free to suggest if there could be any improvement. I'll put it in right away. |
Purpose
Currently, vLLM does not support the timestamp feature for Whisper, which was available in the original Whisper API. However, there is significant demand for this feature, as evidenced by discussions
such as #15012 (comment)
To address this, I created this PR to implement support for the segment timestamp feature.
Test Plan
Test Result
transcription
request
result
translation
request
response
transcription in other languages (korean in this case)
request
response
Note
There’s a lingering issue in this PR. Basically, even if we remove the '<|notimestamps|>' token from the decoder prompt, the model still predicts it.
The most straightforward solution for this issue is to add logits that set the probability of '<|notimestamps|>' to negative infinity during the model's inference. Hugging Face's Transformers library also uses a similar approach.
However, vLLM prevents us from applying logits during inference due to its multi-step process. So, I used an indirect approach.
I replaced '<|notimestamps|>' with '<|0.00|>' to generate other timestamp tokens. This approach might cause issues, but I tested it multiple times, and it appears to work well. Additionally, this timestamp method only works in 'verbose_json' mode, a new feature I introduced in this PR.
Therefore, it may not affect current users. Thus, I submitted the PR despite this issue.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.