-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Frontend] add previous context to whisper transcription over 30s audio #20249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Frontend] add previous context to whisper transcription over 30s audio #20249
Conversation
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @sangbumlikeagod, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant enhancement to vLLM's Whisper ASR capabilities by enabling the use of previous transcription results as contextual prompts for subsequent audio segments. This refactors the audio processing pipeline to be asynchronous and iterative, allowing for dynamic prompting that improves the coherence of transcriptions for audio longer than 30 seconds.
Highlights
- Previous Context for Whisper ASR: Implemented a mechanism to pass the transcribed text from previous audio segments as a prompt to subsequent segments, enhancing contextual continuity for long audio transcriptions in vLLM's Whisper ASR.
- Dynamic Prompt Generation: Modified the
get_decoder_prompt
method in the Whisper model to accept and incorporate theprevious_text
into the decoder prompt, allowing the model to be 'primed' with the context of preceding audio segments. - Asynchronous Processing Flow Refactor: Refactored the
_preprocess_speech_to_text
function to be anAsyncGenerator
that yields prompts and audio durations sequentially. This enables an iterative processing flow where the output of one audio chunk can dynamically inform the prompt for the next. - API Parameter Updates: Updated the
transcription_stream_generator
andtranslation_stream_generator
methods to acceptsampling_params
and the newprevious_context
parameter, aligning the API with the new asynchronous and context-aware processing logic.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature for Whisper transcriptions by adding context from previous audio segments. The overall approach of using an async generator for prompts is sound. I've identified one critical issue where a method was likely renamed in one place but not another, which will cause a runtime error. I've also pointed out a medium-severity issue related to code duplication that could be improved for better maintainability. Once these are addressed, the PR should be in good shape.
asyncPromptGenerator = self._preprocess_transcription( | ||
request=request, | ||
audio_data=audio_data, | ||
previous_text=previous_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a typo here. The method _preprocess_speech_to_text
was modified to be an async generator, but here it's being called as _preprocess_transcription
. This will result in an AttributeError
as _preprocess_transcription
is not defined in this class. It should be _preprocess_speech_to_text
.
asyncPromptGenerator = self._preprocess_transcription( | |
request=request, | |
audio_data=audio_data, | |
previous_text=previous_text | |
asyncPromptGenerator = self._preprocess_speech_to_text( | |
request=request, | |
audio_data=audio_data, | |
previous_text=previous_text |
previous_text[0] = ' '.join( | ||
partial_text.strip() | ||
.split(' ')[-5:] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic to extract the last 5 words for context is duplicated in _speech_to_text_stream_generator
on lines 324-327. To improve maintainability and avoid code duplication, consider extracting this into a private helper method. Additionally, the number of words (5) is a magic number. It would be better to define it as a constant at the class or module level, for example _CONTEXT_WORDS = 5
.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for your work!
Just some very early feedback as I was glancing at your PR: I think the main concern is making sure the previous text does not end up filling up all the max-model-len of the decoder (the "generation budget"), so we can probably feed a smallish window I suppose.
Also,make sure to account for the previous_text additional tokens here https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/speech_to_text.py#L267.
Finally, I think this could benefit or at least anticipate some of the timestamp/segment feature (eg verbose_json
format) that people have been asking about here too #15012 (comment).
sure! i will consider them and apply it asap |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Signed-off-by: sangbumlikeagod <[email protected]>
Hello @NickLucche, |
Hey @sangbumlikeagod , split into two PRs if you can, makes reviewers lives easier :) |
Hello @NickLucche, I just uploaded another separate PR: #24209 It's about the segment-timestamp options you mentioned last time. It still needs some more progress, but could you take a look and give me some feedback on it? I'll continue working on this PR after I finish the other one. Thanks! |
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
According to the Whisper prompt feature discussion on GitHub, Whisper supports a prompt feature that allows previous context to be added to the given audio input. Currently, vLLM supports processing audio longer than 30 seconds, but it does not provide previous results as a prompt. This PR aims to solve that issue.
Key Changes:
async def _preprocess_speech_to_text now includes a previous_text: list[str] parameter, which works like a pointer in C++—it references the previous segment’s result so it can be used as a prompt for the next segment.
get_decoder_prompt is called as follows:
python
get_decoder_prompt( # type: ignore[attr-defined]
lang, self.task_type,
request.prompt,
previous_text[0])
The get_decoder_prompt method is defined as:
python
@classmethod
def get_decoder_prompt(cls, language: str, task_type: str,
prompt: str, previous_text: str) -> str:
return (f"<|startoftranscript|><|{language}|><|{task_type}|>"
f"<|notimestamps|>{prompt}{previous_text}")
Now, previous_text (initially empty) is included in the prompt.
Async Generator Integration:
The results from the previous generator are used as a prompt for subsequent inference.
Remaining Issues:
Prompt Position:
The current position of the prompt may not be optimal. This needs to be adjusted for better results.
Prompt Formatting Impact:
The way prompts are set up significantly affects Whisper’s inference results. We need to explore the most effective prompting methods.
To test this, I plan to try various prompt styles, such as those described in the OpenAI Cookbook’s Whisper Prompting Guide, including generating fictitious prompts with GPT.
Summary:
This PR enables vLLM to use previous audio segment results as prompts for subsequent segments, improving context continuity in Whisper-based transcription. However, the position and formatting of prompts have a major impact on results, so further testing of different prompt strategies is needed.
Test Plan
Test Result
(Optional) Documentation Update