Skip to content

Conversation

sangbumlikeagod
Copy link
Contributor

@sangbumlikeagod sangbumlikeagod commented Jun 30, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

According to the Whisper prompt feature discussion on GitHub, Whisper supports a prompt feature that allows previous context to be added to the given audio input. Currently, vLLM supports processing audio longer than 30 seconds, but it does not provide previous results as a prompt. This PR aims to solve that issue.

Key Changes:

async def _preprocess_speech_to_text now includes a previous_text: list[str] parameter, which works like a pointer in C++—it references the previous segment’s result so it can be used as a prompt for the next segment.

get_decoder_prompt is called as follows:

python
get_decoder_prompt( # type: ignore[attr-defined]
lang, self.task_type,
request.prompt,
previous_text[0])
The get_decoder_prompt method is defined as:

python
@classmethod
def get_decoder_prompt(cls, language: str, task_type: str,
prompt: str, previous_text: str) -> str:
return (f"<|startoftranscript|><|{language}|><|{task_type}|>"
f"<|notimestamps|>{prompt}{previous_text}")
Now, previous_text (initially empty) is included in the prompt.

Async Generator Integration:
The results from the previous generator are used as a prompt for subsequent inference.

Remaining Issues:

Prompt Position:
The current position of the prompt may not be optimal. This needs to be adjusted for better results.

Prompt Formatting Impact:
The way prompts are set up significantly affects Whisper’s inference results. We need to explore the most effective prompting methods.

To test this, I plan to try various prompt styles, such as those described in the OpenAI Cookbook’s Whisper Prompting Guide, including generating fictitious prompts with GPT.

Summary:
This PR enables vLLM to use previous audio segment results as prompts for subsequent segments, improving context continuity in Whisper-based transcription. However, the position and formatting of prompts have a major impact on results, so further testing of different prompt strategies is needed.

Test Plan

Test Result

(Optional) Documentation Update

@sangbumlikeagod sangbumlikeagod changed the title Sangbumlikeagod/frontend/add previous context [Frontend] add previous context to whisper transcription over 30s audio Jun 30, 2025
@mergify mergify bot added the frontend label Jun 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @sangbumlikeagod, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to vLLM's Whisper ASR capabilities by enabling the use of previous transcription results as contextual prompts for subsequent audio segments. This refactors the audio processing pipeline to be asynchronous and iterative, allowing for dynamic prompting that improves the coherence of transcriptions for audio longer than 30 seconds.

Highlights

  • Previous Context for Whisper ASR: Implemented a mechanism to pass the transcribed text from previous audio segments as a prompt to subsequent segments, enhancing contextual continuity for long audio transcriptions in vLLM's Whisper ASR.
  • Dynamic Prompt Generation: Modified the get_decoder_prompt method in the Whisper model to accept and incorporate the previous_text into the decoder prompt, allowing the model to be 'primed' with the context of preceding audio segments.
  • Asynchronous Processing Flow Refactor: Refactored the _preprocess_speech_to_text function to be an AsyncGenerator that yields prompts and audio durations sequentially. This enables an iterative processing flow where the output of one audio chunk can dynamically inform the prompt for the next.
  • API Parameter Updates: Updated the transcription_stream_generator and translation_stream_generator methods to accept sampling_params and the new previous_context parameter, aligning the API with the new asynchronous and context-aware processing logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for Whisper transcriptions by adding context from previous audio segments. The overall approach of using an async generator for prompts is sound. I've identified one critical issue where a method was likely renamed in one place but not another, which will cause a runtime error. I've also pointed out a medium-severity issue related to code duplication that could be improved for better maintainability. Once these are addressed, the PR should be in good shape.

Comment on lines 172 to 175
asyncPromptGenerator = self._preprocess_transcription(
request=request,
audio_data=audio_data,
previous_text=previous_text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There seems to be a typo here. The method _preprocess_speech_to_text was modified to be an async generator, but here it's being called as _preprocess_transcription. This will result in an AttributeError as _preprocess_transcription is not defined in this class. It should be _preprocess_speech_to_text.

Suggested change
asyncPromptGenerator = self._preprocess_transcription(
request=request,
audio_data=audio_data,
previous_text=previous_text
asyncPromptGenerator = self._preprocess_speech_to_text(
request=request,
audio_data=audio_data,
previous_text=previous_text

Comment on lines 216 to 219
previous_text[0] = ' '.join(
partial_text.strip()
.split(' ')[-5:]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic to extract the last 5 words for context is duplicated in _speech_to_text_stream_generator on lines 324-327. To improve maintainability and avoid code duplication, consider extracting this into a private helper method. Additionally, the number of words (5) is a magic number. It would be better to define it as a constant at the class or module level, for example _CONTEXT_WORDS = 5.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for your work!
Just some very early feedback as I was glancing at your PR: I think the main concern is making sure the previous text does not end up filling up all the max-model-len of the decoder (the "generation budget"), so we can probably feed a smallish window I suppose.

Also,make sure to account for the previous_text additional tokens here https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/speech_to_text.py#L267.

Finally, I think this could benefit or at least anticipate some of the timestamp/segment feature (eg verbose_json format) that people have been asking about here too #15012 (comment).

@sangbumlikeagod
Copy link
Contributor Author

sure! i will consider them and apply it asap

Copy link

mergify bot commented Jul 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sangbumlikeagod.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 3, 2025
@mergify mergify bot removed the needs-rebase label Jul 3, 2025
@sangbumlikeagod
Copy link
Contributor Author

Hello @NickLucche,
I have a few questions.
As you said, I was planning to add the verbose_json feature options in my PR. However, I realized that to fully support verbose_json, several other features such as "log_probs" and "time_stamp" also need to be implemented.
Would it be better to add all these features in a single PR, or should I split them into separate PRs?
Thank you!

@NickLucche
Copy link
Collaborator

Hey @sangbumlikeagod , split into two PRs if you can, makes reviewers lives easier :)
Perhaps we can focus to get "segments" properly typed in a single PR and then add the verbose_json response at a later time as a composition.
Mind that we don't need to have all the options implemented right away eg "compression_ratio", "no_speech_prob" I'd say they're optional/nice to have, but the most important part is actually start-end-tokens to define the segment.

@sangbumlikeagod
Copy link
Contributor Author

Hello @NickLucche,

I just uploaded another separate PR: #24209

It's about the segment-timestamp options you mentioned last time. It still needs some more progress, but could you take a look and give me some feedback on it?

I'll continue working on this PR after I finish the other one.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants