[Frontend] add previous context to whisper transcription over 30s audio #20249

sangbumlikeagod · 2025-06-30T09:11:57Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

According to the Whisper prompt feature discussion on GitHub, Whisper supports a prompt feature that allows previous context to be added to the given audio input. Currently, vLLM supports processing audio longer than 30 seconds, but it does not provide previous results as a prompt. This PR aims to solve that issue.

Key Changes:

async def _preprocess_speech_to_text now includes a previous_text: list[str] parameter, which works like a pointer in C++—it references the previous segment’s result so it can be used as a prompt for the next segment.

get_decoder_prompt is called as follows:

python
get_decoder_prompt( # type: ignore[attr-defined]
lang, self.task_type,
request.prompt,
previous_text[0])
The get_decoder_prompt method is defined as:

python
@classmethod
def get_decoder_prompt(cls, language: str, task_type: str,
prompt: str, previous_text: str) -> str:
return (f"<|startoftranscript|><|{language}|><|{task_type}|>"
f"<|notimestamps|>{prompt}{previous_text}")
Now, previous_text (initially empty) is included in the prompt.

Async Generator Integration:
The results from the previous generator are used as a prompt for subsequent inference.

Remaining Issues:

Prompt Position:
The current position of the prompt may not be optimal. This needs to be adjusted for better results.

Prompt Formatting Impact:
The way prompts are set up significantly affects Whisper’s inference results. We need to explore the most effective prompting methods.

To test this, I plan to try various prompt styles, such as those described in the OpenAI Cookbook’s Whisper Prompting Guide, including generating fictitious prompts with GPT.

Summary:
This PR enables vLLM to use previous audio segment results as prompts for subsequent segments, improving context continuity in Whisper-based transcription. However, the position and formatting of prompts have a major impact on results, so further testing of different prompt strategies is needed.

Test Plan

Test Result

(Optional) Documentation Update

Signed-off-by: sangbumlikeagod <[email protected]>

gemini-code-assist

Summary of Changes

Hello @sangbumlikeagod, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to vLLM's Whisper ASR capabilities by enabling the use of previous transcription results as contextual prompts for subsequent audio segments. This refactors the audio processing pipeline to be asynchronous and iterative, allowing for dynamic prompting that improves the coherence of transcriptions for audio longer than 30 seconds.

Highlights

Previous Context for Whisper ASR: Implemented a mechanism to pass the transcribed text from previous audio segments as a prompt to subsequent segments, enhancing contextual continuity for long audio transcriptions in vLLM's Whisper ASR.
Dynamic Prompt Generation: Modified the get_decoder_prompt method in the Whisper model to accept and incorporate the previous_text into the decoder prompt, allowing the model to be 'primed' with the context of preceding audio segments.
Asynchronous Processing Flow Refactor: Refactored the _preprocess_speech_to_text function to be an AsyncGenerator that yields prompts and audio durations sequentially. This enables an iterative processing flow where the output of one audio chunk can dynamically inform the prompt for the next.
API Parameter Updates: Updated the transcription_stream_generator and translation_stream_generator methods to accept sampling_params and the new previous_context parameter, aligning the API with the new asynchronous and context-aware processing logic.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable feature for Whisper transcriptions by adding context from previous audio segments. The overall approach of using an async generator for prompts is sound. I've identified one critical issue where a method was likely renamed in one place but not another, which will cause a runtime error. I've also pointed out a medium-severity issue related to code duplication that could be improved for better maintainability. Once these are addressed, the PR should be in good shape.

gemini-code-assist · 2025-06-30T09:13:40Z

vllm/entrypoints/openai/speech_to_text.py

+            asyncPromptGenerator = self._preprocess_transcription(
                request=request,
                audio_data=audio_data,
+                previous_text=previous_text


There seems to be a typo here. The method _preprocess_speech_to_text was modified to be an async generator, but here it's being called as _preprocess_transcription. This will result in an AttributeError as _preprocess_transcription is not defined in this class. It should be _preprocess_speech_to_text.

Suggested change

asyncPromptGenerator = self._preprocess_transcription(

request=request,

audio_data=audio_data,

previous_text=previous_text

asyncPromptGenerator = self._preprocess_speech_to_text(

request=request,

audio_data=audio_data,

previous_text=previous_text

gemini-code-assist · 2025-06-30T09:13:41Z

vllm/entrypoints/openai/speech_to_text.py

+                previous_text[0] = ' '.join(
+                    partial_text.strip()
+                    .split(' ')[-5:]
+                )


This logic to extract the last 5 words for context is duplicated in _speech_to_text_stream_generator on lines 324-327. To improve maintainability and avoid code duplication, consider extracting this into a private helper method. Additionally, the number of words (5) is a magic number. It would be better to define it as a constant at the class or module level, for example _CONTEXT_WORDS = 5.

github-actions · 2025-06-30T09:20:45Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

NickLucche

Hey, thanks for your work!
Just some very early feedback as I was glancing at your PR: I think the main concern is making sure the previous text does not end up filling up all the max-model-len of the decoder (the "generation budget"), so we can probably feed a smallish window I suppose.

Also,make sure to account for the previous_text additional tokens here https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/speech_to_text.py#L267.

Finally, I think this could benefit or at least anticipate some of the timestamp/segment feature (eg verbose_json format) that people have been asking about here too #15012 (comment).

sangbumlikeagod · 2025-07-01T04:36:04Z

sure! i will consider them and apply it asap

mergify · 2025-07-03T08:19:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sangbumlikeagod.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: sangbumlikeagod <[email protected]>

sangbumlikeagod · 2025-07-28T10:28:52Z

Hello @NickLucche,
I have a few questions.
As you said, I was planning to add the verbose_json feature options in my PR. However, I realized that to fully support verbose_json, several other features such as "log_probs" and "time_stamp" also need to be implemented.
Would it be better to add all these features in a single PR, or should I split them into separate PRs?
Thank you!

NickLucche · 2025-07-28T16:21:44Z

Hey @sangbumlikeagod , split into two PRs if you can, makes reviewers lives easier :)
Perhaps we can focus to get "segments" properly typed in a single PR and then add the verbose_json response at a later time as a composition.
Mind that we don't need to have all the options implemented right away eg "compression_ratio", "no_speech_prob" I'd say they're optional/nice to have, but the most important part is actually start-end-tokens to define the segment.

sangbumlikeagod · 2025-09-11T06:31:44Z

Hello @NickLucche,

I just uploaded another separate PR: #24209

It's about the segment-timestamp options you mentioned last time. It still needs some more progress, but could you take a look and give me some feedback on it?

I'll continue working on this PR after I finish the other one.

Thanks!

sangbumlikeagod added 6 commits June 27, 2025 11:42

[Frontend] add previous_text parameter to _preprocess_speech_to_text

32efaa4

Signed-off-by: sangbumlikeagod <[email protected]>

[FrontEnd] add previous context for over 30s audios

7fb1c76

Signed-off-by: sangbumlikeagod <[email protected]>

[Frontend] add previous context in stream mode

f91228c

Signed-off-by: sangbumlikeagod <[email protected]>

[Frontend] merge main branch

f7d1432

Signed-off-by: sangbumlikeagod <[email protected]>

[Frontend] add previous context

1d3f94d

Signed-off-by: sangbumlikeagod <[email protected]>

[Frontend] change parameter of get_decoder_prompt

b71fc37

Signed-off-by: sangbumlikeagod <[email protected]>

sangbumlikeagod changed the title ~~Sangbumlikeagod/frontend/add previous context~~ [Frontend] add previous context to whisper transcription over 30s audio Jun 30, 2025

mergify bot added the frontend label Jun 30, 2025

gemini-code-assist bot reviewed Jun 30, 2025

View reviewed changes

NickLucche requested changes Jun 30, 2025

View reviewed changes

mergify bot added the needs-rebase label Jul 3, 2025

sangbumlikeagod added 3 commits July 3, 2025 17:25

[Frontend] fix precommit error, add a token length limitation on prompt

661fde6

Signed-off-by: sangbumlikeagod <[email protected]>

[Frontend] update main branch

513a843

Signed-off-by: sangbumlikeagod <[email protected]>

[Frontend] resolve merge conflict

c090a62

Signed-off-by: sangbumlikeagod <[email protected]>

mergify bot removed the needs-rebase label Jul 3, 2025

sangbumlikeagod added 3 commits July 25, 2025 10:30

[Frontend] resolve merge

289868b

Signed-off-by: sangbumlikeagod <[email protected]>

[Frontend] restructure context under main branch

7603940

Signed-off-by: sangbumlikeagod <[email protected]>

[Frontend] fix precommit error

8f55ed8

Signed-off-by: sangbumlikeagod <[email protected]>

NickLucche mentioned this pull request Sep 26, 2025

[Feature]: Tracking Whisper feature requests #25750

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Frontend] add previous context to whisper transcription over 30s audio #20249

[Frontend] add previous context to whisper transcription over 30s audio #20249

sangbumlikeagod commented Jun 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 30, 2025

Uh oh!

gemini-code-assist bot Jun 30, 2025

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

NickLucche left a comment

Uh oh!

sangbumlikeagod commented Jul 1, 2025

Uh oh!

mergify bot commented Jul 3, 2025

Uh oh!

sangbumlikeagod commented Jul 28, 2025

Uh oh!

NickLucche commented Jul 28, 2025

Uh oh!

sangbumlikeagod commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Frontend] add previous context to whisper transcription over 30s audio #20249

Are you sure you want to change the base?

[Frontend] add previous context to whisper transcription over 30s audio #20249

Conversation

sangbumlikeagod commented Jun 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

sangbumlikeagod commented Jul 1, 2025

Uh oh!

mergify bot commented Jul 3, 2025

Uh oh!

sangbumlikeagod commented Jul 28, 2025

Uh oh!

NickLucche commented Jul 28, 2025

Uh oh!

sangbumlikeagod commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sangbumlikeagod commented Jun 30, 2025 •

edited by github-actions bot

Loading