【Spec Decode】support async scheduling with eagle speculative decoding #25872

woodlgz · 2025-09-29T10:05:40Z

Previous work 19970 and 23569 solved async scheduling with common scenarios leaving speculative decoding not yet suppoted.

Purpose

this pull request targets to make eagle speculative decoding work with async-scheduling.

Test Plan

we conducted benchmark with following command under nvidia L20 machine:

benchmark server

VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve meta-llama/Llama-3.1-8B-Instruct --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 4, "disable_padded_drafter_batch": false}' --max-model-len 2048 --no-enable-prefix-caching --async-scheduling

benchmark client

vllm bench serve --port 8000 --save-result --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --endpoint /v1/completions --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --max-concurrency 1

Test Result

version	mean TPOT(ms)	P99 TPOT(ms)	Mean ITL(ms)	P99 ITL(ms)
main	9.91	14.94	28.14	28.52
this pr	9.57	14.42	27.17	28.15

mean TPOT/ITL peformance gain up to 3.4%.
step gaps before optimization:

step gaps after optimization:

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-09-29T10:05:52Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces support for asynchronous scheduling with Eagle speculative decoding. The changes are extensive, touching the scheduler, model runner, and speculative decoding logic. The core of the implementation is to make the speculative decoding flow asynchronous to improve performance, primarily by avoiding CPU-GPU synchronizations. This is achieved by using GPU events for synchronization, caching intermediate GPU tensors between steps, and introducing a 'fix-up' step to correct token counts based on rejected speculative tokens from the previous step. The changes appear to be well-thought-out and correctly implement the asynchronous logic. I have reviewed the code and did not find any issues of high or critical severity.

lhtin · 2025-09-29T12:49:25Z

Is it duplicate of this PR(#24799)?

Signed-off-by: guozelin <[email protected]>

…tokens = 0 Signed-off-by: guozelin <[email protected]>

…e tokens placeholders Signed-off-by: guozelin <[email protected]>

…licts Signed-off-by: guozelin <[email protected]>

mergify · 2025-10-03T07:03:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @woodlgz.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

woodlgz requested review from ApostaC, WoosukKwon, alexm-redhat, benchislett, comaniac, heheda12345, luccafong, njhill, robertgshaw2-redhat and ywang96 as code owners September 29, 2025 10:05

mergify bot added speculative-decoding v1 labels Sep 29, 2025

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

woodlgz added 3 commits September 30, 2025 17:02

support eagle speculative decoding with async-scheduling

1e4e656

Signed-off-by: guozelin <[email protected]>

[Spec Decode] skip caching for prev_num_rejected_tokens if num_draft_…

f5d70f8

…tokens = 0 Signed-off-by: guozelin <[email protected]>

bugfix: partial prefill with speculative decoding must have no propos…

e09923f

…e tokens placeholders Signed-off-by: guozelin <[email protected]>

woodlgz force-pushed the feature-async-scheduling-spec-decode-0925 branch from 94fa608 to ec5fae1 Compare September 30, 2025 09:03

bugfix: spec-decoding with async-scheduling avoid potential data conf…

a6e4caf

…licts Signed-off-by: guozelin <[email protected]>

woodlgz force-pushed the feature-async-scheduling-spec-decode-0925 branch from ec5fae1 to a6e4caf Compare September 30, 2025 09:33

mergify bot added the needs-rebase label Oct 3, 2025

woodlgz closed this Oct 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

【Spec Decode】support async scheduling with eagle speculative decoding #25872

【Spec Decode】support async scheduling with eagle speculative decoding #25872

Uh oh!

woodlgz commented Sep 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

lhtin commented Sep 29, 2025

Uh oh!

mergify bot commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

【Spec Decode】support async scheduling with eagle speculative decoding #25872

【Spec Decode】support async scheduling with eagle speculative decoding #25872

Uh oh!

Conversation

woodlgz commented Sep 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Sep 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

lhtin commented Sep 29, 2025

Uh oh!

mergify bot commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

woodlgz commented Sep 29, 2025 •

edited by github-actions bot

Loading