[Core] Async Scheduling X Spec Decoding Compatibility #24799

Ronald1995 · 2025-09-13T07:38:11Z

Purpose

PR #19970 implements async_scheduling, PR #23569 implement prepare_input overlap base on PR #19970. RP #24539 refactor the logic of eagle spec_code， make it don't rely on cpu's sample_token_ids.

this PR is based on #24539 , and aims to support spec decode with async_scheduling. when enable both async_scheduling and spec decode, we won't copy draft token ids to scheduler any more, but cache it in gpu_model_runner, and update the input_ids with the _draft_token_ids directly for next step execute_model.

because ngram and medusa rely on cpu's sample_token_ids now, maybe we could refactor it in the future, but now this PR
only support eagle spec_decode with async_scheduling.

Test Plan

we will make the e2e test.

async_scheduling + EAGLE-LLaMA3-Instruct-8B draft model, make sure it works well.

Test config:

# dataset is prm800k, read the jsonl and make prompts.
sampling_params = SamplingParams(temperature=0, max_tokens=1024)
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    gpu_memory_utilization=0.9,
    tensor_parallel_size=1,
    max_model_len=2048,
    max_num_seqs=128,
    max_num_batched_tokens=4096,
    async_scheduling=True, 
    speculative_config={
            "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
            "draft_tensor_parallel_size": 1,
            "num_speculative_tokens": 2,
            "method": "eagle",
        },
    seed=1234
)

test device: Nvidia A100

Test Result

performance

num_prompts	async_scheduling(tps)	sync_scheduling(tps)	speedup
24	2356	2314	1.8%
48	3759	3539	6.2%
96	5110	4770	7.1%

precision

I compare the outputs of async_scheduling and sync_scheduling with speculative decoding,
the outputs are actually the same. so the async_scheduling doesn't make precision problem.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-09-13T07:38:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request adds support for speculative decoding with asynchronous scheduling, which is a great feature enhancement. The core logic of handling draft tokens within the worker process for async scheduling is sound. However, I've identified a few critical issues in gpu_model_runner.py related to tensor manipulation for scatter operations that will likely cause runtime errors. There's also a minor logic error in how speculative token lists are truncated. The proposed fixes are straightforward. Once these issues are addressed, the implementation should be solid.

vllm/v1/worker/gpu_model_runner.py

mergify · 2025-09-18T05:41:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

benchislett · 2025-10-16T14:24:54Z

@Ronald1995 I think it might be related to the larger model causing a rare race condition more than it would be due to an MTP-specific difference, for the reasons you identified. But I have no concrete information on the cause of this regression besides the AR discrepancy issue I measured.

Ronald1995 · 2025-10-16T15:27:11Z

@Ronald1995 I think it might be related to the larger model causing a rare race condition more than it would be due to an MTP-specific difference, for the reasons you identified. But I have no concrete information on the cause of this regression besides the AR discrepancy issue I measured.

@benchislett ok, i have fixed issues you reviewed recently and made explanations to the questions.

as for this issue, you reminds me that you set --max-concurrency 1 for bench client, PR #19970 shows the performance speedup is positive correlation to scheduled requests, because async_scheduler have two more threads and extra prepare_input_ids operations, it will make a performance loss, if the performance speedup is smaller than performance loss, it's possible that final performance is regressed, especially for larger model, because the model forward time is longer, the performance speedup of async_scheduling is relatively smaller.

the Total Token throughput metric is regressed of deepseek-r1 when --max-concurrency 1 in async_scheduling could be explained, if we promote max-concurrency value, the metric is expected to be promoted, but Avg Draft acceptance rate is also regressed, it make me confused now, i will debug it and report the result later.

Ronald1995 · 2025-10-16T16:53:14Z

@Ronald1995 I think it might be related to the larger model causing a rare race condition more than it would be due to an MTP-specific difference, for the reasons you identified. But I have no concrete information on the cause of this regression besides the AR discrepancy issue I measured.

@benchislett ok, i have fixed issues you reviewed recently and made explanations to the questions.

as for this issue, you reminds me that you set --max-concurrency 1 for bench client, PR #19970 shows the performance speedup is positive correlation to scheduled requests, because async_scheduler have two more threads and extra prepare_input_ids operations, it will make a performance loss, if the performance speedup is smaller than performance loss, it's possible that final performance is regressed, especially for larger model, because the model forward time is longer, the performance speedup of async_scheduling is relatively smaller.

the Total Token throughput metric is regressed of deepseek-r1 when --max-concurrency 1 in async_scheduling could be explained, if we promote max-concurrency value, the metric is expected to be promoted, but Avg Draft acceptance rate is also regressed, it make me confused now, i will debug it and report the result later.

@benchislett i find bench server print many lines of logged acceptance metrics test, and they have irregular changes, i think the log you show may not prove there are accuracy issues. i compare the output content for sync scheduling and async scheduling with prm800k_500 dataset.

Meta-Llama-3-8B-Instruct : eagle method, the output content are actually the same.
DeepSeek-V3-4layers-MTP-FP8: mtp method, the output content are actually the same.
so i believe this pr don't make accuracy issues, as for performance loss, as i said, it's possible when --max-concurrency 1 for larger model, if we promote max-concurrency, it will gain performance speedup.

benchislett · 2025-10-16T20:03:16Z

@Ronald1995 I think you are misunderstanding the issue. The problem appears to be that draft tokens are not being generated (or received) properly. The verification code is fine, but fewer tokens are accepted when using this feature (async sched + spec) than without (only spec). Running the same experiment with the flag on/off, I should see (almost) exactly the same number of drafted and accepted tokens. Instead, I get the following data (from my prev post):

Accepted: 1024 tokens, Drafted: 2034 tokens # Without async sched
Accepted: 656 tokens, Drafted: 2094 tokens # With async sched

This is not just a performance issue. It means that the draft tokens are getting rejected too often. For example, if there is a race condition and the verification buffer is not filled in time, some tokens in the input might not be updated in time and the verification could reject more readily. I think I have shown sufficient evidence to believe there is an issue here.

benchislett · 2025-10-16T20:24:03Z

As you can see from the benchmark logs I posted, the engine iteration is actually observably faster when running with async scheduling:

Mean ITL (ms):                           14.45     
Median ITL (ms):                         14.45     
P99 ITL (ms):                            14.78   
...
Mean ITL (ms):                           14.02     
Median ITL (ms):                         13.98     
P99 ITL (ms):                            19.92

but the TPOT is slower, due to fewer tokens being accepted:

Mean TPOT (ms):                          5.84      
Median TPOT (ms):                        5.61      
P99 TPOT (ms):                           7.65    

Mean TPOT (ms):                          6.93      
Median TPOT (ms):                        6.81      
P99 TPOT (ms):                           9.04

Ronald1995 · 2025-10-17T08:42:16Z

@Ronald1995 I think you are misunderstanding the issue. The problem appears to be that draft tokens are not being generated (or received) properly. The verification code is fine, but fewer tokens are accepted when using this feature (async sched + spec) than without (only spec). Running the same experiment with the flag on/off, I should see (almost) exactly the same number of drafted and accepted tokens. Instead, I get the following data (from my prev post):
Accepted: 1024 tokens, Drafted: 2034 tokens # Without async sched
Accepted: 656 tokens, Drafted: 2094 tokens # With async sched
This is not just a performance issue. It means that the draft tokens are getting rejected too often. For example, if there is a race condition and the verification buffer is not filled in time, some tokens in the input might not be updated in time and the verification could reject more readily. I think I have shown sufficient evidence to believe there is an issue here.

ok, i got your point, i will reproduce your test to debug it.

Ronald1995 · 2025-10-20T15:06:41Z

@benchislett i have made some tests to reproduce your test result, here are my test results.

use your original config
result is the same as yours. async_scheduling has lower ITL but higher TPOT.
set num_speculative_tokens = 1
server:

VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

client script is the same as yours.
async_scheduling result:

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  307.59    
Total input tokens:                      5535      
Total generated tokens:                  20375     
Request throughput (req/s):              0.26      
Output token throughput (tok/s):         66.24     
Peak output token throughput (tok/s):    39.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          84.24     
---------------Time to First Token----------------
Mean TTFT (ms):                          89.46     
Median TTFT (ms):                        77.82     
P99 TTFT (ms):                           197.69    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.80     
Median TPOT (ms):                        14.51     
P99 TPOT (ms):                           16.77     
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.45     
Median ITL (ms):                         26.42     
P99 ITL (ms):                            27.84     
==================================================

sync_scheduling:

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  316.42    
Total input tokens:                      5535      
Total generated tokens:                  20375     
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         64.39     
Peak output token throughput (tok/s):    37.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          81.88     
---------------Time to First Token----------------
Mean TTFT (ms):                          74.65     
Median TTFT (ms):                        62.14     
P99 TTFT (ms):                           220.20    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.29     
Median TPOT (ms):                        15.00     
P99 TPOT (ms):                           17.32     
---------------Inter-token Latency----------------
Mean ITL (ms):                           27.33     
Median ITL (ms):                         27.31     
P99 ITL (ms):                            28.01     
==================================================

in this config, both ITL and TPOT are speedup by use async_scheduling, ITL speedup 3.3%, TPOT speedup 3.3%

set num_speculative_tokens = 3 but disable cudagraph of DeepSeekMtp implementd by [Spec-Decode] Support piecewise cudagraphs for Eagle head #25109

#@support_torch_compile
class DeepSeekMTP(nn.Module, SupportsPP):

server:

VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

client script is the same as yours.
async_scheduling result:

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  259.19    
Total input tokens:                      5535      
Total generated tokens:                  20375     
Request throughput (req/s):              0.31      
Output token throughput (tok/s):         78.61     
Peak output token throughput (tok/s):    34.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          99.97     
---------------Time to First Token----------------
Mean TTFT (ms):                          105.27    
Median TTFT (ms):                        85.59     
P99 TTFT (ms):                           475.04    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.35     
Median TPOT (ms):                        11.91     
P99 TPOT (ms):                           16.17     
---------------Inter-token Latency----------------
Mean ITL (ms):                           30.40     
Median ITL (ms):                         30.20     
P99 ITL (ms):                            48.17     
==================================================

sync_scheduling result:

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  270.94    
Total input tokens:                      5535      
Total generated tokens:                  20375     
Request throughput (req/s):              0.30      
Output token throughput (tok/s):         75.20     
Peak output token throughput (tok/s):    32.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          95.63     
---------------Time to First Token----------------
Mean TTFT (ms):                          81.04     
Median TTFT (ms):                        63.58     
P99 TTFT (ms):                           404.71    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.02     
Median TPOT (ms):                        12.57     
P99 TPOT (ms):                           17.34     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.00     
Median ITL (ms):                         32.01     
P99 ITL (ms):                            32.81     
==================================================

in this config, both ITL and TPOT are speedup by use async_scheduling, ITL speedup 5.3%, TPOT speedup 5.4%

set num_speculative_tokens = 3 and set enforce_eager=True
server:

VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --enforce-eager --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

client script is the same as yours.
async_scheduling result:

============ Serving Benchmark Result ============
Successful requests:                     10        
Maximum request concurrency:             1         
Benchmark duration (s):                  174.77    
Total input tokens:                      789       
Total generated tokens:                  2560      
Request throughput (req/s):              0.06      
Output token throughput (tok/s):         14.65     
Peak output token throughput (tok/s):    7.00      
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          19.16     
---------------Time to First Token----------------
Mean TTFT (ms):                          929.44    
Median TTFT (ms):                        316.35    
P99 TTFT (ms):                           3583.42   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          64.89     
Median TPOT (ms):                        60.77     
P99 TPOT (ms):                           82.47     
---------------Inter-token Latency----------------
Mean ITL (ms):                           161.43    
Median ITL (ms):                         158.79    
P99 ITL (ms):                            225.90    
==================================================

sync_scheduling result:

============ Serving Benchmark Result ============
Successful requests:                     10        
Maximum request concurrency:             1         
Benchmark duration (s):                  168.98    
Total input tokens:                      789       
Total generated tokens:                  2560      
Request throughput (req/s):              0.06      
Output token throughput (tok/s):         15.15     
Peak output token throughput (tok/s):    7.00      
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          19.82     
---------------Time to First Token----------------
Mean TTFT (ms):                          622.80    
Median TTFT (ms):                        161.48    
P99 TTFT (ms):                           1925.25   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          63.82     
Median TPOT (ms):                        59.06     
P99 TPOT (ms):                           85.40     
---------------Inter-token Latency----------------
Mean ITL (ms):                           158.78    
Median ITL (ms):                         158.38    
P99 ITL (ms):                            219.83    
==================================================

There is performance loss when enable async_scheduling with max_concurrency=1, i have tested if increase max_concurrency, async_scheduling will speedup. the point here is that if disable cudagraph, it won't occur ITL is lower but TOPT is higher.

I guess there are hidden bugs for cudagraph with DeepseekMTP, i need to speed more time to figure it out. But as for this PR, i have made a lot of tests, i think the implementation itself of async_scheduling with spec decoding is fine.

I will add assertion in code to make sure when use async_scheduling and deepseek_mtp, num_speculative_tokens should less equal than 1 and add TODO to fix this issue in another PR. By doing this, i hope you could merge this PR first, please let me know what you think, thanks!

benchislett · 2025-10-20T15:22:00Z

@Ronald1995 I am not fully convinced that this issue is resolved. I investigated further last week and I am still able to consistently reproduce the issue on blackwell. Adding a torch.cuda.synchronize() into the gpu_model_runner.execute_model code almost anywhere will alleviate the issue. As such I suspect there might be some problems overlapping the draft model prepare_inputs and the next iteration's prepare_inputs. I will take a closer look today and inspect the individual data structures to see if there is any problem.

If the EAGLE prepare_inputs and main model's prepare_inputs share any cpu-side data, I believe it might be possible that one of them could overwrite this data while the other has an async HtoD memcpy in-flight, leading to a race condition. We have an event in the main model's prepare_inputs to ensure that this does not happen between iterations of the main model, but there is intentionally no safeguard for this in the spec decoding PR. I will validate if this is the cause of the issue I am seeing, and investigate if so.

Otherwise, I am happy with the state of the PR and am hoping it can be merged this week. Thank you for your continued effort!

benchislett · 2025-10-20T22:23:20Z

@Ronald1995 I have confirmed the issue and propose the following patch:

In gpu_model_runner.py:2908 (propose_draft_token_ids):

        elif self.speculative_config.use_eagle():
            if self.prepare_inputs_event is not None: # new
                self.prepare_inputs_event.synchronize() # new
            assert isinstance(self.drafter, EagleProposer)

This enforces a synchronization between prepare_inputs of the base model and the EAGLE drafter. With this patch, there can be no overlap between the draft model's prepare_inputs cpu execution and the gpu execution of the target model's prepare_inputs for the same step. It is still able to overlap between iterations, and profiling on nsight systems for Llama 3.1 8B with EAGLE3 indicates that this is not a significant block. CPU-GPU overlapping still occurs during the target model's forward pass. You are welcome to benchmark and profile with patch if you are suspicious of the performance implications.

While it would be nice to have identified a specific problematic shared buffer that can simply be replicated, this patch will serve in the meantime to alleviate the issue until the root cause can be identified. Even then, it might be beneficial to have this as a sanity check.

benchislett

An additional concern I have identified is that this PR does not support "disable_padded_drafter_batch". I believe two changes would be necessary to enable this:

Update self.valid_sampled_token_count_cpu in the disable_padded_drafter_batch pathway of propose_draft_token_ids following prepare_next_token_ids_cpu.
add an exception to if not self.use_async_scheduling in _bookkeeping_sync, since this is getting thrown off and leaving valid_sampled_token_ids as empty.

This is very similar work that would be needed to enable other speculative decoding methods from having overlapping support, and does not need to be included in this PR. However, in the meantime, please add some validation that will raise a warning/error if "disable_padded_drafter_batch" is enabled, since this currently seems to lead to an ugly crash.

Ronald1995 · 2025-10-21T01:35:59Z

@Ronald1995 I have confirmed the issue and propose the following patch:

In gpu_model_runner.py:2908 (propose_draft_token_ids):
        elif self.speculative_config.use_eagle():
            if self.prepare_inputs_event is not None: # new
                self.prepare_inputs_event.synchronize() # new
            assert isinstance(self.drafter, EagleProposer)
This enforces a synchronization between prepare_inputs of the base model and the EAGLE drafter. With this patch, there can be no overlap between the draft model's prepare_inputs cpu execution and the gpu execution of the target model's prepare_inputs for the same step. It is still able to overlap between iterations, and profiling on nsight systems for Llama 3.1 8B with EAGLE3 indicates that this is not a significant block. CPU-GPU overlapping still occurs during the target model's forward pass. You are welcome to benchmark and profile with patch if you are suspicious of the performance implications.

While it would be nice to have identified a specific problematic shared buffer that can simply be replicated, this patch will serve in the meantime to alleviate the issue until the root cause can be identified. Even then, it might be beneficial to have this as a sanity check.

@benchislett Thanks for your information, i will make more performance about your advised patch and try to identify the a specific problematic shared buffer.

Ronald1995 · 2025-10-21T01:37:10Z

eantime, please add some validation that will raise a warning/error if "disable_padded_drafter_batch" is enabled, since this currently seems to

An additional concern I have identified is that this PR does not support "disable_padded_drafter_batch". I believe two changes would be necessary to enable this:

Update self.valid_sampled_token_count_cpu in the disable_padded_drafter_batch pathway of propose_draft_token_ids following prepare_next_token_ids_cpu.

add an exception to if not self.use_async_scheduling in _bookkeeping_sync, since this is getting thrown off and leaving valid_sampled_token_ids as empty.

This is very similar work that would be needed to enable other speculative decoding methods from having overlapping support, and does not need to be included in this PR. However, in the meantime, please add some validation that will raise a warning/error if "disable_padded_drafter_batch" is enabled, since this currently seems to lead to an ugly crash.

ok, i will fix this.

Ronald1995 · 2025-10-21T15:25:44Z

@Ronald1995 I have confirmed the issue and propose the following patch:
In gpu_model_runner.py:2908 (propose_draft_token_ids):
        elif self.speculative_config.use_eagle():
            if self.prepare_inputs_event is not None: # new
                self.prepare_inputs_event.synchronize() # new
            assert isinstance(self.drafter, EagleProposer)
This enforces a synchronization between prepare_inputs of the base model and the EAGLE drafter. With this patch, there can be no overlap between the draft model's prepare_inputs cpu execution and the gpu execution of the target model's prepare_inputs for the same step. It is still able to overlap between iterations, and profiling on nsight systems for Llama 3.1 8B with EAGLE3 indicates that this is not a significant block. CPU-GPU overlapping still occurs during the target model's forward pass. You are welcome to benchmark and profile with patch if you are suspicious of the performance implications.
While it would be nice to have identified a specific problematic shared buffer that can simply be replicated, this patch will serve in the meantime to alleviate the issue until the root cause can be identified. Even then, it might be beneficial to have this as a sanity check.
@benchislett Thanks for your information, i will make more performance about your advised patch and try to identify the a specific problematic shared buffer.

@benchislett i have identified the specific problematic shared buffer. please seed the code in eagle.py, the root cause is in-place operations in propose function.

Ronald1995 · 2025-10-21T15:33:27Z

eantime, please add some validation that will raise a warning/error if "disable_padded_drafter_batch" is enabled, since this currently seems to

An additional concern I have identified is that this PR does not support "disable_padded_drafter_batch". I believe two changes would be necessary to enable this:

Update self.valid_sampled_token_count_cpu in the disable_padded_drafter_batch pathway of propose_draft_token_ids following prepare_next_token_ids_cpu.

add an exception to if not self.use_async_scheduling in _bookkeeping_sync, since this is getting thrown off and leaving valid_sampled_token_ids as empty.

This is very similar work that would be needed to enable other speculative decoding methods from having overlapping support, and does not need to be included in this PR. However, in the meantime, please add some validation that will raise a warning/error if "disable_padded_drafter_batch" is enabled, since this currently seems to lead to an ugly crash.

ok, i will fix this.

@benchislett i have add validation in arg_utils.py to make sure disable_padded_drafter_batch=False for async_scheduling.
as for another two changes you suggested, you think when disable_padded_drafter_batch=True, sync_scheduling will crash, right? if so, i think it's unnecessary to add the changes, because disable_padded_drafter_batch=True won't affect sync_scheduling in this PR, and i test it in my local env.

benchislett · 2025-10-21T15:57:37Z

@Ronald1995 sync scheduling should be functional and unchanged. I can confirm this but I'm pretty sure the only issue is spec + disable-padded-drafter-batch + async sched. Thanks for adding the validation

njhill · 2025-10-21T16:10:36Z

vllm/v1/spec_decode/eagle.py

-            common_attn_metadata.seq_lens_cpu += 1
-            # For the requests that exceed the max model length, we set the
-            # sequence length to 1 to minimize their overheads in attention.
+                common_attn_metadata.seq_lens.masked_fill(exceeds_max_model_len, 1)


actually in-place is ok here because it's a new tensor

Suggested change

common_attn_metadata.seq_lens.masked_fill(exceeds_max_model_len, 1)

common_attn_metadata.seq_lens.masked_fill_(exceeds_max_model_len, 1)

But isn't it only the CPU tensor which needs to be copied?

ok, i will validate if only cpu tensor needs to be copied.

benchislett · 2025-10-21T16:14:37Z

@Ronald1995 I think it makes more sense to solve this problem by calling .clone() on the relevant tensors in prepare_inputs and prepare_inputs_padded in eagle.py. I plan to rewrite that logic into a custom kernel anyways, so it is preferable if those metadata are mutable in the first place. I confirmed myself that this implementation resolves the accuracy discrepancy also.

benchislett · 2025-10-21T16:15:23Z

Good work finding the root cause!

benchislett · 2025-10-21T16:19:28Z

vllm/config/speculative.py

        # default.

-        if self.method in MTP_MODEL_TYPES:
+        if self.method in get_args(MTPModelTypes):


This warning is printing erroneously since "mtp" was added to "MTPModelTypes"

Suggested change

if self.method in get_args(MTPModelTypes):

if self.method in get_args(MTPModelTypes) and self.method != "mtp":

ok，i will fix this

benchislett · 2025-10-21T16:20:51Z

vllm/engine/arg_utils.py

-                    "async scheduling."
+                    "Currently, async scheduling is only supported "
+                    "with EAGLE/MTP kind of speculative decodeing and "
+                    "disable_padded_drafter_batch must to be false."


please make this a separate check and error message for clarity. Or, at least specify which constraint was not met in the error message.

ok, i will fix this.

njhill · 2025-10-21T16:42:01Z

Thanks @Ronald1995 @benchislett for all of the work on this! I am taking a look now too, and I think it's important for @WoosukKwon to review.

One thing missing is an e2e CI test covering this. It should be added to https://github.com/vllm-project/vllm/blob/main/tests/v1/e2e/test_async_sched_and_preempt.py so that we also test e2e permutations of this in conjunction with request preemption, penalty sampling parameters, and (soon to be merged) structured outputs.

We should also have an e2e test that verifies the acceptance rate matches when running with/without.

njhill

Thanks very much for all of the work on this @Ronald1995.

I have not yet reviewed the changes to gpu_model_runner.py which is the part of most concern given the complexity. The changes outside of that look ok at least!

njhill · 2025-10-21T23:07:22Z

vllm/v1/core/sched/scheduler.py

+    def _update_computed_tokens(
+        self,
+        request: Request,
+        scheduled_spec_token_ids: list[int],
+        generated_token_ids: list[int],
+        spec_decoding_status: SpecDecodingStats | None,
+    ):
+        num_draft_tokens = len(scheduled_spec_token_ids)
+        num_accepted = len(generated_token_ids) - 1
+        num_rejected = num_draft_tokens - num_accepted
+        # num_computed_tokens represents the number of tokens
+        # processed in the current step, considering scheduled
+        # tokens and rejections. If some tokens are rejected,
+        # num_computed_tokens is decreased by the number of rejected
+        # tokens.
+        request.num_computed_tokens -= num_rejected
+        spec_decoding_stats = self.make_spec_decoding_stats(
+            spec_decoding_status,
+            num_draft_tokens=num_draft_tokens,
+            num_accepted_tokens=num_accepted,
+        )
+        return spec_decoding_stats


I think duplication can be reduced here, perhaps keep this part outside of the method:

num_draft_tokens = len(scheduled_spec_token_ids) num_accepted = len(generated_token_ids) - 1 num_rejected = num_draft_tokens - num_accepted

and then in the async_scheduler override, just update the placeholder count and then call

return super()._update_computed_tokens(...)

ok，i will fix this.

njhill · 2025-10-21T23:07:59Z

vllm/v1/core/sched/scheduler.py


        return engine_core_outputs

+    def _update_computed_tokens(


Suggested change

def _update_computed_tokens(

def _update_computed_tokens_after_spec(

njhill · 2025-10-21T23:10:19Z

vllm/v1/engine/core.py

+        # when using async scheduling we can't get draft token ids in adavance,
+        # so we update draft token ids in the worker process and don't
+        # need to update draft token ids here.
+        if self.use_spec_decode and model_executed and not self.async_scheduling:


I'm not sure about this but would it make sense for the model executor to just return None from take_draft_token_ids in the async scheduling case? Then no changes are needed to this file.

because when use async_scheduling, draft_token_ids are assigned to request in gpu_model_runner directly, it won't call take_draft_token_ids in model executor, it could save the time copy draft_token_ids from gpu to cpu.

njhill · 2025-10-21T23:12:18Z

vllm/v1/core/sched/async_scheduler.py

+                if self.num_spec_tokens > 0:
+                    request.spec_token_ids = [-1] * self.num_spec_tokens


perhaps simplify to

Suggested change

if self.num_spec_tokens > 0:

request.spec_token_ids = [-1] * self.num_spec_tokens

request.spec_token_ids = [-1] * self.num_spec_tokens

ok, i will fix it.

njhill · 2025-10-21T23:13:07Z

vllm/v1/core/sched/async_scheduler.py

+        spec_decode_tokens = scheduler_output.scheduled_spec_decode_tokens
        for req_id in scheduler_output.num_scheduled_tokens:
            request = self.requests[req_id]
+            spec_tokens = len(spec_decode_tokens.get(req_id, []))


Suggest renaming the var for clarity

Suggested change

spec_tokens = len(spec_decode_tokens.get(req_id, []))

cur_num_spec_tokens = len(spec_decode_tokens.get(req_id, ()))

ok， i will fix it

njhill · 2025-10-21T23:14:49Z

vllm/v1/core/sched/scheduler.py

                    del request.spec_token_ids[num_scheduled_spec_tokens:]
                    scheduled_spec_decode_tokens[request.request_id] = (
-                        request.spec_token_ids
+                        request.spec_token_ids.copy()


Could you explain why the copy is needed here? (I'm not saying that it's unnecessary necessarily, I just haven't looked closely enough to understand why it is now needed)

request.spec_token_ids is updated in _update_after_schedule, i think the value of scheduled_spec_decode_tokens[request.request_id] will be modified by mistake, but i validate that delete copy will be ok, i will fix this.

njhill · 2025-10-21T23:48:06Z

vllm/v1/spec_decode/eagle.py

+            # when enable use_async_scheduling, we shouldn't use in place
+            # operations in case they are modified in next step `prepare_input`
+            # of main model.
+            if self.use_async_scheduling:
+                # Increment the sequence lengths.
+                common_attn_metadata.seq_lens = common_attn_metadata.seq_lens + 1
+                common_attn_metadata.seq_lens_cpu = (
+                    common_attn_metadata.seq_lens_cpu + 1
+                )
+                # For the requests that exceed the max model length, we set the
+                # sequence length to 1 to minimize their overheads in attention.

-            # Increment the sequence lengths.
-            common_attn_metadata.seq_lens += 1
-            common_attn_metadata.seq_lens_cpu += 1
-            # For the requests that exceed the max model length, we set the
-            # sequence length to 1 to minimize their overheads in attention.
+                common_attn_metadata.seq_lens.masked_fill(exceeds_max_model_len, 1)
+            else:
+                # Increment the sequence lengths.
+                common_attn_metadata.seq_lens += 1
+                common_attn_metadata.seq_lens_cpu += 1
+                # For the requests that exceed the max model length, we set the
+                # sequence length to 1 to minimize their overheads in attention.

-            common_attn_metadata.seq_lens.masked_fill_(exceeds_max_model_len, 1)
+                common_attn_metadata.seq_lens.masked_fill_(exceeds_max_model_len, 1)


So the race condition is related to the seq_lens_cpu tensor?

I don't think anything should be needed apart from to clone this at the right place if so (or ensure a copy is made via other means e.g. out-of-place op.)

In particular I don't think any change to the GPU tensors should be needed if they are only accessed in the main cuda stream (e.g. common_attn_metadata.seq_lens).

i will validate if only seq_lens_cpu tensor has race conditions.

Signed-off-by: Ronald1995 <[email protected]>

Ronald1995 requested review from WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners September 13, 2025 07:38

mergify bot added the v1 label Sep 13, 2025

mergify bot added the needs-rebase label Sep 13, 2025

gemini-code-assist bot reviewed Sep 13, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 2 times, most recently from f417e8f to b530bf3 Compare September 13, 2025 07:57

mergify bot removed the needs-rebase label Sep 13, 2025

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 8172c2b to 163f9ab Compare September 13, 2025 09:42

Ronald1995 requested review from benchislett and luccafong as code owners September 13, 2025 09:42

mergify bot added the speculative-decoding label Sep 13, 2025

robertgshaw2-redhat changed the title ~~async_scheduling for sepc code~~ [Core] Async Scheduling X Spec Decoding Compatibility Sep 13, 2025

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 4466156 to f971753 Compare September 15, 2025 01:29

benchislett mentioned this pull request Sep 16, 2025

[Spec Decode] Efficient padded speculation #24539

Merged

mergify bot added the needs-rebase label Sep 18, 2025

Ronald1995 changed the title ~~[Core] Async Scheduling X Spec Decoding Compatibility~~ [WIP][Core] Async Scheduling X Spec Decoding Compatibility Sep 19, 2025

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 13773be to 337aab8 Compare September 20, 2025 11:51

Ronald1995 requested a review from ApostaC as a code owner September 20, 2025 11:51

mergify bot removed the needs-rebase label Sep 20, 2025

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 3 times, most recently from 3630428 to 3ad3c1b Compare September 21, 2025 09:20

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 0c5ea7a to 6cf56a7 Compare October 16, 2025 08:02

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 6cf56a7 to 04b3625 Compare October 16, 2025 14:34

vadiklyutiy mentioned this pull request Oct 20, 2025

[Tracking Issue]: Qwen3-next performance optimisations #27225

Open

7 tasks

benchislett requested changes Oct 20, 2025

View reviewed changes

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 2 times, most recently from f0b1a83 to a55d38f Compare October 21, 2025 14:55

njhill reviewed Oct 21, 2025

View reviewed changes

benchislett reviewed Oct 21, 2025

View reviewed changes

njhill reviewed Oct 21, 2025

View reviewed changes

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 1b36850 to f7da030 Compare October 22, 2025 14:04

support async_scheduling for spec-decode

0eee271

Signed-off-by: Ronald1995 <[email protected]>

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from f7da030 to 0eee271 Compare October 22, 2025 14:11

	common_attn_metadata.seq_lens.masked_fill(exceeds_max_model_len, 1)
	common_attn_metadata.seq_lens.masked_fill_(exceeds_max_model_len, 1)

	if self.method in get_args(MTPModelTypes):
	if self.method in get_args(MTPModelTypes) and self.method != "mtp":

	def _update_computed_tokens(
	def _update_computed_tokens_after_spec(

		if self.num_spec_tokens > 0:
		request.spec_token_ids = [-1] * self.num_spec_tokens

	spec_tokens = len(spec_decode_tokens.get(req_id, []))
	cur_num_spec_tokens = len(spec_decode_tokens.get(req_id, ()))

Uh oh!

[Core] Async Scheduling X Spec Decoding Compatibility #24799

Are you sure you want to change the base?

[Core] Async Scheduling X Spec Decoding Compatibility #24799

Conversation

Ronald1995 commented Sep 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

performance

precision

Uh oh!

mergify bot commented Sep 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Sep 18, 2025

Uh oh!

benchislett commented Oct 16, 2025

Uh oh!

Ronald1995 commented Oct 16, 2025

Uh oh!

Ronald1995 commented Oct 16, 2025

Uh oh!

benchislett commented Oct 16, 2025

Uh oh!

benchislett commented Oct 16, 2025

Uh oh!

Ronald1995 commented Oct 17, 2025

Uh oh!

Ronald1995 commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett commented Oct 20, 2025

Uh oh!

benchislett commented Oct 20, 2025

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

Ronald1995 commented Oct 21, 2025

Uh oh!

Ronald1995 commented Oct 21, 2025

Uh oh!

Ronald1995 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ronald1995 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett commented Oct 21, 2025

Uh oh!

njhill Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benchislett commented Oct 21, 2025

Uh oh!

benchislett commented Oct 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill commented Oct 21, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Ronald1995 commented Sep 13, 2025 •

edited by github-actions bot

Loading

Ronald1995 commented Oct 20, 2025 •

edited

Loading

Ronald1995 commented Oct 21, 2025 •

edited

Loading

Ronald1995 commented Oct 21, 2025 •

edited

Loading

njhill Oct 21, 2025 •

edited

Loading