Skip to content

Conversation

Ronald1995
Copy link
Contributor

@Ronald1995 Ronald1995 commented Sep 13, 2025

Purpose

PR #19970 implements async_scheduling, PR #23569 implement prepare_input overlap base on PR #19970. RP #24539 refactor the logic of eagle spec_code, make it don't rely on cpu's sample_token_ids.

this PR is based on #24539 , and aims to support spec decode with async_scheduling. when enable both async_scheduling and spec decode, we won't copy draft token ids to scheduler any more, but cache it in gpu_model_runner, and update the input_ids with the _draft_token_ids directly for next step execute_model.

because ngram and medusa rely on cpu's sample_token_ids now, maybe we could refactor it in the future, but now this PR
only support eagle spec_decode with async_scheduling.

Test Plan

we will make the e2e test.

  • async_scheduling + EAGLE-LLaMA3-Instruct-8B draft model, make sure it works well.

Test config:

# dataset is prm800k, read the jsonl and make prompts.
sampling_params = SamplingParams(temperature=0, max_tokens=1024)
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    gpu_memory_utilization=0.9,
    tensor_parallel_size=1,
    max_model_len=2048,
    max_num_seqs=128,
    max_num_batched_tokens=4096,
    async_scheduling=True, 
    speculative_config={
            "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
            "draft_tensor_parallel_size": 1,
            "num_speculative_tokens": 2,
            "method": "eagle",
        },
    seed=1234
)

test device: Nvidia A100

Test Result

performance

num_prompts async_scheduling(tps) sync_scheduling(tps) speedup
24 2356 2314 1.8%
48 3759 3539 6.2%
96 5110 4770 7.1%

precision

I compare the outputs of async_scheduling and sync_scheduling with speculative decoding,
the outputs are actually the same. so the async_scheduling doesn't make precision problem.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link

mergify bot commented Sep 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 13, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for speculative decoding with asynchronous scheduling, which is a great feature enhancement. The core logic of handling draft tokens within the worker process for async scheduling is sound. However, I've identified a few critical issues in gpu_model_runner.py related to tensor manipulation for scatter operations that will likely cause runtime errors. There's also a minor logic error in how speculative token lists are truncated. The proposed fixes are straightforward. Once these issues are addressed, the implementation should be solid.

@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 2 times, most recently from f417e8f to b530bf3 Compare September 13, 2025 07:57
@mergify mergify bot removed the needs-rebase label Sep 13, 2025
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 8172c2b to 163f9ab Compare September 13, 2025 09:42
@robertgshaw2-redhat robertgshaw2-redhat changed the title async_scheduling for sepc code [Core] Async Scheduling X Spec Decoding Compatibility Sep 13, 2025
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 4466156 to f971753 Compare September 15, 2025 01:29
Copy link

mergify bot commented Sep 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 18, 2025
@Ronald1995 Ronald1995 changed the title [Core] Async Scheduling X Spec Decoding Compatibility [WIP][Core] Async Scheduling X Spec Decoding Compatibility Sep 19, 2025
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 13773be to 337aab8 Compare September 20, 2025 11:51
@Ronald1995 Ronald1995 requested a review from ApostaC as a code owner September 20, 2025 11:51
@mergify mergify bot removed the needs-rebase label Sep 20, 2025
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 3 times, most recently from 3630428 to 3ad3c1b Compare September 21, 2025 09:20
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 0c5ea7a to 6cf56a7 Compare October 16, 2025 08:02
@benchislett
Copy link
Collaborator

@Ronald1995 I think it might be related to the larger model causing a rare race condition more than it would be due to an MTP-specific difference, for the reasons you identified. But I have no concrete information on the cause of this regression besides the AR discrepancy issue I measured.

@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 6cf56a7 to 04b3625 Compare October 16, 2025 14:34
@Ronald1995
Copy link
Contributor Author

@Ronald1995 I think it might be related to the larger model causing a rare race condition more than it would be due to an MTP-specific difference, for the reasons you identified. But I have no concrete information on the cause of this regression besides the AR discrepancy issue I measured.

@benchislett ok, i have fixed issues you reviewed recently and made explanations to the questions.

as for this issue, you reminds me that you set --max-concurrency 1 for bench client, PR #19970 shows the performance speedup is positive correlation to scheduled requests, because async_scheduler have two more threads and extra prepare_input_ids operations, it will make a performance loss, if the performance speedup is smaller than performance loss, it's possible that final performance is regressed, especially for larger model, because the model forward time is longer, the performance speedup of async_scheduling is relatively smaller.

the Total Token throughput metric is regressed of deepseek-r1 when --max-concurrency 1 in async_scheduling could be explained, if we promote max-concurrency value, the metric is expected to be promoted, but Avg Draft acceptance rate is also regressed, it make me confused now, i will debug it and report the result later.

@Ronald1995
Copy link
Contributor Author

@Ronald1995 I think it might be related to the larger model causing a rare race condition more than it would be due to an MTP-specific difference, for the reasons you identified. But I have no concrete information on the cause of this regression besides the AR discrepancy issue I measured.

@benchislett ok, i have fixed issues you reviewed recently and made explanations to the questions.

as for this issue, you reminds me that you set --max-concurrency 1 for bench client, PR #19970 shows the performance speedup is positive correlation to scheduled requests, because async_scheduler have two more threads and extra prepare_input_ids operations, it will make a performance loss, if the performance speedup is smaller than performance loss, it's possible that final performance is regressed, especially for larger model, because the model forward time is longer, the performance speedup of async_scheduling is relatively smaller.

the Total Token throughput metric is regressed of deepseek-r1 when --max-concurrency 1 in async_scheduling could be explained, if we promote max-concurrency value, the metric is expected to be promoted, but Avg Draft acceptance rate is also regressed, it make me confused now, i will debug it and report the result later.

@benchislett i find bench server print many lines of logged acceptance metrics test, and they have irregular changes, i think the log you show may not prove there are accuracy issues. i compare the output content for sync scheduling and async scheduling with prm800k_500 dataset.

  • Meta-Llama-3-8B-Instruct : eagle method, the output content are actually the same.
  • DeepSeek-V3-4layers-MTP-FP8: mtp method, the output content are actually the same.
    so i believe this pr don't make accuracy issues, as for performance loss, as i said, it's possible when --max-concurrency 1 for larger model, if we promote max-concurrency, it will gain performance speedup.

@benchislett
Copy link
Collaborator

@Ronald1995 I think you are misunderstanding the issue. The problem appears to be that draft tokens are not being generated (or received) properly. The verification code is fine, but fewer tokens are accepted when using this feature (async sched + spec) than without (only spec). Running the same experiment with the flag on/off, I should see (almost) exactly the same number of drafted and accepted tokens. Instead, I get the following data (from my prev post):

Accepted: 1024 tokens, Drafted: 2034 tokens # Without async sched
Accepted: 656 tokens, Drafted: 2094 tokens # With async sched

This is not just a performance issue. It means that the draft tokens are getting rejected too often. For example, if there is a race condition and the verification buffer is not filled in time, some tokens in the input might not be updated in time and the verification could reject more readily. I think I have shown sufficient evidence to believe there is an issue here.

@benchislett
Copy link
Collaborator

As you can see from the benchmark logs I posted, the engine iteration is actually observably faster when running with async scheduling:

Mean ITL (ms):                           14.45     
Median ITL (ms):                         14.45     
P99 ITL (ms):                            14.78   
...
Mean ITL (ms):                           14.02     
Median ITL (ms):                         13.98     
P99 ITL (ms):                            19.92     

but the TPOT is slower, due to fewer tokens being accepted:

Mean TPOT (ms):                          5.84      
Median TPOT (ms):                        5.61      
P99 TPOT (ms):                           7.65    

Mean TPOT (ms):                          6.93      
Median TPOT (ms):                        6.81      
P99 TPOT (ms):                           9.04

@Ronald1995
Copy link
Contributor Author

@Ronald1995 I think you are misunderstanding the issue. The problem appears to be that draft tokens are not being generated (or received) properly. The verification code is fine, but fewer tokens are accepted when using this feature (async sched + spec) than without (only spec). Running the same experiment with the flag on/off, I should see (almost) exactly the same number of drafted and accepted tokens. Instead, I get the following data (from my prev post):

Accepted: 1024 tokens, Drafted: 2034 tokens # Without async sched
Accepted: 656 tokens, Drafted: 2094 tokens # With async sched

This is not just a performance issue. It means that the draft tokens are getting rejected too often. For example, if there is a race condition and the verification buffer is not filled in time, some tokens in the input might not be updated in time and the verification could reject more readily. I think I have shown sufficient evidence to believe there is an issue here.

ok, i got your point, i will reproduce your test to debug it.

@Ronald1995
Copy link
Contributor Author

Ronald1995 commented Oct 20, 2025

@benchislett i have made some tests to reproduce your test result, here are my test results.

  • use your original config
    result is the same as yours. async_scheduling has lower ITL but higher TPOT.
  • set num_speculative_tokens = 1
    server:
VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

client script is the same as yours.
async_scheduling result:

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  307.59    
Total input tokens:                      5535      
Total generated tokens:                  20375     
Request throughput (req/s):              0.26      
Output token throughput (tok/s):         66.24     
Peak output token throughput (tok/s):    39.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          84.24     
---------------Time to First Token----------------
Mean TTFT (ms):                          89.46     
Median TTFT (ms):                        77.82     
P99 TTFT (ms):                           197.69    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.80     
Median TPOT (ms):                        14.51     
P99 TPOT (ms):                           16.77     
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.45     
Median ITL (ms):                         26.42     
P99 ITL (ms):                            27.84     
==================================================

sync_scheduling:

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  316.42    
Total input tokens:                      5535      
Total generated tokens:                  20375     
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         64.39     
Peak output token throughput (tok/s):    37.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          81.88     
---------------Time to First Token----------------
Mean TTFT (ms):                          74.65     
Median TTFT (ms):                        62.14     
P99 TTFT (ms):                           220.20    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.29     
Median TPOT (ms):                        15.00     
P99 TPOT (ms):                           17.32     
---------------Inter-token Latency----------------
Mean ITL (ms):                           27.33     
Median ITL (ms):                         27.31     
P99 ITL (ms):                            28.01     
==================================================

in this config, both ITL and TPOT are speedup by use async_scheduling, ITL speedup 3.3%, TPOT speedup 3.3%

#@support_torch_compile
class DeepSeekMTP(nn.Module, SupportsPP):

server:

VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

client script is the same as yours.
async_scheduling result:

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  259.19    
Total input tokens:                      5535      
Total generated tokens:                  20375     
Request throughput (req/s):              0.31      
Output token throughput (tok/s):         78.61     
Peak output token throughput (tok/s):    34.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          99.97     
---------------Time to First Token----------------
Mean TTFT (ms):                          105.27    
Median TTFT (ms):                        85.59     
P99 TTFT (ms):                           475.04    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.35     
Median TPOT (ms):                        11.91     
P99 TPOT (ms):                           16.17     
---------------Inter-token Latency----------------
Mean ITL (ms):                           30.40     
Median ITL (ms):                         30.20     
P99 ITL (ms):                            48.17     
==================================================

sync_scheduling result:

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  270.94    
Total input tokens:                      5535      
Total generated tokens:                  20375     
Request throughput (req/s):              0.30      
Output token throughput (tok/s):         75.20     
Peak output token throughput (tok/s):    32.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          95.63     
---------------Time to First Token----------------
Mean TTFT (ms):                          81.04     
Median TTFT (ms):                        63.58     
P99 TTFT (ms):                           404.71    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.02     
Median TPOT (ms):                        12.57     
P99 TPOT (ms):                           17.34     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.00     
Median ITL (ms):                         32.01     
P99 ITL (ms):                            32.81     
==================================================

in this config, both ITL and TPOT are speedup by use async_scheduling, ITL speedup 5.3%, TPOT speedup 5.4%

  • set num_speculative_tokens = 3 and set enforce_eager=True
    server:
VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --enforce-eager --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

client script is the same as yours.
async_scheduling result:

============ Serving Benchmark Result ============
Successful requests:                     10        
Maximum request concurrency:             1         
Benchmark duration (s):                  174.77    
Total input tokens:                      789       
Total generated tokens:                  2560      
Request throughput (req/s):              0.06      
Output token throughput (tok/s):         14.65     
Peak output token throughput (tok/s):    7.00      
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          19.16     
---------------Time to First Token----------------
Mean TTFT (ms):                          929.44    
Median TTFT (ms):                        316.35    
P99 TTFT (ms):                           3583.42   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          64.89     
Median TPOT (ms):                        60.77     
P99 TPOT (ms):                           82.47     
---------------Inter-token Latency----------------
Mean ITL (ms):                           161.43    
Median ITL (ms):                         158.79    
P99 ITL (ms):                            225.90    
==================================================

sync_scheduling result:

============ Serving Benchmark Result ============
Successful requests:                     10        
Maximum request concurrency:             1         
Benchmark duration (s):                  168.98    
Total input tokens:                      789       
Total generated tokens:                  2560      
Request throughput (req/s):              0.06      
Output token throughput (tok/s):         15.15     
Peak output token throughput (tok/s):    7.00      
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          19.82     
---------------Time to First Token----------------
Mean TTFT (ms):                          622.80    
Median TTFT (ms):                        161.48    
P99 TTFT (ms):                           1925.25   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          63.82     
Median TPOT (ms):                        59.06     
P99 TPOT (ms):                           85.40     
---------------Inter-token Latency----------------
Mean ITL (ms):                           158.78    
Median ITL (ms):                         158.38    
P99 ITL (ms):                            219.83    
==================================================

There is performance loss when enable async_scheduling with max_concurrency=1, i have tested if increase max_concurrency, async_scheduling will speedup. the point here is that if disable cudagraph, it won't occur ITL is lower but TOPT is higher.

I guess there are hidden bugs for cudagraph with DeepseekMTP, i need to speed more time to figure it out. But as for this PR, i have made a lot of tests, i think the implementation itself of async_scheduling with spec decoding is fine.

I will add assertion in code to make sure when use async_scheduling and deepseek_mtp, num_speculative_tokens should less equal than 1 and add TODO to fix this issue in another PR. By doing this, i hope you could merge this PR first, please let me know what you think, thanks!

@benchislett
Copy link
Collaborator

@Ronald1995 I am not fully convinced that this issue is resolved. I investigated further last week and I am still able to consistently reproduce the issue on blackwell. Adding a torch.cuda.synchronize() into the gpu_model_runner.execute_model code almost anywhere will alleviate the issue. As such I suspect there might be some problems overlapping the draft model prepare_inputs and the next iteration's prepare_inputs. I will take a closer look today and inspect the individual data structures to see if there is any problem.

If the EAGLE prepare_inputs and main model's prepare_inputs share any cpu-side data, I believe it might be possible that one of them could overwrite this data while the other has an async HtoD memcpy in-flight, leading to a race condition. We have an event in the main model's prepare_inputs to ensure that this does not happen between iterations of the main model, but there is intentionally no safeguard for this in the spec decoding PR. I will validate if this is the cause of the issue I am seeing, and investigate if so.

Otherwise, I am happy with the state of the PR and am hoping it can be merged this week. Thank you for your continued effort!

@benchislett
Copy link
Collaborator

@Ronald1995 I have confirmed the issue and propose the following patch:

In gpu_model_runner.py:2908 (propose_draft_token_ids):

        elif self.speculative_config.use_eagle():
            if self.prepare_inputs_event is not None: # new
                self.prepare_inputs_event.synchronize() # new
            assert isinstance(self.drafter, EagleProposer)

This enforces a synchronization between prepare_inputs of the base model and the EAGLE drafter. With this patch, there can be no overlap between the draft model's prepare_inputs cpu execution and the gpu execution of the target model's prepare_inputs for the same step. It is still able to overlap between iterations, and profiling on nsight systems for Llama 3.1 8B with EAGLE3 indicates that this is not a significant block. CPU-GPU overlapping still occurs during the target model's forward pass. You are welcome to benchmark and profile with patch if you are suspicious of the performance implications.

While it would be nice to have identified a specific problematic shared buffer that can simply be replicated, this patch will serve in the meantime to alleviate the issue until the root cause can be identified. Even then, it might be beneficial to have this as a sanity check.

Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An additional concern I have identified is that this PR does not support "disable_padded_drafter_batch". I believe two changes would be necessary to enable this:

  • Update self.valid_sampled_token_count_cpu in the disable_padded_drafter_batch pathway of propose_draft_token_ids following prepare_next_token_ids_cpu.
  • add an exception to if not self.use_async_scheduling in _bookkeeping_sync, since this is getting thrown off and leaving valid_sampled_token_ids as empty.

This is very similar work that would be needed to enable other speculative decoding methods from having overlapping support, and does not need to be included in this PR. However, in the meantime, please add some validation that will raise a warning/error if "disable_padded_drafter_batch" is enabled, since this currently seems to lead to an ugly crash.

@Ronald1995
Copy link
Contributor Author

@Ronald1995 I have confirmed the issue and propose the following patch:

In gpu_model_runner.py:2908 (propose_draft_token_ids):

        elif self.speculative_config.use_eagle():
            if self.prepare_inputs_event is not None: # new
                self.prepare_inputs_event.synchronize() # new
            assert isinstance(self.drafter, EagleProposer)

This enforces a synchronization between prepare_inputs of the base model and the EAGLE drafter. With this patch, there can be no overlap between the draft model's prepare_inputs cpu execution and the gpu execution of the target model's prepare_inputs for the same step. It is still able to overlap between iterations, and profiling on nsight systems for Llama 3.1 8B with EAGLE3 indicates that this is not a significant block. CPU-GPU overlapping still occurs during the target model's forward pass. You are welcome to benchmark and profile with patch if you are suspicious of the performance implications.

While it would be nice to have identified a specific problematic shared buffer that can simply be replicated, this patch will serve in the meantime to alleviate the issue until the root cause can be identified. Even then, it might be beneficial to have this as a sanity check.

@benchislett Thanks for your information, i will make more performance about your advised patch and try to identify the a specific problematic shared buffer.

@Ronald1995
Copy link
Contributor Author

eantime, please add some validation that will raise a warning/error if "disable_padded_drafter_batch" is enabled, since this currently seems to

An additional concern I have identified is that this PR does not support "disable_padded_drafter_batch". I believe two changes would be necessary to enable this:

  • Update self.valid_sampled_token_count_cpu in the disable_padded_drafter_batch pathway of propose_draft_token_ids following prepare_next_token_ids_cpu.
  • add an exception to if not self.use_async_scheduling in _bookkeeping_sync, since this is getting thrown off and leaving valid_sampled_token_ids as empty.

This is very similar work that would be needed to enable other speculative decoding methods from having overlapping support, and does not need to be included in this PR. However, in the meantime, please add some validation that will raise a warning/error if "disable_padded_drafter_batch" is enabled, since this currently seems to lead to an ugly crash.

ok, i will fix this.

@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 2 times, most recently from f0b1a83 to a55d38f Compare October 21, 2025 14:55
@Ronald1995
Copy link
Contributor Author

Ronald1995 commented Oct 21, 2025

@Ronald1995 I have confirmed the issue and propose the following patch:
In gpu_model_runner.py:2908 (propose_draft_token_ids):

        elif self.speculative_config.use_eagle():
            if self.prepare_inputs_event is not None: # new
                self.prepare_inputs_event.synchronize() # new
            assert isinstance(self.drafter, EagleProposer)

This enforces a synchronization between prepare_inputs of the base model and the EAGLE drafter. With this patch, there can be no overlap between the draft model's prepare_inputs cpu execution and the gpu execution of the target model's prepare_inputs for the same step. It is still able to overlap between iterations, and profiling on nsight systems for Llama 3.1 8B with EAGLE3 indicates that this is not a significant block. CPU-GPU overlapping still occurs during the target model's forward pass. You are welcome to benchmark and profile with patch if you are suspicious of the performance implications.
While it would be nice to have identified a specific problematic shared buffer that can simply be replicated, this patch will serve in the meantime to alleviate the issue until the root cause can be identified. Even then, it might be beneficial to have this as a sanity check.

@benchislett Thanks for your information, i will make more performance about your advised patch and try to identify the a specific problematic shared buffer.

@benchislett i have identified the specific problematic shared buffer. please seed the code in eagle.py, the root cause is in-place operations in propose function.

@Ronald1995
Copy link
Contributor Author

Ronald1995 commented Oct 21, 2025

eantime, please add some validation that will raise a warning/error if "disable_padded_drafter_batch" is enabled, since this currently seems to

An additional concern I have identified is that this PR does not support "disable_padded_drafter_batch". I believe two changes would be necessary to enable this:

  • Update self.valid_sampled_token_count_cpu in the disable_padded_drafter_batch pathway of propose_draft_token_ids following prepare_next_token_ids_cpu.
  • add an exception to if not self.use_async_scheduling in _bookkeeping_sync, since this is getting thrown off and leaving valid_sampled_token_ids as empty.

This is very similar work that would be needed to enable other speculative decoding methods from having overlapping support, and does not need to be included in this PR. However, in the meantime, please add some validation that will raise a warning/error if "disable_padded_drafter_batch" is enabled, since this currently seems to lead to an ugly crash.

ok, i will fix this.

@benchislett i have add validation in arg_utils.py to make sure disable_padded_drafter_batch=False for async_scheduling.
as for another two changes you suggested, you think when disable_padded_drafter_batch=True, sync_scheduling will crash, right? if so, i think it's unnecessary to add the changes, because disable_padded_drafter_batch=True won't affect sync_scheduling in this PR, and i test it in my local env.

@benchislett
Copy link
Collaborator

@Ronald1995 sync scheduling should be functional and unchanged. I can confirm this but I'm pretty sure the only issue is spec + disable-padded-drafter-batch + async sched. Thanks for adding the validation

common_attn_metadata.seq_lens_cpu += 1
# For the requests that exceed the max model length, we set the
# sequence length to 1 to minimize their overheads in attention.
common_attn_metadata.seq_lens.masked_fill(exceeds_max_model_len, 1)
Copy link
Member

@njhill njhill Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually in-place is ok here because it's a new tensor

Suggested change
common_attn_metadata.seq_lens.masked_fill(exceeds_max_model_len, 1)
common_attn_metadata.seq_lens.masked_fill_(exceeds_max_model_len, 1)

But isn't it only the CPU tensor which needs to be copied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will validate if only cpu tensor needs to be copied.

@benchislett
Copy link
Collaborator

@Ronald1995 I think it makes more sense to solve this problem by calling .clone() on the relevant tensors in prepare_inputs and prepare_inputs_padded in eagle.py. I plan to rewrite that logic into a custom kernel anyways, so it is preferable if those metadata are mutable in the first place. I confirmed myself that this implementation resolves the accuracy discrepancy also.

@benchislett
Copy link
Collaborator

Good work finding the root cause!

# default.

if self.method in MTP_MODEL_TYPES:
if self.method in get_args(MTPModelTypes):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is printing erroneously since "mtp" was added to "MTPModelTypes"

Suggested change
if self.method in get_args(MTPModelTypes):
if self.method in get_args(MTPModelTypes) and self.method != "mtp":

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,i will fix this

"async scheduling."
"Currently, async scheduling is only supported "
"with EAGLE/MTP kind of speculative decodeing and "
"disable_padded_drafter_batch must to be false."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make this a separate check and error message for clarity. Or, at least specify which constraint was not met in the error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will fix this.

@njhill
Copy link
Member

njhill commented Oct 21, 2025

Thanks @Ronald1995 @benchislett for all of the work on this! I am taking a look now too, and I think it's important for @WoosukKwon to review.

One thing missing is an e2e CI test covering this. It should be added to https://github.com/vllm-project/vllm/blob/main/tests/v1/e2e/test_async_sched_and_preempt.py so that we also test e2e permutations of this in conjunction with request preemption, penalty sampling parameters, and (soon to be merged) structured outputs.

We should also have an e2e test that verifies the acceptance rate matches when running with/without.

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for all of the work on this @Ronald1995.

I have not yet reviewed the changes to gpu_model_runner.py which is the part of most concern given the complexity. The changes outside of that look ok at least!

Comment on lines 1088 to 1112
def _update_computed_tokens(
self,
request: Request,
scheduled_spec_token_ids: list[int],
generated_token_ids: list[int],
spec_decoding_status: SpecDecodingStats | None,
):
num_draft_tokens = len(scheduled_spec_token_ids)
num_accepted = len(generated_token_ids) - 1
num_rejected = num_draft_tokens - num_accepted
# num_computed_tokens represents the number of tokens
# processed in the current step, considering scheduled
# tokens and rejections. If some tokens are rejected,
# num_computed_tokens is decreased by the number of rejected
# tokens.
request.num_computed_tokens -= num_rejected
spec_decoding_stats = self.make_spec_decoding_stats(
spec_decoding_status,
num_draft_tokens=num_draft_tokens,
num_accepted_tokens=num_accepted,
)
return spec_decoding_stats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think duplication can be reduced here, perhaps keep this part outside of the method:

        num_draft_tokens = len(scheduled_spec_token_ids)
        num_accepted = len(generated_token_ids) - 1
        num_rejected = num_draft_tokens - num_accepted

and then in the async_scheduler override, just update the placeholder count and then call

return super()._update_computed_tokens(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,i will fix this.


return engine_core_outputs

def _update_computed_tokens(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _update_computed_tokens(
def _update_computed_tokens_after_spec(

Comment on lines +334 to +337
# when using async scheduling we can't get draft token ids in adavance,
# so we update draft token ids in the worker process and don't
# need to update draft token ids here.
if self.use_spec_decode and model_executed and not self.async_scheduling:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this but would it make sense for the model executor to just return None from take_draft_token_ids in the async scheduling case? Then no changes are needed to this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because when use async_scheduling, draft_token_ids are assigned to request in gpu_model_runner directly, it won't call take_draft_token_ids in model executor, it could save the time copy draft_token_ids from gpu to cpu.

Comment on lines 35 to 36
if self.num_spec_tokens > 0:
request.spec_token_ids = [-1] * self.num_spec_tokens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps simplify to

Suggested change
if self.num_spec_tokens > 0:
request.spec_token_ids = [-1] * self.num_spec_tokens
request.spec_token_ids = [-1] * self.num_spec_tokens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will fix it.

spec_decode_tokens = scheduler_output.scheduled_spec_decode_tokens
for req_id in scheduler_output.num_scheduled_tokens:
request = self.requests[req_id]
spec_tokens = len(spec_decode_tokens.get(req_id, []))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest renaming the var for clarity

Suggested change
spec_tokens = len(spec_decode_tokens.get(req_id, []))
cur_num_spec_tokens = len(spec_decode_tokens.get(req_id, ()))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will fix it

del request.spec_token_ids[num_scheduled_spec_tokens:]
scheduled_spec_decode_tokens[request.request_id] = (
request.spec_token_ids
request.spec_token_ids.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why the copy is needed here? (I'm not saying that it's unnecessary necessarily, I just haven't looked closely enough to understand why it is now needed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request.spec_token_ids is updated in _update_after_schedule, i think the value of scheduled_spec_decode_tokens[request.request_id] will be modified by mistake, but i validate that delete copy will be ok, i will fix this.

Comment on lines +391 to +411
# when enable use_async_scheduling, we shouldn't use in place
# operations in case they are modified in next step `prepare_input`
# of main model.
if self.use_async_scheduling:
# Increment the sequence lengths.
common_attn_metadata.seq_lens = common_attn_metadata.seq_lens + 1
common_attn_metadata.seq_lens_cpu = (
common_attn_metadata.seq_lens_cpu + 1
)
# For the requests that exceed the max model length, we set the
# sequence length to 1 to minimize their overheads in attention.

# Increment the sequence lengths.
common_attn_metadata.seq_lens += 1
common_attn_metadata.seq_lens_cpu += 1
# For the requests that exceed the max model length, we set the
# sequence length to 1 to minimize their overheads in attention.
common_attn_metadata.seq_lens.masked_fill(exceeds_max_model_len, 1)
else:
# Increment the sequence lengths.
common_attn_metadata.seq_lens += 1
common_attn_metadata.seq_lens_cpu += 1
# For the requests that exceed the max model length, we set the
# sequence length to 1 to minimize their overheads in attention.

common_attn_metadata.seq_lens.masked_fill_(exceeds_max_model_len, 1)
common_attn_metadata.seq_lens.masked_fill_(exceeds_max_model_len, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the race condition is related to the seq_lens_cpu tensor?

I don't think anything should be needed apart from to clone this at the right place if so (or ensure a copy is made via other means e.g. out-of-place op.)

In particular I don't think any change to the GPU tensors should be needed if they are only accessed in the main cuda stream (e.g. common_attn_metadata.seq_lens).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will validate if only seq_lens_cpu tensor has race conditions.

@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 1b36850 to f7da030 Compare October 22, 2025 14:04
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from f7da030 to 0eee271 Compare October 22, 2025 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants