-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[Core] Async Scheduling X Spec Decoding Compatibility #24799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Core] Async Scheduling X Spec Decoding Compatibility #24799
Conversation
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for speculative decoding with asynchronous scheduling, which is a great feature enhancement. The core logic of handling draft tokens within the worker process for async scheduling is sound. However, I've identified a few critical issues in gpu_model_runner.py
related to tensor manipulation for scatter
operations that will likely cause runtime errors. There's also a minor logic error in how speculative token lists are truncated. The proposed fixes are straightforward. Once these issues are addressed, the implementation should be solid.
f417e8f
to
b530bf3
Compare
8172c2b
to
163f9ab
Compare
4466156
to
f971753
Compare
This pull request has merge conflicts that must be resolved before it can be |
13773be
to
337aab8
Compare
3630428
to
3ad3c1b
Compare
0c5ea7a
to
6cf56a7
Compare
@Ronald1995 I think it might be related to the larger model causing a rare race condition more than it would be due to an MTP-specific difference, for the reasons you identified. But I have no concrete information on the cause of this regression besides the AR discrepancy issue I measured. |
6cf56a7
to
04b3625
Compare
@benchislett ok, i have fixed issues you reviewed recently and made explanations to the questions. as for this issue, you reminds me that you set the |
@benchislett i find bench server print many lines of logged acceptance metrics test, and they have irregular changes, i think the log you show may not prove there are accuracy issues. i compare the output content for sync scheduling and async scheduling with prm800k_500 dataset.
|
@Ronald1995 I think you are misunderstanding the issue. The problem appears to be that draft tokens are not being generated (or received) properly. The verification code is fine, but fewer tokens are accepted when using this feature (async sched + spec) than without (only spec). Running the same experiment with the flag on/off, I should see (almost) exactly the same number of drafted and accepted tokens. Instead, I get the following data (from my prev post):
This is not just a performance issue. It means that the draft tokens are getting rejected too often. For example, if there is a race condition and the verification buffer is not filled in time, some tokens in the input might not be updated in time and the verification could reject more readily. I think I have shown sufficient evidence to believe there is an issue here. |
As you can see from the benchmark logs I posted, the engine iteration is actually observably faster when running with async scheduling:
but the TPOT is slower, due to fewer tokens being accepted:
|
ok, i got your point, i will reproduce your test to debug it. |
@benchislett i have made some tests to reproduce your test result, here are my test results.
VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' client script is the same as yours.
sync_scheduling:
in this config, both ITL and TPOT are speedup by use async_scheduling, ITL speedup 3.3%, TPOT speedup 3.3%
#@support_torch_compile
class DeepSeekMTP(nn.Module, SupportsPP): server: VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' client script is the same as yours.
sync_scheduling result:
in this config, both ITL and TPOT are speedup by use async_scheduling, ITL speedup 5.3%, TPOT speedup 5.4%
VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --enforce-eager --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' client script is the same as yours.
sync_scheduling result:
There is performance loss when enable async_scheduling with max_concurrency=1, i have tested if increase max_concurrency, async_scheduling will speedup. the point here is that if disable cudagraph, it won't occur ITL is lower but TOPT is higher. I guess there are hidden bugs for cudagraph with DeepseekMTP, i need to speed more time to figure it out. But as for this PR, i have made a lot of tests, i think the implementation itself of async_scheduling with spec decoding is fine. I will add assertion in code to make sure when use async_scheduling and deepseek_mtp, num_speculative_tokens should less equal than 1 and add |
@Ronald1995 I am not fully convinced that this issue is resolved. I investigated further last week and I am still able to consistently reproduce the issue on blackwell. Adding a If the EAGLE prepare_inputs and main model's prepare_inputs share any cpu-side data, I believe it might be possible that one of them could overwrite this data while the other has an async HtoD memcpy in-flight, leading to a race condition. We have an event in the main model's prepare_inputs to ensure that this does not happen between iterations of the main model, but there is intentionally no safeguard for this in the spec decoding PR. I will validate if this is the cause of the issue I am seeing, and investigate if so. Otherwise, I am happy with the state of the PR and am hoping it can be merged this week. Thank you for your continued effort! |
@Ronald1995 I have confirmed the issue and propose the following patch: In
This enforces a synchronization between prepare_inputs of the base model and the EAGLE drafter. With this patch, there can be no overlap between the draft model's prepare_inputs cpu execution and the gpu execution of the target model's prepare_inputs for the same step. It is still able to overlap between iterations, and profiling on nsight systems for Llama 3.1 8B with EAGLE3 indicates that this is not a significant block. CPU-GPU overlapping still occurs during the target model's forward pass. You are welcome to benchmark and profile with patch if you are suspicious of the performance implications. While it would be nice to have identified a specific problematic shared buffer that can simply be replicated, this patch will serve in the meantime to alleviate the issue until the root cause can be identified. Even then, it might be beneficial to have this as a sanity check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An additional concern I have identified is that this PR does not support "disable_padded_drafter_batch". I believe two changes would be necessary to enable this:
- Update
self.valid_sampled_token_count_cpu
in thedisable_padded_drafter_batch
pathway ofpropose_draft_token_ids
followingprepare_next_token_ids_cpu
. - add an exception to
if not self.use_async_scheduling
in_bookkeeping_sync
, since this is getting thrown off and leavingvalid_sampled_token_ids
as empty.
This is very similar work that would be needed to enable other speculative decoding methods from having overlapping support, and does not need to be included in this PR. However, in the meantime, please add some validation that will raise a warning/error if "disable_padded_drafter_batch" is enabled, since this currently seems to lead to an ugly crash.
@benchislett Thanks for your information, i will make more performance about your advised patch and try to identify the a specific problematic shared buffer. |
ok, i will fix this. |
f0b1a83
to
a55d38f
Compare
@benchislett i have identified the specific problematic shared buffer. please seed the code in |
@benchislett i have add validation in arg_utils.py to make sure |
@Ronald1995 sync scheduling should be functional and unchanged. I can confirm this but I'm pretty sure the only issue is |
common_attn_metadata.seq_lens_cpu += 1 | ||
# For the requests that exceed the max model length, we set the | ||
# sequence length to 1 to minimize their overheads in attention. | ||
common_attn_metadata.seq_lens.masked_fill(exceeds_max_model_len, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually in-place is ok here because it's a new tensor
common_attn_metadata.seq_lens.masked_fill(exceeds_max_model_len, 1) | |
common_attn_metadata.seq_lens.masked_fill_(exceeds_max_model_len, 1) |
But isn't it only the CPU tensor which needs to be copied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i will validate if only cpu tensor needs to be copied.
@Ronald1995 I think it makes more sense to solve this problem by calling |
Good work finding the root cause! |
vllm/config/speculative.py
Outdated
# default. | ||
|
||
if self.method in MTP_MODEL_TYPES: | ||
if self.method in get_args(MTPModelTypes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning is printing erroneously since "mtp" was added to "MTPModelTypes"
if self.method in get_args(MTPModelTypes): | |
if self.method in get_args(MTPModelTypes) and self.method != "mtp": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,i will fix this
vllm/engine/arg_utils.py
Outdated
"async scheduling." | ||
"Currently, async scheduling is only supported " | ||
"with EAGLE/MTP kind of speculative decodeing and " | ||
"disable_padded_drafter_batch must to be false." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make this a separate check and error message for clarity. Or, at least specify which constraint was not met in the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i will fix this.
Thanks @Ronald1995 @benchislett for all of the work on this! I am taking a look now too, and I think it's important for @WoosukKwon to review. One thing missing is an e2e CI test covering this. It should be added to https://github.com/vllm-project/vllm/blob/main/tests/v1/e2e/test_async_sched_and_preempt.py so that we also test e2e permutations of this in conjunction with request preemption, penalty sampling parameters, and (soon to be merged) structured outputs. We should also have an e2e test that verifies the acceptance rate matches when running with/without. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for all of the work on this @Ronald1995.
I have not yet reviewed the changes to gpu_model_runner.py
which is the part of most concern given the complexity. The changes outside of that look ok at least!
def _update_computed_tokens( | ||
self, | ||
request: Request, | ||
scheduled_spec_token_ids: list[int], | ||
generated_token_ids: list[int], | ||
spec_decoding_status: SpecDecodingStats | None, | ||
): | ||
num_draft_tokens = len(scheduled_spec_token_ids) | ||
num_accepted = len(generated_token_ids) - 1 | ||
num_rejected = num_draft_tokens - num_accepted | ||
# num_computed_tokens represents the number of tokens | ||
# processed in the current step, considering scheduled | ||
# tokens and rejections. If some tokens are rejected, | ||
# num_computed_tokens is decreased by the number of rejected | ||
# tokens. | ||
request.num_computed_tokens -= num_rejected | ||
spec_decoding_stats = self.make_spec_decoding_stats( | ||
spec_decoding_status, | ||
num_draft_tokens=num_draft_tokens, | ||
num_accepted_tokens=num_accepted, | ||
) | ||
return spec_decoding_stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think duplication can be reduced here, perhaps keep this part outside of the method:
num_draft_tokens = len(scheduled_spec_token_ids)
num_accepted = len(generated_token_ids) - 1
num_rejected = num_draft_tokens - num_accepted
and then in the async_scheduler override, just update the placeholder count and then call
return super()._update_computed_tokens(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,i will fix this.
|
||
return engine_core_outputs | ||
|
||
def _update_computed_tokens( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _update_computed_tokens( | |
def _update_computed_tokens_after_spec( |
# when using async scheduling we can't get draft token ids in adavance, | ||
# so we update draft token ids in the worker process and don't | ||
# need to update draft token ids here. | ||
if self.use_spec_decode and model_executed and not self.async_scheduling: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this but would it make sense for the model executor to just return None
from take_draft_token_ids
in the async scheduling case? Then no changes are needed to this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because when use async_scheduling, draft_token_ids are assigned to request in gpu_model_runner directly, it won't call take_draft_token_ids
in model executor, it could save the time copy draft_token_ids from gpu to cpu.
if self.num_spec_tokens > 0: | ||
request.spec_token_ids = [-1] * self.num_spec_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps simplify to
if self.num_spec_tokens > 0: | |
request.spec_token_ids = [-1] * self.num_spec_tokens | |
request.spec_token_ids = [-1] * self.num_spec_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i will fix it.
spec_decode_tokens = scheduler_output.scheduled_spec_decode_tokens | ||
for req_id in scheduler_output.num_scheduled_tokens: | ||
request = self.requests[req_id] | ||
spec_tokens = len(spec_decode_tokens.get(req_id, [])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest renaming the var for clarity
spec_tokens = len(spec_decode_tokens.get(req_id, [])) | |
cur_num_spec_tokens = len(spec_decode_tokens.get(req_id, ())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i will fix it
vllm/v1/core/sched/scheduler.py
Outdated
del request.spec_token_ids[num_scheduled_spec_tokens:] | ||
scheduled_spec_decode_tokens[request.request_id] = ( | ||
request.spec_token_ids | ||
request.spec_token_ids.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why the copy is needed here? (I'm not saying that it's unnecessary necessarily, I just haven't looked closely enough to understand why it is now needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request.spec_token_ids is updated in _update_after_schedule
, i think the value of scheduled_spec_decode_tokens[request.request_id]
will be modified by mistake, but i validate that delete copy will be ok, i will fix this.
# when enable use_async_scheduling, we shouldn't use in place | ||
# operations in case they are modified in next step `prepare_input` | ||
# of main model. | ||
if self.use_async_scheduling: | ||
# Increment the sequence lengths. | ||
common_attn_metadata.seq_lens = common_attn_metadata.seq_lens + 1 | ||
common_attn_metadata.seq_lens_cpu = ( | ||
common_attn_metadata.seq_lens_cpu + 1 | ||
) | ||
# For the requests that exceed the max model length, we set the | ||
# sequence length to 1 to minimize their overheads in attention. | ||
|
||
# Increment the sequence lengths. | ||
common_attn_metadata.seq_lens += 1 | ||
common_attn_metadata.seq_lens_cpu += 1 | ||
# For the requests that exceed the max model length, we set the | ||
# sequence length to 1 to minimize their overheads in attention. | ||
common_attn_metadata.seq_lens.masked_fill(exceeds_max_model_len, 1) | ||
else: | ||
# Increment the sequence lengths. | ||
common_attn_metadata.seq_lens += 1 | ||
common_attn_metadata.seq_lens_cpu += 1 | ||
# For the requests that exceed the max model length, we set the | ||
# sequence length to 1 to minimize their overheads in attention. | ||
|
||
common_attn_metadata.seq_lens.masked_fill_(exceeds_max_model_len, 1) | ||
common_attn_metadata.seq_lens.masked_fill_(exceeds_max_model_len, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the race condition is related to the seq_lens_cpu
tensor?
I don't think anything should be needed apart from to clone this at the right place if so (or ensure a copy is made via other means e.g. out-of-place op.)
In particular I don't think any change to the GPU tensors should be needed if they are only accessed in the main cuda stream (e.g. common_attn_metadata.seq_lens).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will validate if only seq_lens_cpu
tensor has race conditions.
1b36850
to
f7da030
Compare
Signed-off-by: Ronald1995 <[email protected]>
f7da030
to
0eee271
Compare
Purpose
PR #19970 implements async_scheduling, PR #23569 implement
prepare_input
overlap base on PR #19970. RP #24539 refactor the logic of eagle spec_code, make it don't rely on cpu's sample_token_ids.this PR is based on #24539 , and aims to support spec decode with async_scheduling. when enable both async_scheduling and spec decode, we won't copy draft token ids to scheduler any more, but cache it in gpu_model_runner, and update the input_ids with the
_draft_token_ids
directly for next stepexecute_model
.because ngram and medusa rely on cpu's sample_token_ids now, maybe we could refactor it in the future, but now this PR
only support eagle spec_decode with async_scheduling.
Test Plan
we will make the e2e test.
Test config:
test device: Nvidia A100
Test Result
performance
precision
I compare the outputs of async_scheduling and sync_scheduling with speculative decoding,
the outputs are actually the same. so the async_scheduling doesn't make precision problem.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.