-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Core][KVConnector] Propagate all tokens on resumed preemptions #24926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][KVConnector] Propagate all tokens on resumed preemptions #24926
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly adds functionality to propagate all token IDs for resumed preempted requests when using a KVConnector, which is a valuable improvement for external persistent KV cache implementations. The changes are well-contained and follow the intended logic. However, I've identified a critical issue where a missing else
block can lead to an IndexError
in common configurations where neither pipeline parallelism nor a KV connector is used. I've provided a suggestion to fix this and make the logic more robust.
Is this fixing a bug you ran into? |
This added signal - only for resumed preemption - is the prerequisite to support external prefix caching efficiently. |
vllm/v1/core/sched/output.py
Outdated
# If resumed_from_preemption is True, propogate the token ids to the | ||
# connector, otherwise will be empty. | ||
token_ids: list[list[int]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Could you elaborate more why the kv connector needs this information?
- Can you rename it to
resumed_req_token_ids
or something like that? - Empty lists still cost serialization and GC. Can you use None instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, token_ids (and the hashes) aligned with the block_ids are required for the build_connector_meta() -> ... -> save_kv_layer() workflow of external prefix caching KVConnector that I've been working on.
For normal requests, token_ids can be extracted from scheduler_output.scheduled_new_reqs[idx].prompt_token_ids.
But for resumed preempted requests, the tokens are currently not propagated. Compared to this simple propagation, persisting the tokens in advance on KVConnector side is not preferred and error prone, as their prefill tokens would be [prompt token ids] + [decoded token ids before it was preempted].
2 & 3 changes applied : ).
@QierLi could you sign off your commits for the DCO? Would be good to get @WoosukKwon's final approval for this change. |
d3e3e40
to
2b2ec15
Compare
req_ids=[], | ||
resumed_from_preemption=[], | ||
new_token_ids=[], | ||
resumed_req_token_ids=[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a proper test case to make sure this is populated correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Created a new test case of preemption->resumption including this field.
pytest tests/v1/core/test_scheduler.py::test_priority_scheduling_preemption_and_resumption_when_out_of_kv
This pull request has merge conflicts that must be resolved before it can be |
6e7ceb2
to
519785a
Compare
ac9f292
to
b796205
Compare
Signed-off-by: Qier Li <[email protected]>
b796205
to
ed38059
Compare
Squashed and rebased. @WoosukKwon Could you review when you get a chance? Thanks : ) |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Qier Li <[email protected]>
…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits) [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001) Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164) Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353) [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189) Bump Flashinfer to v0.4.0 (vllm-project#26326) Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464) [Core] Relax the LoRA max rank (vllm-project#26461) [CI/Build] Fix model nightly tests (vllm-project#26466) [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486) [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926) [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200) [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439) [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462) [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445) [Minor] Change warning->warning_once in preprocess (vllm-project#26455) [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392) [Misc] Redact ray runtime env before logging (vllm-project#26302) Separate MLAAttention class from Attention (vllm-project#25103) [Attention] Register FLASHMLA_SPARSE (vllm-project#26441) [Kernels] Modular kernel refactor (vllm-project#24812) ...
…-project#24926) Signed-off-by: Qier Li <[email protected]> Co-authored-by: Qier Li <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…-project#24926) Signed-off-by: Qier Li <[email protected]> Co-authored-by: Qier Li <[email protected]> Signed-off-by: Dhruvil Bhatt <[email protected]>
Purpose
Propagate the complete token ids to KV Connector, when the request resumes after preempted, where the Prefill token ids are actually [prompt token ids] + [decoded token ids before it was preempted].
I found the necessity to propagate the tokens - aligned with the newly-scheduled requests - when I worked on external prefix caching via V1 KVConnector.
This PR adds an additional resumed_req_token_ids field, and would be None - no-op - for all other cases.
Test Plan
Added a test case for preemption -> resumption of test_priority_scheduling_preemption_and_resumption_when_out_of_kv in test_schedule.py
Test Result
Passed existed and newly-added tests.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.