Skip to content

Conversation

QierLi
Copy link
Contributor

@QierLi QierLi commented Sep 16, 2025

Purpose

Propagate the complete token ids to KV Connector, when the request resumes after preempted, where the Prefill token ids are actually [prompt token ids] + [decoded token ids before it was preempted].

I found the necessity to propagate the tokens - aligned with the newly-scheduled requests - when I worked on external prefix caching via V1 KVConnector.

This PR adds an additional resumed_req_token_ids field, and would be None - no-op - for all other cases.

Test Plan

Added a test case for preemption -> resumption of test_priority_scheduling_preemption_and_resumption_when_out_of_kv in test_schedule.py

Test Result

Passed existed and newly-added tests.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@QierLi QierLi requested a review from heheda12345 as a code owner September 16, 2025 00:30
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the v1 label Sep 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly adds functionality to propagate all token IDs for resumed preempted requests when using a KVConnector, which is a valuable improvement for external persistent KV cache implementations. The changes are well-contained and follow the intended logic. However, I've identified a critical issue where a missing else block can lead to an IndexError in common configurations where neither pipeline parallelism nor a KV connector is used. I've provided a suggestion to fix this and make the logic more robust.

@robertgshaw2-redhat
Copy link
Collaborator

Is this fixing a bug you ran into?

@QierLi
Copy link
Contributor Author

QierLi commented Sep 16, 2025

Is this fixing a bug you ran into?

This added signal - only for resumed preemption - is the prerequisite to support external prefix caching efficiently.

Comment on lines 90 to 92
# If resumed_from_preemption is True, propogate the token ids to the
# connector, otherwise will be empty.
token_ids: list[list[int]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Could you elaborate more why the kv connector needs this information?
  2. Can you rename it to resumed_req_token_ids or something like that?
  3. Empty lists still cost serialization and GC. Can you use None instead?

Copy link
Contributor Author

@QierLi QierLi Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, token_ids (and the hashes) aligned with the block_ids are required for the build_connector_meta() -> ... -> save_kv_layer() workflow of external prefix caching KVConnector that I've been working on.

For normal requests, token_ids can be extracted from scheduler_output.scheduled_new_reqs[idx].prompt_token_ids.
But for resumed preempted requests, the tokens are currently not propagated. Compared to this simple propagation, persisting the tokens in advance on KVConnector side is not preferred and error prone, as their prefill tokens would be [prompt token ids] + [decoded token ids before it was preempted].

2 & 3 changes applied : ).

@QierLi QierLi requested a review from njhill September 24, 2025 03:45
@njhill
Copy link
Member

njhill commented Sep 24, 2025

@QierLi could you sign off your commits for the DCO?

Would be good to get @WoosukKwon's final approval for this change.

@QierLi QierLi requested a review from ApostaC as a code owner September 25, 2025 07:11
@QierLi QierLi force-pushed the context_to_kv_connector branch from d3e3e40 to 2b2ec15 Compare September 25, 2025 07:15
req_ids=[],
resumed_from_preemption=[],
new_token_ids=[],
resumed_req_token_ids=[],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a proper test case to make sure this is populated correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Created a new test case of preemption->resumption including this field.

pytest tests/v1/core/test_scheduler.py::test_priority_scheduling_preemption_and_resumption_when_out_of_kv

@QierLi QierLi requested a review from yeqcharlotte September 27, 2025 08:40
@QierLi QierLi changed the title [KVConnector] Propagate all tokens on resumed preemptions [Core][KVConnector] Propagate all tokens on resumed preemptions Sep 30, 2025
Copy link

mergify bot commented Oct 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @QierLi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@QierLi QierLi force-pushed the context_to_kv_connector branch from b796205 to ed38059 Compare October 5, 2025 21:55
@QierLi
Copy link
Contributor Author

QierLi commented Oct 5, 2025

Squashed and rebased. @WoosukKwon Could you review when you get a chance? Thanks : )

Copy link

mergify bot commented Oct 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @QierLi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 8, 2025
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @QierLi Can you please rebase? Considering #26385, I think we can always include the whole token ids regardless of whether the connector is used.

@mergify mergify bot removed the needs-rebase label Oct 8, 2025
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 9, 2025
@DarkLight1337 DarkLight1337 merged commit d17f0fb into vllm-project:main Oct 9, 2025
46 checks passed
845473182 pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Oct 10, 2025
…to loader

* 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits)
  [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001)
  Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164)
  Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353)
  [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189)
  Bump Flashinfer to v0.4.0 (vllm-project#26326)
  Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464)
  [Core] Relax the LoRA  max rank (vllm-project#26461)
  [CI/Build] Fix model nightly tests (vllm-project#26466)
  [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486)
  [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926)
  [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200)
  [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439)
  [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462)
  [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445)
  [Minor] Change warning->warning_once in preprocess (vllm-project#26455)
  [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392)
  [Misc] Redact ray runtime env before logging (vllm-project#26302)
  Separate MLAAttention class from Attention (vllm-project#25103)
  [Attention] Register FLASHMLA_SPARSE (vllm-project#26441)
  [Kernels] Modular kernel refactor (vllm-project#24812)
  ...
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build frontend gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed suppress-bc-linter v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants