-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[P/D][V1] KV Connector API V1 #15960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
Co-authored-by: KuntaiDu <[email protected]> Co-authored-by: YaoJiayi <[email protected]> Signed-off-by: ApostaC <[email protected]>
b3d71bb
to
6a12481
Compare
Signed-off-by: ApostaC <[email protected]>
Signed-off-by: ApostaC <[email protected]>
vllm/distributed/kv_transfer/kv_connector/v1/shared_storage_connector.py
Outdated
Show resolved
Hide resolved
@ApostaC I cherry-pick this PR to our repo and run the example. The following is my log
I abstract the key log
Why the first token of |
@maobaolong In the example, the prefill instance first generates a new token and "sends" the context + the newly generated token together to the decode instance. The decoder then starts generating based on that. Therefore, if you look at the "first generated token" on prefill instance and decoder instance, they should be different. Example:
Hoping this answers your question. |
gpu_memory_utilization=0.8, | ||
kv_transfer_config=KVTransferConfig.from_cli( | ||
'{"kv_connector":"SharedStorageConnector","kv_role":"kv_both", ' | ||
'"kv_extra_config": {"shared_storage_path": "local_storage"}}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kv_extra_config
-> kv_connector_extra_config
A minor issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for for this PR 😃. Here some small changes proopsal to restore the support of V0 (broken by this PR).
Should we just deprecate V0? |
Just getting test green |
Signed-off-by: ApostaC <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: remi <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Rémi Delacourt <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Yang Wang <[email protected]>
But is there a demo to run? Can I run like this?
|
Thanks! Which branch of source code should I better to use? https://github.com/ApostaC/vllm/tree/local-dev/lmcache-v1-connector-pr this one? And does it support xPyD now? Multi nodes? And how to install which version of lmcache? |
How do run this with vLLM serve?
|
@khayamgondal Please use |
Thanks, I just figured it out and was about to reply here.
Do you know if there is a way to find out the exact LMCache KV size? We can
specify the max cpu memory and disk size but is there a way to see the
actual KV size on LMCache?
…On Mon, Apr 28, 2025, 2:20 PM Yihua Cheng ***@***.***> wrote:
*ApostaC* left a comment (vllm-project/vllm#15960)
<#15960 (comment)>
@khayamgondal <https://github.com/khayamgondal> Please use
LMCacheConnectorV1 rather than LMCacheConnector.
Here's the example script:
https://github.com/vllm-project/vllm/blob/main/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
—
Reply to this email directly, view it on GitHub
<#15960 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATNG35X2GRUR54F6NXW7GT23Z5OTAVCNFSM6AAAAAB2KI6DCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMZWGI3TCNJRGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Signed-off-by: ApostaC <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: remi <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Rémi Delacourt <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: ApostaC <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: remi <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Rémi Delacourt <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: ApostaC <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: remi <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Rémi Delacourt <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>
Hi, The error is shown as below:
The connector must be initialized to be the subclass of the KVConnectorBase_V1:
How can i modify the official PD cases? Is that i need to use "kv_connector":"SharedStorageConnector" ? Thank you. |
Signed-off-by: ApostaC <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: remi <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Rémi Delacourt <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Mu Huai <[email protected]>
APIS ARE SUBJECT TO CHANGE IN FOLLOW UPS
TL;DR:
This PR opens the KV connector API in v1 to support disaggregated prefill. It also includes a minimal functional implementation as an example of how to use the connector API.
Detailed design doc: https://docs.google.com/document/d/1uPGdbEXksKXeN4Q9nUm9hzotqEjQhYmnpAhidLuAsjk
This PR is co-authored by:
TODOs in the upcoming PRs
Key design choices
The vLLM scheduler calculates which set of tokens needs a KV store or KV load, and the workers perform the actual KV store or load operations.
High-level design of the KV connector in v1
The figure below shows the high-level design of the connector

In the design, every process in vLLM will have a corresponding connector. Specifically, we have
Scheduler connector
On prefill nodes, the scheduler connector needs to parse the scheduler's output and determine what tokens should have their KV cache transmitted to the decoder nodes.
On decoder nodes, the scheduler connector needs to return the "correct"
num_computed_tokens
andcomputed_blocks
when callingget_computed_tokens
.Worker connector
The figure below shows how the worker connector works with the attention module to achieve layer-by-layer KV cache store and load:
Working with outside orchestrator
In more advanced use cases like xPyD, the connector may need to know which decoder node to send the KV cache to from the outside orchestrator. We believe different infrastructure providers may have very different orchestrating logics, and thus such logic should reside outside of vLLM.
The figure below explains the workflow among the orchestrator, vLLM, and the connector:
At a high level, the orchestrator should determine when to send the request to which node. Also, the connector may give the orchestrator some feedback, such as "KV cache transfer finished" (depending on the implementation).
For more details, please refer to our design doc: https://docs.google.com/document/d/1uPGdbEXksKXeN4Q9nUm9hzotqEjQhYmnpAhidLuAsjk
Extra note