[P/D][V1] KV Connector API V1 #15960

ApostaC · 2025-04-02T18:31:55Z

APIS ARE SUBJECT TO CHANGE IN FOLLOW UPS

TL;DR:

This PR opens the KV connector API in v1 to support disaggregated prefill. It also includes a minimal functional implementation as an example of how to use the connector API.

Detailed design doc: https://docs.google.com/document/d/1uPGdbEXksKXeN4Q9nUm9hzotqEjQhYmnpAhidLuAsjk

This PR is co-authored by:

KuntaiDu [email protected]
YaoJiayi [email protected]

TODOs in the upcoming PRs

More performant connector implementation using P2P connections ([v1] [P/D] Adding LMCache KV connector for v1 #16625)
MLA support
Enable KVCacheManager allocating temporary blocks for the connector to use

Key design choices

Implement disagg prefill under the hood of v1's prefix caching and chunked prefill semantics
The vLLM scheduler calculates which set of tokens needs a KV store or KV load, and the workers perform the actual KV store or load operations.
Provide layer-wise async API support
KV cache prefetching and request orchestration should happen outside vLLM so that the changes in the core can be minimized

High-level design of the KV connector in v1

The figure below shows the high-level design of the connector

In the design, every process in vLLM will have a corresponding connector. Specifically, we have

Scheduler connector: the connector that locates in the same process as the scheduler process. It schedules the KV cache transfer ops.
Worker connectors: the connectors that locate in the worker processes. They execute KV cache transfer ops.

Scheduler connector

On prefill nodes, the scheduler connector needs to parse the scheduler's output and determine what tokens should have their KV cache transmitted to the decoder nodes.

On decoder nodes, the scheduler connector needs to return the "correct" num_computed_tokens and computed_blocks when calling get_computed_tokens.

Worker connector

The figure below shows how the worker connector works with the attention module to achieve layer-by-layer KV cache store and load:

Working with outside orchestrator

In more advanced use cases like xPyD, the connector may need to know which decoder node to send the KV cache to from the outside orchestrator. We believe different infrastructure providers may have very different orchestrating logics, and thus such logic should reside outside of vLLM.

The figure below explains the workflow among the orchestrator, vLLM, and the connector:

At a high level, the orchestrator should determine when to send the request to which node. Also, the connector may give the orchestrator some feedback, such as "KV cache transfer finished" (depending on the implementation).

For more details, please refer to our design doc: https://docs.google.com/document/d/1uPGdbEXksKXeN4Q9nUm9hzotqEjQhYmnpAhidLuAsjk

Extra note

This PR's goal is shipping the connector API with just a minimal functional implementation. We are working on a better (more performant, more stable) implementation and that will be a new PR soon.

github-actions · 2025-04-02T18:32:05Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-04-02T18:32:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ApostaC.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ApostaC · 2025-04-02T19:01:18Z

cc @KuntaiDu @YaoJiayi

Co-authored-by: KuntaiDu <[email protected]> Co-authored-by: YaoJiayi <[email protected]> Signed-off-by: ApostaC <[email protected]>

Signed-off-by: ApostaC <[email protected]>

vllm/distributed/kv_transfer/kv_connector/v1/shared_storage_connector.py

maobaolong · 2025-04-03T14:48:08Z

@ApostaC I cherry-pick this PR to our repo and run the example. The following is my log

INFO 04-03 07:33:59 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-03 07:33:59 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={}
INFO 04-03 07:33:59 [shared_storage_connector.py:92] Shared storage path is /tmp
Processed prompts:   0%|                                       | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:131] Start loading KV cache from the connector
Processed prompts: 100%|██████████████████████████| 4/4 [00:00<00:00, 50.53it/s, est. speed input: 38147.85 toks/s, output: 50.56 toks/s]
Prompt: 'Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is', Generated text: ' the'
Prompt: 'Hi Hi Hi Hi Hi ................i Hi Hi Hi The capital of France is', Generated text: ' the'
Prompt: 'Hey Hey..................Hey Your name is', Generated text: ' the'
Prompt: 'Hey Hey Hey ............... Hey Hey The capital of China is', Generated text: ' '
Saved 4 prompts to output.txt
[rank0]:[W403 07:34:00.128094287 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
INFO 04-03 07:34:03 [__init__.py:239] Automatically detected platform cuda.
Loaded 4 prompts from output.txt
INFO 04-03 07:34:09 [config.py:591] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 04-03 07:34:10 [config.py:1712] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-03 07:34:10 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-03 07:34:10 [core.py:54] Initializing a V1 LLM engine (v0.7.4.dev711+gee96432c) with config: model='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/disc/data1/Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
WARNING 04-03 07:34:11 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ffe6832ea80>
INFO 04-03 07:34:11 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-03 07:34:11 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-03 07:34:11 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-03 07:34:11 [shared_storage_connector.py:92] Shared storage path is local_storage
INFO 04-03 07:34:11 [cuda.py:220] Using Flash Attention backend on V1 engine.
INFO 04-03 07:34:11 [gpu_model_runner.py:1179] Starting to load model /disc/data1/Qwen/Qwen2.5-1.5B-Instruct...
INFO 04-03 07:34:11 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.10it/s]

INFO 04-03 07:34:11 [loader.py:447] Loading weights took 0.50 seconds
INFO 04-03 07:34:12 [gpu_model_runner.py:1191] Model loading took 2.8871 GB and 0.585285 seconds
INFO 04-03 07:34:12 [kv_cache_utils.py:566] GPU KV cache size: 2,644,896 tokens
INFO 04-03 07:34:12 [kv_cache_utils.py:569] Maximum concurrency for 32,768 tokens per request: 80.72x
INFO 04-03 07:34:12 [core.py:152] init engine (profile, create kv cache, warmup model) took 0.88 seconds
INFO 04-03 07:34:12 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-03 07:34:12 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-03 07:34:12 [shared_storage_connector.py:92] Shared storage path is local_storage
Processed prompts:   0%|                                       | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
Processed prompts: 100%|█████████████████████████| 4/4 [00:00<00:00, 16.85it/s, est. speed input: 12724.93 toks/s, output: 168.48 toks/s]
Prompt: 'Hi Hi Hi Hi Hi......................Hi Hi Hi Hi Hi Hi Hi Hi Hello, my name is the', Generated text: ' answer: "The answer is: "The answer'
Prompt: 'Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi............. Hi Hi Hi Hi Hi The capital of France is the', Generated text: ' first step. 1. 202'
Prompt: 'Hey Hey Hey Hey ......... Hey Hey Hey Hey Hey Your name is the', Generated text: ' best way to find the best way to find the'
Prompt: 'Hey Hey Hey ............. Hey Hey Hey The capital of China is', Generated text: ' 100000000'
[rank0]:[W403 07:34:13.321916884 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

I abstract the key log

Prompt: 'Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is', Generated text: ' the'
Prompt: 'Hi Hi Hi Hi Hi ................i Hi Hi Hi The capital of France is', Generated text: ' the'
Prompt: 'Hey Hey..................Hey Your name is', Generated text: ' the'
Prompt: 'Hey Hey Hey ............... Hey Hey The capital of China is', Generated text: ' '
Saved 4 prompts to output.txt
.....
....
Prompt: 'Hi Hi Hi Hi Hi......................Hi Hi Hi Hi Hi Hi Hi Hi Hello, my name is the', Generated text: ' answer: "The answer is: "The answer'
Prompt: 'Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi............. Hi Hi Hi Hi Hi The capital of France is the', Generated text: ' first step. 1. 202'
Prompt: 'Hey Hey Hey Hey ......... Hey Hey Hey Hey Hey Your name is the', Generated text: ' best way to find the best way to find the'
Prompt: 'Hey Hey Hey ............. Hey Hey Hey The capital of China is', Generated text: ' 100000000'

Why the first token of Generated text from decode instance is different from the first token generated from profill instance

ApostaC · 2025-04-03T17:57:03Z

Why the first token of Generated text from decode instance is different from the first token generated from profill instance

@maobaolong In the example, the prefill instance first generates a new token and "sends" the context + the newly generated token together to the decode instance. The decoder then starts generating based on that. Therefore, if you look at the "first generated token" on prefill instance and decoder instance, they should be different.

Example:

prefill instance prompt: Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is
prefill instance generation: the
decoder instance prompt: Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is **the** (here, "the" is the first token generated on the prefill instance)
decoder generation: answer: "The answer is: "The answer

Hoping this answers your question.

maobaolong · 2025-04-04T00:22:25Z

examples/offline_inference/disaggrated-prefill-v1/prefill_example.py

+          gpu_memory_utilization=0.8,
+          kv_transfer_config=KVTransferConfig.from_cli(
+              '{"kv_connector":"SharedStorageConnector","kv_role":"kv_both", '
+              '"kv_extra_config": {"shared_storage_path": "local_storage"}}')


kv_extra_config -> kv_connector_extra_config

A minor issue

hasB4K

Thank for for this PR 😃. Here some small changes proopsal to restore the support of V0 (broken by this PR).

vllm/attention/layer.py

vllm/distributed/parallel_state.py

vllm/forward_context.py

robertgshaw2-redhat · 2025-04-05T00:52:47Z

Should we just deprecate V0?

vllm/attention/layer.py

vllm/engine/arg_utils.py

vllm/forward_context.py

robertgshaw2-redhat · 2025-04-17T15:11:55Z

May I ask when the branch will be merged?

Just getting test green

Signed-off-by: ApostaC <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: remi <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Rémi Delacourt <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Yang Wang <[email protected]>

Huixxi · 2025-04-23T12:09:18Z

But is there a demo to run? Can I run like this?

VLLM_USE_V1=1 python3 -m vllm.entrypoints.openai.api_server --model xxx--port 8500 --max-model-len 8192 --gpu-memory-utilization 0.9 --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_consumer"}' --trust-remote-code --tensor-parallel-size 8

ApostaC · 2025-04-23T17:10:19Z

But is there a demo to run? Can I run like this?

@Huixxi There is an example in #16625

Huixxi · 2025-04-24T05:31:19Z

But is there a demo to run? Can I run like this?

@Huixxi There is an example in #16625

Thanks! Which branch of source code should I better to use? https://github.com/ApostaC/vllm/tree/local-dev/lmcache-v1-connector-pr this one? And does it support xPyD now? Multi nodes? And how to install which version of lmcache?

khayamgondal · 2025-04-28T18:06:13Z

How do run this with vLLM serve?
python3 -m vllm.entrypoints.openai.api_server --model /extended/downloaded/Meta-Llama-3.1-70B-Instruct-quantized.w8a8/ --kv-transfer-config '{"kv_connector":"LMCacheConnector", "kv_role":"kv_both"}' --max-model-len 4096 --gpu-memory-utilization 0.9
Gives error

llm.v1.worker.gpu_worker.Worker object at 0xf17daec353d0>
INFO 04-28 18:01:26 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
ERROR 04-28 18:01:26 [core.py:396] EngineCore failed to start.
ERROR 04-28 18:01:26 [core.py:396] Traceback (most recent call last):
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 04-28 18:01:26 [core.py:396]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-28 18:01:26 [core.py:396]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/engine/core.py", line 329, in __init__
ERROR 04-28 18:01:26 [core.py:396]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/engine/core.py", line 64, in __init__
ERROR 04-28 18:01:26 [core.py:396]     self.model_executor = executor_class(vllm_config)
ERROR 04-28 18:01:26 [core.py:396]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-28 18:01:26 [core.py:396]     self._init_executor()
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 46, in _init_executor
ERROR 04-28 18:01:26 [core.py:396]     self.collective_rpc("init_device")
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-28 18:01:26 [core.py:396]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-28 18:01:26 [core.py:396]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/utils.py", line 2456, in run_method
ERROR 04-28 18:01:26 [core.py:396]     return func(*args, **kwargs)
ERROR 04-28 18:01:26 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/worker/worker_base.py", line 604, in init_device
ERROR 04-28 18:01:26 [core.py:396]     self.worker.init_device()  # type: ignore
ERROR 04-28 18:01:26 [core.py:396]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 135, in init_device
ERROR 04-28 18:01:26 [core.py:396]     init_worker_distributed_environment(self.vllm_config, self.rank,
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 329, in init_worker_distributed_environment
ERROR 04-28 18:01:26 [core.py:396]     ensure_kv_transfer_initialized(vllm_config)
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/distributed/kv_transfer/kv_transfer_state.py", line 63, in ensure_kv_transfer_initialized
ERROR 04-28 18:01:26 [core.py:396]     _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector_v1(
ERROR 04-28 18:01:26 [core.py:396]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/factory.py", line 63, in create_connector_v1
ERROR 04-28 18:01:26 [core.py:396]     assert issubclass(connector_cls, KVConnectorBase_V1)
ERROR 04-28 18:01:26 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396] AssertionError
Process EngineCore_0:
Traceback (most recent call last):
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/vllm/vllm/v1/engine/core.py", line 400, in run_engine_core
    raise e
  File "/workspace/vllm/vllm/v1/engine/core.py", line 387, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/core.py", line 329, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/workspace/vllm/vllm/v1/engine/core.py", line 64, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/executor/executor_base.py", line 52, in __init__
    self._init_executor()
  File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 46, in _init_executor
    self.collective_rpc("init_device")
  File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/utils.py", line 2456, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/worker/worker_base.py", line 604, in init_device
    self.worker.init_device()  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 135, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 329, in init_worker_distributed_environment
    ensure_kv_transfer_initialized(vllm_config)
  File "/workspace/vllm/vllm/distributed/kv_transfer/kv_transfer_state.py", line 63, in ensure_kv_transfer_initialized
    _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector_v1(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/factory.py", line 63, in create_connector_v1
    assert issubclass(connector_cls, KVConnectorBase_V1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
[rank0]:[W428 18:01:26.726953847 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1130, in <module>
    uvloop.run(run_server(args))
  File "/workspace/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/workspace/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
    return cls(
           ^^^^
  File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 118, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 642, in __init__
    super().__init__(
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 398, in __init__
    self._wait_for_engine_startup()
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.

ApostaC · 2025-04-28T19:19:47Z

@khayamgondal Please use LMCacheConnectorV1 rather than LMCacheConnector.
Here's the example script: https://github.com/vllm-project/vllm/blob/main/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh

khayamgondal · 2025-04-28T19:36:06Z

Thanks, I just figured it out and was about to reply here. Do you know if there is a way to find out the exact LMCache KV size? We can specify the max cpu memory and disk size but is there a way to see the actual KV size on LMCache?

…

On Mon, Apr 28, 2025, 2:20 PM Yihua Cheng ***@***.***> wrote: *ApostaC* left a comment (vllm-project/vllm#15960) <#15960 (comment)> @khayamgondal <https://github.com/khayamgondal> Please use LMCacheConnectorV1 rather than LMCacheConnector. Here's the example script: https://github.com/vllm-project/vllm/blob/main/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh — Reply to this email directly, view it on GitHub <#15960 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATNG35X2GRUR54F6NXW7GT23Z5OTAVCNFSM6AAAAAB2KI6DCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMZWGI3TCNJRGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Signed-off-by: ApostaC <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: remi <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Rémi Delacourt <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>

Signed-off-by: ApostaC <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: remi <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Rémi Delacourt <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>

zejun-chen · 2025-05-11T06:51:17Z

Hi,
With this PR, i got the failure when running with the following script.
https://docs.vllm.ai/en/stable/getting_started/examples/disaggregated_prefill.html#disaggregated-prefill

The error is shown as below:

ERROR 05-10 23:29:44 [core.py:396] EngineCore failed to start.^M
ERROR 05-10 23:29:44 [core.py:396] Traceback (most recent call last):^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core^M
ERROR 05-10 23:29:44 [core.py:396]     engine_core = EngineCoreProc(*args, **kwargs)^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 329, in __init__^M
ERROR 05-10 23:29:44 [core.py:396]     super().__init__(vllm_config, executor_class, log_stats,^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 64, in __init__^M
ERROR 05-10 23:29:44 [core.py:396]     self.model_executor = executor_class(vllm_config)^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__^M
ERROR 05-10 23:29:44 [core.py:396]     self._init_executor()^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor^M
ERROR 05-10 23:29:44 [core.py:396]     self.collective_rpc("init_device")^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc^M
ERROR 05-10 23:29:44 [core.py:396]     answer = run_method(self.driver_worker, method, args, kwargs)^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/utils.py", line 2456, in run_method^M
ERROR 05-10 23:29:44 [core.py:396]     return func(*args, **kwargs)^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device^M
ERROR 05-10 23:29:44 [core.py:396]     self.worker.init_device()  # type: ignore^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 135, in init_device^M
ERROR 05-10 23:29:44 [core.py:396]     init_worker_distributed_environment(self.vllm_config, self.rank,^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 336, in init_worker_distributed_environment^M
ERROR 05-10 23:29:44 [core.py:396]     ensure_kv_transfer_initialized(vllm_config)^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 63, in ensure_kv_transfer_initialized^M
ERROR 05-10 23:29:44 [core.py:396]     _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector_v1(^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 66, in create_connector_v1^M
ERROR 05-10 23:29:44 [core.py:396]     assert issubclass(connector_cls, KVConnectorBase_V1)^M
ERROR 05-10 23:29:44 [core.py:396] AssertionError

The connector must be initialized to be the subclass of the KVConnectorBase_V1:

        connector_name = config.kv_transfer_config.kv_connector
        connector_cls = cls._registry[connector_name]()
        assert issubclass(connector_cls, KVConnectorBase_V1)

How can i modify the official PD cases? Is that i need to use "kv_connector":"SharedStorageConnector" ?

Thank you.

Signed-off-by: ApostaC <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: remi <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Rémi Delacourt <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Mu Huai <[email protected]>

ApostaC requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners April 2, 2025 18:31

mergify bot added documentation Improvements or additions to documentation ci/build v1 labels Apr 2, 2025

mergify bot added the needs-rebase label Apr 2, 2025

Fixing DCO issue and format checker issue

6a12481

Co-authored-by: KuntaiDu <[email protected]> Co-authored-by: YaoJiayi <[email protected]> Signed-off-by: ApostaC <[email protected]>

ApostaC force-pushed the local-dev/v1-disagg branch from b3d71bb to 6a12481 Compare April 2, 2025 19:11

mergify bot removed the needs-rebase label Apr 2, 2025

ApostaC added 2 commits April 2, 2025 19:21

fixing pre-commit conflicts

34bea75

Signed-off-by: ApostaC <[email protected]>

[fix] fix the runtime error when no kv cache config is provided

20ef2ac

Signed-off-by: ApostaC <[email protected]>

ApostaC changed the title ~~[Core] KV Connector API for v1 disaggregated prefill support~~ [Core][P/D Disagg] KV Connector API for v1 disaggregated prefill support Apr 2, 2025

maobaolong reviewed Apr 3, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/shared_storage_connector.py Outdated Show resolved Hide resolved

hasB4K mentioned this pull request Apr 3, 2025

[V1][Core] Porting Connector API to V1 #13385

Closed

maobaolong reviewed Apr 4, 2025

View reviewed changes

hasB4K reviewed Apr 4, 2025

View reviewed changes

robertgshaw2-redhat reviewed Apr 5, 2025

View reviewed changes

vllm/attention/layer.py Outdated Show resolved Hide resolved

robertgshaw2-redhat reviewed Apr 5, 2025

View reviewed changes

vllm/engine/arg_utils.py Show resolved Hide resolved

robertgshaw2-redhat reviewed Apr 5, 2025

View reviewed changes

vllm/forward_context.py Outdated Show resolved Hide resolved

robertgshaw2-redhat mentioned this pull request Apr 5, 2025

[V1/0][P/D] XpYd based on p2p communication without cache store #15806

Closed

2 tasks

robertgshaw2-redhat enabled auto-merge (squash) April 17, 2025 15:07

tlrmchlsmth approved these changes Apr 17, 2025

View reviewed changes

simon-mo disabled auto-merge April 17, 2025 20:22

simon-mo merged commit 3408e47 into vllm-project:main Apr 17, 2025
50 of 55 checks passed

ApostaC mentioned this pull request Apr 18, 2025

[RFC]: KVBlocks and Metrics Publishing In Inference Frameworks #16669

Closed

1 task

jianzs mentioned this pull request Apr 22, 2025

[Feature] Support the v1 connector API vllm-project/vllm-ascend#605

Open

This was referenced Apr 27, 2025

[WIP][Feature] Impl the connector based on the llmdatadist for v1 vllm-project/vllm-ascend#681

Closed

[Feature][1/2] Impl the connector based on the llmdatadist for v1 vllm-project/vllm-ascend#684

Closed

sdavidbd mentioned this pull request May 8, 2025

[P/D][V1] Add generic KV Connector for delegation to external implementations #17840

Closed

KingsleyZhang123 mentioned this pull request May 12, 2025

[P/D][v1] Allow registering external kv connector from args #18028

Closed

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

sdavidbd mentioned this pull request Jun 8, 2025

[RFC]: Graceful Error Handling for KV Connector Load Failures #19329

Closed

1 task

pliops-daniels mentioned this pull request Jun 30, 2025

[RFC]: Online inference benchmark tool for multi-turn conversations #20265

Closed

1 task

flesher0813 mentioned this pull request Jul 28, 2025

[WIP][Feature]:Add support for the vLLM V1 connector vllm-project/vllm-ascend#2052

Closed

This was referenced Sep 25, 2025

[Feature]: Generalized KV cache reuse #25672

Closed

[RFC]: Generalized KV cache reuse #25950

Open

Uh oh!

[P/D][V1] KV Connector API V1 #15960

[P/D][V1] KV Connector API V1 #15960

Uh oh!

Conversation

ApostaC commented Apr 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

APIS ARE SUBJECT TO CHANGE IN FOLLOW UPS

TL;DR:

TODOs in the upcoming PRs

Key design choices

High-level design of the KV connector in v1

Scheduler connector

Worker connector

Working with outside orchestrator

Extra note

Uh oh!

github-actions bot commented Apr 2, 2025

Uh oh!

mergify bot commented Apr 2, 2025

Uh oh!

ApostaC commented Apr 2, 2025

Uh oh!

Uh oh!

maobaolong commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ApostaC commented Apr 3, 2025

Uh oh!

maobaolong Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

hasB4K left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented Apr 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented Apr 17, 2025

Uh oh!

Uh oh!

Huixxi commented Apr 23, 2025

Uh oh!

ApostaC commented Apr 23, 2025

Uh oh!

Huixxi commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khayamgondal commented Apr 28, 2025

Uh oh!

ApostaC commented Apr 28, 2025

Uh oh!

khayamgondal commented Apr 28, 2025 via email

Uh oh!

zejun-chen commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

21 participants

ApostaC commented Apr 2, 2025 •

edited by github-actions bot

Loading

maobaolong commented Apr 3, 2025 •

edited

Loading

Huixxi commented Apr 24, 2025 •

edited

Loading

zejun-chen commented May 11, 2025 •

edited

Loading