Skip to content

Conversation

ApostaC
Copy link
Collaborator

@ApostaC ApostaC commented Apr 2, 2025

APIS ARE SUBJECT TO CHANGE IN FOLLOW UPS

TL;DR:

This PR opens the KV connector API in v1 to support disaggregated prefill. It also includes a minimal functional implementation as an example of how to use the connector API.

Detailed design doc: https://docs.google.com/document/d/1uPGdbEXksKXeN4Q9nUm9hzotqEjQhYmnpAhidLuAsjk

This PR is co-authored by:

TODOs in the upcoming PRs

Key design choices

  • Implement disagg prefill under the hood of v1's prefix caching and chunked prefill semantics
    The vLLM scheduler calculates which set of tokens needs a KV store or KV load, and the workers perform the actual KV store or load operations.
  • Provide layer-wise async API support
  • KV cache prefetching and request orchestration should happen outside vLLM so that the changes in the core can be minimized

High-level design of the KV connector in v1

The figure below shows the high-level design of the connector
image

In the design, every process in vLLM will have a corresponding connector. Specifically, we have

  • Scheduler connector: the connector that locates in the same process as the scheduler process. It schedules the KV cache transfer ops.
  • Worker connectors: the connectors that locate in the worker processes. They execute KV cache transfer ops.

Scheduler connector

On prefill nodes, the scheduler connector needs to parse the scheduler's output and determine what tokens should have their KV cache transmitted to the decoder nodes.

On decoder nodes, the scheduler connector needs to return the "correct" num_computed_tokens and computed_blocks when calling get_computed_tokens.

Worker connector

The figure below shows how the worker connector works with the attention module to achieve layer-by-layer KV cache store and load:

Working with outside orchestrator

In more advanced use cases like xPyD, the connector may need to know which decoder node to send the KV cache to from the outside orchestrator. We believe different infrastructure providers may have very different orchestrating logics, and thus such logic should reside outside of vLLM.

The figure below explains the workflow among the orchestrator, vLLM, and the connector:

image

At a high level, the orchestrator should determine when to send the request to which node. Also, the connector may give the orchestrator some feedback, such as "KV cache transfer finished" (depending on the implementation).

For more details, please refer to our design doc: https://docs.google.com/document/d/1uPGdbEXksKXeN4Q9nUm9hzotqEjQhYmnpAhidLuAsjk

Extra note

  • This PR's goal is shipping the connector API with just a minimal functional implementation. We are working on a better (more performant, more stable) implementation and that will be a new PR soon.

Copy link

github-actions bot commented Apr 2, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation ci/build v1 labels Apr 2, 2025
Copy link

mergify bot commented Apr 2, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ApostaC.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 2, 2025
@ApostaC
Copy link
Collaborator Author

ApostaC commented Apr 2, 2025

cc @KuntaiDu @YaoJiayi

Co-authored-by: KuntaiDu <[email protected]>
Co-authored-by: YaoJiayi <[email protected]>

Signed-off-by: ApostaC <[email protected]>
@ApostaC ApostaC force-pushed the local-dev/v1-disagg branch from b3d71bb to 6a12481 Compare April 2, 2025 19:11
@mergify mergify bot removed the needs-rebase label Apr 2, 2025
@ApostaC ApostaC changed the title [Core] KV Connector API for v1 disaggregated prefill support [Core][P/D Disagg] KV Connector API for v1 disaggregated prefill support Apr 2, 2025
@maobaolong
Copy link
Contributor

maobaolong commented Apr 3, 2025

@ApostaC I cherry-pick this PR to our repo and run the example. The following is my log

INFO 04-03 07:33:59 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-03 07:33:59 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={}
INFO 04-03 07:33:59 [shared_storage_connector.py:92] Shared storage path is /tmp
Processed prompts:   0%|                                       | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:131] Start loading KV cache from the connector
Processed prompts: 100%|██████████████████████████| 4/4 [00:00<00:00, 50.53it/s, est. speed input: 38147.85 toks/s, output: 50.56 toks/s]
Prompt: 'Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is', Generated text: ' the'
Prompt: 'Hi Hi Hi Hi Hi ................i Hi Hi Hi The capital of France is', Generated text: ' the'
Prompt: 'Hey Hey..................Hey Your name is', Generated text: ' the'
Prompt: 'Hey Hey Hey ............... Hey Hey The capital of China is', Generated text: ' '
Saved 4 prompts to output.txt
[rank0]:[W403 07:34:00.128094287 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
INFO 04-03 07:34:03 [__init__.py:239] Automatically detected platform cuda.
Loaded 4 prompts from output.txt
INFO 04-03 07:34:09 [config.py:591] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 04-03 07:34:10 [config.py:1712] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-03 07:34:10 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-03 07:34:10 [core.py:54] Initializing a V1 LLM engine (v0.7.4.dev711+gee96432c) with config: model='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/disc/data1/Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
WARNING 04-03 07:34:11 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ffe6832ea80>
INFO 04-03 07:34:11 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-03 07:34:11 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-03 07:34:11 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-03 07:34:11 [shared_storage_connector.py:92] Shared storage path is local_storage
INFO 04-03 07:34:11 [cuda.py:220] Using Flash Attention backend on V1 engine.
INFO 04-03 07:34:11 [gpu_model_runner.py:1179] Starting to load model /disc/data1/Qwen/Qwen2.5-1.5B-Instruct...
INFO 04-03 07:34:11 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.10it/s]

INFO 04-03 07:34:11 [loader.py:447] Loading weights took 0.50 seconds
INFO 04-03 07:34:12 [gpu_model_runner.py:1191] Model loading took 2.8871 GB and 0.585285 seconds
INFO 04-03 07:34:12 [kv_cache_utils.py:566] GPU KV cache size: 2,644,896 tokens
INFO 04-03 07:34:12 [kv_cache_utils.py:569] Maximum concurrency for 32,768 tokens per request: 80.72x
INFO 04-03 07:34:12 [core.py:152] init engine (profile, create kv cache, warmup model) took 0.88 seconds
INFO 04-03 07:34:12 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-03 07:34:12 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-03 07:34:12 [shared_storage_connector.py:92] Shared storage path is local_storage
Processed prompts:   0%|                                       | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
Processed prompts: 100%|█████████████████████████| 4/4 [00:00<00:00, 16.85it/s, est. speed input: 12724.93 toks/s, output: 168.48 toks/s]
Prompt: 'Hi Hi Hi Hi Hi......................Hi Hi Hi Hi Hi Hi Hi Hi Hello, my name is the', Generated text: ' answer: "The answer is: "The answer'
Prompt: 'Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi............. Hi Hi Hi Hi Hi The capital of France is the', Generated text: ' first step. 1. 202'
Prompt: 'Hey Hey Hey Hey ......... Hey Hey Hey Hey Hey Your name is the', Generated text: ' best way to find the best way to find the'
Prompt: 'Hey Hey Hey ............. Hey Hey Hey The capital of China is', Generated text: ' 100000000'
[rank0]:[W403 07:34:13.321916884 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

I abstract the key log

Prompt: 'Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is', Generated text: ' the'
Prompt: 'Hi Hi Hi Hi Hi ................i Hi Hi Hi The capital of France is', Generated text: ' the'
Prompt: 'Hey Hey..................Hey Your name is', Generated text: ' the'
Prompt: 'Hey Hey Hey ............... Hey Hey The capital of China is', Generated text: ' '
Saved 4 prompts to output.txt
.....
....
Prompt: 'Hi Hi Hi Hi Hi......................Hi Hi Hi Hi Hi Hi Hi Hi Hello, my name is the', Generated text: ' answer: "The answer is: "The answer'
Prompt: 'Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi............. Hi Hi Hi Hi Hi The capital of France is the', Generated text: ' first step. 1. 202'
Prompt: 'Hey Hey Hey Hey ......... Hey Hey Hey Hey Hey Your name is the', Generated text: ' best way to find the best way to find the'
Prompt: 'Hey Hey Hey ............. Hey Hey Hey The capital of China is', Generated text: ' 100000000'

Why the first token of Generated text from decode instance is different from the first token generated from profill instance

@ApostaC
Copy link
Collaborator Author

ApostaC commented Apr 3, 2025

Why the first token of Generated text from decode instance is different from the first token generated from profill instance

@maobaolong In the example, the prefill instance first generates a new token and "sends" the context + the newly generated token together to the decode instance. The decoder then starts generating based on that. Therefore, if you look at the "first generated token" on prefill instance and decoder instance, they should be different.

Example:

  • prefill instance prompt: Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is
  • prefill instance generation: the
  • decoder instance prompt: Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is **the** (here, "the" is the first token generated on the prefill instance)
  • decoder generation: answer: "The answer is: "The answer

Hoping this answers your question.

gpu_memory_utilization=0.8,
kv_transfer_config=KVTransferConfig.from_cli(
'{"kv_connector":"SharedStorageConnector","kv_role":"kv_both", '
'"kv_extra_config": {"shared_storage_path": "local_storage"}}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kv_extra_config -> kv_connector_extra_config

A minor issue

Copy link
Contributor

@hasB4K hasB4K left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for for this PR 😃. Here some small changes proopsal to restore the support of V0 (broken by this PR).

@robertgshaw2-redhat
Copy link
Collaborator

Should we just deprecate V0?

@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) April 17, 2025 15:07
@robertgshaw2-redhat
Copy link
Collaborator

May I ask when the branch will be merged?

Just getting test green

@simon-mo simon-mo disabled auto-merge April 17, 2025 20:22
@simon-mo simon-mo merged commit 3408e47 into vllm-project:main Apr 17, 2025
50 of 55 checks passed
yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025
Signed-off-by: ApostaC <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: remi <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Rémi Delacourt <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Yang Wang <[email protected]>
@Huixxi
Copy link

Huixxi commented Apr 23, 2025

But is there a demo to run? Can I run like this?

VLLM_USE_V1=1 python3 -m vllm.entrypoints.openai.api_server --model xxx--port 8500 --max-model-len 8192 --gpu-memory-utilization 0.9 --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_consumer"}' --trust-remote-code --tensor-parallel-size 8

@ApostaC
Copy link
Collaborator Author

ApostaC commented Apr 23, 2025

But is there a demo to run? Can I run like this?

@Huixxi There is an example in #16625

@Huixxi
Copy link

Huixxi commented Apr 24, 2025

But is there a demo to run? Can I run like this?

@Huixxi There is an example in #16625

Thanks! Which branch of source code should I better to use? https://github.com/ApostaC/vllm/tree/local-dev/lmcache-v1-connector-pr this one? And does it support xPyD now? Multi nodes? And how to install which version of lmcache?

@khayamgondal
Copy link

How do run this with vLLM serve?
python3 -m vllm.entrypoints.openai.api_server --model /extended/downloaded/Meta-Llama-3.1-70B-Instruct-quantized.w8a8/ --kv-transfer-config '{"kv_connector":"LMCacheConnector", "kv_role":"kv_both"}' --max-model-len 4096 --gpu-memory-utilization 0.9
Gives error

llm.v1.worker.gpu_worker.Worker object at 0xf17daec353d0>
INFO 04-28 18:01:26 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
ERROR 04-28 18:01:26 [core.py:396] EngineCore failed to start.
ERROR 04-28 18:01:26 [core.py:396] Traceback (most recent call last):
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 04-28 18:01:26 [core.py:396]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-28 18:01:26 [core.py:396]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/engine/core.py", line 329, in __init__
ERROR 04-28 18:01:26 [core.py:396]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/engine/core.py", line 64, in __init__
ERROR 04-28 18:01:26 [core.py:396]     self.model_executor = executor_class(vllm_config)
ERROR 04-28 18:01:26 [core.py:396]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-28 18:01:26 [core.py:396]     self._init_executor()
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 46, in _init_executor
ERROR 04-28 18:01:26 [core.py:396]     self.collective_rpc("init_device")
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-28 18:01:26 [core.py:396]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-28 18:01:26 [core.py:396]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/utils.py", line 2456, in run_method
ERROR 04-28 18:01:26 [core.py:396]     return func(*args, **kwargs)
ERROR 04-28 18:01:26 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/worker/worker_base.py", line 604, in init_device
ERROR 04-28 18:01:26 [core.py:396]     self.worker.init_device()  # type: ignore
ERROR 04-28 18:01:26 [core.py:396]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 135, in init_device
ERROR 04-28 18:01:26 [core.py:396]     init_worker_distributed_environment(self.vllm_config, self.rank,
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 329, in init_worker_distributed_environment
ERROR 04-28 18:01:26 [core.py:396]     ensure_kv_transfer_initialized(vllm_config)
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/distributed/kv_transfer/kv_transfer_state.py", line 63, in ensure_kv_transfer_initialized
ERROR 04-28 18:01:26 [core.py:396]     _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector_v1(
ERROR 04-28 18:01:26 [core.py:396]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/factory.py", line 63, in create_connector_v1
ERROR 04-28 18:01:26 [core.py:396]     assert issubclass(connector_cls, KVConnectorBase_V1)
ERROR 04-28 18:01:26 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396] AssertionError
Process EngineCore_0:
Traceback (most recent call last):
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/vllm/vllm/v1/engine/core.py", line 400, in run_engine_core
    raise e
  File "/workspace/vllm/vllm/v1/engine/core.py", line 387, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/core.py", line 329, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/workspace/vllm/vllm/v1/engine/core.py", line 64, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/executor/executor_base.py", line 52, in __init__
    self._init_executor()
  File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 46, in _init_executor
    self.collective_rpc("init_device")
  File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/utils.py", line 2456, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/worker/worker_base.py", line 604, in init_device
    self.worker.init_device()  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 135, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 329, in init_worker_distributed_environment
    ensure_kv_transfer_initialized(vllm_config)
  File "/workspace/vllm/vllm/distributed/kv_transfer/kv_transfer_state.py", line 63, in ensure_kv_transfer_initialized
    _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector_v1(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/factory.py", line 63, in create_connector_v1
    assert issubclass(connector_cls, KVConnectorBase_V1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
[rank0]:[W428 18:01:26.726953847 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1130, in <module>
    uvloop.run(run_server(args))
  File "/workspace/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/workspace/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
    return cls(
           ^^^^
  File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 118, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 642, in __init__
    super().__init__(
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 398, in __init__
    self._wait_for_engine_startup()
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.

@ApostaC
Copy link
Collaborator Author

ApostaC commented Apr 28, 2025

@khayamgondal Please use LMCacheConnectorV1 rather than LMCacheConnector.
Here's the example script: https://github.com/vllm-project/vllm/blob/main/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh

@khayamgondal
Copy link

khayamgondal commented Apr 28, 2025 via email

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: ApostaC <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: remi <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Rémi Delacourt <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: ApostaC <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: remi <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Rémi Delacourt <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025
Signed-off-by: ApostaC <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: remi <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Rémi Delacourt <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Agata Dobrzyniewicz <[email protected]>
@zejun-chen
Copy link

zejun-chen commented May 11, 2025

Hi,
With this PR, i got the failure when running with the following script.
https://docs.vllm.ai/en/stable/getting_started/examples/disaggregated_prefill.html#disaggregated-prefill

The error is shown as below:

ERROR 05-10 23:29:44 [core.py:396] EngineCore failed to start.^M
ERROR 05-10 23:29:44 [core.py:396] Traceback (most recent call last):^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core^M
ERROR 05-10 23:29:44 [core.py:396]     engine_core = EngineCoreProc(*args, **kwargs)^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 329, in __init__^M
ERROR 05-10 23:29:44 [core.py:396]     super().__init__(vllm_config, executor_class, log_stats,^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 64, in __init__^M
ERROR 05-10 23:29:44 [core.py:396]     self.model_executor = executor_class(vllm_config)^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__^M
ERROR 05-10 23:29:44 [core.py:396]     self._init_executor()^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor^M
ERROR 05-10 23:29:44 [core.py:396]     self.collective_rpc("init_device")^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc^M
ERROR 05-10 23:29:44 [core.py:396]     answer = run_method(self.driver_worker, method, args, kwargs)^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/utils.py", line 2456, in run_method^M
ERROR 05-10 23:29:44 [core.py:396]     return func(*args, **kwargs)^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device^M
ERROR 05-10 23:29:44 [core.py:396]     self.worker.init_device()  # type: ignore^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 135, in init_device^M
ERROR 05-10 23:29:44 [core.py:396]     init_worker_distributed_environment(self.vllm_config, self.rank,^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 336, in init_worker_distributed_environment^M
ERROR 05-10 23:29:44 [core.py:396]     ensure_kv_transfer_initialized(vllm_config)^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 63, in ensure_kv_transfer_initialized^M
ERROR 05-10 23:29:44 [core.py:396]     _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector_v1(^M
ERROR 05-10 23:29:44 [core.py:396]   File "/raid/miniforge3/envs/zejun/lib/python3.10/site-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 66, in create_connector_v1^M
ERROR 05-10 23:29:44 [core.py:396]     assert issubclass(connector_cls, KVConnectorBase_V1)^M
ERROR 05-10 23:29:44 [core.py:396] AssertionError

The connector must be initialized to be the subclass of the KVConnectorBase_V1:

        connector_name = config.kv_transfer_config.kv_connector
        connector_cls = cls._registry[connector_name]()
        assert issubclass(connector_cls, KVConnectorBase_V1)

How can i modify the official PD cases? Is that i need to use "kv_connector":"SharedStorageConnector" ?

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.