-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[V1/0][P/D] XpYd based on p2p communication without cache store #15806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Abatom <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
@robertgshaw2-redhat ping! |
My afternoon is blocked for me to focus on this PR. Thanks! |
Are you okay with me deleting the proxy in a follow up? |
logger = logging.getLogger(__name__) | ||
|
||
|
||
class P2pNcclPipe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not inherit from KVPipeBase
- I think we should leverage the base class to make sure the implementations remain consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to add two parameters, tensor_id: str = ""
and remote_address: Optional[str] = None
, to both the send_tensor
and recv_tensor
functions in KVPipeBase
to make sure the implementations remain consistent.
Is this approach acceptable?
if remote_address not in self.socks: | ||
sock = self.context.socket(zmq.DEALER) | ||
sock.setsockopt_string(zmq.IDENTITY, self.zmq_address) | ||
sock.connect(f"tcp://{remote_address}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR, can we leverage ipc
?
In general, using zmq sockets with pickle is insecure (see below)
With TCP, any arbitrary user with access to the route can cause remote code execution.
For this prototyping stage, we should only have ipc. When we move to production deployments, we can turn on tpc but with security caveats.
Thanks @Abatom - generally looks good. Key concerns:
|
No problem. |
Thank you very much for your code review.I will complete the revisions as soon as possible. |
我这几天尝试之后观察到的现象是,prefill的吞吐量确实增加非常多,但是decoder的吞吐量和v1相比并没有什么变化。 不知道是不是decoder需要特殊配置之类的?在我的场景中,每个prompt大约1k长度,同时响应的内容大约2~3k长度,是一个比较heavy的token generation的任务。 |
@WXFMAV, What model, what GPU? It's likely that the temporary GPU memory allocation is too small. You can try reducing the |
It is Nvidia A100 gpu cards 80GB, Qwen 7B model, I compare this method with the origin vllm v1 baseline method, both two experiments config four gpu cards with the same request qps of 10 qps. The baseline method works well, but the xpyd method will soon increase the pending reqs and consume the KV-Cache rapidly, indictating that the decoders are overload. So I think the prefill and decoder seprated config does not improve the decoder throughput in my scenarios. That puzzles me. And the gpu utilization of 70%, 60% were all be tested but can not lead to improvements. |
@WXFMAV OK, I think there are still some parameters that haven't been properly configured. The adjustable parameters include the ratio of P instances to D instances(try 1:7), reducing |
Fine, THANKS, let me try it! |
When is this PR expected to be merged? |
After the support for v1 is perfected,it should be possible to merge. |
我尝试了这个配置, 但是看起来效果仍然不太好, 现在是1P7D, 10qps的请求, 到了1.6W个累计请求之后,会有大量的 Failed toreceive all KVS. 使用的模型是: Qwen2-7B, 数据集合是一个输入平均为1000, 输出为2000~3000的任务, 执行前期1000个prompts不会报错, 到了1.6W个prompts左右开始会报错. 其中GPU卡配置如下
具体的prefill 和 decoder的启动命令如下
prefill的metric如下
decoder的metric基本都表明存在大量pending的seqs:
其中一个decoder的接收溢出报错如下:
使用4张卡,基线配置,是可以承担10QPS的请求的,基线的kv-cache使用情况如下:
基线的启动命令如下
|
@WXFMAV 这是因为现在的proxy还比较简单,只是随机选择D实例,刚开始还比较均匀,后面就不均匀了 |
### Description: Multi-node Prefill/Decode Disaggregated Deployment with FlagCX This PR implements support for multi-node disaggregated deployment of **prefill** and **decode** stages using `xPyD` Disaggregation: - Schedule strategies of PD instances currently support: `robin`, `random`. default is `robin`. - It introduces a new communication backend based on [FlagCX](https://github.com/FlagOpen/FlagCX). Merge [FlagCX Adapter](#461). - KV cache transfer is enabled via [p2pConnector](vllm-project/vllm#15806) in `vLLM`. --- ### How to Use **Step 1**: Install [FlagCX](https://github.com/FlagOpen/FlagCX?tab=readme-ov-file#quick-start) **Step 2**: Install the `vLLM` version from [FlagScale](https://github.com/FlagOpen/FlagScale?tab=readme-ov-file#setup) **Step 3**: Define your config files under `./examples/qwen/conf` **Step 4**: Launch the distributed deployment ```bash python run.py --config-path ./examples/qwen/conf --config-name config_qwen2.5_7b_disagg_xpyd action=run ``` **Step 5**: Send requests to the deployed service ```bash curl -X POST -s http://localhost:10001/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/models/Qwen2.5-7B-Instruct", "prompt": "Introduce Bruce Lee in details", "max_tokens": 100, "temperature": 0, "stream": true }' ```
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
An implementation of XpYd with dynamic scaling based on point-to-point communication, partly inspired by Dynamo.
DeepSeek R1 On H20 with 1P2D and V0
In the deepseek-R1 inference scenario with 1k input and 1k output tokens, where the three H20 machines (1P2D) have a TTFT of around 2 seconds and TPOT ≤ 100ms, the throughput improvement compared to deploying vllm on three single H20 machine is 115%(2396/3/370-1).
Architecture diagram
Explanations:
--max-num-seqs
should be set to a smaller value. In my scenario, I set it to 5 to avoid filling up the buffer of the D instance, which would otherwise cause the D instance to recompute the prefill. To address the issue of setting--max-num-seqs
to a small value, I will implement a local memory pool to handle the sudden increase in kvcache to avoid recompute prefill.TODO
Install vLLM
Environment configuration
Delete this line in vllm/env_override.py
It is highly recommended to set
NCCL_CUMEM_ENABLE=1
, allowing NCCL to use CUDA Unified Memory (CUmem) for communication. This can reduce the overhead of GPU memory copying and improve the efficiency of multi-GPU or cross-node communication.In
--kv-transfer-config
, the sending type includes tree mutually exclusive options: PUT, GET, PUT_ASYNC.How to run 1P2D with V0? (Stable)
Node 1 (IP:1.1.1.1)
Node 1 (IP:1.1.1.1)
Node 2 (IP:2.2.2.2)
Node 3 (IP:3.3.3.3)
How to run 1P2D with V1? (Unstable)
Node 1 (IP:1.1.1.1)
Node 1 (IP:1.1.1.1)
Node 2 (IP:2.2.2.2)
Node 3 (IP:3.3.3.3)
Request