[Frontend] Disaggregate prefill decode with zmq #11791

panf2333 · 2025-01-07T05:42:54Z

Added vLLM Connect to initiate a proxy service and connect to the VLLM Server via ZMQ, improved the performance of prefill-decode disaggregation by 5-10% (TTFT), and 5- 15% (ITL) on average.

This key change of this PR includes replacing HTTP with ZMQ for communication between the proxy and the VLLM server, and using socket pools to maintain persistent ZMQ connections, which reduces reconnection overhead.

We have attached the benchmark result and the detailed configuration to reproduce the result.

Benchmark

Without CUDA_LAUNCH_BLOCKING=1. The TTFT performance is improved by an average of 5-10%, The average performance improvement of ITL is 5-15%

Parameters

GPU device: 2 * H100 80G
Model: meta-llama/Meta-Llama-3.1-8B-Instruct
Parameters: gpu-memory-utilization 0.6 + kv_buffer_size 5e9
dataset input 1024 output 6
QPS: 12, 24, 48, 96
Total Request: 96

Evaluation Steps

Start Disagg HTTP proxy and 2 VLLM Server Instances(1 prefill and 1 decode)
Run the script to test QPS in [12, 24, 48, 96] each qps repeat 3 times and then obtain the average metric
Start Disagg ZMQ proxy and 2 VLLM Server Instance and repeat the previous process.

When I set CUDA_LAUNCH_BLOCKING=1 the results were similar to before, improved the performance of prefill-decode disaggregation by 30-50% (TTFT), and 3X- 15X (ITL) on average.

Parameters

GPU device: 2 * H100 80G
Model: meta-llama/Meta-Llama-3.1-8B-Instruct
Parameters: gpu-memory-utilization 0.6 + kv_buffer_size 5e9
dataset input 1024 output 6
CUDA_LAUNCH_BLOCKING=1
QPS: 12, 24, 48, 96
Total Request: 96

Evaluation Steps

Start Disagg HTTP proxy and 2 VLLM Server Instances(1 prefill and 1 decode)
Run the script to test QPS in [12, 24, 48, 96] each qps repeat 3 times and then obtain the average metric
Start Disagg ZMQ proxy and 2 VLLM Server Instance and repeat the previous process.

Design of ZMQ-based Client-Server Communication

High-level Overview

Design of ZMQ-based Communication

github-actions · 2025-01-07T05:43:07Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

robertgshaw2-redhat · 2025-01-07T16:26:28Z

Ping when ready. NOTE for reviewers: do not merge until me and @russellb have a chance to review

KuntaiDu · 2025-01-08T00:51:11Z

vllm/entrypoints/launcher.py

@simon-mo I am not familiar with ZMQ --- is dealer the right technical choice?

https://zguide.zeromq.org/docs/chapter3/#The-DEALER-to-DEALER-Combination

We need to proactively send messages to workers in this scenario.

ROUTER is not suitable for initiating messages because it doesn't know the identities of other receivers until it receives the first message. Only then can it establish routes for interaction.

REQ requires acknowledging each message before sending the next one, which doesn't meet our requirements.

DEALER allows us to actively send messages and supports asynchronous multi-send and multi-receive, making it the more suitable pattern. It's important to note that we need to maintain the DEALER's ID.

vllm/entrypoints/connect.py

vllm/scripts.py

KuntaiDu · 2025-01-08T01:04:58Z

vllm/entrypoints/connect.py

A potential optimization (you don't need to implement it in this PR): return the first token generated by the prefill instance in this async for, instead of reposting the request to decode instance and waiting for the first token from there.

vllm/scripts.py

robertgshaw2-redhat · 2025-01-08T03:54:52Z

benchmarks/disagg_benchmarks/zmq/test_connect_server1.py

zmq sockets are not threadsafe. This cannot run in a background thread it must be in an asyncio task.

This means you cannot use the built in proxy since it does not use async sockets. In prior versions of VLLM, you will have to write your own proxy (its like 10LOC)

The test scripts test_connect_server1.py and test_connect_server2.py were used to simulate model responses. I've since removed them.

robertgshaw2-redhat · 2025-01-08T03:58:48Z

vllm/entrypoints/connect.py

use destroy(linger=0)

Great point! To ensure immediate termination and avoid potential blocking, I'll switch to using destroy(linger=0) instead of term(). I also replace it in the vllm/entrypoints/launcher.py

https://pyzmq.readthedocs.io/en/latest/api/zmq.html#context
After interrupting all blocking calls, term shall block until the following conditions are satisfied:

All sockets open within context have been closed.

For each socket within context, all messages sent on the socket have either been physically transferred to a network peer, or the socket’s linger period set with the zmq.LINGER socket option has expired.

robertgshaw2-redhat · 2025-01-08T04:05:53Z

vllm/entrypoints/launcher.py

zmq sockets are not threadsafe. You cannot run this in a background thread.

I appreciate you pointing out potential thread safety issues with zmq sockets. You are completely correct; By default, they are not thread safe. I will prioritize finding a more thread safe alternative in the future to ensure robust operation in multi-threaded environments.

As zmq.proxy() is a synchronous function, executing it directly within the main thread can potentially block the server.

Currently, these two sockets are used exclusively within this thread. While I believe there are no immediate thread safety concerns, it's prudent to consider future scalability and maintainability. Can we address potential thread-safety issues in a subsequent PR?

https://zguide.zeromq.org/docs/chapter2/#ZeroMQ-s-Built-In-Proxy-Function
It’s exactly like starting the main loop of rrbroker.

https://github.com/booksbyus/zguide/blob/master/examples/Python/rrbroker.py

@robertgshaw2-neuralmagic Hi, Robert I have resolved this issue by threadproxy. It will create sockets in the proxy, which no one can access. So there won't be any thread safety issues

robertgshaw2-redhat · 2025-01-08T04:13:26Z

@panf2333 - Thanks for the PR! Disaggregated serving is a hugely important initiative for VLLM in 2025

I am responsible for the multiprocessing + asyncio + zmq architecture of VLLM, so I am going to review this in detail. I am having some trouble following the design here. Can you make a simple diagram that charts out what these objects are to ease in review?

Thanks!

panf2333 · 2025-01-08T04:38:14Z

@panf2333 - Thanks for the PR! Disaggregated serving is a hugely important initiative for VLLM in 2025

I am responsible for the multiprocessing + asyncio + zmq architecture of VLLM, so I am going to review this in detail. I am having some trouble following the design here. Can you make a simple diagram that charts out what these objects are to ease in review?

Thanks!

@robertgshaw2-neuralmagic It's my pleasure. I'll put together a diagram and send it over shortly.

panf2333 · 2025-01-08T06:43:07Z

@panf2333 - Thanks for the PR! Disaggregated serving is a hugely important initiative for VLLM in 2025

I am responsible for the multiprocessing + asyncio + zmq architecture of VLLM, so I am going to review this in detail. I am having some trouble following the design here. Can you make a simple diagram that charts out what these objects are to ease in review?

Thanks!

@robertgshaw2-neuralmagic These are simple diagram , hoping to help you better understand this PR. I also updated the description of PR.

The relationship with client connector and vllm server

The zmq detail between connector and vllm server

russellb

I don't understand the whole design yet, but I have one early comment: is all zmq communication local? If so, can you please use ipc:// sockets instead of tcp://? That will avoid some security concerns.

panf2333 · 2025-01-09T03:22:19Z

I don't understand the whole design yet, but I have one early comment: is all zmq communication local? If so, can you please use ipc:// sockets instead of tcp://? That will avoid some security concerns.

@russellb I completely agree that security is a paramount concern.

Given the Disaggregated serving feature's potential to dispatch requests to other nodes, it's crucial to establish a secure communication channel between the connector proxy, prefill node, and decode node.

In order to connect the connector proxy, pre filled nodes, and decoding nodes, we should use 'tcp://'

in vllm/entrypoints/disagg_connector.py
async def run_disagg_connector(args, **uvicorn_kwargs) -> None:

in vllm/entrypoints/launcher.py

async def serve_zmq(arg, zmq_server_port: int, app: FastAPI) -> None:
    """Server routine"""
    logger.info("zmq Server start arg: %s, zmq_server_port: %d", arg,
                zmq_server_port)
    url_worker = "inproc://workers"
    url_client = f"tcp://0.0.0.0:{zmq_server_port}"

In the server side we will use "inproc://workers" to deal the message.

russellb · 2025-01-09T14:38:39Z

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?

I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

panf2333 · 2025-01-10T07:44:12Z

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?

I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb I appreciate you raising this concern.
I will integrate the pyzme.auth module to enhance security with follow-up pr. I will change to ipc:// this time.

https://pyzmq.readthedocs.io/en/latest/api/zmq.auth.html
base on ZAP authentication and CURVE authentication
The document are here. Recommended Lark Doc.

lark doc: https://qus2es1bg99i.larksuite.com/wiki/Pbi1wFUTaiBZneksfytuQxrSsTe?from=from_copylink

google doc: https://docs.google.com/document/d/1ZwFij2OEx_K1xBx2EBx5FKfXQ9EJEGU6shYh-9MJdPs/edit?usp=sharing

panf2333 · 2025-01-12T13:26:03Z

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?

I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb Hi Russell, for now, I've used 'ipc://' to address immediate security concerns. However, I'll be addressing network security comprehensively in a future PR. I plan to leverage pyzmq.auth to implement robust authentication and authorization mechanisms.

russellb · 2025-01-14T19:44:36Z

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?
I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb Hi Russell, for now, I've used 'ipc://' to address immediate security concerns. However, I'll be addressing network security comprehensively in a future PR. I plan to leverage pyzmq.auth to implement robust authentication and authorization mechanisms.

I don't think that's sufficient. We also need a viable option for encryption, ideally with TLS.

panf2333 · 2025-01-15T03:41:13Z

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?
I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb Hi Russell, for now, I've used 'ipc://' to address immediate security concerns. However, I'll be addressing network security comprehensively in a future PR. I plan to leverage pyzmq.auth to implement robust authentication and authorization mechanisms.

I don't think that's sufficient. We also need a viable option for encryption, ideally with TLS.

@russellb I believe the disaggregation feature might benefit from optional TLS encryption. While encryption enhances security, it may introduce a slight performance overhead. Do you mean we can provide a configuration option to enable TLS encryption? This will allow users to choose the security level they need. I think users prefer to deploy clusters within secure environments such as intranets, so they want to improve performance as much as possible.

I will conduct in-depth research on auth and encryption before deciding on the selection. Before that zmq was only allowed to run locally. How about this?

russellb · 2025-01-15T16:25:00Z

That's fine. I'm completely OK with using it local-only.

KuntaiDu

Great work! Can you change the disaggregated prefill example file under the examples folder? Let's provide some handle for newcomers to run disaggregated prefill example without figuring out how to correctly set all the CLI args.

mergify · 2025-01-20T15:15:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @panf2333.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Robert Shaw <[email protected]>

Signed-off-by: [email protected] <[email protected]>

Rob pd controller

github-actions · 2025-07-22T02:37:02Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions · 2025-08-22T02:26:29Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

mergify bot added the frontend label Jan 7, 2025

mergify bot mentioned this pull request Jan 7, 2025

Disaggregate prefill decode with zmq #11789

Closed

robertgshaw2-redhat requested review from robertgshaw2-redhat and russellb January 7, 2025 16:26

KuntaiDu reviewed Jan 8, 2025

View reviewed changes

KuntaiDu requested a review from youkaichao January 8, 2025 01:06

robertgshaw2-redhat reviewed Jan 8, 2025

View reviewed changes

panf2333 marked this pull request as ready for review January 8, 2025 10:20

panf2333 changed the title ~~Disaggregate prefill decode with zmq~~ [Frontend] Disaggregate prefill decode with zmq Jan 8, 2025

panf2333 force-pushed the disaggregate_prefill_decode_with_zmq branch from 1bc97ec to 0728a42 Compare January 8, 2025 16:39

russellb requested changes Jan 8, 2025

View reviewed changes

KuntaiDu mentioned this pull request Jan 10, 2025

[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818

Closed

28 tasks

KuntaiDu requested changes Jan 20, 2025

View reviewed changes

mergify bot added needs-rebase and removed needs-rebase labels Jan 20, 2025

Robert Shaw and others added 21 commits March 22, 2025 17:12

Stash

522279e

Signed-off-by: Robert Shaw <[email protected]>

Merge branch 'main' into rob-fixes

144162f

updated

47a3f26

Signed-off-by: Robert Shaw <[email protected]>

working?

2fec6e0

Signed-off-by: Robert Shaw <[email protected]>

updated

24cbbe4

Signed-off-by: Robert Shaw <[email protected]>

cleanup

f6f008c

Signed-off-by: Robert Shaw <[email protected]>

updated

5d57896

Signed-off-by: Robert Shaw <[email protected]>

updated

2ba687d

Signed-off-by: Robert Shaw <[email protected]>

fix pre-commit

79e465f

Signed-off-by: Robert Shaw <[email protected]>

pre-commit

f51f182

Signed-off-by: [email protected] <[email protected]>

updated

cf64b0e

Signed-off-by: [email protected] <[email protected]>

added files

2f29ae3

Signed-off-by: [email protected] <[email protected]>

updated

28d0396

Signed-off-by: [email protected] <[email protected]>

updated

66349c3

Signed-off-by: [email protected] <[email protected]>

added __init__.py

d5b0db4

Signed-off-by: [email protected] <[email protected]>

added __init__.py

284d5df

Signed-off-by: [email protected] <[email protected]>

updated

a10da86

Signed-off-by: [email protected] <[email protected]>

updated

7954461

Signed-off-by: [email protected] <[email protected]>

updated

70e06dd

Signed-off-by: [email protected] <[email protected]>

updated

220d694

Signed-off-by: [email protected] <[email protected]>

Merge pull request #2 from robertgshaw2-redhat/rob-pd-controller

2767063

Rob pd controller

panf2333 mentioned this pull request Mar 31, 2025

Rob pd controller yottalabsai/vllm#2

Merged

russellb added this to Structured Output Apr 22, 2025

russellb moved this to Secondary in Structured Output Apr 22, 2025

russellb removed the structured-output label Apr 22, 2025

russellb removed this from Structured Output Apr 22, 2025

github-actions bot added the stale Over 90 days of inactivity label Jul 22, 2025

github-actions bot closed this Aug 22, 2025

Uh oh!

[Frontend] Disaggregate prefill decode with zmq #11791

[Frontend] Disaggregate prefill decode with zmq #11791

Uh oh!

Conversation

panf2333 commented Jan 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Parameters

Evaluation Steps

Parameters

Evaluation Steps

Design of ZMQ-based Client-Server Communication

High-level Overview

Design of ZMQ-based Communication

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

robertgshaw2-redhat commented Jan 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

panf2333 commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

panf2333 commented Jan 8, 2025

The relationship with client connector and vllm server

The zmq detail between connector and vllm server

Uh oh!

russellb left a comment

Choose a reason for hiding this comment

Uh oh!

panf2333 commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

russellb commented Jan 9, 2025

Uh oh!

panf2333 commented Jan 10, 2025

Uh oh!

panf2333 commented Jan 12, 2025

Uh oh!

russellb commented Jan 14, 2025

Uh oh!

panf2333 commented Jan 15, 2025

Uh oh!

russellb commented Jan 15, 2025

Uh oh!

KuntaiDu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jan 20, 2025

panf2333 commented Jan 7, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat Jan 8, 2025 •

edited

Loading

robertgshaw2-redhat commented Jan 8, 2025 •

edited

Loading

panf2333 commented Jan 8, 2025 •

edited

Loading

panf2333 commented Jan 9, 2025 •

edited

Loading

KuntaiDu left a comment •

edited

Loading