Skip to content

Conversation

panf2333
Copy link

@panf2333 panf2333 commented Jan 7, 2025

Added vLLM Connect to initiate a proxy service and connect to the VLLM Server via ZMQ, improved the performance of prefill-decode disaggregation by 5-10% (TTFT), and 5- 15% (ITL) on average.

This key change of this PR includes replacing HTTP with ZMQ for communication between the proxy and the VLLM server, and using socket pools to maintain persistent ZMQ connections, which reduces reconnection overhead.

We have attached the benchmark result and the detailed configuration to reproduce the result.

Benchmark

Without CUDA_LAUNCH_BLOCKING=1. The TTFT performance is improved by an average of 5-10%, The average performance improvement of ITL is 5-15%

Parameters

  • GPU device: 2 * H100 80G
  • Model: meta-llama/Meta-Llama-3.1-8B-Instruct
  • Parameters: gpu-memory-utilization 0.6 + kv_buffer_size 5e9
  • dataset input 1024 output 6
  • QPS: 12, 24, 48, 96
  • Total Request: 96

Evaluation Steps

  1. Start Disagg HTTP proxy and 2 VLLM Server Instances(1 prefill and 1 decode)
  2. Run the script to test QPS in [12, 24, 48, 96] each qps repeat 3 times and then obtain the average metric
  3. Start Disagg ZMQ proxy and 2 VLLM Server Instance and repeat the previous process.

image

When I set CUDA_LAUNCH_BLOCKING=1 the results were similar to before, improved the performance of prefill-decode disaggregation by 30-50% (TTFT), and 3X- 15X (ITL) on average.

Parameters

  • GPU device: 2 * H100 80G
  • Model: meta-llama/Meta-Llama-3.1-8B-Instruct
  • Parameters: gpu-memory-utilization 0.6 + kv_buffer_size 5e9
  • dataset input 1024 output 6
  • CUDA_LAUNCH_BLOCKING=1
  • QPS: 12, 24, 48, 96
  • Total Request: 96

Evaluation Steps

  1. Start Disagg HTTP proxy and 2 VLLM Server Instances(1 prefill and 1 decode)
  2. Run the script to test QPS in [12, 24, 48, 96] each qps repeat 3 times and then obtain the average metric
  3. Start Disagg ZMQ proxy and 2 VLLM Server Instance and repeat the previous process.

image

Design of ZMQ-based Client-Server Communication

High-level Overview

image

Design of ZMQ-based Communication

image

Copy link

github-actions bot commented Jan 7, 2025

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@robertgshaw2-redhat
Copy link
Collaborator

Ping when ready. NOTE for reviewers: do not merge until me and @russellb have a chance to review

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simon-mo I am not familiar with ZMQ --- is dealer the right technical choice?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://zguide.zeromq.org/docs/chapter3/#The-DEALER-to-DEALER-Combination

We need to proactively send messages to workers in this scenario.

  1. ROUTER is not suitable for initiating messages because it doesn't know the identities of other receivers until it receives the first message. Only then can it establish routes for interaction.

  2. REQ requires acknowledging each message before sending the next one, which doesn't meet our requirements.

  3. DEALER allows us to actively send messages and supports asynchronous multi-send and multi-receive, making it the more suitable pattern. It's important to note that we need to maintain the DEALER's ID.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential optimization (you don't need to implement it in this PR): return the first token generated by the prefill instance in this async for, instead of reposting the request to decode instance and waiting for the first token from there.

@KuntaiDu KuntaiDu requested a review from youkaichao January 8, 2025 01:06
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zmq sockets are not threadsafe. This cannot run in a background thread it must be in an asyncio task.

Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means you cannot use the built in proxy since it does not use async sockets. In prior versions of VLLM, you will have to write your own proxy (its like 10LOC)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test scripts test_connect_server1.py and test_connect_server2.py were used to simulate model responses. I've since removed them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use destroy(linger=0)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! To ensure immediate termination and avoid potential blocking, I'll switch to using destroy(linger=0) instead of term(). I also replace it in the vllm/entrypoints/launcher.py

https://pyzmq.readthedocs.io/en/latest/api/zmq.html#context
After interrupting all blocking calls, term shall block until the following conditions are satisfied:

  1. All sockets open within context have been closed.
  2. For each socket within context, all messages sent on the socket have either been physically transferred to a network peer, or the socket’s linger period set with the zmq.LINGER socket option has expired.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zmq sockets are not threadsafe. You cannot run this in a background thread.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate you pointing out potential thread safety issues with zmq sockets. You are completely correct; By default, they are not thread safe. I will prioritize finding a more thread safe alternative in the future to ensure robust operation in multi-threaded environments.

As zmq.proxy() is a synchronous function, executing it directly within the main thread can potentially block the server.

Currently, these two sockets are used exclusively within this thread. While I believe there are no immediate thread safety concerns, it's prudent to consider future scalability and maintainability. Can we address potential thread-safety issues in a subsequent PR?

https://zguide.zeromq.org/docs/chapter2/#ZeroMQ-s-Built-In-Proxy-Function
It’s exactly like starting the main loop of rrbroker.
image

https://github.com/booksbyus/zguide/blob/master/examples/Python/rrbroker.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertgshaw2-neuralmagic Hi, Robert I have resolved this issue by threadproxy. It will create sockets in the proxy, which no one can access. So there won't be any thread safety issues

@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Jan 8, 2025

@panf2333 - Thanks for the PR! Disaggregated serving is a hugely important initiative for VLLM in 2025

I am responsible for the multiprocessing + asyncio + zmq architecture of VLLM, so I am going to review this in detail. I am having some trouble following the design here. Can you make a simple diagram that charts out what these objects are to ease in review?

Thanks!

@panf2333
Copy link
Author

panf2333 commented Jan 8, 2025

@panf2333 - Thanks for the PR! Disaggregated serving is a hugely important initiative for VLLM in 2025

I am responsible for the multiprocessing + asyncio + zmq architecture of VLLM, so I am going to review this in detail. I am having some trouble following the design here. Can you make a simple diagram that charts out what these objects are to ease in review?

Thanks!

@robertgshaw2-neuralmagic It's my pleasure. I'll put together a diagram and send it over shortly.

@panf2333
Copy link
Author

panf2333 commented Jan 8, 2025

@panf2333 - Thanks for the PR! Disaggregated serving is a hugely important initiative for VLLM in 2025

I am responsible for the multiprocessing + asyncio + zmq architecture of VLLM, so I am going to review this in detail. I am having some trouble following the design here. Can you make a simple diagram that charts out what these objects are to ease in review?

Thanks!

@robertgshaw2-neuralmagic These are simple diagram , hoping to help you better understand this PR. I also updated the description of PR.

The relationship with client connector and vllm server

image

The zmq detail between connector and vllm server

image

@panf2333 panf2333 marked this pull request as ready for review January 8, 2025 10:20
@panf2333 panf2333 changed the title Disaggregate prefill decode with zmq [Frontend] Disaggregate prefill decode with zmq Jan 8, 2025
@panf2333 panf2333 force-pushed the disaggregate_prefill_decode_with_zmq branch from 1bc97ec to 0728a42 Compare January 8, 2025 16:39
Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the whole design yet, but I have one early comment: is all zmq communication local? If so, can you please use ipc:// sockets instead of tcp://? That will avoid some security concerns.

@panf2333
Copy link
Author

panf2333 commented Jan 9, 2025

I don't understand the whole design yet, but I have one early comment: is all zmq communication local? If so, can you please use ipc:// sockets instead of tcp://? That will avoid some security concerns.

@russellb I completely agree that security is a paramount concern.

Given the Disaggregated serving feature's potential to dispatch requests to other nodes, it's crucial to establish a secure communication channel between the connector proxy, prefill node, and decode node.

In order to connect the connector proxy, pre filled nodes, and decoding nodes, we should use 'tcp://'

in vllm/entrypoints/disagg_connector.py
async def run_disagg_connector(args, **uvicorn_kwargs) -> None:

in vllm/entrypoints/launcher.py

async def serve_zmq(arg, zmq_server_port: int, app: FastAPI) -> None:
    """Server routine"""
    logger.info("zmq Server start arg: %s, zmq_server_port: %d", arg,
                zmq_server_port)
    url_worker = "inproc://workers"
    url_client = f"tcp://0.0.0.0:{zmq_server_port}"

In the server side we will use "inproc://workers" to deal the message.

@russellb
Copy link
Member

russellb commented Jan 9, 2025

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?

I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@panf2333
Copy link
Author

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?

I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb I appreciate you raising this concern.
I will integrate the pyzme.auth module to enhance security with follow-up pr. I will change to ipc:// this time.

https://pyzmq.readthedocs.io/en/latest/api/zmq.auth.html
base on ZAP authentication and CURVE authentication
The document are here. Recommended Lark Doc.

lark doc: https://qus2es1bg99i.larksuite.com/wiki/Pbi1wFUTaiBZneksfytuQxrSsTe?from=from_copylink

google doc: https://docs.google.com/document/d/1ZwFij2OEx_K1xBx2EBx5FKfXQ9EJEGU6shYh-9MJdPs/edit?usp=sharing

@panf2333
Copy link
Author

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?

I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb Hi Russell, for now, I've used 'ipc://' to address immediate security concerns. However, I'll be addressing network security comprehensively in a future PR. I plan to leverage pyzmq.auth to implement robust authentication and authorization mechanisms.

@russellb
Copy link
Member

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?
I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb Hi Russell, for now, I've used 'ipc://' to address immediate security concerns. However, I'll be addressing network security comprehensively in a future PR. I plan to leverage pyzmq.auth to implement robust authentication and authorization mechanisms.

I don't think that's sufficient. We also need a viable option for encryption, ideally with TLS.

@panf2333
Copy link
Author

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?
I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb Hi Russell, for now, I've used 'ipc://' to address immediate security concerns. However, I'll be addressing network security comprehensively in a future PR. I plan to leverage pyzmq.auth to implement robust authentication and authorization mechanisms.

I don't think that's sufficient. We also need a viable option for encryption, ideally with TLS.

@russellb I believe the disaggregation feature might benefit from optional TLS encryption. While encryption enhances security, it may introduce a slight performance overhead. Do you mean we can provide a configuration option to enable TLS encryption? This will allow users to choose the security level they need. I think users prefer to deploy clusters within secure environments such as intranets, so they want to improve performance as much as possible.

I will conduct in-depth research on auth and encryption before deciding on the selection. Before that zmq was only allowed to run locally. How about this?




@russellb
Copy link
Member

That's fine. I'm completely OK with using it local-only.

Copy link
Collaborator

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Can you change the disaggregated prefill example file under the examples folder? Let's provide some handle for newcomers to run disaggregated prefill example without figuring out how to correctly set all the CLI args.

Copy link

mergify bot commented Jan 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @panf2333.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Robert Shaw and others added 21 commits March 22, 2025 17:12
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Jul 22, 2025
Copy link

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

@github-actions github-actions bot closed this Aug 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation frontend speculative-decoding stale Over 90 days of inactivity v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants