[Core] Nanoflow-style Computation-Communication Overlap #23592

Conless · 2025-08-25T21:42:43Z

Overview

Following discussions with Woosuk Kwon ([email protected]), this PR integrates the computation-communication overlap technique from Nanoflow [1] into vLLM using a non-intrusive, compilation-based approach. The integration operates at the Torch FX graph level by partitioning input batches into nano-batches and duplicating selected operations to overlap compute and communication operations. Key features include:

Non-intrusive code path: This design introduces fewer than 50 lines of changes to the core vLLM codebase and avoids any model-specific modifications.
Robust performance gain: It delivers ~10% speedup for certain workloads (e.g., large-batch inference), while guarding against performance regression for small-batch inference via a simple heuristic.

[1] NanoFlow: Towards Optimal Large Language Model Serving Throughput, OSDI 2025

Design and Implementation

To enable seamless and transparent intra-device parallelism, this PR introduces a graph-level transformation applied during model compilation. The transformation operates on the traced torch.fx.Graph, rather than the original model source code, making it entirely transparent to model implementations. This allows for broad applicability across different models and deployment backends.

The figure below illustrates the overall pipeline of our approach:

Graph Transformation: During the compilation phase, the traced torch.fx.Graph is passed to a transformation function that partitions the graph into submodules based on resource usage patterns (e.g., computation vs. communication). These submodules are then duplicated to process different parts of the input batch, enabling pipelined execution with overlapping compute and communication. Notably, the graph is input batch size agnostic. The resulting transformed graphs are cached in a split manager to avoid runtime recompilation and thereby minimize CPU overhead.
Attention Metadata Preparation: At run time, the model runner provides input batch information to a context preparation function. This function determines the nano-batch sizes, prepares the necessary attention metadata for each nano-batch, and stores all globally shared data (such as vllm.ForwardContext instances for each nano-batch) into the split manager.
Run-time Hook: During execution, the model’s forward method is redirected to a custom runtime callable. This callable retrieves the appropriate cached graph module and executes it using custom forward hooks. These hooks dynamically override the global forward context for each nano-batch to ensure correct and efficient execution.
Graceful Degradation: Splitting into nano-batches does incur overheads for small batch inputs. To maintain robustness across varying workloads and avoid GPU underutilization, the system automatically skips nano-batch splitting when the total token batch size is below a threshold (min_nano_split_tokens, default: 1024). Additionally, because this is a graph-level optimization, the entire feature can be toggled via a simple configuration flag, ensuring no performance regressions when disabled.

Evaluations

We tested the current implementation with LLaMA 3-8B model on 2xH200 GPUs (TP=2, use_cudagraph=False) by benchmark_throughput.py. It reduces the single-iteration latency by 13% and increases the end-to-end throughput by up to 8%.

Input 512 output 0

	Throughput (token/s)
vLLM	69533.7
vLLM w/ Nanoflow	74532.8

Input 512 output 512

	Throughput (token/s)
vLLM	23018.2
vLLM w/ Nanoflow	24066.4

Input 1024 output 512

	Throughput (token/s)
vLLM	31135.3
vLLM w/ Nanoflow	32587.0

Discussion & Future Work

In the future, we plan to add these features on top of the current design:

Explore CUDA graph compatibility
Implement more types of intra-device parallelism, such as overlapping compute- and memory-bound operators and dual-batch overlap for MoE models

Co-authored-by: Kan Zhu <[email protected]>
Co-authored-by: Yilong Zhao <[email protected]>
Co-authored-by: Ziren Wang <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
Co-authored-by: Baris Kasikci <[email protected]>

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

mergify · 2025-08-25T21:43:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Conless.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a Nanoflow-style computation-communication overlap optimization, which is a significant performance enhancement. The implementation is non-intrusive, leveraging Torch FX graph transformations to partition batches and overlap operations. The changes are well-structured, with clear separation of concerns in the new nanoflow module. My main feedback is regarding a limitation in the batch splitting logic that currently only supports up to two nano-batches, which contradicts the max_num_nano_batches configuration. Addressing this would make the feature more flexible and powerful for performance tuning.

vllm/compilation/nanoflow/split_utils.py

LucasWilkinson · 2025-08-25T22:26:48Z

vllm/utils/nano_split.py

+    return cu_num_tokens, arange
+
+
+def prepare_nano_split_and_set_hooks(


We should unify the logic here with: #21153 so there can be shared attention splitting between this and the upcoming MoE dual batch overlap implementation

Thanks for the suggestion and this makes sense to me! The main challenge is that the current split_attn_metadata interface in that PR takes the original common_attn_metadata as input. This forces the splitting logic into _prepare_inputs, which couples it tightly with the existing preparation logic and makes the integration more intrusive. There is a few options to unify the logic while keeping things flexible:

Add new interfaces that work directly from the scheduler output

Have _prepare_inputs return the original common_attn_metadata

Put the original common_attn_metadata into the builder-generated metadata, so it can be accessed later through the forward context

I think move the splitting logic into prepare inputs is fine; base on my understanding for 2+3 we still call _prepare_inputs which means duplicating builder.build calls, this this is on the hot path (directly impacts TPOT in low-qps regimes) we should be minimizing duplicated work as much as possible. I could potentially see 1 being an option but would likely lead to duplicated code.

I think micro-batching will become fairly commonly used both through nanoflow and the wide-ep micro batching @SageMoore and I are working on so I think its fine for it to be a first class citizen in the gpu_model_runner. We should have a draft PR up very soon so you can see our planned gpu_model_runner changes 👍 (cc @SageMoore)

Got it. Looking forward to seeing the planned changes!

Signed-off-by: Yi Pan <[email protected]>

mergify · 2025-09-08T13:56:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Conless.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Yi Pan <[email protected]>

ProExpertProg · 2025-09-25T00:20:30Z

@Conless is this ready for review again?

Conless · 2025-09-25T06:56:06Z

@Conless is this ready for review again?

Hi @ProExpertProg , the current version can work well but has some conflict with the latest DBO PRs #23693 #24845 . I'll fix them soon and ping you later!

mergify · 2025-09-25T07:47:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Conless.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Yi Pan <[email protected]>

Conless · 2025-10-03T00:48:23Z

Hi @ProExpertProg, it's ready for review again! The performance of the current version is even better with the attention metadata splitting logic in DBO PRs.

mergify · 2025-10-08T14:33:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Conless.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Yi Pan <[email protected]>

Conless requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, youkaichao, ywang96 and zou3519 as code owners August 25, 2025 21:42

mergify bot added the v1 label Aug 25, 2025

mergify bot added the needs-rebase label Aug 25, 2025

gemini-code-assist bot reviewed Aug 25, 2025

View reviewed changes

vllm/compilation/nanoflow/split_utils.py Outdated Show resolved Hide resolved

ProExpertProg mentioned this pull request Aug 25, 2025

[RFC]: Address piecewise graph splitting and attention fusion incompatibility #23261

Closed

1 task

LucasWilkinson reviewed Aug 25, 2025

View reviewed changes

Conless requested review from hmellor, houseroad, mgoin, simon-mo, tlrmchlsmth and yewentao256 as code owners August 25, 2025 23:29

mergify bot removed the needs-rebase label Aug 25, 2025

Conless added 8 commits August 25, 2025 16:38

feat: basic support of nano split.

a8a9f38

Signed-off-by: Yi Pan <[email protected]>

finish compute comm overlap

ad5fbcd

Signed-off-by: Yi Pan <[email protected]>

update

8712a26

Signed-off-by: Yi Pan <[email protected]>

fix cpu overhead

c240236

Signed-off-by: Yi Pan <[email protected]>

update model runner

49269f2

Signed-off-by: Yi Pan <[email protected]>

refine interface

39b878b

Signed-off-by: Yi Pan <[email protected]>

refine

4970a80

Signed-off-by: Yi Pan <[email protected]>

separate nanoflow logic

d00c4af

Signed-off-by: Yi Pan <[email protected]>

mergify bot removed the needs-rebase label Sep 1, 2025

Conless added 2 commits September 3, 2025 16:17

move to compilation config

335aab1

Signed-off-by: Yi Pan <[email protected]>

Merge remote-tracking branch 'upstream/main' into dev

f3b2dbe

mergify bot added the needs-rebase label Sep 8, 2025

ProExpertProg moved this from To triage to In progress in torch.compile integration Sep 12, 2025

ProExpertProg moved this from In progress to Done in torch.compile integration Sep 12, 2025

ProExpertProg moved this from Done to In review in torch.compile integration Sep 12, 2025

Conless added 2 commits September 21, 2025 15:00

minor

563fe72

Signed-off-by: Yi Pan <[email protected]>

Merge remote-tracking branch 'upstream/main' into dev

92903d9

mergify bot removed the needs-rebase label Sep 22, 2025

mergify bot added the needs-rebase label Sep 25, 2025

Conless added 3 commits September 29, 2025 14:22

Merge remote-tracking branch 'upstream/main' into dev

1a26d31

Signed-off-by: Yi Pan <[email protected]>

adapt to dbo design

ba26309

Signed-off-by: Yi Pan <[email protected]>

Merge remote-tracking branch 'upstream/main' into dev

d1e134e

Signed-off-by: Yi Pan <[email protected]>

mergify bot removed the needs-rebase label Oct 2, 2025

fix

f087b99

Signed-off-by: Yi Pan <[email protected]>

mergify bot added the needs-rebase label Oct 8, 2025

Merge remote-tracking branch 'upstream/main' into dev

28288f7

Signed-off-by: Yi Pan <[email protected]>

mergify bot removed the needs-rebase label Oct 10, 2025

fix num tokens across dp

eed29c1

Signed-off-by: Yi Pan <[email protected]>

Conless requested review from LucasWilkinson, hmellor and zou3519 October 10, 2025 02:48

format fix

d8b3573

Signed-off-by: Yi Pan <[email protected]>

		return cu_num_tokens, arange


		def prepare_nano_split_and_set_hooks(

Uh oh!

[Core] Nanoflow-style Computation-Communication Overlap #23592

Are you sure you want to change the base?

[Core] Nanoflow-style Computation-Communication Overlap #23592

Conversation

Conless commented Aug 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Design and Implementation

Evaluations

Discussion & Future Work

Uh oh!

mergify bot commented Aug 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

LucasWilkinson Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Conless Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Conless Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 8, 2025

Uh oh!

ProExpertProg commented Sep 25, 2025

Uh oh!

Conless commented Sep 25, 2025

Uh oh!

mergify bot commented Sep 25, 2025

Uh oh!

Conless commented Oct 3, 2025

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conless commented Aug 25, 2025 •

edited by github-actions bot

Loading