Skip to content

Conversation

Conless
Copy link

@Conless Conless commented Aug 25, 2025

Overview

Following discussions with Woosuk Kwon ([email protected]), this PR integrates the computation-communication overlap technique from Nanoflow [1] into vLLM using a non-intrusive, compilation-based approach. The integration operates at the Torch FX graph level by partitioning input batches into nano-batches and duplicating selected operations to overlap compute and communication operations. Key features include:

  • Non-intrusive code path: This design introduces fewer than 50 lines of changes to the core vLLM codebase and avoids any model-specific modifications.
  • Robust performance gain: It delivers ~10% speedup for certain workloads (e.g., large-batch inference), while guarding against performance regression for small-batch inference via a simple heuristic.

[1] NanoFlow: Towards Optimal Large Language Model Serving Throughput, OSDI 2025

Design and Implementation

To enable seamless and transparent intra-device parallelism, this PR introduces a graph-level transformation applied during model compilation. The transformation operates on the traced torch.fx.Graph, rather than the original model source code, making it entirely transparent to model implementations. This allows for broad applicability across different models and deployment backends.

The figure below illustrates the overall pipeline of our approach:

  • Graph Transformation: During the compilation phase, the traced torch.fx.Graph is passed to a transformation function that partitions the graph into submodules based on resource usage patterns (e.g., computation vs. communication). These submodules are then duplicated to process different parts of the input batch, enabling pipelined execution with overlapping compute and communication. Notably, the graph is input batch size agnostic. The resulting transformed graphs are cached in a split manager to avoid runtime recompilation and thereby minimize CPU overhead.

  • Attention Metadata Preparation: At run time, the model runner provides input batch information to a context preparation function. This function determines the nano-batch sizes, prepares the necessary attention metadata for each nano-batch, and stores all globally shared data (such as vllm.ForwardContext instances for each nano-batch) into the split manager.

  • Run-time Hook: During execution, the model’s forward method is redirected to a custom runtime callable. This callable retrieves the appropriate cached graph module and executes it using custom forward hooks. These hooks dynamically override the global forward context for each nano-batch to ensure correct and efficient execution.

  • Graceful Degradation: Splitting into nano-batches does incur overheads for small batch inputs. To maintain robustness across varying workloads and avoid GPU underutilization, the system automatically skips nano-batch splitting when the total token batch size is below a threshold (min_nano_split_tokens, default: 1024). Additionally, because this is a graph-level optimization, the entire feature can be toggled via a simple configuration flag, ensuring no performance regressions when disabled.

design

Evaluations

We tested the current implementation with LLaMA 3-8B model on 2xH200 GPUs (TP=2, use_cudagraph=False) by benchmark_throughput.py. It reduces the single-iteration latency by 13% and increases the end-to-end throughput by up to 8%.

Input 512 output 0

Throughput (token/s)
vLLM 69533.7
vLLM w/ Nanoflow 74532.8

Input 512 output 512

Throughput (token/s)
vLLM 23018.2
vLLM w/ Nanoflow 24066.4

Input 1024 output 512

Throughput (token/s)
vLLM 31135.3
vLLM w/ Nanoflow 32587.0

Discussion & Future Work

In the future, we plan to add these features on top of the current design:

  • Explore CUDA graph compatibility
  • Implement more types of intra-device parallelism, such as overlapping compute- and memory-bound operators and dual-batch overlap for MoE models

Co-authored-by: Kan Zhu <[email protected]>
Co-authored-by: Yilong Zhao <[email protected]>
Co-authored-by: Ziren Wang <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
Co-authored-by: Baris Kasikci <[email protected]>


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Copy link

mergify bot commented Aug 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Conless.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 25, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Nanoflow-style computation-communication overlap optimization, which is a significant performance enhancement. The implementation is non-intrusive, leveraging Torch FX graph transformations to partition batches and overlap operations. The changes are well-structured, with clear separation of concerns in the new nanoflow module. My main feedback is regarding a limitation in the batch splitting logic that currently only supports up to two nano-batches, which contradicts the max_num_nano_batches configuration. Addressing this would make the feature more flexible and powerful for performance tuning.

return cu_num_tokens, arange


def prepare_nano_split_and_set_hooks(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should unify the logic here with: #21153 so there can be shared attention splitting between this and the upcoming MoE dual batch overlap implementation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion and this makes sense to me! The main challenge is that the current split_attn_metadata interface in that PR takes the original common_attn_metadata as input. This forces the splitting logic into _prepare_inputs, which couples it tightly with the existing preparation logic and makes the integration more intrusive. There is a few options to unify the logic while keeping things flexible:

  1. Add new interfaces that work directly from the scheduler output
  2. Have _prepare_inputs return the original common_attn_metadata
  3. Put the original common_attn_metadata into the builder-generated metadata, so it can be accessed later through the forward context

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think move the splitting logic into prepare inputs is fine; base on my understanding for 2+3 we still call _prepare_inputs which means duplicating builder.build calls, this this is on the hot path (directly impacts TPOT in low-qps regimes) we should be minimizing duplicated work as much as possible. I could potentially see 1 being an option but would likely lead to duplicated code.

I think micro-batching will become fairly commonly used both through nanoflow and the wide-ep micro batching @SageMoore and I are working on so I think its fine for it to be a first class citizen in the gpu_model_runner. We should have a draft PR up very soon so you can see our planned gpu_model_runner changes 👍 (cc @SageMoore)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Looking forward to seeing the planned changes!

Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
@mergify mergify bot removed the needs-rebase label Sep 1, 2025
Copy link

mergify bot commented Sep 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Conless.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 8, 2025
@ProExpertProg ProExpertProg moved this from To triage to In progress in torch.compile integration Sep 12, 2025
@ProExpertProg ProExpertProg moved this from In progress to Done in torch.compile integration Sep 12, 2025
@ProExpertProg ProExpertProg moved this from Done to In review in torch.compile integration Sep 12, 2025
@mergify mergify bot removed the needs-rebase label Sep 22, 2025
@ProExpertProg
Copy link
Collaborator

@Conless is this ready for review again?

@Conless
Copy link
Author

Conless commented Sep 25, 2025

@Conless is this ready for review again?

Hi @ProExpertProg , the current version can work well but has some conflict with the latest DBO PRs #23693 #24845 . I'll fix them soon and ping you later!

Copy link

mergify bot commented Sep 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Conless.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 25, 2025
@mergify mergify bot removed the needs-rebase label Oct 2, 2025
Signed-off-by: Yi Pan <[email protected]>
@Conless
Copy link
Author

Conless commented Oct 3, 2025

Hi @ProExpertProg, it's ready for review again! The performance of the current version is even better with the attention metadata splitting logic in DBO PRs.

Copy link

mergify bot commented Oct 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Conless.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 8, 2025
@mergify mergify bot removed the needs-rebase label Oct 10, 2025
Signed-off-by: Yi Pan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

5 participants