-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Core] Nanoflow-style Computation-Communication Overlap #23592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a Nanoflow-style computation-communication overlap optimization, which is a significant performance enhancement. The implementation is non-intrusive, leveraging Torch FX graph transformations to partition batches and overlap operations. The changes are well-structured, with clear separation of concerns in the new nanoflow
module. My main feedback is regarding a limitation in the batch splitting logic that currently only supports up to two nano-batches, which contradicts the max_num_nano_batches
configuration. Addressing this would make the feature more flexible and powerful for performance tuning.
vllm/utils/nano_split.py
Outdated
return cu_num_tokens, arange | ||
|
||
|
||
def prepare_nano_split_and_set_hooks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should unify the logic here with: #21153 so there can be shared attention splitting between this and the upcoming MoE dual batch overlap implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion and this makes sense to me! The main challenge is that the current split_attn_metadata
interface in that PR takes the original common_attn_metadata
as input. This forces the splitting logic into _prepare_inputs
, which couples it tightly with the existing preparation logic and makes the integration more intrusive. There is a few options to unify the logic while keeping things flexible:
- Add new interfaces that work directly from the scheduler output
- Have
_prepare_inputs
return the originalcommon_attn_metadata
- Put the original
common_attn_metadata
into the builder-generated metadata, so it can be accessed later through the forward context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think move the splitting logic into prepare inputs is fine; base on my understanding for 2+3 we still call _prepare_inputs
which means duplicating builder.build
calls, this this is on the hot path (directly impacts TPOT in low-qps regimes) we should be minimizing duplicated work as much as possible. I could potentially see 1 being an option but would likely lead to duplicated code.
I think micro-batching will become fairly commonly used both through nanoflow and the wide-ep micro batching @SageMoore and I are working on so I think its fine for it to be a first class citizen in the gpu_model_runner. We should have a draft PR up very soon so you can see our planned gpu_model_runner changes 👍 (cc @SageMoore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Looking forward to seeing the planned changes!
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Yi Pan <[email protected]>
@Conless is this ready for review again? |
Hi @ProExpertProg , the current version can work well but has some conflict with the latest DBO PRs #23693 #24845 . I'll fix them soon and ping you later! |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Hi @ProExpertProg, it's ready for review again! The performance of the current version is even better with the attention metadata splitting logic in DBO PRs. |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Signed-off-by: Yi Pan <[email protected]>
Overview
Following discussions with Woosuk Kwon ([email protected]), this PR integrates the computation-communication overlap technique from Nanoflow [1] into vLLM using a non-intrusive, compilation-based approach. The integration operates at the Torch FX graph level by partitioning input batches into nano-batches and duplicating selected operations to overlap compute and communication operations. Key features include:
[1] NanoFlow: Towards Optimal Large Language Model Serving Throughput, OSDI 2025
Design and Implementation
To enable seamless and transparent intra-device parallelism, this PR introduces a graph-level transformation applied during model compilation. The transformation operates on the traced
torch.fx.Graph
, rather than the original model source code, making it entirely transparent to model implementations. This allows for broad applicability across different models and deployment backends.The figure below illustrates the overall pipeline of our approach:
Graph Transformation: During the compilation phase, the traced
torch.fx.Graph
is passed to a transformation function that partitions the graph into submodules based on resource usage patterns (e.g., computation vs. communication). These submodules are then duplicated to process different parts of the input batch, enabling pipelined execution with overlapping compute and communication. Notably, the graph is input batch size agnostic. The resulting transformed graphs are cached in a split manager to avoid runtime recompilation and thereby minimize CPU overhead.Attention Metadata Preparation: At run time, the model runner provides input batch information to a context preparation function. This function determines the nano-batch sizes, prepares the necessary attention metadata for each nano-batch, and stores all globally shared data (such as
vllm.ForwardContext
instances for each nano-batch) into the split manager.Run-time Hook: During execution, the model’s
forward
method is redirected to a custom runtime callable. This callable retrieves the appropriate cached graph module and executes it using custom forward hooks. These hooks dynamically override the global forward context for each nano-batch to ensure correct and efficient execution.Graceful Degradation: Splitting into nano-batches does incur overheads for small batch inputs. To maintain robustness across varying workloads and avoid GPU underutilization, the system automatically skips nano-batch splitting when the total token batch size is below a threshold (
min_nano_split_tokens
, default: 1024). Additionally, because this is a graph-level optimization, the entire feature can be toggled via a simple configuration flag, ensuring no performance regressions when disabled.Evaluations
We tested the current implementation with LLaMA 3-8B model on 2xH200 GPUs (TP=2, use_cudagraph=False) by
benchmark_throughput.py
. It reduces the single-iteration latency by 13% and increases the end-to-end throughput by up to 8%.Input 512 output 0
Input 512 output 512
Input 1024 output 512
Discussion & Future Work
In the future, we plan to add these features on top of the current design:
Co-authored-by: Kan Zhu <[email protected]>
Co-authored-by: Yilong Zhao <[email protected]>
Co-authored-by: Ziren Wang <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
Co-authored-by: Baris Kasikci <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.