[torch.compile] CUDAGraph Inductor partition integration #24281

BoyuanFeng · 2025-09-04T22:03:09Z

Depends on pytorch/pytorch#162207 [landed and available in PyTorch 2.9].

Command:

vllm bench latency -O.cudagraph_mode=PIECEWISE -O.use_inductor_graph_partition=true

Tests

Test1

test_simple_inductor_graph_partition checks that, when use_inductor_graph_partition=True, we have 1 num_piecewise_graphs_seen and 1 num_backend_compilations. By contrast, test_simple_piecewise_compile checks that, when use_inductor_graph_partition=True, we have 5 num_piecewise_graphs_seen and 3 num_backend_compilations since we are splitting at fx graph level. For both tests, we assert that num_cudagraph_captured=6.

Test2

test_custom_compile_config checks that use_inductor_graph_partition=True, level=CompilationLevel.PIECEWISE, cudagraph_mode=CUDAGraphMode.PIECEWISE work together.

Test3

test_attention_quant_pattern checks that attention+FP8Quant fusion happens when use_inductor_graph_partition=True.

Benchmark

Model: meta-llama/Meta-Llama-3.1-8B
Hardware: B200

With vllm Piecewise CUDAGraph backend:

With inductor graph partition:

TTFT is 0.4% faster, TPOT is 2% slower.

Start Time

Support Attention Fusion

trace w/o attn fusion. We can see [prior cudagraph'ed kernels] -> vllm::scaled_fp8_quant_kernel_strided -> void vllm::reshape_and_cache_flash_kernel -> fmhaSm100Kernel -> [(next cudagraph'ed kernels) vllm::scaled_fp8_quant_kernel_strided -> _ZN7cutlass13device_kernelINS_4gemm6kernel -> ...].

trace w/ attn fusion. We can see [prior cudagraph'ed kernels] -> vllm::scaled_fp8_quant_kernel -> vllm::reshape_and_cache_flash_kernel -> fmhaSm100Kernel -> [(next cudagraph'ed kernels) _ZN7cutlass13device_kernelINS_4gemm6kernel...]. Note that the second vllm::scaled_fp8_quant_kernel_strided is moved from cudagraph'ed region into fmhaSm100Kernel.

gemini-code-assist

Code Review

This pull request introduces CUDAGraph partitioning by integrating custom wrappers with the torch inductor compiler. The core logic is added in vllm/compilation/backends.py. The review identifies a critical debugging statement (breakpoint()) that must be removed, as it will halt execution. Additionally, there are several large blocks of commented-out code and unused variables that should be cleaned up to improve code readability and maintainability.

vllm/compilation/compiler_interface.py

vllm/compilation/backends.py

ProExpertProg

We should add a flag to config that enables this. It can be experimental for now but we should add good documentation because it will likely stick around (even if on by default) because other platforms will reuse the existing cudagraph wrapper mechanism and piecewise spliting after dynamo!

vllm/compilation/backends.py

vllm/attention/layer.py

vllm/compilation/backends.py

This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](vllm-project/vllm#24281) Pull Request resolved: #162207 Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg

This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](vllm-project/vllm#24281) Pull Request resolved: pytorch#162207 Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg

ProExpertProg

Looks good! Nice work. I left a few comments mostly about comments and asserts. We should improve the documentation around this, If you want to accept some of my suggestions directly that would also help me be added as a coauthor :D.

We should add tests: one that just extends the current piecewise cudagraph tests, and one that also tests that attention fusion happened with this splitting method (and that it's not broken). We should also check the performance of attention fusion after this PR. Let me know if you need help with what commands to run.

Additionally, it might be nice to be able to pass a list of "splitting ops" to Inductor during compilation (as opposed to op declaration time). If we want to decide whether to exclude attention or fused_moe (or neither or both), we can't depend on torch._C.Tag.cudagraph_unsafe because that's a static property of the op. I guess for now we can depend on the old (current) splitting pathway but it might be nice in the future to use a list inside config.

vllm/compilation/backends.py

vllm/config/compilation.py

vllm/v1/cudagraph_dispatcher.py

vllm/config/compilation.py

vllm/v1/cudagraph_dispatcher.py

Signed-off-by: Boyuan Feng <[email protected]>

vllm/attention/layer.py

This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](vllm-project/vllm#24281) Pull Request resolved: pytorch#162207 Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg

…t#24281) Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: boyuanfeng <[email protected]> Co-authored-by: Luka Govedič <[email protected]>

…t#24281) Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: boyuanfeng <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: charlifu <[email protected]>

This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](vllm-project/vllm#24281) Pull Request resolved: pytorch#162207 Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg

Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: boyuanfeng <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: yewentao256 <[email protected]>

…t#24281) Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: boyuanfeng <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…t#24281) Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: boyuanfeng <[email protected]> Co-authored-by: Luka Govedič <[email protected]>

BoyuanFeng requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao, ywang96 and zou3519 as code owners September 4, 2025 22:03

mergify bot added the v1 label Sep 4, 2025

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

vllm/compilation/compiler_interface.py Outdated Show resolved Hide resolved

vllm/compilation/backends.py Outdated Show resolved Hide resolved

vllm/compilation/backends.py Outdated Show resolved Hide resolved

ProExpertProg mentioned this pull request Sep 4, 2025

[RFC]: Address piecewise graph splitting and attention fusion incompatibility #23261

Closed

1 task

ProExpertProg reviewed Sep 5, 2025

View reviewed changes

BoyuanFeng mentioned this pull request Sep 5, 2025

[Graph Partition] interface for custom cg wrapper pytorch/pytorch#162207

Closed

BoyuanFeng force-pushed the bf/cg-partition branch 2 times, most recently from f5af8f6 to 1397e35 Compare September 5, 2025 05:33

BoyuanFeng requested a review from ProExpertProg September 9, 2025 17:08

ProExpertProg reviewed Sep 11, 2025

View reviewed changes

vllm/config/compilation.py Outdated Show resolved Hide resolved

ProExpertProg reviewed Sep 12, 2025

View reviewed changes

vllm/v1/cudagraph_dispatcher.py Show resolved Hide resolved

BoyuanFeng requested review from NickLucche and patrickvonplaten as code owners September 14, 2025 16:50

ProExpertProg enabled auto-merge (squash) September 19, 2025 20:35

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 19, 2025

ProExpertProg approved these changes Sep 19, 2025

View reviewed changes

ProExpertProg changed the title ~~CUDAGraph partition integration~~ [torch.compile] CUDAGraph Inductor partition integration Sep 19, 2025

ProExpertProg mentioned this pull request Sep 19, 2025

refactor: abstract graph mode support into platform interface #25161

Merged

2 tasks

test inductor graph partition only when >= torch2.9

19787d3

Signed-off-by: Boyuan Feng <[email protected]>

auto-merge was automatically disabled September 19, 2025 22:35
Head branch was pushed to by a user without write access

ProExpertProg enabled auto-merge (squash) September 19, 2025 22:39

BoyuanFeng mentioned this pull request Sep 19, 2025

[Cherry Pick][Graph Partition] allow sharing default device context pytorch/pytorch#163097

Merged

ProExpertProg merged commit 8945b00 into vllm-project:main Sep 20, 2025
44 checks passed

xuechendi reviewed Sep 20, 2025

View reviewed changes

vllm/attention/layer.py Show resolved Hide resolved

xuechendi mentioned this pull request Sep 20, 2025

[BUG FIX][NON-CUDA]quick fix to avoid call cudagraph_unsafe in attention #25298

Merged

5 tasks

ProExpertProg mentioned this pull request Sep 20, 2025

[Bug]: Torch.compile issues on Hopper when unwrapping operations in apply_w8a8_block_fp8_linear() #25080

Closed

1 task

fhl2000 mentioned this pull request Sep 20, 2025

[Docs] add docs for cuda graph v1 #24374

Merged

5 tasks

ProExpertProg linked an issue Sep 22, 2025 that may be closed by this pull request

[RFC]: Address piecewise graph splitting and attention fusion incompatibility #23261

Closed

1 task

ProExpertProg added the torch.compile label Sep 24, 2025

ProExpertProg added this to the vllm==v0.12.0/torch==2.9.0 compilation improvements milestone Sep 24, 2025

ProExpertProg added this to torch.compile integration Sep 24, 2025

github-project-automation bot moved this to To triage in torch.compile integration Sep 24, 2025

ProExpertProg moved this from To triage to Done in torch.compile integration Sep 24, 2025

ProExpertProg mentioned this pull request Oct 17, 2025

[RFC]: To Inductor partition or to not Inductor partition (by default in v0.11.1) #27080

Open

1 task

Uh oh!

[torch.compile] CUDAGraph Inductor partition integration #24281

[torch.compile] CUDAGraph Inductor partition integration #24281

Conversation

BoyuanFeng commented Sep 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests

Test1

Test2

Test3

Benchmark

Start Time

Support Attention Fusion

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

BoyuanFeng commented Sep 4, 2025 •

edited by github-actions bot

Loading