-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[torch.compile] CUDAGraph Inductor partition integration #24281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces CUDAGraph partitioning by integrating custom wrappers with the torch inductor compiler. The core logic is added in vllm/compilation/backends.py
. The review identifies a critical debugging statement (breakpoint()
) that must be removed, as it will halt execution. Additionally, there are several large blocks of commented-out code and unused variables that should be cleaned up to improve code readability and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a flag to config that enables this. It can be experimental for now but we should add good documentation because it will likely stick around (even if on by default) because other platforms will reuse the existing cudagraph wrapper mechanism and piecewise spliting after dynamo!
f5af8f6
to
1397e35
Compare
This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](vllm-project/vllm#24281) Pull Request resolved: #162207 Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg
This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](vllm-project/vllm#24281) Pull Request resolved: pytorch#162207 Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Nice work. I left a few comments mostly about comments and asserts. We should improve the documentation around this, If you want to accept some of my suggestions directly that would also help me be added as a coauthor :D.
We should add tests: one that just extends the current piecewise cudagraph tests, and one that also tests that attention fusion happened with this splitting method (and that it's not broken). We should also check the performance of attention fusion after this PR. Let me know if you need help with what commands to run.
Additionally, it might be nice to be able to pass a list of "splitting ops" to Inductor during compilation (as opposed to op declaration time). If we want to decide whether to exclude attention or fused_moe (or neither or both), we can't depend on torch._C.Tag.cudagraph_unsafe
because that's a static property of the op. I guess for now we can depend on the old (current) splitting pathway but it might be nice in the future to use a list inside config.
Signed-off-by: Boyuan Feng <[email protected]>
Head branch was pushed to by a user without write access
This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](vllm-project/vllm#24281) Pull Request resolved: pytorch#162207 Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg
This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](vllm-project/vllm#24281) Pull Request resolved: pytorch#162207 Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg
…t#24281) Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: boyuanfeng <[email protected]> Co-authored-by: Luka Govedič <[email protected]>
…t#24281) Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: boyuanfeng <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: charlifu <[email protected]>
This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](vllm-project/vllm#24281) Pull Request resolved: pytorch#162207 Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg
Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: boyuanfeng <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: yewentao256 <[email protected]>
…t#24281) Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: boyuanfeng <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…t#24281) Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: boyuanfeng <[email protected]> Co-authored-by: Luka Govedič <[email protected]>
Depends on pytorch/pytorch#162207 [landed and available in PyTorch 2.9].
Command:
Tests
Test1
test_simple_inductor_graph_partition
checks that, whenuse_inductor_graph_partition=True
, we have 1 num_piecewise_graphs_seen and 1 num_backend_compilations. By contrast,test_simple_piecewise_compile
checks that, whenuse_inductor_graph_partition=True
, we have 5 num_piecewise_graphs_seen and 3 num_backend_compilations since we are splitting at fx graph level. For both tests, we assert thatnum_cudagraph_captured=6
.Test2
test_custom_compile_config
checks thatuse_inductor_graph_partition=True, level=CompilationLevel.PIECEWISE, cudagraph_mode=CUDAGraphMode.PIECEWISE
work together.Test3
test_attention_quant_pattern
checks thatattention+FP8Quant
fusion happens whenuse_inductor_graph_partition=True
.Benchmark
Model: meta-llama/Meta-Llama-3.1-8B
Hardware: B200
With vllm Piecewise CUDAGraph backend:

With inductor graph partition:

TTFT is 0.4% faster, TPOT is 2% slower.
Start Time
Support Attention Fusion
trace w/o attn fusion. We can see [prior cudagraph'ed kernels] -> vllm::scaled_fp8_quant_kernel_strided -> void vllm::reshape_and_cache_flash_kernel -> fmhaSm100Kernel -> [(next cudagraph'ed kernels) vllm::scaled_fp8_quant_kernel_strided -> _ZN7cutlass13device_kernelINS_4gemm6kernel -> ...].

trace w/ attn fusion. We can see [prior cudagraph'ed kernels] -> vllm::scaled_fp8_quant_kernel -> vllm::reshape_and_cache_flash_kernel -> fmhaSm100Kernel -> [(next cudagraph'ed kernels) _ZN7cutlass13device_kernelINS_4gemm6kernel...]. Note that the second

vllm::scaled_fp8_quant_kernel_strided
is moved from cudagraph'ed region into fmhaSm100Kernel.