Enable sequence parallelism for full cuda graph without specifying compile sizes #21031

cascade812 · 2025-07-16T03:59:40Z

Previously it has to specify compile_sizes in CompilationConfig to enable sequence parallelism.
This PR is to remove this limitation for full cuda graph compilation.

…mpile sizes Signed-off-by: cascade812 <[email protected]>

github-actions · 2025-07-16T03:59:59Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request enables sequence parallelism for full cuda graph compilation without specifying compile sizes. The type hint for splitting_ops is incorrect and has been corrected to list[str] to match the actual data type.

vllm/compilation/collective_fusion.py

vllm/compilation/inductor_pass.py

vllm/compilation/sequence_parallelism.py

Signed-off-by: cascade812 <[email protected]>

youkaichao

what if the number of tokens cannot divide the tp size?

ProExpertProg

I think it would be good to add some documentation for this behavior, and get some benchmarking results! And should we pad to multiples of tp size?

vllm/compilation/collective_fusion.py

cascade812 · 2025-07-16T18:11:55Z

what if the number of tokens cannot divide the tp size?

When sequence parallelism is enabled, we always pad num_tokens to be a multiple of tensor_parallel_size in gpu_model_runner.py

Signed-off-by: cascade812 <[email protected]>

cascade812 · 2025-07-16T19:09:44Z

I think it would be good to add some documentation for this behavior, and get some benchmarking results! And should we pad to multiples of tp size?

Added detailed comment. And it already pads to multiples of tp size for full graph.
For benchmark, sequence parallelism along doesn't yield much performance improvement, it mainly lays the groundwork for subsequent fusion passes, such as GEMM + ReduceScatter and AllGather + GEMM fusions. Benchmark for those fusions have been provided in asynctp PRs.

ProExpertProg · 2025-07-17T14:50:02Z

Yes but this improves the speedup of async TP because it now also works for shapes that weren't explicitly compiled. Could you do an end-to-end serving benchmark comparing async TP off (main), async TP on (main), and async TP on (this PR).

zou3519 · 2025-07-23T00:19:52Z

I'm deferring to @ProExpertProg and @youkaichao on this

cascade812 · 2025-08-10T00:35:01Z

@zou3519 I'm encountering an out-of-memory error for the KV cache when benchmarking the LLaMA 70B model on H100x4 with full_cuda_graph=True and enable_sequence_parallelism=True for this PR.
The activation memory usage is super high ~44G (sp pass enabled) vs ~5G (sp pass disabled) during profiling.

I dumped a memory usage snapshot for profiling (see below screenshot), and found it's due to the improper memory release of buf290. This buffer is supposed to be released after its only usage at the line buf294 = torch.ops.vllm.all_gather.default(buf290, 0, 2, 'tp:0') . However, it isn’t freed until the very end of execution.
This seems to be a bug of torch compile. Do you have any insight for this issue?

buf289 = torch.ops.vllm.reduce_scatter.default(buf288, 0, 2, 'tp:0')
 del buf288
 buf290 = buf289
 assert_size_stride(buf290, (s0 // 2, 4096), (4096, 1))
 del buf289
 # Topologically Sorted Source Nodes: [], Original ATen: []
 torch.ops._C.fused_add_rms_norm.default(input=buf290, residual=buf2, weight=arg77_1, epsilon=1e-05)
 del arg77_1
 # Topologically Sorted Source Nodes: [], Original ATen: []
 buf294 = torch.ops.vllm.all_gather.default(buf290, 0, 2, 'tp:0')
 ...until the very end...
 del buf290

zou3519 · 2025-08-11T23:17:19Z

@cascade812 are you able to send me a tlparse of this? This will generate a html page with all of the torch.compile logs that we can stare at. (https://github.com/pytorch/tlparse?tab=readme-ov-file#tlparse-parse-structured-pt2-logs).

Otherwise I'll try to repro this on a machine.

The thing I am curious about is: are we sure that buf290 isn't used again, and what is between buf294 = torch.ops.vllm.all_gather.default(buf290, 0, 2, 'tp:0') and the del buf290

zou3519 · 2025-08-13T19:02:19Z

btw, @BoyuanFeng @eellison any initial thoughts here? Inductor codegen seems to not be deleting a buffer after its final use. I can get a tlparse later.

BoyuanFeng · 2025-08-13T19:48:06Z

is buf294 a slice of buf290? If buf294 (or some other view/slice) is used later, buf290 cannot be freed.

A tlparse or generated output code (via TORCH_LOGS=output_code) would be helpful to investigate

cascade812 · 2025-08-14T01:13:13Z

is buf294 a slice of buf290? If buf294 (or some other view/slice) is used later, buf290 cannot be freed.

A tlparse or generated output code (via TORCH_LOGS=output_code) would be helpful to investigate

No, buf294 is not a slice of buf290. The all_gather operation allocates a new memory space for the output.

BoyuanFeng · 2025-08-18T18:49:21Z

what is the command to repro? I can check the memory issue. Thanks!

cascade812 · 2025-08-19T03:56:52Z

what is the command to repro? I can check the memory issue. Thanks!

@BoyuanFeng thanks! I just sent the repro step and compilation result to you over slack.

ProExpertProg · 2025-08-20T20:00:29Z

Btw #23261 would help this pass as well

Signed-off-by: cascade812 <[email protected]>

enable sequence parallelism for full cuda graph without specifying co…

97af19f

…mpile sizes Signed-off-by: cascade812 <[email protected]>

cascade812 requested review from youkaichao and zou3519 as code owners July 16, 2025 03:59

cascade812 changed the title ~~enable sequence parallelism for full cuda graph without specifying compile sizes~~ Enable sequence parallelism for full cuda graph without specifying compile sizes Jul 16, 2025

gemini-code-assist bot reviewed Jul 16, 2025

View reviewed changes

vllm/compilation/collective_fusion.py Outdated Show resolved Hide resolved

vllm/compilation/inductor_pass.py Outdated Show resolved Hide resolved

vllm/compilation/sequence_parallelism.py Outdated Show resolved Hide resolved

update

6189189

Signed-off-by: cascade812 <[email protected]>

youkaichao reviewed Jul 16, 2025

View reviewed changes

youkaichao requested a review from ProExpertProg July 16, 2025 09:04

ProExpertProg reviewed Jul 16, 2025

View reviewed changes

vllm/compilation/collective_fusion.py Outdated Show resolved Hide resolved

address comment

cd99b5f

Signed-off-by: cascade812 <[email protected]>

zou3519 removed their request for review July 23, 2025 00:19

ProExpertProg mentioned this pull request Aug 20, 2025

[RFC]: Address piecewise graph splitting and attention fusion incompatibility #23261

Closed

1 task

ProExpertProg mentioned this pull request Sep 19, 2025

[Bug]: Sequence Parallelism and Async TP disabled by default #25277

Open

4 tasks

Merge remote-tracking branch 'origin' into sp2

83d80a8

Signed-off-by: cascade812 <[email protected]>

This was referenced Oct 10, 2025

[compile] Add patched_fused_scaled_matmul_reduce_scatter #26604

Merged

[compile] Enable sequence parallelism for full cuda graph without specifying compile sizes #26681

Merged

Uh oh!

Enable sequence parallelism for full cuda graph without specifying compile sizes #21031

Are you sure you want to change the base?

Enable sequence parallelism for full cuda graph without specifying compile sizes #21031

Conversation

cascade812 commented Jul 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cascade812 commented Jul 16, 2025

Uh oh!

cascade812 commented Jul 16, 2025

Uh oh!

ProExpertProg commented Jul 17, 2025

Uh oh!

zou3519 commented Jul 23, 2025

Uh oh!

cascade812 commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 commented Aug 13, 2025

Uh oh!

BoyuanFeng commented Aug 13, 2025

Uh oh!

cascade812 commented Aug 14, 2025

Uh oh!

BoyuanFeng commented Aug 18, 2025

Uh oh!

cascade812 commented Aug 19, 2025

Uh oh!

ProExpertProg commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cascade812 commented Jul 16, 2025 •

edited by github-actions bot

Loading

cascade812 commented Aug 10, 2025 •

edited

Loading

zou3519 commented Aug 11, 2025 •

edited

Loading