-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Enable sequence parallelism for full cuda graph without specifying compile sizes #21031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…mpile sizes Signed-off-by: cascade812 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables sequence parallelism for full cuda graph compilation without specifying compile sizes. The type hint for splitting_ops
is incorrect and has been corrected to list[str]
to match the actual data type.
Signed-off-by: cascade812 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the number of tokens cannot divide the tp size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add some documentation for this behavior, and get some benchmarking results! And should we pad to multiples of tp size?
When sequence parallelism is enabled, we always pad num_tokens to be a multiple of tensor_parallel_size in |
Signed-off-by: cascade812 <[email protected]>
Added detailed comment. And it already pads to multiples of tp size for full graph. |
Yes but this improves the speedup of async TP because it now also works for shapes that weren't explicitly compiled. Could you do an end-to-end serving benchmark comparing async TP off (main), async TP on (main), and async TP on (this PR). |
I'm deferring to @ProExpertProg and @youkaichao on this |
@zou3519 I'm encountering an out-of-memory error for the KV cache when benchmarking the LLaMA 70B model on H100x4 with full_cuda_graph=True and enable_sequence_parallelism=True for this PR. I dumped a memory usage snapshot for profiling (see below screenshot), and found it's due to the improper memory release of
![]() |
@cascade812 are you able to send me a tlparse of this? This will generate a html page with all of the torch.compile logs that we can stare at. (https://github.com/pytorch/tlparse?tab=readme-ov-file#tlparse-parse-structured-pt2-logs). Otherwise I'll try to repro this on a machine. The thing I am curious about is: are we sure that buf290 isn't used again, and what is between |
btw, @BoyuanFeng @eellison any initial thoughts here? Inductor codegen seems to not be deleting a buffer after its final use. I can get a tlparse later. |
is buf294 a slice of buf290? If buf294 (or some other view/slice) is used later, buf290 cannot be freed. A tlparse or generated output code (via TORCH_LOGS=output_code) would be helpful to investigate |
No, buf294 is not a slice of buf290. The all_gather operation allocates a new memory space for the output. |
what is the command to repro? I can check the memory issue. Thanks! |
@BoyuanFeng thanks! I just sent the repro step and compilation result to you over slack. |
Btw #23261 would help this pass as well |
Signed-off-by: cascade812 <[email protected]>
Previously it has to specify
compile_sizes
inCompilationConfig
to enable sequence parallelism.This PR is to remove this limitation for full cuda graph compilation.