-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[compile] Enable sequence parallelism for full cuda graph without specifying compile sizes #26681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables sequence parallelism for full CUDA graph compilation without requiring compile_sizes
. This is achieved by updating the applicability checks in SequenceParallelismPass
and AsyncTPPass
. The changes look good and align with the goal of the PR.
My main feedback is regarding code duplication in the is_applicable
method, which is now identical in both SequenceParallelismPass
and AsyncTPPass
. I've suggested refactoring this into a shared helper method to improve maintainability. I've also included a minor simplification for the condition check in the method.
ea0f81b
to
0506835
Compare
💡 Codex Reviewvllm/vllm/compilation/sequence_parallelism.py Lines 486 to 506 in ea0f81b
Allowing ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. |
0506835
to
32d222a
Compare
Signed-off-by: angelayi <[email protected]>
32d222a
to
3d563e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Can you post the benchmarking numbers in this PR as well?
def is_applicable_for_shape(self, shape: int | None) -> bool: | ||
# only do replace for specific shapes | ||
def is_applicable(self, shape: int | None) -> bool: | ||
# This pass is applied on top of the sequence parallelism pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine for now but @cascade812 isn't this pass technically fine no matter the shape? Obviously it won't match anything if sequence parallelism didn't run, but still
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, removing this implementation would work as well. But it would trigger matching logic for this pass which could add some overhead.
3fe31bf
to
e8d80b0
Compare
Signed-off-by: angelayi <[email protected]>
e8d80b0
to
95a0ba8
Compare
As a follow up, it seems like we should track down the following performance issues:
|
…cifying compile sizes (vllm-project#26681) Signed-off-by: angelayi <[email protected]> Signed-off-by: 1994 <[email protected]>
…cifying compile sizes (vllm-project#26681) Signed-off-by: angelayi <[email protected]> Signed-off-by: Dhruvil Bhatt <[email protected]>
…cifying compile sizes (vllm-project#26681) Signed-off-by: angelayi <[email protected]> Signed-off-by: bbartels <[email protected]>
Btw, this PR causes the PostGradPassManager object to hold onto the CompilationConfig. This is not good, because Inductor configs need to be serializable and deepcopy-able, the PostGradPassManager is added to the Inductor config, and the CompilationConfig holds some nn.Modules. Internally, we saw deepcopy fail on the nn.Modules and OOMs due to this |
Yeah we used to only save |
I'll make an issue to track, @luccafong tracked this one down (and has a fix somewhere) |
Fixed in #27041 |
Purpose
Addresses #25277
Porting over the changes from @cascade812 in #21031 which enabled SP without needing to specify compile_sizes in CompilationConfig, + adding some special handling so that async TP/SP can be enabled if just
use_inductor_graph_partition=True
Test Result
Here are the updated numbers:
For meta-llama/Llama-3.1-70B-Instruct, updated the command:
For RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8:
For nvidia/Llama-3.3-70B-Instruct-FP8: