-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[BugFix] Potential Fix for FA3 full-cudagraph IMA #25490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a potential Invalid Memory Access in FlashAttention 3 with full CUDA graphs by ensuring the max_num_splits parameter is consistent. The change refactors the logic for setting max_num_splits to a common location. However, the current implementation introduces a critical flaw: it can lead to an UnboundLocalError because max_num_splits is not defined in all code paths. My review provides a fix for this issue to ensure the variable is always initialized. Addressing this will also help achieve the PR's goal of making the parameter consistent.
Signed-off-by: Lucas Wilkinson <[email protected]> fix Signed-off-by: Lucas Wilkinson <[email protected]> fix Signed-off-by: Lucas Wilkinson <[email protected]> comment Signed-off-by: Lucas Wilkinson <[email protected]>
58df19e to
e1c19ca
Compare
WoosukKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
|
@LucasWilkinson Can you please check the CI again? |
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
@WoosukKwon reported an IMA with FA3 full-CG that was fixed by doing https://github.com/vllm-project/vllm/compare/woosuk/fa3-ima?expand=1
the theory here is that
get_scheduler_metadatawas being called with a differentmax_num_splitsthan what was being passed toFlashAttentionMetadatathis is an alternative solution that doesn't lose the logic to use
max_num_splits=0(i.e. use the heuristic) for batches larger thenmax_cudagraph_sizewe do not currently have a repo so cannot confirm this resolves @WoosukKwon 's IMA but this should be resolved regardless; we should alway make sure the arguments to
get_scheduler_metadataandFlashAttentionMetadataare inline