-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[FEAT] [ROCm] [AITER]: Add AITER HIP block quant kernel #21242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: tjtanaa <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a HIP block quantization kernel from AITER for ROCm to improve performance. The changes are mainly in fp8_utils.py
to dispatch to this new kernel. My review identifies a critical issue with the implementation's robustness. The new code for the AITER kernel is added at the module level with a conditional import that can lead to runtime crashes (NameError
or ImportError
) if misconfigured or if the optional aiter
dependency is missing. I've recommended a more robust approach to handle the optional dependency and variable definitions.
if (envs.VLLM_ROCM_USE_AITER and envs.VLLM_ROCM_USE_AITER_LINEAR | ||
and current_platform.supports_fp8()): | ||
|
||
import aiter as rocm_aiter | ||
from aiter import get_hip_quant | ||
|
||
aiter_per1x128_quant = get_hip_quant(rocm_aiter.QuantType.per_1x128) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic for conditionally defining aiter_per1x128_quant
and rocm_aiter
at the module level is fragile and can lead to runtime errors.
-
NameError
Risk: If the condition on line 85 is false,aiter_per1x128_quant
androcm_aiter
are never defined. However, the functionapply_w8a8_block_fp8_linear
uses these variables, guarded by theuse_aiter_and_is_supported
flag. If the logic to set this flag at the call site diverges even slightly from the condition here, it will result in aNameError
at runtime. It's much safer to unconditionally define these variables (e.g., toNone
) before thisif
block. -
Unguarded Import: The
import aiter
on line 88 is not wrapped in atry...except ImportError
. IfVLLM_ROCM_USE_AITER
is enabled butaiter
is not installed, the application will crash on import. This should be handled gracefully to prevent the entire service from failing.
A robust implementation would define aiter_per1x128_quant
and rocm_aiter
unconditionally (e.g., as None
) and use a try...except
block to handle the optional import, updating the variables upon success. This would prevent both NameError
and ImportError
crashes.
if use_aiter_and_is_supported: | ||
q_input, x_scale = aiter_per1x128_quant( | ||
input_2d.contiguous(), quant_dtype=rocm_aiter.dtypes.fp8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to aiter_per1x128_quant
is not guarded by a try-except block, which can lead to a crash if the aiter
module fails to load. Wrap this call in a try-except block to handle potential import errors gracefully.
try:
q_input, x_scale = aiter_per1x128_quant(
input_2d.contiguous(), quant_dtype=rocm_aiter.dtypes.fp8)
except NameError as e:
raise ImportError("AITER is not properly installed or configured.") from e
if (envs.VLLM_ROCM_USE_AITER and envs.VLLM_ROCM_USE_AITER_LINEAR | ||
and current_platform.supports_fp8()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also be calling what sets use_aiter_and_is_supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you feel about moving this import into the block below that does the call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgoin @SageMoore Thank you so much for the feedback. I think I will try to start implementing this #21504 here so that we can reuse the conditions. There is a problem here is that the use_aiter_and_is_supported
is not able to use to call the import.
I think moving the import into the block is not that great to avoid always invoking the get kernel function get_hip_quant
.
I will let you guys know again after I fix the conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates: I will implement #21504 after this PR.
@mgoin
Thank you. Following the use_aiter_and_is_supported
in fp8.py
I have updated to
if (envs.VLLM_ROCM_USE_AITER and envs.VLLM_ROCM_USE_AITER_LINEAR
- and current_platform.supports_fp8()):
+ and current_platform.is_fp8_fnuz()):
@SageMoore
I would prefer to keep the current implementation to avoiding repeated dictionary lookups is important for performance.
Signed-off-by: tjtanaa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay thanks for the update, I think this is okay for now. I would like these conditions to be less local and more centralized, as in having one function to check for this block linear case. But we can refactor that later
…#21242) Signed-off-by: x22x22 <[email protected]>
…#21242) Signed-off-by: Jinzhen Lin <[email protected]>
…#21242) Signed-off-by: Paul Pak <[email protected]>
…#21242) Signed-off-by: Diego-Castan <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Introduce AITER HIP Block Scale Quantization kernel.
Verified on AITER Commit:
916bf3c
Test Plan
Test Result
Accuracy
vllm (pretrained=deepseek-ai/DeepSeek-R1,tensor_parallel_size=8,max_model_len=32768,block_size=1,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
Perf Gain
(Optional) Documentation Update