-
Notifications
You must be signed in to change notification settings - Fork 369
[Float8] add non-decomposed version of quantize/dequantize ops for fp8 #2961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2961
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 85614a4 with merge base 18dbe87 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
jerryzh168
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems OK to me, wondering if @vkuzo has additional thoughts, not sure if there is a better alternative here to support preserving ops for cpu
|
@vkuzo Could you help review this PR? |
is this urgent? Vasiliy is not available recently and will be back next week |
Thanks for letting me know. Not urgent. We can wait for him back next week |
|
@vkuzo Could you help review this PR? |
vkuzo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should keep cuda and cpu logic consistent, device is supposed to be orthogonal to quantization workflows
I'd recommend a flag named around something like AOTI or pt2e (cc @jerryzh168 for the right name) to control whether you want to decompose the quant/dequant or not decompose them.
|
Hi @vkuzo Thanks for your suggestion. May I know what kind of flag you were talking about? A global flag, or an argument passed to quantization APIs? |
Hi @vkuzo @jerryzh168 Could you share a little more about the design? Thanks. |
|
I think the decision of decompose or not decompose should be static? if we want consistent behavior for the same op across cuda and cpu, it might be better to have separate ops I feel |
|
Hi @jerryzh168 Would you think it better to have a non-decomposed and a decomposed version of the op than a CPU and a CUDA version? We did a similar thing here: https://github.com/pytorch/pytorch/blob/df4ebddbe0fa2306fb8acd09b20265046d968c10/torch/ao/quantization/fx/_decomposed.py#L1206 |
|
yeah just a different op seems to be the only alternative here |
|
Created quantize_affine_float8_non_decomposed and dequantize_affine_float8_non_decomposed separately for non-decomposed |
|
LGTM, cc @vkuzo can you take a look again |
jerryzh168
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me
Fix #2896
What we want to do is to enable FP8 quantization in PyTorch. Similar to INT8 quantization, this requires inserting quantize and dequantize operations into the computational graph. In order to reuse pattern matching logic of int8, we need register FP8 quant and dequant.
To address this, we attempted to register quant in #2379, but the PR was reverted in #2672 because it caused performance regression on H100 GPUs. And there is no need to register q/dq on CUDA.
Based on the above reasons, I register quant specifically for CPU.