Skip to content

Conversation

@shiyang-weng
Copy link
Contributor

Fix #2896

What we want to do is to enable FP8 quantization in PyTorch. Similar to INT8 quantization, this requires inserting quantize and dequantize operations into the computational graph. In order to reuse pattern matching logic of int8, we need register FP8 quant and dequant.

To address this, we attempted to register quant in #2379, but the PR was reverted in #2672 because it caused performance regression on H100 GPUs. And there is no need to register q/dq on CUDA.

Based on the above reasons, I register quant specifically for CPU.

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2961

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 85614a4 with merge base 18dbe87 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 9, 2025
@shiyang-weng shiyang-weng marked this pull request as draft September 9, 2025 03:18
@Xia-Weiwen Xia-Weiwen added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Sep 9, 2025
@shiyang-weng shiyang-weng marked this pull request as ready for review September 10, 2025 01:28
@jerryzh168 jerryzh168 requested a review from vkuzo September 11, 2025 00:10
Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems OK to me, wondering if @vkuzo has additional thoughts, not sure if there is a better alternative here to support preserving ops for cpu

@shiyang-weng
Copy link
Contributor Author

@vkuzo Could you help review this PR?

@jerryzh168
Copy link
Contributor

@vkuzo Could you help review this PR?

is this urgent? Vasiliy is not available recently and will be back next week

@shiyang-weng
Copy link
Contributor Author

is this urgent? Vasiliy is not available recently and will be back next week

Thanks for letting me know. Not urgent. We can wait for him back next week

@shiyang-weng
Copy link
Contributor Author

@vkuzo Could you help review this PR?

Copy link
Contributor

@vkuzo vkuzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should keep cuda and cpu logic consistent, device is supposed to be orthogonal to quantization workflows

I'd recommend a flag named around something like AOTI or pt2e (cc @jerryzh168 for the right name) to control whether you want to decompose the quant/dequant or not decompose them.

@Xia-Weiwen
Copy link
Collaborator

Hi @vkuzo Thanks for your suggestion. May I know what kind of flag you were talking about? A global flag, or an argument passed to quantization APIs?

@Xia-Weiwen
Copy link
Collaborator

Hi @vkuzo Thanks for your suggestion. May I know what kind of flag you were talking about? A global flag, or an argument passed to quantization APIs?

Hi @vkuzo @jerryzh168 Could you share a little more about the design? Thanks.

@jerryzh168
Copy link
Contributor

jerryzh168 commented Sep 17, 2025

I think the decision of decompose or not decompose should be static? if we want consistent behavior for the same op across cuda and cpu, it might be better to have separate ops I feel

@Xia-Weiwen
Copy link
Collaborator

Xia-Weiwen commented Sep 17, 2025

Hi @jerryzh168 Would you think it better to have a non-decomposed and a decomposed version of the op than a CPU and a CUDA version? We did a similar thing here: https://github.com/pytorch/pytorch/blob/df4ebddbe0fa2306fb8acd09b20265046d968c10/torch/ao/quantization/fx/_decomposed.py#L1206
also @vkuzo

@jerryzh168
Copy link
Contributor

yeah just a different op seems to be the only alternative here

@shiyang-weng
Copy link
Contributor Author

Created quantize_affine_float8_non_decomposed and dequantize_affine_float8_non_decomposed separately for non-decomposed

@jerryzh168
Copy link
Contributor

LGTM, cc @vkuzo can you take a look again

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@Xia-Weiwen Xia-Weiwen changed the title [Float8] register fp8 quant/dequant only on CPU [Float8] add non-decomposed version of quantize/dequantize ops for fp8 Sep 21, 2025
@Xia-Weiwen Xia-Weiwen merged commit 8525185 into pytorch:main Sep 21, 2025
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing Use this tag if you don't want this PR to show up in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CPU][FP8][Inductor] How to support fp8 quant for inductor on CPU

4 participants