Update on "Improve QAT nvfp4 numerics"

andrewor14 · andrewor14 · commit 7f06046cfe99 · 2025-09-26T13:17:07.000-07:00
**Summary:** Similar to #2986, this commit improves the prepare vs convert SQNR of NVFP4 QAT from 12 to inf. This is achieved by refactoring NVFP4 QAT to mimic the PTQ numerics exactly, using a new linear class to incorporate both the quantization and mm logic. **Unit tests:** ``` python test/quantization/test_qat.py -k test_qat_nvfp4 python test/quantization/test_qat.py -k test_quantize_api_nvfp4 ``` **End-to-end tests:** Fine-tuning Llama3.2-3B with and without this PR in axolotl: - fine-tune for 1 epoch on yahma/alpaca-cleaned - batch size 512, learning rate 2e-5, no gradient accumulation Wikitext: - With this PR, QAT nvfp4 quantized model achieved 15% lower perplexity than the quantized baseline - Without this PR, QAT nvfp4 quantized model was about the same as the quantized baseline ``` ==> Llama3.2-3B_baseline_bs512/eval_float.log <== | | |none | 0|word_perplexity|↓ |9.418|± | N/A| ==> Llama3.2-3B_baseline_bs512/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |10.3681|± | N/A| # QAT with this PR (quantized) ==> Llama3.2-3B_qat_bs512/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |10.2281|± | N/A| ``` [ghstack-poisoned]
diff --git a/torchao/prototype/qat/nvfp4.py b/torchao/prototype/qat/nvfp4.py
@@ -31,9 +31,10 @@ class NVFP4FakeQuantizeConfig(FakeQuantizeConfigBase):
     use_triton_kernel: bool = False
 
 
-class _NVFP4FakeQuantizedLinearForward(torch.autograd.Function):
+class _NVFP4QuantizedForwardFakeQuantizedBackward(torch.autograd.Function):
     """
-    Autograd function for NVFP4 fake quantization + addmm.
+    Autograd function for NVFP4 quantization + addmm in low precision during forward,
+    and fake quantization in high precision during backward.
     """
 
     @staticmethod
@@ -100,7 +101,9 @@ class NVFP4FakeQuantizedLinear(torch.nn.Linear):
     """
     Linear module for fake quantized NVFP4 weights and/or activations.
 
-    The forward pass follows quantization and addmm numerics in `NVFP4Tensor` exactly.
+    The forward pass follows quantization and addmm numerics in `NVFP4Tensor`
+    in lower precision exactly, while the backward pass uses dequantize
+    (fake quantized) values in high precision.
 
     Example usage::
 
@@ -146,7 +149,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
             x = x.view(-1, x.shape[-1])
         else:
             batch_size = None
-        fq = _NVFP4FakeQuantizedLinearForward.apply(
+        fq = _NVFP4QuantizedForwardFakeQuantizedBackward.apply(
             x, self.weight, self.bias, self.activation_config, self.weight_config
         )
         assert fq.dtype == x.dtype