Update on "Improve QAT nvfp4 numerics"

andrewor14 · andrewor14 · commit e16506d2769c · 2025-09-24T15:34:41.000-07:00
**Summary:** Similar to #2986, this commit improves the prepare vs convert SQNR of NVFP4 QAT from 12 to 36 with `use_per_tensor_scale`, and 12 to inf without. This is achieved by mimicking the PTQ flow more closely, in particular, in descending order of significance: 1. Simulate `f4_unpacked_to_f32` and `f32_to_f4_unpacked`, but in `torch.int32` instead of `torch.uint8` 2. Do not cast intermediate fake quantized values to original dtype, e.g. bf16 which loses some fidelity from fp32 3. Fake round blockwise scales to float8 **Test Plan:** ``` python test/quantization/test_qat.py -k test_qat_nvfp4 python test/quantization/test_qat.py -k test_quantize_api_nvfp4 ``` End-to-end tests TBD. [ghstack-poisoned]
diff --git a/test/quantization/test_qat.py b/test/quantization/test_qat.py
@@ -2096,6 +2096,7 @@ def test_quantize_api_nvfp4(self, use_per_tensor_scale: bool):
             target_convert_sqnr=float("inf"),
         )
 
+    @unittest.skipIf(not is_sm_at_least_89(), "Need sm89+")
     @unittest.skipIf(not _CUDA_IS_AVAILABLE, "skipping when cuda is not available")
     @parametrize("use_per_tensor_scale", [True, False])
     def test_qat_nvfp4(self, use_per_tensor_scale: bool):

Original file line number	Diff line number	Diff line change
`@@ -2096,6 +2096,7 @@ def test_quantize_api_nvfp4(self, use_per_tensor_scale: bool):`
`2096`	`2096`	`target_convert_sqnr=float("inf"),`
`2097`	`2097`	`)`
`2098`	`2098`
	`2099`	`+ @unittest.skipIf(not is_sm_at_least_89(), "Need sm89+")`
`2099`	`2100`	`@unittest.skipIf(not _CUDA_IS_AVAILABLE, "skipping when cuda is not available")`
`2100`	`2101`	`@parametrize("use_per_tensor_scale", [True, False])`
`2101`	`2102`	`def test_qat_nvfp4(self, use_per_tensor_scale: bool):`