Introduce new W8A8-FP-CSR quantitzation API #3258

namgyu-youn · 2025-10-29T18:27:33Z

Summary:
Introduce new W8A8-FP-CSR quantization API, Float8SemiSparseTensor, which specializes in semi-sparse pattern using cuSPARSELt accelerations (https://docs.nvidia.com/cuda/cusparselt/)

Related Issue/PR: #2752

Future Plan:
This PR only introduces core operations (quantization/dequantization). For better API support, we have to introduce tensor utility operations like indexing and slicing.

Test Plan:
test/prototype/quantization/quantize_/float8/test_float8_semisparse_tensor.py

pytorch-bot · 2025-10-29T18:27:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3258

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm failures during provisioning step due to network issues

❌ 9 New Failures

As of commit f5f7a17 with merge base 3577306 ():

NEW FAILURES - The following jobs have failed:

PR Label Check / Check PR Labels (gh)
Process completed with exit code 1.
Run Regression Tests / test (CPU 2.6, linux.4xlarge, torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t c223961b253dc35cb0bdfc33afeadfc867b003875b232cbcad0b6dbd3eca1083 /exec failed with exit code 2
Run Regression Tests / test (CPU 2.7, linux.4xlarge, torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t f3a656d52122687f4af72f221f6592e09add42cb355a2ca782caf8cca43de94b /exec failed with exit code 2
Run Regression Tests / test (CPU 2.8, linux.4xlarge, torch==2.8.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t 4e67dbc027c7a9da0e2f4ef6c179bf49d6762ab1f0b177eb137897267189afa1 /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.6, linux.g5.12xlarge.nvidia.gpu, torch==2.6.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t a7be2ef744c754d91d45e53dca1143cc86b0c7b5cb44b056350bc8fdc2a6a56e /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t de24936e5bd69b5fc4ad624996fa49efbc9676a6109b38106319fd323040dedf /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.8, linux.g5.12xlarge.nvidia.gpu, torch==2.8.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t b751bb078c0c06a765f44b8a577b613f99360e0138229220f1117abb82369d83 /exec failed with exit code 2
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh)
RuntimeError: Command docker exec -t fd0dde55b33d1bafc1dd84e13e9694ae7d573194ec8f8fcfce02da8251e6abda /exec failed with exit code 2
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
RuntimeError: Command docker exec -t be34cc8afd057d83f2f6c230c91ef55e859a8e2c22860269d800dd32aff92b39 /exec failed with exit code 2

This comment was automatically generated by Dr. CI and updates every 15 minutes.

namgyu-youn · 2025-10-29T18:29:08Z

@jcaip could you please check this PR?

jcaip · 2025-10-31T18:23:43Z

cc @namgyu-youn

Can you split this into two PRs? one for int8 and one for float8?

In general I don't think we want to introduce weight-only sparsity configs for int8 and float8 because we don't have mixed-dtype kernel support currently. The only kernels we have are for int8 x int8 2:4 sparse and fp8 x fp8 2:4 sparse.

I would like Int8SemiSparseTensor though, but I think it should live in prototype until we have a user for it.

Also cc @bbeckca who has been working on fp8xfp8 2:4 sparse tensor subclass migration in #3182.

jerryzh168 · 2025-10-31T18:25:54Z

cc @namgyu-youn

Can you split this into two PRs? one for int8 and one for float8?

In general I don't think we want to introduce weight-only sparsity configs for int8 and float8 because we don't have mixed-dtype kernel support currently. The only kernels we have are for int8 x int8 2:4 sparse and fp8 x fp8 2:4 sparse.

I would like Int8SemiSparseTensor though, but I think it should live in prototype until we have a user for it.

Also cc @bbeckca who has been working on fp8xfp8 2:4 sparse tensor subclass migration in #3182.

@jcaip if we want to move int8 2:4 sparse to prototype, then we don't need to migrate the tensor I think

namgyu-youn · 2025-10-31T19:56:59Z

Okay, then I'll address only ~~W8A8-INT~~ W8A8-FP here and keep file structure at the prototype.

jcaip · 2025-10-31T20:10:03Z

cc @namgyu-youn I talked to @bbeckca and I think your PR is closer so lets use it instead.
Can you remove the int8 changes then and I will give this a review. Thanks for picking this up!

namgyu-youn · 2025-11-02T09:15:36Z

cc @jcaip to request review, thanks.

jcaip

cc @namgyu-youn

I think there's a bit of confusion on what the tensor subclass should be storing and how to do the op overload.

Please take a look at https://github.com/pytorch/ao/pull/3182/files#diff-afc7dd21d2b704181a6fd55be989426c0217a2bbfb694af9eb9746239ec462ed for the appropriate logic / ops to be called.

jcaip · 2025-11-03T16:14:10Z