[Transform] [Quantization] Add QuTLASS support to vLLM #24440

LopezCastroRoberto · 2025-09-08T12:02:04Z

Purpose

This pull request brings in the QuTLASS library: https://github.com/iST-DASLab/qutlass

QuTLASS is a high-performance library designed for low-precision kernel support in deep learning quantization, built on top of NVIDIA CUTLASS.

QuTLASS v0.1.0 introduces 4-bit microscaling routines tailored for Large Language Model (LLM) inference on NVIDIA Blackwell GPUs.

Online rotations:
- Fused transform + quantization + scale computation.
  - Rotation matrices loaded at runtime, allowing any transformation to be applied.
- Support for both NVFP4 and MXFP4 microscaling formats.
- Multiple rotation sizes (16/32/64/128).
MXFP4 matmul kernel support powered by CUTLASS.
- QuTLASS is compatible with any matmul backend supporting microscaling formats (e.g., CUTLASS, FlashInfer).
Multiple quantization schemes:
- Quartet (i.e., Quest-like).
- Abs-Max.

Microbenchmarking

benchmarks/kernels/bench_mxfp4_qutlass.py
benchmarks/kernels/bench_nvfp4_qutlass.py

QuTLASS performance on a single Qwen3-32B layer with NVIDIA RTX5090 GPU

QuTLASS performance on a single Llama-3.1-70B layer with NVIDIA B200 GPU

[WIP] End-to-end

python benchmarks/benchmark_latency.py
- daslab-testing/Llama-3.3-70B-Instruct-FPQuant-GPTQ-MXFP4-hadamard
- meta-llama/Llama-3.3-70B-Instruct

FP16

Quantization	Llama: MMLU-CoT	GSM8k	Hellaswag	Winogrande	Average	Recovery %
N/A	0.866	0.951	0.862	0.849	0.882	-

MXFP4

Quantization	Llama: MMLU-CoT	GSM8k	Hellaswag	Winogrande	Average	Recovery %
RTN	0.834	0.927	0.839	0.815	0.854	96.8
RTN + HAD (GS32)	0.839	0.936	0.841	0.836	0.863	97.8
GPTQ	0.838	0.945	0.844	0.826	0.863	97.9
GPTQ + HAD (GS32)	0.848	0.945	0.847	0.831	0.868	98.4

Testing

tests/kernels/quantization/test_mxfp4_qutlass.py
tests/kernels/quantization/test_nvfp4_qutlass.py

gemini-code-assist

Code Review

This pull request integrates the QuTLASS library to add support for 4-bit quantization kernels, including new custom ops, benchmarks, and tests. The changes are well-structured. I have two high-severity suggestions: one to improve build reproducibility by pinning the QuTLASS dependency to a specific version, and another to fix a bug in a new test file to prevent future issues.

gemini-code-assist · 2025-09-08T12:06:03Z

cmake/external_projects/qutlass.cmake

+  FetchContent_Declare(
+    qutlass
+    GIT_REPOSITORY https://github.com/IST-DASLab/qutlass.git
+    GIT_TAG main


Using main as the GIT_TAG can lead to non-reproducible builds and may break the build if there are incompatible changes in the QuTLASS repository's main branch. It is highly recommended to pin this to a specific commit hash or a release tag (like v0.1.0 as mentioned in the PR description) to ensure build stability and reproducibility.

GIT_TAG v0.1.0

gemini-code-assist · 2025-09-08T12:06:03Z

tests/kernels/quantization/test_mxfp4_qutlass.py

+        b_dq, *_ = _dq_fp4(b_e2m1, b_e8m0[:n, :k], alpha=1.)
+        out_ref = a_dq @ b_dq.transpose(-2, -1)
+
+        out = qutlass.matmul_ada_mxf4_bf16_tn(a_e2m1, b_e2m1, a_e8m0, b_e8m0, alpha)


The run_problem_ada function attempts to call qutlass.matmul_ada_mxf4_bf16_tn, but qutlass is not defined or imported. This will result in a NameError. Although this function is not currently called, it's best to fix it to prevent future issues.

To fix this, you should add matmul_ada_mxf4_bf16_tn to your imports at the top of the file:

from vllm._custom_ops import matmul_mxf4_bf16_tn, fusedQuantizeMx, matmul_ada_mxf4_bf16_tn

And then update this line accordingly.

Suggested change

out = qutlass.matmul_ada_mxf4_bf16_tn(a_e2m1, b_e2m1, a_e8m0, b_e8m0, alpha)

out = matmul_ada_mxf4_bf16_tn(a_e2m1, b_e2m1, a_e8m0, b_e8m0, alpha)

voipmonitor · 2025-09-08T12:14:03Z

@LopezCastroRoberto does this PR support gpt-oss on sm120 ? How to exactly test some mxfp4 models with this PR? Would love to test rtx 6000 pro on this

kylesayrs · 2025-09-08T19:15:53Z

benchmarks/kernels/bench_mxfp4_qutlass.py

+    return torch.tensor(
+        hadamard(group_size) * group_size**-0.5, dtype=dtype, device=device
+    )


Can you use our hadamard utility for consistency?

from compressed_tensors.transform.utils.hadamard import deterministic_hadamard_matrix

Suggested change

return torch.tensor(

hadamard(group_size) * group_size**-0.5, dtype=dtype, device=device

)

deterministic_hadamard_matrix(group_size, dtype=dtype, device=device) * group_size**-0.5

kylesayrs · 2025-09-08T19:17:01Z

benchmarks/kernels/bench_mxfp4_qutlass.py

+
+def build_mxfp4_runner(cfg, a, b, forward_hadamard_matrix, dtype, device):
+    weight_hf_e2m1, weight_hf_scale_block = _quant_weight_mxfp4(b, forward_hadamard_matrix, device)
+    alpha = torch.Tensor([1.]).to("cuda")


Suggested change

alpha = torch.Tensor([1.]).to("cuda")

alpha = torch.Tensor([1.], device="cuda")

kylesayrs · 2025-09-08T19:17:57Z

benchmarks/kernels/bench_nvfp4_qutlass.py

+
+def get_hadamard_matrix(group_size: int, dtype: torch.dtype, device: torch.device):
+    return torch.tensor(
+        hadamard(group_size) * group_size**-0.5, dtype=dtype, device=device


Same here, use our util

kylesayrs · 2025-09-08T19:19:30Z

benchmarks/kernels/bench_mxfp4_qutlass.py

+    'Llama-3.1-70B': [(8192, 8192), (8192, 57344), (28672, 8192)]
+}
+
+for model, layers in MODELS.items():


Please wrap in `if name == "main"

Consider adding some user arguments

kylesayrs · 2025-09-08T19:21:48Z

benchmarks/kernels/bench_nvfp4_qutlass.py

+    'Llama-3.1-70B': [(8192, 8192), (8192, 57344), (28672, 8192)]
+}
+
+for model, layers in MODELS.items():


Please wrap in `if name == "main"

Consider allowing users to specify arguments, that way you don't have to have commented code

kylesayrs · 2025-09-08T19:27:31Z

vllm/_custom_ops.py

+
+def fusedQuantizeMx(a: torch.Tensor,
+                    b: torch.Tensor,
+                    *,


What's the point of this *?

Means all arguments that come after the * must be passed by keyword, not by position. My point was to make the API clearer and less error-prone.

That's fair!

kylesayrs · 2025-09-08T19:35:54Z

vllm/_custom_ops.py

+    xh_e8m0      = torch.empty(padded_rows, padded_cols, dtype=torch.float8_e8m0fnu, device=a.device)
+
+    if method=="quest":
+        return torch.ops._qutlass_C.fusedQuantizeMxQuest(a, b, xh_e2m1, xh_e8m0)


Because these functions have a return value, you'll want to register a fake function so torch compile works right

if hasattr(torch.ops._C, "_qutlass_C"): @register_fake("_C::_qutlass_C::fusedQuantizeMxQuest") def fake_qutlass_mx_quest(a: torch.Tensor, b: torch.Tensor, xh_e2m1: torch.Tensor, xh_e8m0: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: return (torch.empty(...), torch.empty(...))

kylesayrs · 2025-09-08T19:36:34Z

vllm/qutlass_utils/utils.py

+        output_block_stride,
+        BLOCK_ROWS: tl.constexpr,
+        BLOCK_COLS: tl.constexpr,
+    ):


Is this over-indented? I think we should standardize on 4 space indent

kylesayrs · 2025-09-08T19:41:57Z

vllm/qutlass_utils/utils.py

+    return (a + b - 1) // b
+
+
+def to_blocked(input_matrix, use_triton_kernel: bool = False) -> Tensor:


Just as a style thing, consider calling triton_mx_block_rearrange in cases where you want to use the triton kernel and to_blocked otherwise

How about keeping one to_blocked but making the backend explicit (e.g. backend="torch" | "triton" | "auto")?

kylesayrs · 2025-09-08T19:44:54Z

benchmarks/kernels/bench_nvfp4_qutlass.py

+    # Quantize activation on-the-fly
+    def run():
+        input_hf_e2m1, input_hf_e8m0 = fusedQuantizeNv(a, forward_hadamard_matrix, global_scale)
+        input_hf_scale_block = to_blocked(input_hf_e8m0, True).view(-1,K//16)


Will the triton jit affect benchmarked runtime? Ie, first time compile causes the first graph to take longer than normal?

yes—the very first time is slower, but after that it's cached

LopezCastroRoberto · 2025-09-09T11:24:07Z

@voipmonitor This PR supports dense models only, and it's perfectly fine to use an RTX 6000 Pro. We will add usage examples to this PR soon.

We’re actively working on MoE support in QuTLASS—stay tuned :)

Signed-off-by: LopezCastroRoberto <[email protected]>

…24440) Signed-off-by: LopezCastroRoberto <[email protected]> Signed-off-by: Roberto L. Castro <[email protected]> Signed-off-by: Andrei Panferov <[email protected]> Co-authored-by: Andrei Panferov <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Dhruvil Bhatt <[email protected]>

…24440) Signed-off-by: LopezCastroRoberto <[email protected]> Signed-off-by: Roberto L. Castro <[email protected]> Signed-off-by: Andrei Panferov <[email protected]> Co-authored-by: Andrei Panferov <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: bbartels <[email protected]>

…24440) Signed-off-by: LopezCastroRoberto <[email protected]> Signed-off-by: Roberto L. Castro <[email protected]> Signed-off-by: Andrei Panferov <[email protected]> Co-authored-by: Andrei Panferov <[email protected]> Co-authored-by: Michael Goin <[email protected]>

…24440) Signed-off-by: LopezCastroRoberto <[email protected]> Signed-off-by: Roberto L. Castro <[email protected]> Signed-off-by: Andrei Panferov <[email protected]> Co-authored-by: Andrei Panferov <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…24440) Signed-off-by: LopezCastroRoberto <[email protected]> Signed-off-by: Roberto L. Castro <[email protected]> Signed-off-by: Andrei Panferov <[email protected]> Co-authored-by: Andrei Panferov <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: 0xrushi <[email protected]>

…24440) Signed-off-by: LopezCastroRoberto <[email protected]> Signed-off-by: Roberto L. Castro <[email protected]> Signed-off-by: Andrei Panferov <[email protected]> Co-authored-by: Andrei Panferov <[email protected]> Co-authored-by: Michael Goin <[email protected]>

bbrowning · 2025-11-25T02:12:57Z

cmake/external_projects/qutlass.cmake

+cuda_archs_loose_intersection(QUTLASS_ARCHS "12.0a;10.0a" "${CUDA_ARCHS}")
+if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND QUTLASS_ARCHS)
+
+  if(QUTLASS_ARCHS MATCHES "10\\.0a")
+    set(QUTLASS_TARGET_CC 100)
+  elseif(QUTLASS_ARCHS MATCHES "12\\.0a")
+    set(QUTLASS_TARGET_CC 120)


This excludes the DGX Spark (SM121) from using QuTLASS, since it is not 12.0a. If you adjust this to allow 12.0f or 12.1a, this will also get built for the spark. I built this on my Spark locally and all tests in test_mxfp4_qutlass.py and test_nvfp4_qutlass.py pass, but I'm not sure if that's all that's needed to verify this would work for that hardware.

Is there a way I can verify this is getting used in a running vLLM, outside of those tests?

BlackSamorez · 2025-11-25T22:11:25Z

There have been reports of NVFP4 QuTLASS failing on RTX 5090 and DGX Spark too.

bbrowning · 2025-11-25T22:54:05Z

There have been reports of NVFP4 QuTLASS failing on RTX 5090 and DGX Spark too.

I tested locally loosening up the cmake CUDA_ARCHS matching and was able to run an example FP-Quant model fine on a DGX Spark building vLLM from source. I don't have an RTX 5090 to test, but on the surface I would have assumed it already matched this CUDA_ARCHS check.

mgoin · 2025-11-26T00:34:09Z

Can we just build for 10.0+PTX for instance?

BlackSamorez · 2025-11-26T13:16:54Z

If I remember correctly, the kernel dispatch is static w.r.t. arch because the kernels were separately tuned for each architecture. Compiling for a non-supported arch would throw runtime errors. @LopezCastroRoberto would we have to re-tune the kernels for 12.1a?

…24440) Signed-off-by: LopezCastroRoberto <[email protected]> Signed-off-by: Roberto L. Castro <[email protected]> Signed-off-by: Andrei Panferov <[email protected]> Co-authored-by: Andrei Panferov <[email protected]> Co-authored-by: Michael Goin <[email protected]>

LopezCastroRoberto requested review from LucasWilkinson, WoosukKwon, tlrmchlsmth and yewentao256 as code owners September 8, 2025 12:02

mergify bot added ci/build performance Performance-related issues labels Sep 8, 2025

gemini-code-assist bot reviewed Sep 8, 2025

View reviewed changes

jeejeelee requested a review from mgoin September 8, 2025 15:33

kylesayrs reviewed Sep 8, 2025

View reviewed changes

LopezCastroRoberto closed this Sep 9, 2025

LopezCastroRoberto reopened this Sep 9, 2025

LopezCastroRoberto marked this pull request as draft September 9, 2025 10:24

LopezCastroRoberto force-pushed the transforms branch from f9ca647 to dce5334 Compare September 11, 2025 09:09

github-project-automation bot added this to Structured Output Sep 11, 2025

mergify bot added speculative-decoding v1 tpu Related to Google TPUs tool-calling labels Sep 11, 2025

github-project-automation bot added this to Tool Calling Sep 11, 2025

LopezCastroRoberto and others added 6 commits October 8, 2025 06:31

change dir qutlass utils

8cca014

Signed-off-by: LopezCastroRoberto <[email protected]>

remove noqa e501

b3e354e

Signed-off-by: LopezCastroRoberto <[email protected]>

Merge branch 'vllm-project:main' into transforms

50eec02

isolate QuTLASS compile flags

df98064

Signed-off-by: LopezCastroRoberto <[email protected]>

fix docstring

057cbde

Signed-off-by: LopezCastroRoberto <[email protected]>

Merge branch 'main' into transforms

df2ee30

mgoin approved these changes Oct 9, 2025

View reviewed changes

mgoin enabled auto-merge (squash) October 9, 2025 21:09

vllm-bot merged commit 96ad65b into vllm-project:main Oct 10, 2025
84 of 86 checks passed

FENP mentioned this pull request Oct 13, 2025

[Bugfix][CI/Build] Fix build error when using QUTLASS_SRC_DIR to build vllm from source. #26688

Closed

5 tasks

BlackSamorez mentioned this pull request Oct 16, 2025

Does nvfp4 inference be supported？ IST-DASLab/FP-Quant#10

Closed

daniel-fahey mentioned this pull request Oct 23, 2025

python3Packages.vllm: 0.11.0 -> 0.11.2 NixOS/nixpkgs#454872

Closed

13 tasks

BlackSamorez mentioned this pull request Oct 24, 2025

Evaluation Extremely Slow After Quantization IST-DASLab/FP-Quant#8

Open

mratsim mentioned this pull request Nov 9, 2025

[RFC]: MR-GPTQ (GPTQ+NVFP4) vllm-project/llm-compressor#2006

Open

This was referenced Nov 12, 2025

online_rotations #15162

Closed

[RFC]: 4-bit KV cache quantization through Hadamard transforms #28538

Open

bbrowning reviewed Nov 25, 2025

View reviewed changes

	out = qutlass.matmul_ada_mxf4_bf16_tn(a_e2m1, b_e2m1, a_e8m0, b_e8m0, alpha)
	out = matmul_ada_mxf4_bf16_tn(a_e2m1, b_e2m1, a_e8m0, b_e8m0, alpha)

	alpha = torch.Tensor([1.]).to("cuda")
	alpha = torch.Tensor([1.], device="cuda")

		return (a + b - 1) // b


		def to_blocked(input_matrix, use_triton_kernel: bool = False) -> Tensor:

Uh oh!

[Transform] [Quantization] Add QuTLASS support to vLLM #24440

[Transform] [Quantization] Add QuTLASS support to vLLM #24440

Uh oh!

Conversation

LopezCastroRoberto commented Sep 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Microbenchmarking

[WIP] End-to-end

FP16

MXFP4

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

voipmonitor commented Sep 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LopezCastroRoberto commented Sep 9, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BlackSamorez commented Nov 25, 2025

Uh oh!

bbrowning commented Nov 25, 2025

Uh oh!

mgoin commented Nov 26, 2025

Uh oh!

BlackSamorez commented Nov 26, 2025

Uh oh!

Reviewers

LopezCastroRoberto commented Sep 8, 2025 •

edited by github-actions bot

Loading

kylesayrs Sep 8, 2025 •

edited

Loading

kylesayrs Sep 8, 2025 •

edited

Loading

kylesayrs Sep 8, 2025 •

edited

Loading