-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[ROCm][Quantization] extend AMD Quark to support mixed-precision quantized model #24239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d45f2be
d984821
8976bbb
4ed7a68
b217e06
5f4b012
24a0203
55742f9
4ba49d3
f5dfdd2
8fcb1ad
d58e162
7072b0a
755d214
3a59917
b6aea4e
356e03c
89243f3
1e6f959
9475678
875d83a
812db0f
7f9db09
deb645d
5ad31a0
1fddee9
00460e7
6819a72
64f3071
3e98843
9022e5b
ef93b22
0b15d87
2f0c1cb
f9704de
8d80088
d34e23e
575caf2
bd59e51
9ef2b8f
1c2c4d5
20b23dc
9256a8e
6f8294c
8c719b3
3fac262
d4c7cc9
350cc15
db3cc7e
8762dff
687c2cd
decd8d7
bb39f1d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -281,4 +281,36 @@ python quantize_quark.py --model_dir Qwen/Qwen1.5-MoE-A2.7B-Chat \ | |||||
--group_size 32 | ||||||
``` | ||||||
|
||||||
The current integration supports [all combination of FP4, FP6_E3M2, FP6_E2M3](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/ocp_mx_utils.py) used for either weights or activations. Eventually, some target hardware support mixed precision GEMM, as AMD Instinct MI350/MI355, for example using FP6 for activations and FP4 for weights. | ||||||
The current integration supports [all combination of FP4, FP6_E3M2, FP6_E2M3](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/ocp_mx_utils.py) used for either weights or activations. | ||||||
|
||||||
## Using Quark Quantized layerwise Auto Mixed Precision (AMP) Models | ||||||
|
||||||
vLLM also supports loading layerwise mixed precision model quantized using AMD Quark. Currently, mixed scheme of {MXFP4, FP8} is supported, where FP8 here denotes for FP8 per-tensor scheme. More mixed precision schemes are planned to be supported in a near future, including | ||||||
|
||||||
- Unquantized Linear and/or MoE layer(s) as an option for each layer, i.e., mixed of {MXFP4, FP8, BF16/FP16} | ||||||
- MXFP6 quantization extension, i.e., {MXFP4, MXFP6, FP8, BF16/FP16} | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. MXFP6 only would be misleading as single scheme (bitwidth) quantization. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just trying to make the doc less verbose (mxfp4, bf16, fp8 usability is already implied above) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's make it accurate. {MXFP4, MXFP6, FP8, BF16/FP16} is a whole, a base unit for mixed-precision. |
||||||
|
||||||
Although one can maximize serving throughput using the lowest precision supported on a given device (e.g. MXFP4 for AMD Instinct MI355, FP8 for AMD Instinct MI300), these aggressive schemes can be detrimental to accuracy recovering from quantization on target tasks. Mixed precision allows to strike a balance between maximizing accuracy and throughput. | ||||||
|
||||||
There are two steps to generate and deploy a mixed precision model quantized with AMD Quark, as shown below. | ||||||
|
||||||
### 1. Quantize a model using mixed precision in AMD Quark | ||||||
|
||||||
Firstly, the layerwise mixed-precision configuration for a given LLM model is searched and then quantized using AMD Quark. We will provide a detailed tutorial with Quark APIs later. | ||||||
|
||||||
As examples, we provide some ready-to-use quantized mixed precision model to show the usage in vLLM and the accuracy benifits. They are: | ||||||
|
||||||
- amd/Llama-2-70b-chat-hf-WMXFP4FP8-AMXFP4FP8-AMP-KVFP8 | ||||||
- amd/Mixtral-8x7B-Instruct-v0.1-WMXFP4FP8-AMXFP4FP8-AMP-KVFP8 | ||||||
- amd/Qwen3-8B-WMXFP4FP8-AMXFP4FP8-AMP-KVFP8 | ||||||
Comment on lines
+301
to
+305
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make these public + add link There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They're going to be published. |
||||||
|
||||||
### 2. inference the quantized mixed precision model in vLLM | ||||||
|
||||||
Models quantized with AMD Quark using mixed precision can natively be reload in vLLM, and e.g. evaluated using lm-evaluation-harness as follow: | ||||||
|
||||||
```bash | ||||||
lm_eval --model vllm \ | ||||||
--model_args pretrained=amd/Llama-2-70b-chat-hf-WMXFP4FP8-AMXFP4FP8-AMP-KVFP8,tensor_parallel_size=4,dtype=auto,gpu_memory_utilization=0.8,trust_remote_code=False \ | ||||||
--tasks mmlu \ | ||||||
--batch_size auto | ||||||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
"""Test quark-quantized {MXFP4, FP8} mixed precision models. | ||
|
||
Run `pytest tests/quantization/test_mixed_precision.py`. | ||
|
||
""" | ||
|
||
import importlib | ||
import importlib.metadata | ||
from dataclasses import dataclass | ||
|
||
import lm_eval | ||
import pytest | ||
from packaging import version | ||
|
||
QUARK_MXFP4_AVAILABLE = importlib.util.find_spec("quark") is not None and version.parse( | ||
importlib.metadata.version("amd-quark") | ||
) >= version.parse("0.8.99") | ||
|
||
|
||
@dataclass | ||
class ModelCase: | ||
model_id: str | ||
tp: int | ||
|
||
|
||
@dataclass | ||
class EvaluationConfig: | ||
model_name: str | ||
|
||
def get_model_args(self) -> str: | ||
return ( | ||
f"pretrained={self.model_name}," | ||
"tensor_parallel_size=4,dtype=auto,gpu_memory_utilization=0.8,trust_remote_code=False" | ||
) | ||
|
||
|
||
TEST_CONFIGS = { | ||
# Mixed-precision (AMP) model | ||
# - Demonstrates end-to-end pipeline functionality | ||
"amd/Qwen3-8B-WMXFP4FP8-AMXFP4FP8-AMP-KVFP8": {"arc_challenge": 0.52, "mmlu": 0.72}, | ||
# Non-mixed-precision (PTQ) model | ||
# - Reference for pipeline compatibility verification -> No conflicts or breakings | ||
"amd/Llama-2-70b-chat-hf-FP8-MLPerf-fp8_attn_quark_format": { | ||
"arc_challenge": 0.53, | ||
"mmlu": 0.61, | ||
}, | ||
} | ||
|
||
|
||
@pytest.mark.parametrize("model_name, accuracy_numbers", TEST_CONFIGS.items()) | ||
@pytest.mark.skipif(not QUARK_MXFP4_AVAILABLE, reason="amd-quark>=0.9 is not available") | ||
def test_mixed_precision_model_accuracies(model_name: str, accuracy_numbers: dict): | ||
results = lm_eval.simple_evaluate( | ||
model="vllm", | ||
model_args=EvaluationConfig(model_name).get_model_args(), | ||
tasks=list(accuracy_numbers.keys()), | ||
batch_size=8, | ||
) | ||
|
||
rtol = 0.05 | ||
|
||
for task, expect_accuracy in accuracy_numbers.items(): | ||
measured_accuracy = results["results"][task]["acc,none"] | ||
assert ( | ||
measured_accuracy - rtol < expect_accuracy | ||
and measured_accuracy + rtol > expect_accuracy | ||
), f"Expected: {expect_accuracy} | Measured: {measured_accuracy}" |
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -114,7 +114,14 @@ def from_config(cls, config: dict[str, Any]) -> "QuarkConfig": | |||||||||||
layer_quant_names = list(layer_quant_config.keys()) | ||||||||||||
layer_quant_set = set(layer_quant_names) | ||||||||||||
|
||||||||||||
if not kv_cache_set.issubset(layer_quant_set): | ||||||||||||
if not ( | ||||||||||||
kv_cache_set.issubset(layer_quant_set) | ||||||||||||
or any( | ||||||||||||
fnmatch.fnmatchcase(layer_quant, pat) | ||||||||||||
for layer_quant in list(layer_quant_set) | ||||||||||||
for pat in list(kv_cache_set) | ||||||||||||
) | ||||||||||||
): | ||||||||||||
raise ValueError( | ||||||||||||
"The Quark quantized model has the " | ||||||||||||
"kv_cache_group parameter setting, " | ||||||||||||
|
@@ -124,10 +131,15 @@ def from_config(cls, config: dict[str, Any]) -> "QuarkConfig": | |||||||||||
) | ||||||||||||
|
||||||||||||
q_configs = [ | ||||||||||||
cast(dict[str, Any], layer_quant_config.get(name)) | ||||||||||||
for name in kv_cache_group | ||||||||||||
quant_cfg | ||||||||||||
for name, quant_cfg in layer_quant_config.items() | ||||||||||||
if any(fnmatch.fnmatchcase(name, pattern) for pattern in kv_cache_group) | ||||||||||||
] | ||||||||||||
if not all(deep_compare(q_config, q_configs[0]) for q_config in q_configs): | ||||||||||||
|
||||||||||||
if not all( | ||||||||||||
deep_compare(q_config["output_tensors"], q_configs[0]["output_tensors"]) | ||||||||||||
for q_config in q_configs | ||||||||||||
): | ||||||||||||
raise ValueError( | ||||||||||||
"The quantization method used for kv_cache should " | ||||||||||||
"be the same, but the quantization method for the " | ||||||||||||
|
@@ -312,9 +324,9 @@ def _find_matched_config( | |||||||||||
layer_quant_config = cast( | ||||||||||||
dict[str, Any], self.quant_config.get("layer_quant_config") | ||||||||||||
) | ||||||||||||
for name_pattern in layer_quant_config: | ||||||||||||
if fnmatch.fnmatch(layer_name, name_pattern): | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this change necessary? Also There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update as also suggested #24239 (comment)
Comment on lines
-315
to
-316
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am confused. What is this PR changing here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The gemini-code-assist had also similar questions. Please see my comments above, e.g., #24239 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for referencing our previous discussion, @xuebwang-amd. I'd like to clarify the change in behavior introduced by replacing Key Difference:
This change fundamentally alters how layer names are matched against the Unless there's a specific reason to remove wildcard matching, I recommend reverting to
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gemini code bot is not useful here. @xuebwang-amd I don't understand why this PR introduces an handling different than e.g. why would the handling in vllm be different than we have in quark, e.g. when reloading models through Transformers library? I think it is not a good thing. Maybe existing models rely on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There have been lots of discussions about it in this PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand why it is okay to do the change here but not with Transformers backend (reloading quark models through transformers). Or maybe I misunderstand something There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see so you want precise match to take precedence over wildcard matching. I'd suggest keeping the wildcard matching logic after your exact match loop. Otherwise, it looks like the new code won't match with wildcard anymore for non-mixed-precision models. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can fully understand your concern here. Please find my explanations above like: #24239 (comment) To ensure no breaking or confliction to existed PTQ model matching, I add a a non-mixed-precision (PTQ, public) model as a reference to demonstrate pipeline compatibility in the Conclusion is: no conflicts or breakings using precise substring containment matching rule. |
||||||||||||
return layer_quant_config[name_pattern] | ||||||||||||
for name_pattern, config in layer_quant_config.items(): | ||||||||||||
if layer_name in name_pattern: | ||||||||||||
Comment on lines
+327
to
+328
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we make sure somewhere that e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question. |
||||||||||||
return config | ||||||||||||
|
||||||||||||
layer_type = cast(str, type(module)) | ||||||||||||
layer_type_quant_config = cast( | ||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better not to draw conclusion or provide guiding descriptions about why layers are quantized or not quantized, they're searched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just trying to make the doc less verbose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it accurate.