-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Model] Gemma3: Fix GGUF loading and quantization #26189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Gemma3: Fix GGUF loading and quantization #26189
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Gemma3 GGUF quantization, addressing several key issues including model type mismatches, incorrect weight handling that led to gibberish output, and enabling proper Q4_0 compression. The changes are well-structured and include extensive testing, which provides confidence in the fix. My main feedback is to improve the robustness of the error handling in the Gemma3 model detection logic by catching more specific exceptions instead of a broad Exception
.
69cc5d4
to
c2bc592
Compare
Hi @22quinn, Thanks for taking the time to review that. Pls let me know if you have questions/queries/concerns. This PR is really important for us at the Gemma team. Thanks, |
# Apply Gemma3-specific RMSNorm weight correction | ||
# GemmaRMSNorm computes: output = x * (1 + weight) | ||
# Standard PyTorch: output = x * weight | ||
# | ||
# GGUF stores full weight values (for x * weight) | ||
# but vLLM's GemmaRMSNorm expects (weight - 1) since | ||
# it adds 1 during forward pass. Without this | ||
# correction, the model produces gibberish output. | ||
if is_gemma3 and 'norm' in name and len(param.shape) == 1: | ||
param = param - 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put the RMSNorm handling in gemma3 load_weights
:
vllm/vllm/model_executor/models/gemma2.py
Lines 326 to 330 in 6f9adf6
if self.quant_config and self.quant_config.get_name() == "gguf" \ | |
and name.endswith("norm.weight"): | |
# Revert +1 during llama.cpp conversion | |
# see: https://github.com/ggerganov/llama.cpp/blob/2e2f8f093cd4fb6bbb87ba84f6b9684fa082f3fa/convert_hf_to_gguf.py#L3313-L3315 | |
loaded_weight -= 1 |
# Handle quantized weights (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, etc.) | ||
if weight_type.name not in ("F32", "F16", "BF16"): | ||
# For quantized weights, yield raw GGUF tensor data. | ||
# The GGUF quantization layers will handle | ||
# dequantization on-demand during inference, keeping | ||
# weights compressed in GPU memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked if BF16 checkpoint can still work?
1b31d57
to
96ca10f
Compare
This pull request has merge conflicts that must be resolved before it can be |
ee1404b
to
acabedd
Compare
Hi @Isotr0py ! Thank you for the feedback! I've updated the PR to move the Changes made:
Rationale for the correction:
Testing:
Additional Testing NotesDuring testing, I encountered a pre-existing vLLM bug with F16/BF16 unquantized GGUF models that is unrelated to this PR: Issue: F16/BF16 Unquantized GGUF Models Fail to Load Affected models tested:
Root cause:
I will raise an issue for that and work on this fix too. Similar symptoms reported in vLLM issue #10600 where FP16 GGUF models load but produce nonsensical output. The RMSNorm correction in this PR is correct and dtype-agnostic. The F16/BF16 loading issue is a separate infrastructure bug that should be addressed in a future PR. |
This commit implements complete GGUF quantization support for Gemma3 models with true Q4_0 compression, addressing gibberish output and enabling 50% memory reduction. Changes: 1. gguf_loader.py: Add gemma3_text -> gemma3 model type mapping 2. gemma3.py: - Add Gemma3 RMSNorm weight correction (-1.0 offset) - Fix qweight_type tensor shape (scalar -> [1]) - Fix F16 embedding handling (no reshape needed) - Enable GGUF quantization in linear layers - Handle UninitializedParameter for GGUF layers Key fixes: - RMSNorm correction: Gemma3 uses (1+weight) convention but GGUF stores full values, requiring -1.0 subtraction - F16 embeddings: GGUF raw data is already in PyTorch layout, preventing data corruption from unnecessary reshape operations - qweight_type shape: GGUF layers expect shape [1] not scalar [] Tested on: - 8 Gemma3 variants (1B-27B parameters) - Both instruction-tuned and pretrained versions - Q4_0 quantization format - 100% success rate with coherent text generation Fixes #14753, #15480 Signed-off-by: Luciano Martins <[email protected]>
1040d6e
to
c9481d5
Compare
All set, @Isotr0py. Thanks in advance! Luciano Martins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look reasonable to me. Thanks for the clear documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits) [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001) Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164) Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353) [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189) Bump Flashinfer to v0.4.0 (vllm-project#26326) Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464) [Core] Relax the LoRA max rank (vllm-project#26461) [CI/Build] Fix model nightly tests (vllm-project#26466) [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486) [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926) [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200) [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439) [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462) [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445) [Minor] Change warning->warning_once in preprocess (vllm-project#26455) [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392) [Misc] Redact ray runtime env before logging (vllm-project#26302) Separate MLAAttention class from Attention (vllm-project#25103) [Attention] Register FLASHMLA_SPARSE (vllm-project#26441) [Kernels] Modular kernel refactor (vllm-project#24812) ...
Signed-off-by: Luciano Martins <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Luciano Martins <[email protected]> Co-authored-by: Isotr0py <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Purpose
Fix Gemma3 GGUF quantization support in vLLM, resolving gibberish output and enabling true Q4_0 compression.
Issues Resolved:
Problems Fixed:
gemma3_text
but GGUF expectsgemma3
Test Plan
Test Environment
Test Command
Models Tested
Comprehensive validation across 8 Gemma3 variants:
Test Result
Before Fix
ValueError: GGUF model with architecture gemma3 is not supported yet
After Fix
Performance Metrics
| Model | File Size | GPU Memory (Q4_0) | Memory vs BF16 | Status |
|-------|-----------|-------------------|----------------|-----------|---------|
| 1B-pt | 665 MB | ~0.37 GB | 55% | PASS |
| 1B-it | 665 MB | ~0.37 GB | 55% | PASS |
| 4B-pt | 2.48 GB | ~1.36 GB | 55% | PASS |
| 4B-it | 2.48 GB | ~1.36 GB | 55% | PASS |
| 12B-pt | 6.75 GB | ~3.71 GB | 55% | PASS |
| 12B-it | 6.75 GB | ~3.71 GB | 55% | PASS |
| 27B-pt | 14.9 GB | ~8.20 GB | 55% | PASS |
| 27B-it | 14.9 GB | ~8.20 GB | 55% | PASS |
Success Rate: 100% (8/8 models)
Sample Output Quality
Prompt: "Hello, my name is"
Output: "Alice and I am a 21 year old student from the UK. I am currently studying..."
Prompt: "The capital of France is"
Output: "Paris, a city renowned for its art, fashion, and culture..."
Prompt: "What is 2+2?"
Output: "2 + 2 = 4"
All outputs are coherent and contextually appropriate.
Changes Made
1.
gguf_loader.py
- Model Type MappingLocation:
vllm/model_executor/model_loader/gguf_loader.py:66-69
Added mapping for Gemma3 model type:
2.
weight_utils.py
- GGUF Quantization LogicLocation:
vllm/model_executor/model_loader/weight_utils.py:807-862
Changes:
general.architecture
field for "gemma3"param - 1.0
to norm weights (architectural requirement)torch.tensor(weight_type)
→torch.tensor([weight_type])
Technical Details:
output = x * (1 + weight)
x * weight
)weight - 1
since it adds 1 during forward passDocumentation
No user-facing documentation update needed. This fix enables existing GGUF functionality for Gemma3 models without API changes.
Release Notes
Suggested entry for release notes:
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.