Quant fallback to 8w per token + other quant improvements for multimodal #154

jackzhxng · 2025-09-24T23:21:31Z

Big quantization improvements for Gemma3 4B vision (7.4 GB -> 3.0 GB)

Quantize encoder with 8da4w group size 32 where possible, else use 8da8w per token (this applies to the fc2 layers)
Quantize LM head

optimum-cli export executorch
    --model google/gemma-3-4b-it
    --task "multimodal-text-to-text"
    --max_seq_len 1024
    --recipe "xnnpack"
    --use_custom_sdpa
    --use_custom_kv_cache
    --qlinear 8da4w
    --qlinear_group_size 32
    --qlinear_encoder 8da4w,8da8w
    --qlinear_encoder_group_size 32
    --qembedding 8w
    --output_dir="gemma3_vision"

…divisible

HuggingFaceDocBuilderDev · 2025-09-24T23:25:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

optimum/exporters/executorch/quantization.py

metascroy · 2025-09-29T17:24:13Z

optimum/exporters/executorch/quantization.py

+                fallback_linear_config_key = None
        else:
-            assert qlinear_group_size % 2 == 0, "Linear quantization group size must be a multiple of 2."
+            assert qlinear_group_size % 2 == 0, f"Linear quantization group size must be a multiple of 2, got {qlinear_group_size}."


Why is groupsixe a multiple of 2? Shouldn't it be a multiple of 32?

optimum/exporters/executorch/quantization.py

larryliu0820 · 2025-10-08T19:15:56Z

optimum/exporters/executorch/tasks/multimodal_text_to_text.py

+    quantize_lm_head_kwargs = {
+        "eager_model": eager_model.lm_head,
+        "qlinear_config": qlinear_config,
+    }


Can you guard this by whether eager_model has lm_head?

Sure, curious though is there a model without lm_head?

Yeah voxtral doesn't have lm_head

Implement quantization fallback to 8w per channel if block size is in…

b1d4dab

…divisible

metascroy reviewed Sep 25, 2025

View reviewed changes

optimum/exporters/executorch/quantization.py Outdated Show resolved Hide resolved

metascroy reviewed Sep 25, 2025

View reviewed changes

optimum/exporters/executorch/quantization.py Outdated Show resolved Hide resolved

Add fallback logic

da772d2

metascroy reviewed Sep 29, 2025

View reviewed changes

optimum/exporters/executorch/quantization.py Show resolved Hide resolved

jackzhxng force-pushed the jz/quantize-fallback branch 2 times, most recently from 9238ad0 to 3b3ae50 Compare October 7, 2025 20:42

jackzhxng changed the title ~~Implement quantization fallback to 8w per channel~~ Implement quantization fallback to 8w per channel + other quant improvements for multimodal Oct 7, 2025

jackzhxng changed the title ~~Implement quantization fallback to 8w per channel + other quant improvements for multimodal~~ Quant fallback to 8w per token + other quant improvements for multimodal Oct 7, 2025

jackzhxng marked this pull request as ready for review October 7, 2025 21:05

larryliu0820 approved these changes Oct 7, 2025

View reviewed changes

jackzhxng force-pushed the jz/quantize-fallback branch from 3b3ae50 to d2f238e Compare October 8, 2025 17:06

Works

a872c53

jackzhxng force-pushed the jz/quantize-fallback branch from d2f238e to a872c53 Compare October 8, 2025 18:02

larryliu0820 reviewed Oct 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quant fallback to 8w per token + other quant improvements for multimodal #154

Quant fallback to 8w per token + other quant improvements for multimodal #154

Uh oh!

jackzhxng commented Sep 24, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

metascroy Sep 29, 2025

Uh oh!

Uh oh!

larryliu0820 Oct 8, 2025

Uh oh!

jackzhxng Oct 8, 2025

Uh oh!

larryliu0820 Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Quant fallback to 8w per token + other quant improvements for multimodal #154

Are you sure you want to change the base?

Quant fallback to 8w per token + other quant improvements for multimodal #154

Uh oh!

Conversation

jackzhxng commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

metascroy Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

larryliu0820 Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

jackzhxng Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

larryliu0820 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jackzhxng commented Sep 24, 2025 •

edited

Loading