[BUG] Qwen 2.5 34B returns garbage at certain quantization levels, but not others #628

Downtown-Case · 2024-09-19T18:11:49Z

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Pytorch version

2.3, 2.4, 2.6 nightly, flash-attn and xformers built from source, exllama built from master branch

Describe the bug

Qwen 2.5 34B returns garbage output with certain quantizations above 4bpw, but not ones below 4bpw.

Possibly related to #621 or #627

What's unusual is that lower quantizations work, but higher ones do not.

These two quants work for me:

While this one (and a 4.04 I had locally) return garbage:

https://huggingface.co/Downtown-Case/Qwen_Qwen2.5-32B-Base-exl2-4.1bpw

Here's an example command I used for quantization: python convert.py --in_dir "/home/down/Models/Raw/Qwen_Qwen2.5-32B" -o "/home/down/FastStorage/scratch2" -m "/home/down/Models/calibration/Q32-base.json" -b 4.0 -hb 6 -cf "/home/down/Models/exllama/Qwen_Qwen2.5-32B-exl2-4.0bpw" -nr --fast_safetensors

Re-doing the calibration from scratch doesn't seem to make a difference, and that same calibration was used for the sub 4bpw quantizations.

I tried quantizing at 4.1/4.04 bpw in multiple pytorch environments, with different versions of flash-attention installed, remaking the measurements json from scratch, and so on. My test is an 75K context story at Q4 cache quantization, simply continuing it in exui. Again, the sub 4bpw quantization continue it coherently while the ones over 4bpw return garbled english, with no errors in the console.

I'm running through more troubleshooting steps now (like trying different levels of cache quantization and making more quantizations), but figured I'd post early since others seem to be having issues with Qwen.

Acknowledgements

I have looked for similar issues before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.

The text was updated successfully, but these errors were encountered:

Downtown-Case · 2024-09-19T18:41:06Z

Does not appear to be a quantized KV cache issue, FP16 cache returns the same garbled english.

Brand new 4bpw quantization also returns the same garbled english.

turboderp · 2024-09-19T19:04:07Z

Can you tell me more about how you're prompting the model to get garbage? If I try your 4.1bpw version, it seems to be working fine, both in 0.2.2 master and 0.2.2 dev, with FP16 or Q4 cache. Doesn't seem to break either way.

Is it possible you're running low on VRAM or something?

Downtown-Case · 2024-09-19T19:26:19Z

I have a few gigabytes of vram to spare loading it at short context.

If I load it into exui, even with the default prompt of "Once upon a time" it just starts looping and looping garbled english with the 4.1 quant, but the 3.75 is fine.

...I know, lol. I'm currently trying to reproduce it with a super minimal exllama script, and working my way up from there.

Downtown-Case · 2024-09-19T19:39:33Z

...I'm a moron. I overwrote an ancient test model in exui, and it turns out RoPE scale was set at 4.0.

I appreciate the quick response anyway!

For reference, Qwen 2.5 doesn't seem to mind Q4 cache like Qwen 2 does.

Originalimoc · 2024-11-17T16:36:14Z

It actually does. Try 7B with Q4, start tokens fine then quickly outputting garbage. But Q6+ performance does not seems to matter Q6 Q8 F16 similar rate of answer correctly. 14B+ Q4 is (mostly) fine.

Downtown-Case · 2024-11-20T17:46:00Z

You're talking about weights quantization, not cache right?

Originalimoc · 2024-12-13T19:38:21Z

Cache Q4

turboderp · 2024-12-14T19:40:47Z

Qwen2.x 7B specifically is known to have issues with the Q4 cache mode in ExLlama due to oddly normalized key vectors. Q6 and Q8 should work fine, and 7B already has a very compact key/value cache to begin with (probably related to why it's harder to quantize.)

Downtown-Case added the bug Something isn't working label Sep 19, 2024

Downtown-Case closed this as completed Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG] Qwen 2.5 34B returns garbage at certain quantization levels, but not others #628

[BUG] Qwen 2.5 34B returns garbage at certain quantization levels, but not others #628

Downtown-Case commented Sep 19, 2024 •

edited

Loading

Downtown-Case commented Sep 19, 2024 •

edited

Loading

Uh oh!

turboderp commented Sep 19, 2024

Uh oh!

Downtown-Case commented Sep 19, 2024 •

edited

Loading

Uh oh!

Downtown-Case commented Sep 19, 2024

Uh oh!

Originalimoc commented Nov 17, 2024

Uh oh!

Downtown-Case commented Nov 20, 2024

Uh oh!

Originalimoc commented Dec 13, 2024

Uh oh!

turboderp commented Dec 14, 2024

Uh oh!

Uh oh!

[BUG] Qwen 2.5 34B returns garbage at certain quantization levels, but not others #628

[BUG] Qwen 2.5 34B returns garbage at certain quantization levels, but not others #628

Comments

Downtown-Case commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OS

GPU Library

Python version

Pytorch version

Describe the bug

Acknowledgements

Downtown-Case commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

turboderp commented Sep 19, 2024

Uh oh!

Downtown-Case commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Downtown-Case commented Sep 19, 2024

Uh oh!

Originalimoc commented Nov 17, 2024

Uh oh!

Downtown-Case commented Nov 20, 2024

Uh oh!

Originalimoc commented Dec 13, 2024

Uh oh!

turboderp commented Dec 14, 2024

Uh oh!

Downtown-Case commented Sep 19, 2024 •

edited

Loading

Downtown-Case commented Sep 19, 2024 •

edited

Loading

Downtown-Case commented Sep 19, 2024 •

edited

Loading