-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Certain 70B Q4_0 quants outputting gibberish (other quant formats unaffected) #3148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I also noticed token candidates are almost identical for each (next) token:
...
|
The model is definitely broken, the ppl of first blocks is |
I just made another for https://huggingface.co/TheBloke/Spicyboros-70B-2.2-GGUF/blob/main/spicyboros-70b-2.2.Q4_0.gguf, this time with commit 4f7cd6b File came out identical - same sha256sum - and of course therefore the same gibberish output. Very odd! I don't know if it's of any help, but here's the full log of making the new q4_0, first making the FP16 and the q4_0. The Fp16 we know is fine because all the other quants are fine: https://gist.github.com/TheBloke/6fe3bb4d870e45c97acb71772906caaf#file-quant-spicyboros-q4_0-log |
For what it is worth, I looked at the mean, min and max of each tensor and compared it to the Q4_K_S model and I didn't see anything obviously out of place. The tokenizer also looks fine. |
That's the only changes in And 2 weeks ago: 5d6f19f#diff-6745585c496560d324d1f0d6d77beebcb6dd9c3354bef41ab262535a87a376a7 Other than that all changes were cosmetic, all the way to gguf merge. So whatever got borked, it's in either of those. @cebtenzzre That commit was about gcc warning fixes, and that is a functional change, wasn't that |
No. If the condition is true, the function returns, so the only way to get to that line is if the condition was false - the 'else' is unnecessary. |
I ran Anybody has a link to f16 of any of mentioned models ? I can run a script overnight to find if checksum changes with commits. |
Ones that reproduce the gibberish: One that was apparently OK on an earlier commit: |
Yes, I've seen those but aren't they raw f32 ? That's not a problem, it's just with f16 I could run wget && script right now, and with raw I'm gonna have to covert them in the morning and results would be probably tomorrow evening. Edit: It's not that bad, HF isn't throttling much this time, only 20min download. |
No, 145GiB 70B should be fp16. I think most HF uploads are. Compare to TheBloke/Llama-2-70B-fp16. |
Ok. Edit: I'll finish tomorrow, it's like 5 in the morning and I can't see what I'm missing here:
I downloaded spicyboros from HF through git/ git lfs, |
Something is wrong with your sentencepiece install. Here's what mine looks like: $ python3 -m pip show sentencepiece | grep Version
Version: 0.1.99
$ python3 -c 'import sentencepiece; print(sentencepiece.SentencePieceProcessor.__init__)'
<function SentencePieceProcessor.Init at 0x7feeb7464ae0>
|
Thanks for looking at this guys. I tried going back to an earlier commit, August 28th, shortly after GGUFv2 release - commit ebcee20 I made a new FP16 from the convert.py from that commit, and made a new q4_0 of Spicyboros 70B 2.2 And it has exactly the same problem. So I'm thinking this isn't any new problem caused by a recent commit. It's something broken with GGUF q4_0 only, on specific models only. Which is very weird.. |
I guess Here is how the quant histograms look like for vanilla LLaMA v2 70B: [ 139/ 723] blk.15.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021
[ 140/ 723] blk.15.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020
[ 141/ 723] blk.15.attn_v.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.097 0.112 0.119 0.112 0.097 0.076 0.056 0.039 0.025 0.021
[ 142/ 723] blk.15.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.037 0.016 0.026 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021
[ 143/ 723] blk.15.ffn_gate.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 144/ 723] blk.15.ffn_down.weight - [28672, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021
[ 145/ 723] blk.15.ffn_up.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 146/ 723] blk.15.attn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 147/ 723] blk.15.ffn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 148/ 723] blk.16.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021
[ 149/ 723] blk.16.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020
[ 150/ 723] blk.16.attn_v.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021
[ 151/ 723] blk.16.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021
[ 152/ 723] blk.16.ffn_gate.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 153/ 723] blk.16.ffn_down.weight - [28672, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021
[ 154/ 723] blk.16.ffn_up.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 155/ 723] blk.16.attn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 156/ 723] blk.16.ffn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 157/ 723] blk.17.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021
[ 158/ 723] blk.17.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.119 0.112 0.097 0.077 0.056 0.038 0.025 0.020
[ 159/ 723] blk.17.attn_v.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021
[ 160/ 723] blk.17.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021
[ 161/ 723] blk.17.ffn_gate.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 162/ 723] blk.17.ffn_down.weight - [28672, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021
[ 163/ 723] blk.17.ffn_up.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 Notice the Gaussian-shaped distribution with Here is how the histograms look like with spicy bros: [ 137/ 723] blk.15.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.044 0.000 0.043 0.025 0.063 0.073 0.081 0.143 0.090 0.143 0.081 0.074 0.068 0.015 0.043 0.016
[ 138/ 723] blk.15.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.044 0.000 0.043 0.024 0.063 0.072 0.079 0.145 0.091 0.145 0.079 0.072 0.068 0.015 0.043 0.016
[ 139/ 723] blk.15.attn_v.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.044 0.000 0.044 0.024 0.063 0.071 0.079 0.145 0.090 0.145 0.079 0.072 0.069 0.015 0.044 0.016
[ 140/ 723] blk.15.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.044 0.000 0.045 0.025 0.064 0.071 0.078 0.144 0.088 0.144 0.078 0.072 0.070 0.015 0.045 0.016
[ 141/ 723] blk.15.ffn_gate.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.063 0.074 0.081 0.141 0.088 0.141 0.081 0.075 0.068 0.015 0.044 0.016
[ 142/ 723] blk.15.ffn_up.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.064 0.071 0.078 0.144 0.088 0.144 0.078 0.073 0.070 0.015 0.045 0.016
[ 143/ 723] blk.15.ffn_down.weight - [28672, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.065 0.071 0.078 0.145 0.089 0.146 0.077 0.072 0.070 0.015 0.045 0.016
[ 144/ 723] blk.15.attn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 145/ 723] blk.15.ffn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 146/ 723] blk.16.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.044 0.000 0.043 0.025 0.062 0.073 0.081 0.142 0.090 0.142 0.081 0.075 0.067 0.015 0.043 0.016
[ 147/ 723] blk.16.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.044 0.000 0.043 0.025 0.063 0.072 0.079 0.145 0.091 0.145 0.079 0.073 0.068 0.015 0.043 0.016
[ 148/ 723] blk.16.attn_v.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.044 0.000 0.044 0.024 0.063 0.072 0.080 0.143 0.090 0.144 0.080 0.074 0.069 0.014 0.044 0.016
[ 149/ 723] blk.16.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.044 0.000 0.045 0.025 0.064 0.071 0.078 0.143 0.088 0.144 0.078 0.073 0.070 0.015 0.045 0.016
[ 150/ 723] blk.16.ffn_gate.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.063 0.074 0.081 0.141 0.088 0.141 0.081 0.075 0.068 0.015 0.044 0.016
[ 151/ 723] blk.16.ffn_up.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.064 0.072 0.079 0.144 0.088 0.144 0.079 0.073 0.069 0.015 0.045 0.016
[ 152/ 723] blk.16.ffn_down.weight - [28672, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.065 0.071 0.078 0.145 0.089 0.146 0.077 0.072 0.070 0.015 0.045 0.016
[ 153/ 723] blk.16.attn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 154/ 723] blk.16.ffn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 155/ 723] blk.17.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.044 0.000 0.043 0.025 0.063 0.073 0.080 0.143 0.089 0.143 0.080 0.074 0.068 0.015 0.043 0.016
[ 156/ 723] blk.17.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.044 0.000 0.043 0.024 0.063 0.071 0.079 0.145 0.091 0.145 0.079 0.073 0.069 0.015 0.043 0.016
[ 157/ 723] blk.17.attn_v.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.044 0.000 0.044 0.025 0.064 0.071 0.078 0.145 0.090 0.146 0.078 0.072 0.069 0.015 0.044 0.016
[ 158/ 723] blk.17.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.044 0.000 0.045 0.025 0.064 0.071 0.078 0.144 0.088 0.144 0.078 0.073 0.070 0.015 0.045 0.016
[ 159/ 723] blk.17.ffn_gate.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.043 0.025 0.062 0.075 0.083 0.139 0.089 0.139 0.083 0.076 0.067 0.016 0.043 0.017
[ 160/ 723] blk.17.ffn_up.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.063 0.073 0.080 0.143 0.088 0.143 0.080 0.074 0.069 0.015 0.044 0.016
[ 161/ 723] blk.17.ffn_down.weight - [28672, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.065 0.071 0.078 0.145 0.089 0.146 0.077 0.072 0.070 0.015 0.045 0.016
[ 162/ 723] blk.17.attn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
It might be useful to plot the weight distribution in some of the tensors to get a better idea of what is going on. Would also be interesting to understand what is the specific reason for |
Ah this is interesting. I recall Jon Durbin telling me that he had implemented a suggestion from Tim Dettmers:
The 70B Jon Durbin models were made with qLoRA. But rather than merging the qLoRA adapter in 16-bit as usual, I believe he first quantised the source weights to 4-bit using BitsAndBytes and then merged the qLoRA in 4-bit, before saving in 16-bit. I then quantised the 16-bit weights as normal. I believe this is the code Jon used, which is based on Tim's suggestion: https://gist.github.com/ChrisHayduk/1a53463331f52dca205e55982baf9930 In hindsight that seems almost certainly what must be different about Jon's recent 70Bs that's causing GGUF 70B Q4_0 to break. @jondurbin could you confirm that I'm remembering correctly that you're following this new Tim Dettmers procedure for your 70B models? Apparently this method will soon be available in HF PEFT, so this practice is going to become commonplace, so this is likely to be an ongoing issue. I will stop making Q4_0 for 70B Jon Durbin models for now, and keep an eye on this happening for models from other creators too. |
Indeed, here's the exact script I used: Specifically: python qlora/qmerge.py \
--base llama-2-70b-hf \
--peft spicyboros-70b-2.2-checkpoints/checkpoint-750/model_adapter \
--out spicyboros-70b-2.2 I can upload a non-prequantized merge version too, let me know. |
Can confirm a regular merge with main llama.cpp works fine with q4_0. |
@cebtenzzre It did, than you. I ran q4_0 quant on spicyboros and got identical checksum as TheBloke. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Uh oh!
There was an error while loading. Please reload this page.
Hi guys
I've just had reports that two specific Q4_0 70B models are outputting gibberish, and I've confirmed the same.
Example file with this issue: https://huggingface.co/TheBloke/Spicyboros-70B-2.2-GGUF/blob/main/spicyboros-70b-2.2.Q4_0.gguf
Second example, made 12 days ago: https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-Creative-GGUF/blob/main/airoboros-l2-70b-2.1-creative.Q4_0.gguf
I've had no reports of problems with other quants. I've tested Q4_K_M and Q5_0 from the same model and commit, and both were fine.
The Spicyboros bad q4_0 was made with commit d54a402
At first I thought it was a recent problem until I realised there was also a file from 12 days ago with the same issue.
But a 70B q4_0 I made three days ago, with commit 21ac3a1, is fine: https://huggingface.co/TheBloke/ORCA_LLaMA_70B_QLoRA-GGUF/blob/main/orca_llama_70b_qlora.Q4_0.gguf
I notice both broken models were made by Jon Durbin - could there be something in the source model causing this? But only for q4_0? That's weird.
Full iutput when testing Spicyboros 70B Q4_0 70B gguf file (too long to post in one comment!) : https://gist.github.com/TheBloke/b7a45d3e5ff1432f90aa221de6a5fb08#file-q4_0-gibberish-log
Trimmed log:
The text was updated successfully, but these errors were encountered: