Description
Hi guys
I've just had reports that two specific Q4_0 70B models are outputting gibberish, and I've confirmed the same.
Example file with this issue: https://huggingface.co/TheBloke/Spicyboros-70B-2.2-GGUF/blob/main/spicyboros-70b-2.2.Q4_0.gguf
Second example, made 12 days ago: https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-Creative-GGUF/blob/main/airoboros-l2-70b-2.1-creative.Q4_0.gguf
I've had no reports of problems with other quants. I've tested Q4_K_M and Q5_0 from the same model and commit, and both were fine.
The Spicyboros bad q4_0 was made with commit d54a402
At first I thought it was a recent problem until I realised there was also a file from 12 days ago with the same issue.
But a 70B q4_0 I made three days ago, with commit 21ac3a1, is fine: https://huggingface.co/TheBloke/ORCA_LLaMA_70B_QLoRA-GGUF/blob/main/orca_llama_70b_qlora.Q4_0.gguf
I notice both broken models were made by Jon Durbin - could there be something in the source model causing this? But only for q4_0? That's weird.
Full iutput when testing Spicyboros 70B Q4_0 70B gguf file (too long to post in one comment!) : https://gist.github.com/TheBloke/b7a45d3e5ff1432f90aa221de6a5fb08#file-q4_0-gibberish-log
Trimmed log:
(pytorch2) ubuntu@a10:/workspace/git/gguf-llama (master ✔) ᐅ ./main -m /workspace/spicyboros-70B-2.2.Q4_0.gguf -c 4096 -p "A chat.\nUSER: Write a story about llamas\nASSISTANT:" -n 128
Log start
main: build = 1215 (89e8959)
main: seed = 1694547445
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA A10, compute capability 8.6
llama_model_loader: loaded meta data with 20 key-value pairs and 723 tensors from /workspace/spicyboros-70B-2.2.Q4_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 8192, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_k.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.attn_v.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.attn_output.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_up.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 7: blk.0.ffn_down.weight q4_0 [ 28672, 8192, 1, 1 ]
llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.attn_q.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.attn_k.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.attn_v.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 13: blk.1.attn_output.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 14: blk.1.ffn_gate.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 15: blk.1.ffn_up.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 16: blk.1.ffn_down.weight q4_0 [ 28672, 8192, 1, 1 ]
llama_model_loader: - tensor 17: blk.1.attn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 18: blk.1.ffn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 19: blk.2.attn_q.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 20: blk.2.attn_k.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 21: blk.2.attn_v.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 22: blk.2.attn_output.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 23: blk.2.ffn_gate.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 24: blk.2.ffn_up.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 25: blk.2.ffn_down.weight q4_0 [ 28672, 8192, 1, 1 ]
llama_model_loader: - tensor 26: blk.2.attn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 27: blk.2.ffn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 28: blk.3.attn_q.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 29: blk.3.attn_k.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 30: blk.3.attn_v.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 31: blk.3.attn_output.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 32: blk.3.ffn_gate.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 33: blk.3.ffn_up.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 34: blk.3.ffn_down.weight q4_0 [ 28672, 8192, 1, 1 ]
llama_model_loader: - tensor 35: blk.3.attn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 36: blk.3.ffn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 37: blk.4.attn_q.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 38: blk.4.attn_k.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 39: blk.4.attn_v.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 40: blk.4.attn_output.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 41: blk.4.ffn_gate.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 42: blk.4.ffn_up.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 43: blk.4.ffn_down.weight q4_0 [ 28672, 8192, 1, 1 ]
llama_model_loader: - tensor 44: blk.4.attn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 45: blk.4.ffn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 46: blk.5.attn_q.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 47: blk.5.attn_k.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 48: blk.5.attn_v.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 49: blk.5.attn_output.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 50: blk.5.ffn_gate.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 51: blk.5.ffn_up.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 52: blk.5.ffn_down.weight q4_0 [ 28672, 8192, 1, 1 ]
... trimmed ...
llama_model_loader: - tensor 716: blk.79.ffn_gate.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 717: blk.79.ffn_up.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 718: blk.79.ffn_down.weight q4_0 [ 28672, 8192, 1, 1 ]
llama_model_loader: - tensor 719: blk.79.attn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 720: blk.79.ffn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 721: output_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 722: output.weight q6_K [ 8192, 32000, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: llama.rope.freq_base f32
llama_model_loader: - kv 11: general.file_type u32
llama_model_loader: - kv 12: tokenizer.ggml.model str
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr
llama_model_loader: - kv 14: tokenizer.ggml.scores arr
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 19: general.quantization_version u32
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q4_0: 561 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 4096
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 37070.97 MB (+ 1280.00 MB per state)
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/83 layers to GPU
llm_load_tensors: VRAM used: 0 MB
....................................................................................................
llama_new_context_with_model: kv self size = 1280.00 MB
llama_new_context_with_model: compute buffer total size = 561.47 MB
llama_new_context_with_model: VRAM scratch buffer: 560.00 MB
system_info: n_threads = 15 / 30 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = 128, n_keep = 0
A chat.\nUSER: Write a story about llamas\nASSISTANT:oid◄◄letteakoÝbrieпіbrieroberiaÝiomcych Insertomengenommen prolong Feder Sebbrie◄ fifigliaÝ Matthoauthandro◄loyee◄ obser cabarfgresloyeeigliaMITgenommen◄тистиbrie stat◄