-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Name and Version
Python:
Python 3.12.11
llama-cli:
version: 6602 (72b24d9)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-quantize
Command line
llama-quantize.exe "F:\LIMI-Air\LIMI-Air.gguf" "F:\LIMI-Air\LIMI-Air-Q2_K.gguf" Q2_K
Problem description & steps to reproduce
llama-quantize after convert_hf_to_gguf.py on GAIR/LIMI-Air which is a GLM 4.5 Air finetune fails with D:/a/llama.cpp/llama.cpp/src/llama-quant.cpp:732: GGML_ASSERT((qs.n_attention_wv == n_attn_layer - pruned_attention_w) && "n_attention_wv is unexpected") failed
.
I downloaded the https://huggingface.co/GAIR/LIMI-Air
repository using git.
I created a Python 3.12 conda environment.
I downloaded llama.cpp using git.
I installed the requirements with pip install -r requirements.txt
I used convert_hf_to_gguf.py with the command python convert_hf_to_gguf.py --outfile "F:\LIMI-Air\LIMI-Air.gguf" "F:\LIMI-Air\LIMI-Air" --outtype bf16
I ran llama-quantize using the command llama-quantize.exe "F:\LIMI-Air\LIMI-Air.gguf" "F:\LIMI-Air\LIMI-Air-Q2_K.gguf" Q2_K
I encountered the error.
First Bad Commit
No response
Relevant log output
main: build = 6602 (72b24d96)
main: built with clang version 19.1.5 for x86_64-pc-windows-msvc
main: quantizing 'F:\LIMI-Air\LIMI-Air.gguf' to 'F:\LIMI-Air\LIMI-Air-Q2_K.gguf' as Q2_K
llama_model_loader: loaded meta data with 41 key-value pairs and 780 tensors from F:\LIMI-Air\LIMI-Air.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = glm4moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = LIMI Air
llama_model_loader: - kv 3: general.size_label str = 128x8.0B
llama_model_loader: - kv 4: general.license str = apache-2.0
llama_model_loader: - kv 5: general.tags arr[str,5] = ["text-generation", "agent", "tool-us...
llama_model_loader: - kv 6: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 7: glm4moe.block_count u32 = 47
llama_model_loader: - kv 8: glm4moe.context_length u32 = 131072
llama_model_loader: - kv 9: glm4moe.embedding_length u32 = 4096
llama_model_loader: - kv 10: glm4moe.feed_forward_length u32 = 10944
llama_model_loader: - kv 11: glm4moe.attention.head_count u32 = 96
llama_model_loader: - kv 12: glm4moe.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: glm4moe.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 14: glm4moe.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: glm4moe.expert_used_count u32 = 8
llama_model_loader: - kv 16: glm4moe.attention.key_length u32 = 128
llama_model_loader: - kv 17: glm4moe.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 32
llama_model_loader: - kv 19: glm4moe.rope.dimension_count u32 = 64
llama_model_loader: - kv 20: glm4moe.expert_count u32 = 128
llama_model_loader: - kv 21: glm4moe.expert_feed_forward_length u32 = 1408
llama_model_loader: - kv 22: glm4moe.expert_shared_count u32 = 1
llama_model_loader: - kv 23: glm4moe.leading_dense_block_count u32 = 1
llama_model_loader: - kv 24: glm4moe.expert_gating_func u32 = 2
llama_model_loader: - kv 25: glm4moe.expert_weights_scale f32 = 1.000000
llama_model_loader: - kv 26: glm4moe.expert_weights_norm bool = true
llama_model_loader: - kv 27: glm4moe.nextn_predict_layers u32 = 1
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 30: tokenizer.ggml.pre str = glm4
llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,151552] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,318088] = ["─á ─á", "─á ─á─á─á", "─á─á ─á─á", "...
llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 151329
llama_model_loader: - kv 35: tokenizer.ggml.padding_token_id u32 = 151329
llama_model_loader: - kv 36: tokenizer.ggml.bos_token_id u32 = 151331
llama_model_loader: - kv 37: tokenizer.ggml.eot_token_id u32 = 151336
llama_model_loader: - kv 38: tokenizer.ggml.unknown_token_id u32 = 151329
llama_model_loader: - kv 39: tokenizer.ggml.eom_token_id u32 = 151338
llama_model_loader: - kv 40: tokenizer.chat_template str = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
llama_model_loader: - type f32: 321 tensors
llama_model_loader: - type bf16: 459 tensors
D:/a/llama.cpp/llama.cpp/src/llama-quant.cpp:732: GGML_ASSERT((qs.n_attention_wv == n_attn_layer - pruned_attention_w) && "n_attention_wv is unexpected") failed