Skip to content

Misc. bug: llama-quantize LIMI Air (GLM 4.5 Air finetune) fails with n_attention_wv is unexpected #16283

@GeckoLeaf

Description

@GeckoLeaf

Name and Version

Python:
Python 3.12.11

llama-cli:
version: 6602 (72b24d9)
built with clang version 19.1.5 for x86_64-pc-windows-msvc

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-quantize

Command line

llama-quantize.exe "F:\LIMI-Air\LIMI-Air.gguf" "F:\LIMI-Air\LIMI-Air-Q2_K.gguf" Q2_K

Problem description & steps to reproduce

llama-quantize after convert_hf_to_gguf.py on GAIR/LIMI-Air which is a GLM 4.5 Air finetune fails with D:/a/llama.cpp/llama.cpp/src/llama-quant.cpp:732: GGML_ASSERT((qs.n_attention_wv == n_attn_layer - pruned_attention_w) && "n_attention_wv is unexpected") failed.

I downloaded the https://huggingface.co/GAIR/LIMI-Air repository using git.
I created a Python 3.12 conda environment.
I downloaded llama.cpp using git.
I installed the requirements with pip install -r requirements.txt
I used convert_hf_to_gguf.py with the command python convert_hf_to_gguf.py --outfile "F:\LIMI-Air\LIMI-Air.gguf" "F:\LIMI-Air\LIMI-Air" --outtype bf16
I ran llama-quantize using the command llama-quantize.exe "F:\LIMI-Air\LIMI-Air.gguf" "F:\LIMI-Air\LIMI-Air-Q2_K.gguf" Q2_K
I encountered the error.

First Bad Commit

No response

Relevant log output

main: build = 6602 (72b24d96)
main: built with clang version 19.1.5 for x86_64-pc-windows-msvc
main: quantizing 'F:\LIMI-Air\LIMI-Air.gguf' to 'F:\LIMI-Air\LIMI-Air-Q2_K.gguf' as Q2_K
llama_model_loader: loaded meta data with 41 key-value pairs and 780 tensors from F:\LIMI-Air\LIMI-Air.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = LIMI Air
llama_model_loader: - kv   3:                         general.size_label str              = 128x8.0B
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                               general.tags arr[str,5]       = ["text-generation", "agent", "tool-us...
llama_model_loader: - kv   6:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv   7:                        glm4moe.block_count u32              = 47
llama_model_loader: - kv   8:                     glm4moe.context_length u32              = 131072
llama_model_loader: - kv   9:                   glm4moe.embedding_length u32              = 4096
llama_model_loader: - kv  10:                glm4moe.feed_forward_length u32              = 10944
llama_model_loader: - kv  11:               glm4moe.attention.head_count u32              = 96
llama_model_loader: - kv  12:            glm4moe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                     glm4moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:   glm4moe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                  glm4moe.expert_used_count u32              = 8
llama_model_loader: - kv  16:               glm4moe.attention.key_length u32              = 128
llama_model_loader: - kv  17:             glm4moe.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 32
llama_model_loader: - kv  19:               glm4moe.rope.dimension_count u32              = 64
llama_model_loader: - kv  20:                       glm4moe.expert_count u32              = 128
llama_model_loader: - kv  21:         glm4moe.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  22:                glm4moe.expert_shared_count u32              = 1
llama_model_loader: - kv  23:          glm4moe.leading_dense_block_count u32              = 1
llama_model_loader: - kv  24:                 glm4moe.expert_gating_func u32              = 2
llama_model_loader: - kv  25:               glm4moe.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  26:                glm4moe.expert_weights_norm bool             = true
llama_model_loader: - kv  27:               glm4moe.nextn_predict_layers u32              = 1
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - kv  29:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  31:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  32:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  33:                      tokenizer.ggml.merges arr[str,318088]  = ["─á ─á", "─á ─á─á─á", "─á─á ─á─á", "...
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  36:                tokenizer.ggml.bos_token_id u32              = 151331
llama_model_loader: - kv  37:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  38:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  39:                tokenizer.ggml.eom_token_id u32              = 151338
llama_model_loader: - kv  40:                    tokenizer.chat_template str              = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type bf16:  459 tensors
D:/a/llama.cpp/llama.cpp/src/llama-quant.cpp:732: GGML_ASSERT((qs.n_attention_wv == n_attn_layer - pruned_attention_w) && "n_attention_wv is unexpected") failed

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions