-
Notifications
You must be signed in to change notification settings - Fork 12k
llama : support BailingMoE (Ling) #12634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks a lot for adding support for this model.
I'm giving it a try and let you know as I'm most interested in |
@CISC I tried llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_output.weight' has wrong shape; expected 5376, 7168, got 7168, 5376, 1, 1 Here the full log: root@AI:/apool/BailingMoE/llama.cpp/build/bin# ./llama-cli -m /mradermacher/Ling-plus.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 4992 (531b3170) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 2797 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23725 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 1147 tensors from /mradermacher/Ling-plus.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bailingmoe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Ling Plus
llama_model_loader: - kv 3: general.size_label str = 64x18B
llama_model_loader: - kv 4: general.license str = mit
llama_model_loader: - kv 5: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 6: bailingmoe.block_count u32 = 88
llama_model_loader: - kv 7: bailingmoe.context_length u32 = 16384
llama_model_loader: - kv 8: bailingmoe.embedding_length u32 = 5376
llama_model_loader: - kv 9: bailingmoe.feed_forward_length u32 = 12288
llama_model_loader: - kv 10: bailingmoe.attention.head_count u32 = 56
llama_model_loader: - kv 11: bailingmoe.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: bailingmoe.rope.freq_base f32 = 600000.000000
llama_model_loader: - kv 13: bailingmoe.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 14: bailingmoe.expert_used_count u32 = 4
llama_model_loader: - kv 15: bailingmoe.attention.key_length u32 = 128
llama_model_loader: - kv 16: bailingmoe.attention.value_length u32 = 128
llama_model_loader: - kv 17: general.file_type u32 = 1
llama_model_loader: - kv 18: bailingmoe.rope.dimension_count u32 = 128
llama_model_loader: - kv 19: bailingmoe.rope.scaling.type str = none
llama_model_loader: - kv 20: bailingmoe.leading_dense_block_count u32 = 0
llama_model_loader: - kv 21: bailingmoe.vocab_size u32 = 126464
llama_model_loader: - kv 22: bailingmoe.expert_feed_forward_length u32 = 3072
llama_model_loader: - kv 23: bailingmoe.expert_weights_scale f32 = 1.000000
llama_model_loader: - kv 24: bailingmoe.expert_count u32 = 64
llama_model_loader: - kv 25: bailingmoe.expert_shared_count u32 = 1
llama_model_loader: - kv 26: bailingmoe.expert_weights_norm bool = true
llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 28: tokenizer.ggml.pre str = bailingmoe
llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,126464] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,126464] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,125824] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "h e...
llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 126080
llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 126081
llama_model_loader: - kv 34: tokenizer.ggml.padding_token_id u32 = 126081
llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 37: tokenizer.chat_template str = {% for message in messages %}{% set r...
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - type f32: 265 tensors
llama_model_loader: - type f16: 882 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = F16
print_info: file size = 544.96 GiB (16.00 BPW)
load: special tokens cache size = 262
load: token to piece cache size = 0.8056 MB
print_info: arch = bailingmoe
print_info: vocab_only = 0
print_info: n_ctx_train = 16384
print_info: n_embd = 5376
print_info: n_layer = 88
print_info: n_head = 56
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 64
print_info: n_expert_used = 4
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = none
print_info: freq_base_train = 600000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 16384
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 290B
print_info: model params = 292.54 B
print_info: general.name = Ling Plus
print_info: n_layer_dense_lead = 0
print_info: n_ff_exp = 3072
print_info: n_expert_shared = 1
print_info: expert_weights_scale = 1.0
print_info: expert_weights_norm = 1
print_info: vocab type = BPE
print_info: n_vocab = 126464
print_info: n_merges = 125824
print_info: BOS token = 126080 '<|startoftext|>'
print_info: EOS token = 126081 '<|endoftext|>'
print_info: EOT token = 126081 '<|endoftext|>'
print_info: PAD token = 126081 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 126081 '<|endoftext|>'
print_info: max token length = 154
load_tensors: loading model tensors, this can take a while... (mmap = true)
make_cpu_buft_list: disabling extra buffer types (i.e. repacking) since a GPU device is available
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_output.weight' has wrong shape; expected 5376, 7168, got 7168, 5376, 1, 1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/mradermacher/Ling-plus.gguf'
main: error: unable to load model |
@nicoboss Ooops, should be fixed now! Thank you for testing! |
Thank you @CISC! Quantizing works great, but I'm getting a small error during inference:
It does however continue to work, even with the brief error. Edit: used this command: |
Yes, I noticed this too, not sure why, @ochafik mind having a look? |
Thanks a lot for fixing it. I retested I'm quite positively surprised about the performance. Despite the model being 290B using Q4_K_M quants I'm getting over 6 tokens/second text generation speed without offloading any layers to GPU or using speculative decoding on a Ryzen Threadripper PRO 7975WX system using 512 GiB octa-channel DDR5-4800. |
@ggerganov I'm done nit-picking, ready for review. :) |
I'm unable to run this with latest build (attempting Ling-lite) Conversion and quantization work without issue, but when attempting to calculate imatrix I get stuck on when attempting to inference I get:
even with this prompt:
|
LM Studio gives a different error regarding the REGEX::
Seems to be coming from llama.cpp's unicode code, but could be unrelated somehow |
Hmmm, strange.
This is harmless as long as you don't use |
No error when using llama.cpp, so maybe an LM Studio issue? |
I can confirm imatrix computation getting stuck at |
Yep, tested just now, same. Very strange. |
I observed multiple reports of users on latest llama.cpp being unable to run quants of Ling-lite on HuggingFace due to the this error.
See https://huggingface.co/mradermacher/Ling-lite-GGUF/discussions/1 and https://huggingface.co/mradermacher/Ling-lite-GGUF/discussions/2. I was unable to reproduce this issue myself so far using |
Yeah if it was just the inference or lm studio I wouldn't be as positive about it, but the imatrix not working is concerning |
Also I wasn't using --jinja, I provided my own prompt with |
@CISC I just found another concern issue. Ling-lite-base is failing convert_hf_to_gguf.py on latest llama.cpp using the following error making this specific model unusable:
|
No, the error is the result of the chat template being parsed for tool calling even though |
Ok, spotted something that looks like an error in the original regex, testing... |
Well, not an error per-se, seems the issue is the possessive (?+ and ++) expressions, making them greedy fixes it, but I'm not sure if that's really a fix... |
Let's try atomic grouping instead. |
Atomic grouping doesn't seem to be supported. :( Let's just go with greedy... |
Yeah, I never tested the base model, I see the issue though, for some reason head_dim is 0: |
@CISC Thanks a lot for trying to fix the |
Sigh, sorry about that, late-night bug fixing is not a great idea I guess, new PR up. :) |
Add support for BailingMoE (Ling)
Ling-plus is untested due to size, great if someone has the resources to do so!NOTE: They mention YaRN in their docs, but it's not enabled on any of the models, and the code does not support it, so not added for now, will revisit if this changes.