Skip to content

llama : support BailingMoE (Ling) #12634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 30, 2025
Merged

llama : support BailingMoE (Ling) #12634

merged 9 commits into from
Mar 30, 2025

Conversation

CISC
Copy link
Collaborator

@CISC CISC commented Mar 28, 2025

Add support for BailingMoE (Ling)

Ling-plus is untested due to size, great if someone has the resources to do so!

NOTE: They mention YaRN in their docs, but it's not enabled on any of the models, and the code does not support it, so not added for now, will revisit if this changes.

@github-actions github-actions bot added the python python script changes label Mar 28, 2025
@CISC CISC requested review from ggerganov and slaren March 28, 2025 22:32
@CISC CISC added the model Model specific label Mar 28, 2025
@nicoboss
Copy link
Contributor

Thanks a lot for adding support for this model.

Ling-plus is untested due to size, great if someone has the resources to do so!

I'm giving it a try and let you know as I'm most interested in Ling-plus.

@nicoboss
Copy link
Contributor

@CISC I tried Ling-plus. Convearting HF to GGUF and Q4_K_M quantizing worked without any issues but llama.cpp fails to load the resulting model due to the shape of blk.0.attn_output.weight being swapped. It expects [5376, 7168] but gets [7168, 5376, 1, 1]. The same issue occurred booth while trying to load the source GGUF and the Q4_K_M quants.

llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_output.weight' has wrong shape; expected  5376,  7168, got  7168,  5376,     1,     1

Here the full log:

root@AI:/apool/BailingMoE/llama.cpp/build/bin# ./llama-cli -m /mradermacher/Ling-plus.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 4992 (531b3170) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 2797 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23725 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 1147 tensors from /mradermacher/Ling-plus.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bailingmoe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Ling Plus
llama_model_loader: - kv   3:                         general.size_label str              = 64x18B
llama_model_loader: - kv   4:                            general.license str              = mit
llama_model_loader: - kv   5:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   6:                     bailingmoe.block_count u32              = 88
llama_model_loader: - kv   7:                  bailingmoe.context_length u32              = 16384
llama_model_loader: - kv   8:                bailingmoe.embedding_length u32              = 5376
llama_model_loader: - kv   9:             bailingmoe.feed_forward_length u32              = 12288
llama_model_loader: - kv  10:            bailingmoe.attention.head_count u32              = 56
llama_model_loader: - kv  11:         bailingmoe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                  bailingmoe.rope.freq_base f32              = 600000.000000
llama_model_loader: - kv  13: bailingmoe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:               bailingmoe.expert_used_count u32              = 4
llama_model_loader: - kv  15:            bailingmoe.attention.key_length u32              = 128
llama_model_loader: - kv  16:          bailingmoe.attention.value_length u32              = 128
llama_model_loader: - kv  17:                          general.file_type u32              = 1
llama_model_loader: - kv  18:            bailingmoe.rope.dimension_count u32              = 128
llama_model_loader: - kv  19:               bailingmoe.rope.scaling.type str              = none
llama_model_loader: - kv  20:       bailingmoe.leading_dense_block_count u32              = 0
llama_model_loader: - kv  21:                      bailingmoe.vocab_size u32              = 126464
llama_model_loader: - kv  22:      bailingmoe.expert_feed_forward_length u32              = 3072
llama_model_loader: - kv  23:            bailingmoe.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  24:                    bailingmoe.expert_count u32              = 64
llama_model_loader: - kv  25:             bailingmoe.expert_shared_count u32              = 1
llama_model_loader: - kv  26:             bailingmoe.expert_weights_norm bool             = true
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = bailingmoe
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,126464]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,126464]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,125824]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "h e...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 126080
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 126081
llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 126081
llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {% for message in messages %}{% set r...
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  265 tensors
llama_model_loader: - type  f16:  882 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 544.96 GiB (16.00 BPW) 
load: special tokens cache size = 262
load: token to piece cache size = 0.8056 MB
print_info: arch             = bailingmoe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 16384
print_info: n_embd           = 5376
print_info: n_layer          = 88
print_info: n_head           = 56
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 64
print_info: n_expert_used    = 4
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = none
print_info: freq_base_train  = 600000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 16384
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 290B
print_info: model params     = 292.54 B
print_info: general.name     = Ling Plus
print_info: n_layer_dense_lead   = 0
print_info: n_ff_exp             = 3072
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 1.0
print_info: expert_weights_norm  = 1
print_info: vocab type       = BPE
print_info: n_vocab          = 126464
print_info: n_merges         = 125824
print_info: BOS token        = 126080 '<|startoftext|>'
print_info: EOS token        = 126081 '<|endoftext|>'
print_info: EOT token        = 126081 '<|endoftext|>'
print_info: PAD token        = 126081 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 126081 '<|endoftext|>'
print_info: max token length = 154
load_tensors: loading model tensors, this can take a while... (mmap = true)
make_cpu_buft_list: disabling extra buffer types (i.e. repacking) since a GPU device is available
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_output.weight' has wrong shape; expected  5376,  7168, got  7168,  5376,     1,     1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/mradermacher/Ling-plus.gguf'
main: error: unable to load model

@CISC
Copy link
Collaborator Author

CISC commented Mar 29, 2025

@nicoboss Ooops, should be fixed now!

Thank you for testing!

@arch-btw
Copy link
Contributor

arch-btw commented Mar 29, 2025

Thank you @CISC!
I have just quantized and tested: Ling-Coder-lite (Q5_K_M)

Quantizing works great, but I'm getting a small error during inference:

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

Failed to generate tool call example: Value is not callable: null at row 1, column 155:
{% for message in messages %}{% set role = message['role'] | lower %}{% if role == 'user' %}{% set role = 'HUMAN' %}{% endif %}{% set role = role | upper %}{{ '<role>' + role + '</role>' + message['content'] }}{% endfor %}{% if add_generation_prompt %}{{ '<role>ASSISTANT</role>' }}{% endif %}
                                                                                                                                                          ^
 at row 1, column 128:
{% for message in messages %}{% set role = message['role'] | lower %}{% if role == 'user' %}{% set role = 'HUMAN' %}{% endif %}{% set role = role | upper %}{{ '<role>' + role + '</role>' + message['content'] }}{% endfor %}{% if add_generation_prompt %}{{ '<role>ASSISTANT</role>' }}{% endif %}
                                                                                                                               ^
 at row 1, column 30:
{% for message in messages %}{% set role = message['role'] | lower %}{% if role == 'user' %}{% set role = 'HUMAN' %}{% endif %}{% set role = role | upper %}{{ '<role>' + role + '</role>' + message['content'] }}{% endfor %}{% if add_generation_prompt %}{{ '<role>ASSISTANT</role>' }}{% endif %}
                             ^
 at row 1, column 1:
{% for message in messages %}{% set role = message['role'] | lower %}{% if role == 'user' %}{% set role = 'HUMAN' %}{% endif %}{% set role = role | upper %}{{ '<role>' + role + '</role>' + message['content'] }}{% endfor %}{% if add_generation_prompt %}{{ '<role>ASSISTANT</role>' }}{% endif %}
^
 at row 1, column 1:
{% for message in messages %}{% set role = message['role'] | lower %}{% if role == 'user' %}{% set role = 'HUMAN' %}{% endif %}{% set role = role | upper %}{{ '<role>' + role + '</role>' + message['content'] }}{% endfor %}{% if add_generation_prompt %}{{ '<role>ASSISTANT</role>' }}{% endif %}
^

main: llama threadpool init, n_threads = 4
main: chat template example:
<role>SYSTEM</role>You are a helpful assistant<role>HUMAN</role>Hello<role>ASSISTANT</role>Hi there<role>HUMAN</role>How are you?<role>ASSISTANT</role>

[....]

It does however continue to work, even with the brief error.

Edit: used this command:
./llama-cli -m Ling-Coder-Lite-64x1.5B-Q5_K_M.gguf --ctx-size 4096 --conversation

@CISC
Copy link
Collaborator Author

CISC commented Mar 29, 2025

Quantizing works great, but I'm getting a small error during inference:

Yes, I noticed this too, not sure why, @ochafik mind having a look?

@nicoboss
Copy link
Contributor

nicoboss commented Mar 29, 2025

@nicoboss Ooops, should be fixed now!

Thanks a lot for fixing it. I retested Ling-plus and can confirm that it is now fully working without any issues and is generating great answers to my questions.

I'm quite positively surprised about the performance. Despite the model being 290B using Q4_K_M quants I'm getting over 6 tokens/second text generation speed without offloading any layers to GPU or using speculative decoding on a Ryzen Threadripper PRO 7975WX system using 512 GiB octa-channel DDR5-4800.

@CISC
Copy link
Collaborator Author

CISC commented Mar 30, 2025

@ggerganov I'm done nit-picking, ready for review. :)

@CISC CISC merged commit 2c3f8b8 into ggml-org:master Mar 30, 2025
51 checks passed
@CISC CISC deleted the bailingmoe branch March 30, 2025 20:21
@bartowski1182
Copy link
Contributor

bartowski1182 commented Mar 31, 2025

I'm unable to run this with latest build (attempting Ling-lite)

Conversion and quantization work without issue, but when attempting to calculate imatrix I get stuck on compute_imatrix: tokenizing the input ..

when attempting to inference I get:

Failed to generate tool call example: Value is not callable: null at row 1, column 155:
{% for message in messages %}{% set role = message['role'] | lower %}{% if role == 'user' %}{% set role = 'HUMAN' %}{% endif %}{% set role = role | upper %}{{ '<role>' + role + '</role>' + message['content'] }}{% endfor %}{% if add_generation_prompt %}{{ '<role>ASSISTANT</role>' }}{% endif %}

even with this prompt:

<role>SYSTEM</role>You are a helpful assistant<role>HUMAN</role>Hello<role>ASSISTANT</role>Hi there<role>HUMAN</role>How are you?<role>ASSISTANT</role>

@bartowski1182
Copy link
Contributor

LM Studio gives a different error regarding the REGEX::

Failed to process regex: ''(?:[sSdDmMtT]|[lL][lL]|[vV][eE]|[rR][eE])|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+'
Regex error: regex_error(error_badrepeat): One of *?+{ was not preceded by a valid regular expression.

Seems to be coming from llama.cpp's unicode code, but could be unrelated somehow

@CISC
Copy link
Collaborator Author

CISC commented Mar 31, 2025

Conversion and quantization work without issue, but when attempting to calculate imatrix I get stuck on compute_imatrix: tokenizing the input ..

Hmmm, strange.

when attempting to inference I get:
Failed to generate tool call example: Value is not callable: null at row 1, column 155:

This is harmless as long as you don't use --jinja and will be fixed once @ochafik merges minja.

@CISC
Copy link
Collaborator Author

CISC commented Mar 31, 2025

LM Studio gives a different error regarding the REGEX::
Seems to be coming from llama.cpp's unicode code, but could be unrelated somehow

No error when using llama.cpp, so maybe an LM Studio issue?

@nicoboss
Copy link
Contributor

I can confirm imatrix computation getting stuck at compute_imatrix: tokenizing the input .. for Ling-plus, Ling-lite, Ling-Coder-lite and Ling-Coder-lite-base. I even waited over 12 hours to make sure this is not just slow.

@CISC
Copy link
Collaborator Author

CISC commented Mar 31, 2025

I can confirm imatrix computation getting stuck at compute_imatrix: tokenizing the input .. for Ling-plus, Ling-lite, Ling-Coder-lite and Ling-Coder-lite-base. I even waited over 12 hours to make sure this is not just slow.

Yep, tested just now, same. Very strange.

@nicoboss
Copy link
Contributor

nicoboss commented Mar 31, 2025

LM Studio gives a different error regarding the REGEX

I observed multiple reports of users on latest llama.cpp being unable to run quants of Ling-lite on HuggingFace due to the this error.

Failed to process regex: ''(?:[sSdDmMtT]|[lL][lL]|[vV][eE]|[rR][eE])|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+'
Regex error: regex_error(error_badrepeat): One of *?+{ was not preceded by a valid regular expression.
llama_model_load: error loading model: error loading model vocabulary: Failed to process regex
llama_model_load_from_file_impl: failed to load model

See https://huggingface.co/mradermacher/Ling-lite-GGUF/discussions/1 and https://huggingface.co/mradermacher/Ling-lite-GGUF/discussions/2. I was unable to reproduce this issue myself so far using Ling-lite.Q8_0.gguf on llama-server.

@bartowski1182
Copy link
Contributor

Yeah if it was just the inference or lm studio I wouldn't be as positive about it, but the imatrix not working is concerning

@bartowski1182
Copy link
Contributor

Also I wasn't using --jinja, I provided my own prompt with -p, not sure if I need to explicitly disable jinja?

@nicoboss
Copy link
Contributor

@CISC I just found another concern issue. Ling-lite-base is failing convert_hf_to_gguf.py on latest llama.cpp using the following error making this specific model unusable:

root@AI:/apool/llama.cpp# venv/bin/python convert_hf_to_gguf.py /bpool/Ling-lite-base --outfile /mradermacher/tmp/quant/Ling-lite-base.gguf
INFO:hf-to-gguf:Loading model: Ling-lite-base
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:blk.0.attn_output.weight,     torch.bfloat16 --> F16, shape = {2048, 2048}
Traceback (most recent call last):
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 5514, in <module>
    main()
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 5508, in main
    model_instance.write()
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 440, in write
    self.prepare_tensors()
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 5232, in prepare_tensors
    super().prepare_tensors()
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 299, in prepare_tensors
    for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 5185, in modify_tensors
    q, k, v = data_torch.split([n_head * head_dim, n_kv_head * head_dim, n_kv_head * head_dim], dim=-2)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apool/llama.cpp/gguf-py/gguf/lazy.py", line 121, in wrapped_fn
    res = fn(*meta_args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apool/llama.cpp/gguf-py/gguf/lazy.py", line 21, in <lambda>
    (lambda s, *args, **kwargs: getattr(s, name)(*args, **kwargs)),
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apool/llama.cpp/venv/lib/python3.11/site-packages/torch/_tensor.py", line 896, in split
    return torch._VF.split_with_sizes(self, split_size, dim)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: split_with_sizes expects split_sizes to sum exactly to 3072 (input tensor's size at dimension -2), but got split_sizes=[0, 0, 0]

@CISC
Copy link
Collaborator Author

CISC commented Mar 31, 2025

Also I wasn't using --jinja, I provided my own prompt with -p, not sure if I need to explicitly disable jinja?

No, the error is the result of the chat template being parsed for tool calling even though --jinja is not enabled, it's probably a bug.

@CISC
Copy link
Collaborator Author

CISC commented Mar 31, 2025

Ok, spotted something that looks like an error in the original regex, testing...

@CISC
Copy link
Collaborator Author

CISC commented Mar 31, 2025

Well, not an error per-se, seems the issue is the possessive (?+ and ++) expressions, making them greedy fixes it, but I'm not sure if that's really a fix...

@CISC
Copy link
Collaborator Author

CISC commented Mar 31, 2025

Let's try atomic grouping instead.

@CISC
Copy link
Collaborator Author

CISC commented Mar 31, 2025

Atomic grouping doesn't seem to be supported. :(

Let's just go with greedy...

@CISC
Copy link
Collaborator Author

CISC commented Mar 31, 2025

@CISC I just found another concern issue. Ling-lite-base is failing convert_hf_to_gguf.py on latest llama.cpp using the following error making this specific model unusable:

Yeah, I never tested the base model, I see the issue though, for some reason head_dim is 0:
https://huggingface.co/inclusionAI/Ling-lite-base/blob/main/config.json#L44

@nicoboss
Copy link
Contributor

nicoboss commented Mar 31, 2025

@CISC Thanks a lot for trying to fix the head_dim issue. Unfortunately convert_hf_to_gguf.py for Ling-lite-base still fails on latest llama.cpp with the same exact exception. I made sure convert_hf_to_gguf.py contains your latest changes. I added some debug code and can confirm the set_gguf_parameters fucntion you modified never even gets called as it crashes inside prepare_tensors() while calling super().prepare_tensors()

@CISC
Copy link
Collaborator Author

CISC commented Apr 1, 2025

@CISC Thanks a lot for trying to fix the head_dim issue. Unfortunately convert_hf_to_gguf.py for Ling-lite-base still fails on latest llama.cpp with the same exact exception.

Sigh, sorry about that, late-night bug fixing is not a great idea I guess, new PR up. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Model specific python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants