-
Notifications
You must be signed in to change notification settings - Fork 14.2k
model: support MiMo-V2-Flash #18328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
model: support MiMo-V2-Flash #18328
Conversation
|
I opened a follow‑up PR with the fixes here: #18333 |
|
Interesting, is this the first model that actually uses a non-step SWA pattern? |
|
@CISC it still uses repeated the SWA pattern, but the actual config So I think it will be cleaner to just use whatever already inside |
mimo2: wire RMS eps + MoE bias + converter guards
|
@Aaryan-Kapoor thanks, I merged your commit here |
Co-authored-by: Aaryan-Kapoor <[email protected]>
Ah, it's just step with dense first. |
|
Logits matching against vllm (thanks @bartowski1182 for the vllm test), pretty close on long context: I'm still not quite sure where comes the differences though, but I think this PR is ready to merge - fixes can be added without breaking the existing GGUF |
src/llama-model.cpp
Outdated
| ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps); | ||
|
|
||
| hparams.swa_type = LLAMA_SWA_TYPE_STANDARD; | ||
| hparams.rope_freq_base_train_swa = 10000.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be saved to metadata now with add_rope_freq_base_swa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 0cd227f
@bartowski1182 I'm reconverting the GGUF on my side, will merge this as soon as I can confirm that it still works (feel free to also run the conversion on your side)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested on my side and results are unchanged:
| idx | llamacpp.log | logprob_1 | vllm.log | logprob_2 | diff (abs) |
|------|----------------|-----------|----------------|-----------|------------|
| 1 | '//' | -2.9061 | ' \n' | -4.2447 | 1.3386 |
| 2 | ' here' | -0.8739 | ' Here' | -1.3667 | 0.4927 |
| 3 | ' expert' | -0.9182 | ' expert' | -0.8108 | 0.1073 |
| 4 | ' AI' | -0.0318 | ' AI' | -0.0167 | 0.0150 |
| 5 | ' assistant' | -0.3031 | ' assistant' | -0.1606 | 0.1425 |
| 6 | ' designed' | -0.5695 | '.' | -1.3440 | 0.7744 |
| 7 | ' of' | -0.0011 | ' of' | -0.0000 | 0.0011 |
| 8 | ' text' | -1.4561 | '1' | -0.1285 | 1.3277 |
| 9 | ' tools' | -0.0513 | ' tools' | -0.0492 | 0.0021 |
| 10 | ' to' | -0.5703 | '.' | -0.7627 | 0.1924 |
| 1011 | ' you' | -0.0063 | ' you' | -0.0005 | 0.0058 |
| 1012 | ' need' | -0.0010 | ' need' | -0.0001 | 0.0009 |
| 1013 | ' to' | -0.0032 | ' to' | -0.0018 | 0.0014 |
| 1014 | ' use' | -0.0023 | ' use' | -0.0141 | 0.0118 |
| 1015 | ' a' | -0.0027 | ' a' | -0.0044 | 0.0017 |
| 1016 | ' tool' | -0.0144 | '1' | -0.1830 | 0.1686 |
| 1017 | ' output' | -0.0052 | ' output' | -0.0002 | 0.0049 |
| 1018 | ' the' | -0.0087 | ' the' | -0.0014 | 0.0073 |
| 1019 | ' call' | -0.0085 | ' call' | -0.0021 | 0.0065 |
| 1020 | ' in' | -0.0044 | ' in' | -0.0004 | 0.0040 |
| 5021 | ' requires' | -0.0002 | ' requires' | -0.0000 | 0.0002 |
| 5022 | ' external' | -0.0002 | ' external' | -0.0000 | 0.0002 |
| 5023 | ' data' | -0.0008 | ' data' | -0.0001 | 0.0007 |
| 5024 | ' computation' | -0.0140 | ' computation' | -0.0022 | 0.0118 |
| 5025 | ' or' | -0.0001 | ' or' | -0.0000 | 0.0000 |
| 5026 | ' actions' | -0.0002 | ' actions' | -0.0000 | 0.0002 |
| 5027 | ' beyond' | -0.0001 | ' beyond' | -0.0000 | 0.0001 |
| 5028 | ' your' | -0.0040 | ' your' | -0.0001 | 0.0040 |
| 5029 | ' internal' | -0.0005 | ' internal' | -0.0000 | 0.0004 |
| 5030 | ' knowledge' | -0.0022 | ' knowledge' | -0.0000 | 0.0021 |
Ref HF model: https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash
I'm interested in this model because they are the second one to use attention sink (after GPT-OSS)
Fix #18120
Test using Q8_0 model (tested up to ~4K tokens):