Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 23, 2025

Ref HF model: https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash

I'm interested in this model because they are the second one to use attention sink (after GPT-OSS)

Fix #18120

Test using Q8_0 model (tested up to ~4K tokens):

$ llama-cli -m ../models/MiMo-V2-Flash/modelq.gguf -c 8000 -p "hi"

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b7534-d4a3c4d41
model      : modelq.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> hi

[Start thinking]
We are going to create a simple "hi" response in multiple languages.
 The user just said "hi", so we will respond with "hi" in a few different languages.
 We'll create a list of common languages and their translation for "hi".
 Then we will output each one in a formatted way.
[End thinking]

Hello! Here's a friendly "hello" in several languages:

🌍 **Common Greetings**:
- **English**: Hi  
- **Spanish**: Hola  
- **French**: Bonjour  
- **German**: Hallo  
- **Italian**: Ciao  
- **Japanese**: こんにちは (Konnichiwa)  
- **Korean**: 안녕하세요 (Annyeonghaseyo)  
- **Arabic**: مرحبًا (Marhaban)  
- **Hindi**: नमस्ते (Namaste)  
- **Portuguese**: Olá  

**Fun fact**: The earliest recorded use of "hello" in English dates back to 1827!  😊  

How can I assist you today?

[ Prompt: 4431.1 t/s | Generation: 31.3 t/s ]

@github-actions github-actions bot added model Model specific python python script changes labels Dec 23, 2025
@ngxson ngxson marked this pull request as ready for review December 23, 2025 22:41
@Aaryan-Kapoor
Copy link

I opened a follow‑up PR with the fixes here: #18333
Please merge that PR (or cherry‑pick its commit(s)) into this one if you prefer.

@CISC
Copy link
Collaborator

CISC commented Dec 24, 2025

Interesting, is this the first model that actually uses a non-step SWA pattern?

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 24, 2025

@CISC it still uses repeated the SWA pattern, but the actual config hybrid_block_size to control the pattern is not written into config.json: https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash/blob/main/configuration_mimo_v2_flash.py#L93-L96

So I think it will be cleaner to just use whatever already inside config.json, more and more models also do the same thing now

mimo2: wire RMS eps + MoE bias + converter guards
@ngxson
Copy link
Collaborator Author

ngxson commented Dec 24, 2025

@Aaryan-Kapoor thanks, I merged your commit here

Co-authored-by: Aaryan-Kapoor <[email protected]>
@CISC
Copy link
Collaborator

CISC commented Dec 24, 2025

@CISC it still uses repeated the SWA pattern, but the actual config hybrid_block_size to control the pattern is not written into config.json: https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash/blob/main/configuration_mimo_v2_flash.py#L93-L96

Ah, it's just step with dense first.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 24, 2025

Logits matching against vllm (thanks @bartowski1182 for the vllm test), pretty close on long context:

| idx  | llamacpp.log   | logprob_1 | vllm.log       | logprob_2 | diff (abs) |
|------|----------------|-----------|----------------|-----------|------------|
| 1    | '//'           | -2.9061   | ' \n'          | -4.2447   | 1.3386     |
| 2    | ' here'        | -0.8739   | ' Here'        | -1.3667   | 0.4927     |
| 3    | ' expert'      | -0.9182   | ' expert'      | -0.8108   | 0.1073     |
| 4    | ' AI'          | -0.0318   | ' AI'          | -0.0167   | 0.0150     |
| 5    | ' assistant'   | -0.3031   | ' assistant'   | -0.1606   | 0.1425     |
| 6    | ' designed'    | -0.5695   | '.'            | -1.3440   | 0.7744     |
| 7    | ' of'          | -0.0011   | ' of'          | -0.0000   | 0.0011     |
| 8    | ' text'        | -1.4561   | '1'            | -0.1285   | 1.3277     |
| 9    | ' tools'       | -0.0513   | ' tools'       | -0.0492   | 0.0021     |
| 10   | ' to'          | -0.5703   | '.'            | -0.7627   | 0.1924     |
| 1011 | ' you'         | -0.0063   | ' you'         | -0.0005   | 0.0058     |
| 1012 | ' need'        | -0.0010   | ' need'        | -0.0001   | 0.0009     |
| 1013 | ' to'          | -0.0032   | ' to'          | -0.0018   | 0.0014     |
| 1014 | ' use'         | -0.0023   | ' use'         | -0.0141   | 0.0118     |
| 1015 | ' a'           | -0.0027   | ' a'           | -0.0044   | 0.0017     |
| 1016 | ' tool'        | -0.0144   | '1'            | -0.1830   | 0.1686     |
| 1017 | ' output'      | -0.0052   | ' output'      | -0.0002   | 0.0049     |
| 1018 | ' the'         | -0.0087   | ' the'         | -0.0014   | 0.0073     |
| 1019 | ' call'        | -0.0085   | ' call'        | -0.0021   | 0.0065     |
| 1020 | ' in'          | -0.0044   | ' in'          | -0.0004   | 0.0040     |
| 5021 | ' requires'    | -0.0002   | ' requires'    | -0.0000   | 0.0002     |
| 5022 | ' external'    | -0.0002   | ' external'    | -0.0000   | 0.0002     |
| 5023 | ' data'        | -0.0008   | ' data'        | -0.0001   | 0.0007     |
| 5024 | ' computation' | -0.0140   | ' computation' | -0.0022   | 0.0118     |
| 5025 | ' or'          | -0.0001   | ' or'          | -0.0000   | 0.0000     |
| 5026 | ' actions'     | -0.0002   | ' actions'     | -0.0000   | 0.0002     |
| 5027 | ' beyond'      | -0.0001   | ' beyond'      | -0.0000   | 0.0001     |
| 5028 | ' your'        | -0.0040   | ' your'        | -0.0001   | 0.0040     |
| 5029 | ' internal'    | -0.0005   | ' internal'    | -0.0000   | 0.0004     |
| 5030 | ' knowledge'   | -0.0022   | ' knowledge'   | -0.0000   | 0.0021     |

I'm still not quite sure where comes the differences though, but I think this PR is ready to merge - fixes can be added without breaking the existing GGUF

ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps);

hparams.swa_type = LLAMA_SWA_TYPE_STANDARD;
hparams.rope_freq_base_train_swa = 10000.0f;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be saved to metadata now with add_rope_freq_base_swa.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 0cd227f

@bartowski1182 I'm reconverting the GGUF on my side, will merge this as soon as I can confirm that it still works (feel free to also run the conversion on your side)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested on my side and results are unchanged:

| idx  | llamacpp.log   | logprob_1 | vllm.log       | logprob_2 | diff (abs) |
|------|----------------|-----------|----------------|-----------|------------|
| 1    | '//'           | -2.9061   | ' \n'          | -4.2447   | 1.3386     |
| 2    | ' here'        | -0.8739   | ' Here'        | -1.3667   | 0.4927     |
| 3    | ' expert'      | -0.9182   | ' expert'      | -0.8108   | 0.1073     |
| 4    | ' AI'          | -0.0318   | ' AI'          | -0.0167   | 0.0150     |
| 5    | ' assistant'   | -0.3031   | ' assistant'   | -0.1606   | 0.1425     |
| 6    | ' designed'    | -0.5695   | '.'            | -1.3440   | 0.7744     |
| 7    | ' of'          | -0.0011   | ' of'          | -0.0000   | 0.0011     |
| 8    | ' text'        | -1.4561   | '1'            | -0.1285   | 1.3277     |
| 9    | ' tools'       | -0.0513   | ' tools'       | -0.0492   | 0.0021     |
| 10   | ' to'          | -0.5703   | '.'            | -0.7627   | 0.1924     |
| 1011 | ' you'         | -0.0063   | ' you'         | -0.0005   | 0.0058     |
| 1012 | ' need'        | -0.0010   | ' need'        | -0.0001   | 0.0009     |
| 1013 | ' to'          | -0.0032   | ' to'          | -0.0018   | 0.0014     |
| 1014 | ' use'         | -0.0023   | ' use'         | -0.0141   | 0.0118     |
| 1015 | ' a'           | -0.0027   | ' a'           | -0.0044   | 0.0017     |
| 1016 | ' tool'        | -0.0144   | '1'            | -0.1830   | 0.1686     |
| 1017 | ' output'      | -0.0052   | ' output'      | -0.0002   | 0.0049     |
| 1018 | ' the'         | -0.0087   | ' the'         | -0.0014   | 0.0073     |
| 1019 | ' call'        | -0.0085   | ' call'        | -0.0021   | 0.0065     |
| 1020 | ' in'          | -0.0044   | ' in'          | -0.0004   | 0.0040     |
| 5021 | ' requires'    | -0.0002   | ' requires'    | -0.0000   | 0.0002     |
| 5022 | ' external'    | -0.0002   | ' external'    | -0.0000   | 0.0002     |
| 5023 | ' data'        | -0.0008   | ' data'        | -0.0001   | 0.0007     |
| 5024 | ' computation' | -0.0140   | ' computation' | -0.0022   | 0.0118     |
| 5025 | ' or'          | -0.0001   | ' or'          | -0.0000   | 0.0000     |
| 5026 | ' actions'     | -0.0002   | ' actions'     | -0.0000   | 0.0002     |
| 5027 | ' beyond'      | -0.0001   | ' beyond'      | -0.0000   | 0.0001     |
| 5028 | ' your'        | -0.0040   | ' your'        | -0.0001   | 0.0040     |
| 5029 | ' internal'    | -0.0005   | ' internal'    | -0.0000   | 0.0004     |
| 5030 | ' knowledge'   | -0.0022   | ' knowledge'   | -0.0000   | 0.0021     |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Support for New MoE Model - XiaomiMiMo / MiMo-V2-Flash

3 participants