Skip to content

Improve Mistral models integration with llama.cpp #14737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

juliendenize
Copy link

@juliendenize juliendenize commented Jul 17, 2025

Description

This PR aims to enhance the integration of Mistral models with llama.cpp by addressing several key issues and introducing new features. Here are the details:

Context

  • The current HF conversion to GGUF does not work directly for Mistral models due to our format that is vLLM based. This means that we have to first convert weights to Hugging Face then to GGUF which is not ideal and can lead to conversion errors if the first conversion is not done correctly. It also means that adding new models to the llama.cpp ecosystem requires first adding them to Transformers.
  • We do not support chat templates natively which means chat templates are community based and not guaranteed to work correctly.
  • We are using mistral-common internally for tokenization and want the community to use it to unlock full capacities of our models. As mistral-common is a Python library, we have opened a PR to add a REST API via FastAPI to make it easier for users who are not in the Python ecosystem.

Using mistral-common with llama.cpp

We recommend that users only use the llama-server tool with the /completions route of the server for now, as it is the only one that supports tokens input. We also advise users to set return_tokens=True in their requests to let mistral-common handle detokenization.

Added features

  1. Model conversion:

We have added a script to convert Mistral models to GGUF directly from Hugging Face. This script is located at convert_mistral_to_gguf.py and can be used to convert Mistral models to GGUF format.

  1. Model architecture:

We registered the Mistral architecture in llama.cpp to support Mistral models natively. This allows users to use Mistral models with llama.cpp without having to convert them to Hugging Face first.

Known Limitations:

Our approach does not support multimodality:

  • mistral-common handles processing multimodal data but they cannot be passed to llama.cpp via the route.
  • llama.cpp only supports multimodality via chat templates, which we do not support.

Also this approach requires users to only use the llama.cpp server with the /completions route.

Example Code

To get started, install mistral-common using the following command:

(Optional) Convert the model

HF_TOKEN=... python convert_mistral_to_gguf.py \
mistralai/Devstral-Small-2505 --remote --ctx-train 131072 --outtype bf16

Launch the mistral-common and llama.cpp servers

pip install git+https://github.com/mistralai/mistral-common.git@improve_llama_cpp_integration[server]

Launch the mistral-common server:

HF_TOKEN=... mistral_common mistralai/Devstral-Small-2505 --port 6000

Launch the llama.cpp server:

./build/bin/llama-server -m models/Devstral-Small-2505-Q4_K_M.gguf --port 8080

Use the servers

Here is a code snippet demonstrating how to use the new features:

import requests

mistral_common_url = "http://127.0.0.1:6000"
llama_cpp_url = "http://127.0.0.1:8080"

def tokenize(messages, url):
    response = requests.post(f"{url}/tokenize/messages", json=messages)
    return response.json()

def detokenize(tokens, url):
    response = requests.post(f"{url}/detokenize", json={"tokens": tokens})
    return response.json()

def detokenize_message(tokens, url):
    response = requests.post(f"{url}/detokenize", json={"tokens": tokens, "as_message": True})
    return response.json()

def generate(tokens, url):
    response = requests.post(f"{url}/completions", json={
        "prompt": tokens,
        "stream": False,
        "return_tokens": True
    })
    return response.json()

messages = [
    {"role": "system", "content": "You are Devstral a cool coding agent that can help users with their coding needs."},
    {"role": "user", "content": "Who are you and what can you do?"}
]

tokens = tokenize(messages, mistral_common_url)
print(tokens)

generated = generate(tokens, llama_cpp_url)["tokens"]
print(generated)

detokenized = detokenize(generated, mistral_common_url)
print(detokenized)

detokenized_message = detokenize_message(generated, mistral_common_url)
print(detokenized_message)

Feedback and Contributions

We believe these changes will significantly improve the integration of Mistral models with llama.cpp and provide a better experience for our users. We welcome any feedback or suggestions to further enhance this integration. Also, as we have few experience in the codebase of llama.cpp, we welcome any help to improve the integration and make sure we respect the codebase and the community.

@github-actions github-actions bot added the python python script changes label Jul 17, 2025
@ggerganov
Copy link
Member

Thanks for the contribution. From a developer perspective, it looks like a good approach to avoid any potential tokenization / formatting problems. In general, for all models, using a reference tokenizer instead of relying on llama.cpp is always recommended. From usability standpoint, the requirement to start a separate tokenization server is a bit of a drawback, but I understand that correctness is of higher importance.

My understanding is that most chat template problems occur during the early days of the model release, and with time tend to get polished and fixed. So this approach would be a stable alternative during such periods of instability.

@ehoogeveen-medweb
Copy link

IIRC Mistral's architecture also makes use of sliding window attention (SWA), defaulting to a window size of 4096 tokens - though I don't know all the details (like which layers, if any, are full layers). It would be great if the window size could be stored in the GGUF file as well (e.g. as mistral.attention.sliding_window), and the model could eventually be hooked into llama.cpp's SWA support.

@juliendenize juliendenize force-pushed the mistral_integration branch from b809a96 to 2865a25 Compare July 23, 2025 14:55
@juliendenize
Copy link
Author

Hey guys many sorries for the delay of the answer and thanks a lot for your feedback.

@ggerganov

My understanding is that most chat template problems occur during the early days of the model release, and with time tend to get polished and fixed. So this approach would be a stable alternative during such periods of instability.

Exactly what's cool with llama.cpp is that you support the possibility to pass jinja templates when serving so people can use them once they are correct if they want and remove the mistral-common server ! Very nice feature.

@ehoogeveen-medweb

Mistral's architecture also makes use of sliding window attention (SWA)

This is actually for super old (for Deep Learning ^^) models so we didn't add support to that. Could it be a subsequent PR ?

Regarding the PR:

  • I refactored a bit to remove the Mistral arch, it didn't add value so we think it's a less of a burden for maintainability !
  • I tried to make the CI green but I think that because we modified gguf-py files some checks cannot pass because it installs the published package AFAIU. Is it right ?

Happy to answer more questions :)

@juliendenize juliendenize marked this pull request as ready for review July 23, 2025 22:19
@CISC
Copy link
Collaborator

CISC commented Jul 24, 2025

* I tried to make the CI green but I think that because we modified gguf-py files some checks cannot pass because it installs the published package AFAIU. Is it right ?

Partially, there's also a pydantic version conflict:
https://github.com/ggml-org/llama.cpp/actions/runs/16474192826/job/46571944794?pr=14737#step:4:291

@CISC
Copy link
Collaborator

CISC commented Jul 24, 2025

@juliendenize Please undo all the formatting/style changes, they are not relevant and add too much noise to the PR, will review afterwards. :)

@Kreijstal

This comment was marked as off-topic.

@juliendenize
Copy link
Author

Partially, there's also a pydantic version conflict:

Arf, would it be ok to up the requirements for Pydantic in your side or is it a no ? Was there a particular reason to stay at 2.6 ?

@juliendenize
Copy link
Author

@juliendenize Please undo all the formatting/style changes, they are not relevant and add too much noise to the PR, will review afterwards. :)

Done sorry about that my own formatter was on.

Is there a formatter or linter available for Python ? Didn't find it in the contributing guidelines.
I've installed flake8 but it didn't flag any thing when launched locally

@CISC
Copy link
Collaborator

CISC commented Jul 24, 2025

Arf, would it be ok to up the requirements for Pydantic in your side or is it a no ? Was there a particular reason to stay at 2.6 ?

Yes, I think it's ok, it's probably just the version that was available at the time.

Is there a formatter or linter available for Python ? Didn't find it in the contributing guidelines. I've installed flake8 but it didn't flag any thing when launched locally

We don't use a Python formatter, only flake8 linting.

@CISC
Copy link
Collaborator

CISC commented Jul 24, 2025

Pillow conflict, should be fine to update:
https://github.com/ggml-org/llama.cpp/actions/runs/16497338367/job/46646389345?pr=14737#step:4:305

@CISC
Copy link
Collaborator

CISC commented Jul 24, 2025

Right, now we are getting somewhere. :)
https://github.com/ggml-org/llama.cpp/actions/runs/16497849941/job/46648060083?pr=14737

Edit: The unbound errors are clearly handled at init and can be silenced by # pyright: ignore[reportPossiblyUnboundVariable]

@am17an
Copy link
Collaborator

am17an commented Jul 24, 2025

@juliendenize do you also plan to make changes to convert_mistral_to_gguf.py to have mappings for audio_tower.* for the mmproj, I guess it will be necessary for the new voxtral models?

@juliendenize
Copy link
Author

Right, now we are getting somewhere. :) https://github.com/ggml-org/llama.cpp/actions/runs/16497849941/job/46648060083?pr=14737

Edit: The unbound errors are clearly handled at init and can be silenced by # pyright: ignore[reportPossiblyUnboundVariable]

Tried to make things cleaner sorry for the back and forth.

@juliendenize
Copy link
Author

@juliendenize do you also plan to make changes to convert_mistral_to_gguf.py to have mappings for audio_tower.* for the mmproj, I guess it will be necessary for the new voxtral models?

that would be cool indeed, I didn't work personally on Voxtral (for the model), so I might need some assistance as I lack experience in audio models.

Is voxtral already supported by llama.cpp ? I assumed that not for now.

@am17an
Copy link
Collaborator

am17an commented Jul 24, 2025

@juliendenize do you also plan to make changes to convert_mistral_to_gguf.py to have mappings for audio_tower.* for the mmproj, I guess it will be necessary for the new voxtral models?

that would be cool indeed, I didn't work personally on Voxtral (for the model), so I might need some assistance as I lack experience in audio models.

Is voxtral already supported by llama.cpp ? I assumed that not for now.

Yeah not for now, but I was trying to add support and ran into issues converting to GGUF. But that should be easy to add after this PR is merged, so don't worry about it for now :)

@juliendenize
Copy link
Author

Ok so this: https://github.com/ggml-org/llama.cpp/actions/runs/16500995835/job/46660394829?pr=14737

Is actually expected because we didn't merge the PR here yet in mistral-common:
Add a FastAPI app #113

We're in the process of merging I'm just adding a final feature which is begin able to call /v1/chat/completions to directly call the inference server (in this case llama.cpp !!). I'm moving as fast as possible for this.

@CISC
Copy link
Collaborator

CISC commented Jul 24, 2025

We're in the process of merging I'm just adding a final feature which is begin able to call /v1/chat/completions to directly call the inference server (in this case llama.cpp !!). I'm moving as fast as possible for this.

Ok, ping me when you're ready.

@ngxson
Copy link
Collaborator

ngxson commented Jul 24, 2025

I've just had a deeper look into this PR. One concern though, most of the code inside convert_mistral_to_gguf.py is copied from convert_hf_to_gguf.py, which can make it a bit tricky in the long term, especially the code for multimodal model conversion.

Just thinking, maybe it's better to bring them right into convert_hf_to_gguf.py? AFAIU most of the complicated code in this PR is dedicated to converting the tokenizer to GGUF.

Btw, I'm also working converting Voxtral to GGUF. I thought that would be simple but I'm currently stuck at the tokenizer. Trying a quick hack to copy some code from this PR.. will see if it work.

@ngxson
Copy link
Collaborator

ngxson commented Jul 24, 2025

Ok so as demo in #14862, I think it might be better to merge everything into convert_hf_to_gguf. This has 2 big advantages:

  • Easier for long term maintenance, since we will have less duplicated code
  • Less confusion for end-users. Users who are not very familiar with llama.cpp may not understand that they need to use a dedicated script for mistral models (or models fine tuned from mistral)

@CISC
Copy link
Collaborator

CISC commented Jul 24, 2025

Ok so as demo in #14862, I think it might be better to merge everything into convert_hf_to_gguf.

Sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants