Improve Mistral models integration with llama.cpp #14737

juliendenize · 2025-07-17T12:30:40Z

Description

This PR aims to enhance the integration of Mistral models with llama.cpp by addressing several key issues and introducing new features. Here are the details:

Context

The current HF conversion to GGUF does not work directly for Mistral models due to our format that is vLLM based. This means that we have to first convert weights to Hugging Face then to GGUF which is not ideal and can lead to conversion errors if the first conversion is not done correctly. It also means that adding new models to the llama.cpp ecosystem requires first adding them to Transformers.
We do not support chat templates natively which means chat templates are community based and not guaranteed to work correctly.
We are using mistral-common internally for tokenization and want the community to use it to unlock full capacities of our models. As mistral-common is a Python library, we have opened a PR to add a REST API via FastAPI to make it easier for users who are not in the Python ecosystem.

Using mistral-common with llama.cpp

We recommend that users only use the llama-server tool with the /completions route of the server for now, as it is the only one that supports tokens input. We also advise users to set return_tokens=True in their requests to let mistral-common handle detokenization.

Added features

Model conversion:

We have added a script to convert Mistral models to GGUF directly from Hugging Face. This script is located at convert_mistral_to_gguf.py and can be used to convert Mistral models to GGUF format.

Model architecture:

We registered the Mistral architecture in llama.cpp to support Mistral models natively. This allows users to use Mistral models with llama.cpp without having to convert them to Hugging Face first.

Known Limitations:

Our approach does not support multimodality:

mistral-common handles processing multimodal data but they cannot be passed to llama.cpp via the route.
llama.cpp only supports multimodality via chat templates, which we do not support.

Also this approach requires users to only use the llama.cpp server with the /completions route.

Example Code

To get started, install mistral-common using the following command:

(Optional) Convert the model

HF_TOKEN=... python convert_mistral_to_gguf.py \
mistralai/Devstral-Small-2505 --remote --ctx-train 131072 --outtype bf16

Launch the mistral-common and llama.cpp servers

pip install git+https://github.com/mistralai/mistral-common.git@improve_llama_cpp_integration[server]

Launch the mistral-common server:

HF_TOKEN=... mistral_common mistralai/Devstral-Small-2505 --port 6000

Launch the llama.cpp server:

./build/bin/llama-server -m models/Devstral-Small-2505-Q4_K_M.gguf --port 8080

Use the servers

Here is a code snippet demonstrating how to use the new features:

import requests

mistral_common_url = "http://127.0.0.1:6000"
llama_cpp_url = "http://127.0.0.1:8080"

def tokenize(messages, url):
    response = requests.post(f"{url}/tokenize/messages", json=messages)
    return response.json()

def detokenize(tokens, url):
    response = requests.post(f"{url}/detokenize", json={"tokens": tokens})
    return response.json()

def detokenize_message(tokens, url):
    response = requests.post(f"{url}/detokenize", json={"tokens": tokens, "as_message": True})
    return response.json()

def generate(tokens, url):
    response = requests.post(f"{url}/completions", json={
        "prompt": tokens,
        "stream": False,
        "return_tokens": True
    })
    return response.json()

messages = [
    {"role": "system", "content": "You are Devstral a cool coding agent that can help users with their coding needs."},
    {"role": "user", "content": "Who are you and what can you do?"}
]

tokens = tokenize(messages, mistral_common_url)
print(tokens)

generated = generate(tokens, llama_cpp_url)["tokens"]
print(generated)

detokenized = detokenize(generated, mistral_common_url)
print(detokenized)

detokenized_message = detokenize_message(generated, mistral_common_url)
print(detokenized_message)

Feedback and Contributions

We believe these changes will significantly improve the integration of Mistral models with llama.cpp and provide a better experience for our users. We welcome any feedback or suggestions to further enhance this integration. Also, as we have few experience in the codebase of llama.cpp, we welcome any help to improve the integration and make sure we respect the codebase and the community.

ggerganov · 2025-07-17T15:49:49Z

Thanks for the contribution. From a developer perspective, it looks like a good approach to avoid any potential tokenization / formatting problems. In general, for all models, using a reference tokenizer instead of relying on llama.cpp is always recommended. From usability standpoint, the requirement to start a separate tokenization server is a bit of a drawback, but I understand that correctness is of higher importance.

My understanding is that most chat template problems occur during the early days of the model release, and with time tend to get polished and fixed. So this approach would be a stable alternative during such periods of instability.

ehoogeveen-medweb · 2025-07-17T16:22:06Z

IIRC Mistral's architecture also makes use of sliding window attention (SWA), defaulting to a window size of 4096 tokens - though I don't know all the details (like which layers, if any, are full layers). It would be great if the window size could be stored in the GGUF file as well (e.g. as mistral.attention.sliding_window), and the model could eventually be hooked into llama.cpp's SWA support.

juliendenize · 2025-07-23T22:19:26Z

Hey guys many sorries for the delay of the answer and thanks a lot for your feedback.

@ggerganov

My understanding is that most chat template problems occur during the early days of the model release, and with time tend to get polished and fixed. So this approach would be a stable alternative during such periods of instability.

Exactly what's cool with llama.cpp is that you support the possibility to pass jinja templates when serving so people can use them once they are correct if they want and remove the mistral-common server ! Very nice feature.

@ehoogeveen-medweb

Mistral's architecture also makes use of sliding window attention (SWA)

This is actually for super old (for Deep Learning ^^) models so we didn't add support to that. Could it be a subsequent PR ?

Regarding the PR:

I refactored a bit to remove the Mistral arch, it didn't add value so we think it's a less of a burden for maintainability !
I tried to make the CI green but I think that because we modified gguf-py files some checks cannot pass because it installs the published package AFAIU. Is it right ?

Happy to answer more questions :)

convert_mistral_to_gguf.py

CISC · 2025-07-24T08:37:14Z

* I tried to make the CI green but I think that because we modified gguf-py files some checks cannot pass because it installs the published package AFAIU. Is it right ?

Partially, there's also a pydantic version conflict:
https://github.com/ggml-org/llama.cpp/actions/runs/16474192826/job/46571944794?pr=14737#step:4:291

CISC · 2025-07-24T08:47:41Z

@juliendenize Please undo all the formatting/style changes, they are not relevant and add too much noise to the PR, will review afterwards. :)

juliendenize · 2025-07-24T12:29:33Z

Partially, there's also a pydantic version conflict:

Arf, would it be ok to up the requirements for Pydantic in your side or is it a no ? Was there a particular reason to stay at 2.6 ?

juliendenize · 2025-07-24T12:30:57Z

@juliendenize Please undo all the formatting/style changes, they are not relevant and add too much noise to the PR, will review afterwards. :)

Done sorry about that my own formatter was on.

Is there a formatter or linter available for Python ? Didn't find it in the contributing guidelines.
I've installed flake8 but it didn't flag any thing when launched locally

CISC · 2025-07-24T12:44:04Z

Arf, would it be ok to up the requirements for Pydantic in your side or is it a no ? Was there a particular reason to stay at 2.6 ?

Yes, I think it's ok, it's probably just the version that was available at the time.

Is there a formatter or linter available for Python ? Didn't find it in the contributing guidelines. I've installed flake8 but it didn't flag any thing when launched locally

We don't use a Python formatter, only flake8 linting.

CISC · 2025-07-24T12:53:47Z

Pillow conflict, should be fine to update:
https://github.com/ggml-org/llama.cpp/actions/runs/16497338367/job/46646389345?pr=14737#step:4:305

CISC · 2025-07-24T13:16:51Z

Right, now we are getting somewhere. :)
https://github.com/ggml-org/llama.cpp/actions/runs/16497849941/job/46648060083?pr=14737

Edit: The unbound errors are clearly handled at init and can be silenced by # pyright: ignore[reportPossiblyUnboundVariable]

am17an · 2025-07-24T14:35:20Z

@juliendenize do you also plan to make changes to convert_mistral_to_gguf.py to have mappings for audio_tower.* for the mmproj, I guess it will be necessary for the new voxtral models?

juliendenize · 2025-07-24T15:23:08Z

Right, now we are getting somewhere. :) https://github.com/ggml-org/llama.cpp/actions/runs/16497849941/job/46648060083?pr=14737

Edit: The unbound errors are clearly handled at init and can be silenced by # pyright: ignore[reportPossiblyUnboundVariable]

Tried to make things cleaner sorry for the back and forth.

juliendenize · 2025-07-24T15:25:23Z

@juliendenize do you also plan to make changes to convert_mistral_to_gguf.py to have mappings for audio_tower.* for the mmproj, I guess it will be necessary for the new voxtral models?

that would be cool indeed, I didn't work personally on Voxtral (for the model), so I might need some assistance as I lack experience in audio models.

Is voxtral already supported by llama.cpp ? I assumed that not for now.

am17an · 2025-07-24T15:35:23Z

@juliendenize do you also plan to make changes to convert_mistral_to_gguf.py to have mappings for audio_tower.* for the mmproj, I guess it will be necessary for the new voxtral models?

that would be cool indeed, I didn't work personally on Voxtral (for the model), so I might need some assistance as I lack experience in audio models.

Is voxtral already supported by llama.cpp ? I assumed that not for now.

Yeah not for now, but I was trying to add support and ran into issues converting to GGUF. But that should be easy to add after this PR is merged, so don't worry about it for now :)

juliendenize · 2025-07-24T16:20:42Z

Ok so this: https://github.com/ggml-org/llama.cpp/actions/runs/16500995835/job/46660394829?pr=14737

Is actually expected because we didn't merge the PR here yet in mistral-common:
Add a FastAPI app #113

We're in the process of merging I'm just adding a final feature which is begin able to call /v1/chat/completions to directly call the inference server (in this case llama.cpp !!). I'm moving as fast as possible for this.

CISC · 2025-07-24T17:58:34Z

We're in the process of merging I'm just adding a final feature which is begin able to call /v1/chat/completions to directly call the inference server (in this case llama.cpp !!). I'm moving as fast as possible for this.

Ok, ping me when you're ready.

ngxson · 2025-07-24T18:56:15Z

I've just had a deeper look into this PR. One concern though, most of the code inside convert_mistral_to_gguf.py is copied from convert_hf_to_gguf.py, which can make it a bit tricky in the long term, especially the code for multimodal model conversion.

Just thinking, maybe it's better to bring them right into convert_hf_to_gguf.py? AFAIU most of the complicated code in this PR is dedicated to converting the tokenizer to GGUF.

Btw, I'm also working converting Voxtral to GGUF. I thought that would be simple but I'm currently stuck at the tokenizer. Trying a quick hack to copy some code from this PR.. will see if it work.

ngxson · 2025-07-24T19:48:19Z

Ok so as demo in #14862, I think it might be better to merge everything into convert_hf_to_gguf. This has 2 big advantages:

Easier for long term maintenance, since we will have less duplicated code
Less confusion for end-users. Users who are not very familiar with llama.cpp may not understand that they need to use a dedicated script for mistral models (or models fine tuned from mistral)

CISC · 2025-07-24T21:39:50Z

Ok so as demo in #14862, I think it might be better to merge everything into convert_hf_to_gguf.

Sounds good to me.

juliendenize · 2025-07-25T09:33:15Z

Hi @ngxson thanks for the review.

Just thinking, maybe it's better to bring them right into convert_hf_to_gguf.py?

The reason I split the two files was to avoid confusion of what is happening, because here we don't convert hf models.

It is indeed a lot of copy paste from convert_hf_to_gguf.py but with lots of overriding. We could probably subclasses but end up overriding whole methods and use few super(). Though not entirely sure about that as I decided to decouple things really early and didn't keep track of that. I can probably achieve something better. Maybe the first thing would be to import from convert_hf_to_gguf.py . Then if you have a strong opinion about merging the two it could be done more easily.

ngxson · 2025-07-25T10:03:19Z

The reason I split the two files was to avoid confusion of what is happening, because here we don't convert hf models.

Hmm FYI, convert_hf_to_gguf already support some non-HF models so I think it's probably fine to merge these two.

We can also add an additional flag like --non-hf if needed, but eventually I think it's unnecessary.

From the perspective of convert_hf_to_gguf, not many things that make a mistral model different from HF model:

The tensors are stored inside consolidated.safetensors --> easy to support
Hyperparams are stored inside params.json instead of config.json --> also easy to support
Tokenizer conversion --> a bit tricky, but as demo in mtmd : add support for Voxtral #14862 it is feasible

So overall I still think merging everything into convert_hf_to_gguf is not that complicated and will be a better choice.

juliendenize · 2025-07-25T16:25:34Z

thanks @ngxson I started the process of refactoring. I have a bug that i need to fix (probably on Monday) which is why I didn't push but I prefer notifying you guys as I don't want to be silent again. You were right there is very few changes to make !

BTW @CISC we merged the PR and made the release of mistral-common.

FYI we also released Magistral GGUF https://huggingface.co/mistralai/Magistral-Small-2507-GGUF thanks to this PR seems to work very smoothly with mistral-common and llama.cpp

juliendenize · 2025-07-28T08:53:13Z

Hey @CISC think it should be good there I refactored what was needed.

To use the Mistral format in the conversion script I added the same flag we did for vLLM with --mistral-format. This avoids ambiguity with HF's implementation.

BTW, it's unfortunate that you rely entirely on your own chat parsing instead of making chat templates, this forces us to use community written chat templates instead, see #14862 (comment)

We understand the difficulty, however we don't plan to release official ones. We have several advantages doing it via mistral-common: validation, aggregation, testing, ... And releasing chat template would add more burden to releases + maintaining it.

This is why we're trying to find ways to ease the usage of mistral-common and seek feedback on how it can be improved.

ngxson

Please note that when I merge #14862, it will create some conflicts in this PR. I will make a PR to your repo to resolve the conflicts.

In the meantime, I think it's important to test the output from your script. As mentioned in some comments, this probably not yet working.

convert_hf_to_gguf.py

gguf-py/gguf/tensor_mapping.py

convert_hf_to_gguf.py

juliendenize · 2025-07-28T18:48:13Z

Please note that when I merge #14862, it will create some conflicts in this PR. I will make a PR to your repo to resolve the conflicts.

In the meantime, I think it's important to test the output from your script. As mentioned in some comments, this probably not yet working.

thanks for the review, I updated the PR. Tested on Mistral-Small-3.2 with mmproj. I used the jinja template from 3.1.

from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8080/v1"

TEMP = 0.15
MAX_TOK = 131072

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


model_id = "mistralai/Mistral-Small-3.2-24B-Instruct-2506"
SYSTEM_PROMPT = load_system_prompt(model_id, "SYSTEM_PROMPT.txt")
image_url = "https://cdn.lospec.com/thumbnails/gallery/gabigaie/a-little-pikachu-default.png"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is it and which side is the tail ?",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]


response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    max_tokens=MAX_TOK,
)

print(response.choices[0].message.content)

Based on your second paragraph i didn't rebase but lmk if you want me to do it.

ngxson · 2025-07-29T10:31:09Z

Thanks, it looks better now.

Based on your second paragraph i didn't rebase but lmk if you want me to do it.

Yes please do a rebase. I thought about doing it myself but turns out I still haven't had time.

juliendenize · 2025-07-29T12:59:16Z

Thanks, it looks better now.

Based on your second paragraph i didn't rebase but lmk if you want me to do it.

Yes please do a rebase. I thought about doing it myself but turns out I still haven't had time.

Done :)

convert_hf_to_gguf.py

requirements/requirements-convert_hf_to_gguf.txt

ngxson · 2025-07-29T22:18:09Z

convert_hf_to_gguf.py

+        valid_prefixes = (
+            "multi_modal_projector.",
+            "vision_tower.",
+            "vision_encoder.",
+            "vision_language_adapter.",
+            "patch_merger.",
+            "pre_mm_projector_norm",
+        )


A bit out of scope, but this list can be extracted into a static const inside MmprojModel.TENSOR_PREFIXES

I will do that in another PR, just writing a note here so I won't forget it

convert_hf_to_gguf.py

ngxson · 2025-07-29T22:21:36Z

convert_hf_to_gguf.py

    def set_vocab_tekken(self):
-        vocab = gguf.vocab.MistralVocab(self.dir_model)
-        self.gguf_writer.add_tokenizer_model(vocab.gguf_tokenizer_model)
-
-        tokens = []
-        scores = []
-        toktypes = []
-
-        for text, score, toktype in vocab.all_tokens():
-            tokens.append(text)
-            scores.append(score)
-            toktypes.append(toktype)
-
-        assert len(tokens) == vocab.vocab_size, (
-            f"token count ({len(tokens)}) != vocab size ({vocab.vocab_size})"
-        )
-
-        if vocab.tokenizer_type == gguf.vocab.MistralTokenizerType.tekken:
-            self.gguf_writer.add_tokenizer_pre("tekken")
-            self.gguf_writer.add_token_merges(
-                vocab.extract_vocab_merges_from_model()
-            )
-
-        logger.info(
-            f"Setting bos, eos, unk and pad token IDs to {vocab.bos_id}, {vocab.eos_id}, {vocab.unk_id}, {vocab.pad_id}."
-        )
-
-        self.gguf_writer.add_bos_token_id(vocab.bos_id)
-        self.gguf_writer.add_eos_token_id(vocab.eos_id)
-        self.gguf_writer.add_unk_token_id(vocab.unk_id)
-        self.gguf_writer.add_pad_token_id(vocab.pad_id)
-
-        self.gguf_writer.add_token_list(tokens)
-        self.gguf_writer.add_token_scores(scores)
-        self.gguf_writer.add_token_types(toktypes)
-        self.gguf_writer.add_vocab_size(vocab.vocab_size)
-
-        self.gguf_writer.add_add_bos_token(True)
-        self.gguf_writer.add_add_eos_token(False)
+        self._set_vocab_mistral()


We can simply replace all calls to set_vocab_tekken with self._set_vocab_mistral()

Not sure in set_vocab_tekken you also have the following:

script_dir = Path(__file__).parent template_path = script_dir / "models/templates/unsloth-mistral-Devstral-Small-2507.jinja" with open(template_path, "r", encoding="utf-8") as f: template = f.read() self.gguf_writer.add_chat_template(template)

LMK if you still want to discard this.

Edit: understood after this comment what you meant i removed the method and copied the template_path part to the set_vocab method.

convert_hf_to_gguf.py

gguf-py/gguf/utility.py

ngxson · 2025-07-30T09:31:52Z

convert_hf_to_gguf.py

+            script_dir = Path(__file__).parent
+            template_path = script_dir / "models/templates/unsloth-mistral-Devstral-Small-2507.jinja"
+            with open(template_path, "r", encoding="utf-8") as f:
+                template = f.read()
+                self.gguf_writer.add_chat_template(template)


I'm wondering if we can also move this to _set_vocab_mistral(), as for now GGUF converted via --mistral-format won't have any templates built-in, thus it won't be usable out-of-the box with tools based on llama.cpp.

IMO forcing people to use both python (to format the chat) and llama.cpp at the same time may not be a good user experience, so having built-in jinja template should still be a requirement.

Also self-note: one thing missing from this code, we should make sure we are using the correct tekken version for the given model, maybe as a assert tekken_json["config"]["version"] == "v7". But I can add it later.

ngxson

Looks good overall, nice contribution!

We can merge once the 2 pending comments are all resolved.

ngxson · 2025-07-30T09:35:52Z

convert_hf_to_gguf.py

+
+                if not self.is_mistral_format:
+                    remote_tensors = gguf.utility.SafetensorRemote.get_list_tensors_hf_model(remote_hf_model_id)
+
+                else:
+                    url = f"{gguf.utility.SafetensorRemote.BASE_DOMAIN}/{remote_hf_model_id}/resolve/main/consolidated.safetensors"
+                    remote_tensors = gguf.utility.SafetensorRemote.get_list_tensors(url)
+
                self.tensor_names = set(name for name in remote_tensors.keys())
-                for name, remote_tensor in gguf.utility.SafetensorRemote.get_list_tensors_hf_model(remote_hf_model_id).items():
+                for name, remote_tensor in remote_tensors.items():
                    yield (name, LazyTorchTensor.from_remote_tensor(remote_tensor))

            self.get_tensors = get_remote_tensors


As discussed about remote_safetensors, I think these changes should also be reverted

juliendenize

Think I answered your comments.

Regarding the remote I think it would have been nice to allow remote download for mistral format but I reverted as requested.

Regarding chat templates, i created a method for it that should be easily expandable.

juliendenize · 2025-07-30T12:57:59Z

convert_hf_to_gguf.py

                remote_tensors = gguf.utility.SafetensorRemote.get_list_tensors_hf_model(remote_hf_model_id)
                self.tensor_names = set(name for name in remote_tensors.keys())
-                for name, remote_tensor in gguf.utility.SafetensorRemote.get_list_tensors_hf_model(remote_hf_model_id).items():
+                for name, remote_tensor in remote_tensors.items():


Left the change here (while removing mistral-format) as remote_tensors was not used in the for loop but evaluated again.

Now mistral-format cannot be downloaded from hf though by removing the mistral format case.

juliendenize · 2025-07-30T13:02:36Z

convert_hf_to_gguf.py

+    @staticmethod
+    def get_community_chat_template(vocab: MistralVocab, templates_dir: Path):
+        assert TokenizerVersion is not None, "mistral_common is not installed"
+        assert isinstance(vocab.tokenizer, (Tekkenizer, SentencePieceTokenizer)), (
+            f"Expected Tekkenizer or SentencePieceTokenizer, got {type(vocab.tokenizer)}"
+        )
+
+        if vocab.tokenizer.version == TokenizerVersion.v1:
+            return "mistral-v1"
+        elif vocab.tokenizer.version == TokenizerVersion.v3 and vocab.tokenizer_type == MistralTokenizerType.spm:
+            return "mistral-v3"
+        elif vocab.tokenizer.version == TokenizerVersion.v3 and vocab.tokenizer_type == MistralTokenizerType.tekken:
+            return "mistral-v3-tekken"
+        elif vocab.tokenizer.version == TokenizerVersion.v7 and vocab.tokenizer_type == MistralTokenizerType.spm:
+            return "mistral-v7"
+        elif vocab.tokenizer.version == TokenizerVersion.v7 and vocab.tokenizer_type == MistralTokenizerType.tekken:
+            return "mistral-v7-tekken"
+        elif vocab.tokenizer.version == TokenizerVersion.v11:
+            template_file = "Mistral-Small-3.2-24B-Instruct-2506.jinja"
+        elif vocab.tokenizer.version == TokenizerVersion.v13:
+            template_file = "unsloth-mistral-Devstral-Small-2507.jinja"
+        else:
+            raise ValueError(f"Unknown tokenizer type: {vocab.tokenizer_type} and version {vocab.tokenizer.version}")
+
+        template_path = templates_dir / template_file
+        if not template_path.exists():
+            raise FileNotFoundError(f"Template file not found: {template_path}")
+
+        with open(template_path, "r", encoding="utf-8") as f:
+            template = f.read()
+
+        return template


This should handle the chat template defaults.

github-actions bot added the python python script changes label Jul 17, 2025

juliendenize force-pushed the mistral_integration branch from b809a96 to 2865a25 Compare July 23, 2025 14:55

juliendenize marked this pull request as ready for review July 23, 2025 22:19

CISC mentioned this pull request Jul 24, 2025

New Voxtral model support ggml-org/whisper.cpp#3326

Open

CISC reviewed Jul 24, 2025

View reviewed changes

convert_mistral_to_gguf.py Outdated Show resolved Hide resolved

This comment was marked as off-topic.

Sign in to view

github-actions bot added the examples label Jul 24, 2025

ngxson mentioned this pull request Jul 24, 2025

mtmd : add support for Voxtral #14862

Merged

ngxson reviewed Jul 28, 2025

View reviewed changes

juliendenize added 8 commits July 29, 2025 14:39

Improve Mistral models integration with llama.cpp

8e4a0fc

Revert changes and fix gguf

b7e6e13

Revert change

1be1458

refactor convert_mistral_to_gguf.py in convert_hf_to_gguf.py

950eb73

Revert collateral

d1b633c

Rename model name

02e932f

refactor

b374360

revert

9b5d9a8

juliendenize force-pushed the mistral_integration branch from 10ef34f to 9b5d9a8 Compare July 29, 2025 12:52

juliendenize added 2 commits July 29, 2025 14:54

remove duplicate

3f49017

Remove duplication code

0ac6b75

juliendenize added 2 commits July 29, 2025 15:06

Fixes

025dd6e

Fix flake issues

ba74870

CISC approved these changes Jul 29, 2025

View reviewed changes

ngxson reviewed Jul 29, 2025

View reviewed changes

juliendenize added 4 commits July 30, 2025 10:36

Apply comments

402f87e

Apply comments

3326484

Apply comments

3fa963f

Fix remote

63002a0

ngxson reviewed Jul 30, 2025

View reviewed changes

ngxson approved these changes Jul 30, 2025

View reviewed changes

juliendenize added 3 commits July 30, 2025 14:55

add default chat template

42489f5

Revert

467ccd2

nit

9493ced

juliendenize commented Jul 30, 2025

View reviewed changes

Improve Mistral models integration with llama.cpp #14737

Are you sure you want to change the base?

Improve Mistral models integration with llama.cpp #14737

Conversation

juliendenize commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Context

Using mistral-common with llama.cpp

Added features

Known Limitations:

Example Code

(Optional) Convert the model

Launch the mistral-common and llama.cpp servers

Use the servers

Feedback and Contributions

Uh oh!

ggerganov commented Jul 17, 2025

Uh oh!

ehoogeveen-medweb commented Jul 17, 2025

Uh oh!

juliendenize commented Jul 23, 2025

Uh oh!

Uh oh!

CISC commented Jul 24, 2025

Uh oh!

CISC commented Jul 24, 2025

Uh oh!

This comment was marked as off-topic.

juliendenize commented Jul 24, 2025

Uh oh!

juliendenize commented Jul 24, 2025

Uh oh!

CISC commented Jul 24, 2025

Uh oh!

CISC commented Jul 24, 2025

Uh oh!

CISC commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliendenize commented Jul 24, 2025

Uh oh!

juliendenize commented Jul 24, 2025

Uh oh!

am17an commented Jul 24, 2025

Uh oh!

juliendenize commented Jul 24, 2025

Uh oh!

CISC commented Jul 24, 2025

Uh oh!

ngxson commented Jul 24, 2025

Uh oh!

ngxson commented Jul 24, 2025

Uh oh!

CISC commented Jul 24, 2025

Uh oh!

juliendenize commented Jul 25, 2025

Uh oh!

ngxson commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliendenize commented Jul 25, 2025

Uh oh!

juliendenize commented Jul 28, 2025

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

juliendenize commented Jul 28, 2025

Uh oh!

ngxson commented Jul 29, 2025

Uh oh!

juliendenize commented Jul 17, 2025 •

edited

Loading

CISC commented Jul 24, 2025 •

edited

Loading

am17an commented Jul 24, 2025 •

edited

Loading

ngxson commented Jul 25, 2025 •

edited

Loading

ngxson Jul 29, 2025 •

edited

Loading

juliendenize Jul 30, 2025 •

edited

Loading