-
Notifications
You must be signed in to change notification settings - Fork 13.2k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Without raw tokens, we cannot properly use models like gpt-oss because the de-tokenisation ruins the Harmony schema, making it impossible for us to cleanly parse the output.
Either this, or you need to parse the harmony from the model, and output deltas with role=thinking and role=assistant or even generally, role=<channel_name> with the Harmony frames omitted.
I checked this is up to date:
$ llama-server --version
version: 6100 (65c797c4)
built with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
Motivation
We need gpt-oss model to work properly, and since ollama and llamacpp have not seemed to have implemented Harmony, we need to have the tools to make it work ourselves. Hence raw tokens, please. I am tripping over the fact that the detokenised stream does not include some of the Harmony markers (like <|end|> is missing) which trips up the openai_harmony python library when stream decoding
Possible Implementation
Ollama can already dump raw tokens from llamacpp using c-api, you just need to change the rest endpoint so that if we set a "raw" flag in our request, we get back tokens instead of the current partially de-tokenised response (like why are you letting <|start|> out but not <|end|>?)
Cheers