Skip to content

Feature Request: the rest api does not allow to retrieve chat completions in raw tokens #15731

@nicholas-johnson-techxcel

Description

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Without raw tokens, we cannot properly use models like gpt-oss because the de-tokenisation ruins the Harmony schema, making it impossible for us to cleanly parse the output.

Either this, or you need to parse the harmony from the model, and output deltas with role=thinking and role=assistant or even generally, role=<channel_name> with the Harmony frames omitted.

I checked this is up to date:

$ llama-server --version
version: 6100 (65c797c4)
built with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0

Motivation

We need gpt-oss model to work properly, and since ollama and llamacpp have not seemed to have implemented Harmony, we need to have the tools to make it work ourselves. Hence raw tokens, please. I am tripping over the fact that the detokenised stream does not include some of the Harmony markers (like <|end|> is missing) which trips up the openai_harmony python library when stream decoding

Possible Implementation

Ollama can already dump raw tokens from llamacpp using c-api, you just need to change the rest endpoint so that if we set a "raw" flag in our request, we get back tokens instead of the current partially de-tokenised response (like why are you letting <|start|> out but not <|end|>?)

Cheers

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions