Skip to content

Bug: Gemma2 tokenization seems incorrect. #8349

@AUTOMATIC1111

Description

@AUTOMATIC1111

What happened?

tokenizer.json from gemma2 has this token: "[toxicity=0]": 255968.

When tokenizing that text using llamacpp, we get [235309, 1373, 235293, 235276, 235307]

If I ask llamacapp gemma2 to repeat this text, [toxicity=0], it does so effortlessly.

If I ask corpo hosted gemma2 to repeat it, it fails, thinking there's no text there:

chrome_oG2qUJimh4

Name and Version

version: 3317 (8e55830)
built with MSVC 19.29.30154.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions