Skip to content

Eval bug: UGM tokenizer sometimes outputs wrong tokens/in the wrong order #13725

@CISC

Description

@CISC

Models

nomic-embed-text-v2-moe

Problem description & steps to reproduce

./llama-tokenize -m nomic-embed-text-v2-moe.bf16.gguf -p ".??????"

Outputs tokens in a different order compared to transformers, which outputs in this order:

     0 -> '<s>'
     6 -> ' '
     5 -> '.'
    32 -> '?'
 85908 -> '?????'
     2 -> '</s>'

First Bad Commit

No response

Relevant log output

     0 -> '<s>'
     6 -> ' '
     5 -> '.'
 85908 -> '?????'
    32 -> '?'
     2 -> '</s>'

Just -p "??????" outputs correctly:

     0 -> '<s>'
   705 -> ' ?'
 85908 -> '?????'
     2 -> '</s>'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions