Skip to content

Bug: tab/space mistokenization for gemma spm models #8338

Closed
@josharian

Description

@josharian

What happened?

$ ./tokenize codegemma-2b.gguf "                                         test"
[snip]
     2 -> '<bos>'
255970 -> '			'
255970 -> '			'
  2121 -> ' test'
$ echo "                                             test" | spm_encode --model codegemma-2b.model --input /dev/stdin --output_format id
255973 2195
$ echo "255970 255970 2121" | spm_decode --model codegemma-2b.model --input /dev/stdin --input_format id | jq -R .
"\t\t\t\t\t\t test"
$ echo "255973 2195" | spm_decode --model codegemma-2b.model --input /dev/stdin --input_format id | jq -R .
"\t\t\t\t\t\ttest"

Note that the input is six tabs followed by "test", i.e. "\t\t\t\t\t\ttest". Take care not to accidentally use spaces when reproducing.

Note that this is not just inserting a stray space before "test": it also breaks the tabs into two sets of 3 instead of a single set of 6.

Inputs like this (leading indentation followed by text) happen a lot with code.

There are three issues here:

  • Mismatch between what the model was trained on and how llama.cpp tokenizes it. Adding a space is definitely OOD, particularly for languages with strong formatting opinions (Go) or significant whitespace (Python).
  • llama.cpp tokenizer doesn't roundtrip (inserts an extraneous space).
  • llama.cpp tokenizer uses more tokens to represent the input.

Thanks!

Name and Version

$ ./llama-cli --version
version: 3325 (87e25a1d)

(head as of Sat Jul 6 09:22:16 2024 +0200)

What operating system are you seeing the problem on?

Linux, Mac

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions