Closed
Description
What happened?
$ ./tokenize codegemma-2b.gguf " test"
[snip]
2 -> '<bos>'
255970 -> ' '
255970 -> ' '
2121 -> ' test'
$ echo " test" | spm_encode --model codegemma-2b.model --input /dev/stdin --output_format id
255973 2195
$ echo "255970 255970 2121" | spm_decode --model codegemma-2b.model --input /dev/stdin --input_format id | jq -R .
"\t\t\t\t\t\t test"
$ echo "255973 2195" | spm_decode --model codegemma-2b.model --input /dev/stdin --input_format id | jq -R .
"\t\t\t\t\t\ttest"
Note that the input is six tabs followed by "test", i.e. "\t\t\t\t\t\ttest"
. Take care not to accidentally use spaces when reproducing.
Note that this is not just inserting a stray space before "test": it also breaks the tabs into two sets of 3 instead of a single set of 6.
Inputs like this (leading indentation followed by text) happen a lot with code.
There are three issues here:
- Mismatch between what the model was trained on and how llama.cpp tokenizes it. Adding a space is definitely OOD, particularly for languages with strong formatting opinions (Go) or significant whitespace (Python).
- llama.cpp tokenizer doesn't roundtrip (inserts an extraneous space).
- llama.cpp tokenizer uses more tokens to represent the input.
Thanks!
Name and Version
$ ./llama-cli --version
version: 3325 (87e25a1d)
(head as of Sat Jul 6 09:22:16 2024 +0200)
What operating system are you seeing the problem on?
Linux, Mac
Relevant log output
No response