-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Bug: Tokenization adds a space to the first non-special token #8584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a known problem with SentencePiece tokenizer. Different implementations deal with it in different ways. In llama.cpp, developers have different opinions about the matter. I.e. it can be argued that adding space is the correct behavior. I proposed to make it flexible in #3664, but so far we haven't agreed on a path forward. The space is added here: https://github.com/ggerganov/llama.cpp/blob/3d0e4367d99087892e355ddbeebd232a0b2f40de/src/llama.cpp#L16450-L16453 |
To solve this problem I added an enable_space_prefix flag to the tokenize call:
Then I created a separate tokenizer entry which turns off adding spaces:
Its annoying to support this patch because patch breaks a lot when anything is touched in this file but |
Very interesting! I presume that adding |
This has already been fixed for new conversions in #8248, (but there's currently another problem #7897, which prevents conversion for instruct Gemma models, I'll try to fix this soon). Otherwise, you should be able to use |
It's interesting that the default value for |
Unfortunately, the KV override does not seem to work. Using
tokenizes the prompt as
Token Patching the model by adding
Interestingly, the patched model ignores the KV override when setting it to true with Is this a bug? |
Yes, it's addressed in #8614 |
Very good! Thank you! |
What happened?
When tokenizing text with LlamaSharp, a C# library directly using llama.cpp, a space is added to the first non-special token.
Unfortunately, this changes the prompt templates and even breaks some of my use cases (e.g. stripping the last tokens then using anti-prompts in LlamaSharp).
In particular, it would be good if the following invariant always applied:
detokenize(tokenize("text", addBos:false)) == "text"
But this is not the case:
llama-tokenize.exe --no-bos -m "gemma-1.1-2b-it.Q6_K.gguf" -p "<pad>text<eos>"
Token
2793 -> ' text'
should have been1082 -> 'text'
.Is there any possibility to fix this behavior, telling llama.cpp not to add any spaces when tokenizing and just process the raw text?
Name and Version
version: 3412 (3807c3d)
built with MSVC 19.29.30154.0 for x64
What operating system are you seeing the problem on?
Windows 10
Relevant log output
The text was updated successfully, but these errors were encountered: