-
Notifications
You must be signed in to change notification settings - Fork 12k
Fix gemma2 tokenizer convert #8244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@ngxson this is closer to the hf tokenizer in my tests however when trying out the cli / server I've noticed that newlines generation seems broken (doesn't occur at all). |
@abetlen Thanks for testing that. I've just tried on my side, and I can confirm that the new line is also broken. |
Edit: sorry I made a mistake. The output new line token is correct (token ID 108), I'm investigating this further |
@abetlen Turns out new line and all tokens after ID 108 are marked as control, while they should be normal token. I fixed my code and it should work correctly now:
Tokenized (main.log):
I also split the code into |
this seems like the correct one. It even properly tokenizes the prompt with discards noted in the discussion as disc and ards @ggerganov if you want to take a look |
* fix gemma2 tokenizer convert * remove scores * improve code, fix new line issue
for i in range(108): | ||
# including <unusedX>, <start_of_turn>, <end_of_turn> | ||
toktypes[i] = SentencePieceTokenTypes.CONTROL | ||
self.gguf_writer.add_tokenizer_model("llama") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it’s merged, and it’s a nitpick, and ignores the rule of 3 but especially for someone with little understanding of what the code is actually doing (it’s all magic to me) it would benefit from a separate method for this and the sequence of calls lines 582-586
while removing a small duplication, it also can serve as a helper to understand what it’s doing. And someday someone may fix something in once place and not the other.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you’re new to the code base.. if you have a look on the other parts of the file, there’re even more duplications. Not because we don’t care about this, but sometimes duplications make it more visible what the code does.
* fix gemma2 tokenizer convert * remove scores * improve code, fix new line issue
Ref comment:
The output model is capable of tokenizing special tokens (used in chat templates):
Perplexity is also improved from
8.9711
to7.8952
(I'm using q8_0 because colab notebook does not have enough VRAM for f16)