-
Notifications
You must be signed in to change notification settings - Fork 11.8k
new tokenizer-verifier tool to check gguf tokenizer parameters #6988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Here is what it looks like for a model with a BPE tokenizer that can tokenize anything:
And for another where this fails because of an incomplete (or misconfigured?) tokenizer:
Note that recent changes have made it very slow fol llama3 (or maybe it's my gguf file?). |
This program verifies that a given gguf model file can tokenize all potential valid characters. Since llama.cpp currently raises an exception when tokenization is not possible[1], this tool helps verifying that valid ascii and utf-8 will always be properly tokenized. [1] ggml-org#2580
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that other tools in example folder has README.md, I think we need one for this as well, since it's for a specific purpose as a tool.
Also doesn't seem to handle missing models well, ends up segfaulting.
This PR might be a bit more adjusting. |
Thanks a lot for your review @mofosyne, I'll add a README in the next iteration.
I don't think it hanged, but it illustrates the issue I talked about earlier:
Something made the tokenization very slow, but I don't know what. I might bisect it to find the issue.
If you look at the code, it loads the models in a very simple and straightforward way, just like the |
This program verifies that a given gguf model file can tokenize all potential valid characters. Since llama.cpp currently raises an exception when tokenization is not possible, this tool helps verifying that valid ascii and utf-8 will always be properly tokenized.