-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Add sentencepiece tokenizer and modify build (Support UTF-8 / Emoijs) #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
342e52f
to
3c04dfb
Compare
Sentencepiece fails to build in ubuntu. Of course it does :) |
39b42fb
to
3e2327c
Compare
A simpler approach (maybe?) could be to parse the |
@j-f1 I would caution that approach, because that is already the current approach already used here, except sentencepiece (python) was used to generate the list of tokens.
There is no tokenization of 篇. It is tokenized by splitting it into UTF8 chunks. https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?input=%E7%AF%87&mode=char
So does it mean decompose to E7 AF 87 ? Will you store 255 "code point" tokens? I think that is what happened. But maybe not. Token ID 131 - 258 stores 0xEFBFBD which is not 256 entries. Even if we did how would we determine to use a normal token vs the codepoint? There is a great deal I don't know about it maybe I am overlooking something! |
SentencePiece stores the following types of tokens:
|
I think your changes to buffer output until we have a complete UTF-8 character make sense, but I think we can remove the |
Currently rebuilding the 7B model now, will give it a few tests then open a PR. |
Nice @j-f1 I think a general algorithm of tokenizing would be, attempt to look up in the token set, and it it doesnt exist, decompose it into 0xXX tokens. It might make displaying the text more difficult. If you detect a 0xXX on decode, wait until the next token is not a hex and then display it. Hopefully we can tell where to stop tho. Also this PR has a bug and prompt crashes please dont merge!
|
For tokenizing, I think that will happen automatically, assuming the |
@j-f1 "which could end up being one of the one-byte tokens" -> which could end up being multiple of the one-byte tokens. Encoding "you" could end up with [512], but if not found, it would wind up as [55, 63, 99] (numbers made up). |
6dd03a8
to
ce7ebb3
Compare
Once llama.cpp gets python bindings (#82), things like interaction, tokenization, and even sampling are easier to develop and test in python. I'm thinking that making the core of llama.cpp text-agnostic would allow more flexibility for less feature creep. |
@Piezoid great points. I think maybe the python integration should go up to ggml as a library wouldn't that make more sense? But yep I'd agree with you there. This problem has actually been solved in 2 other PRs which I believe are cleaner and simpler. The tokens are actually stored in model files, and plugged in during the scripts that quantize etc. The cleanest PR I believe is here: |
Resolved in #79 |
Resolves #11
Resolves #63
I think its best to add sentencepiece which is small and easy to build. Here is a PR but we can hack together something still. The changes to
main.cpp
are still relevant for either case.