Skip to content

Add sentencepiece tokenizer and modify build (Support UTF-8 / Emoijs) #66

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

beiller
Copy link
Contributor

@beiller beiller commented Mar 12, 2023

Resolves #11

Resolves #63

I think its best to add sentencepiece which is small and easy to build. Here is a PR but we can hack together something still. The changes to main.cpp are still relevant for either case.

@beiller beiller force-pushed the feature/tokenization branch from 342e52f to 3c04dfb Compare March 12, 2023 23:36
@beiller
Copy link
Contributor Author

beiller commented Mar 12, 2023

Sentencepiece fails to build in ubuntu. Of course it does :)

@beiller beiller force-pushed the feature/tokenization branch from 39b42fb to 3e2327c Compare March 12, 2023 23:49
@j-f1
Copy link
Collaborator

j-f1 commented Mar 12, 2023

A simpler approach (maybe?) could be to parse the tokenizer.model file using protobuf in Python and then generate the tokens appropriately. Working on that now!

@beiller
Copy link
Contributor Author

beiller commented Mar 13, 2023

@j-f1 I would caution that approach, because that is already the current approach already used here, except sentencepiece (python) was used to generate the list of tokens.

result1 = tokenizer.encode("篇")
print(f'token: {result1}')
>>>[29871, 234, 178, 138]

There is no tokenization of 篇. It is tokenized by splitting it into UTF8 chunks.

https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?input=%E7%AF%87&mode=char
Info:

Character | 篇
Hex code point | 7BC7
Decimal code point | 31687
Hex UTF-8 bytes | E7 AF 87

So does it mean decompose to E7 AF 87 ? Will you store 255 "code point" tokens? I think that is what happened. But maybe not. Token ID 131 - 258 stores 0xEFBFBD which is not 256 entries. Even if we did how would we determine to use a normal token vs the codepoint? There is a great deal I don't know about it maybe I am overlooking something!

@j-f1
Copy link
Collaborator

j-f1 commented Mar 13, 2023

SentencePiece stores the following types of tokens:

  • normal tokens. These appear to be valid UTF-8 text snippets (including both ASCII and non-ASCII text. e.g. reate ▁end (they use to represent a space for some reason)
  • The <unk> token (translated as ⁇). Not sure how this is used.
  • <s> and </s> tokens to represent the start/end of text
  • <0xXX> tokens (e.g <0xE7><0xAF><0x87> for ) for all values 0-255

@j-f1
Copy link
Collaborator

j-f1 commented Mar 13, 2023

I think your changes to buffer output until we have a complete UTF-8 character make sense, but I think we can remove the sentencepiece dependency entirely by vendoring sentencepiece_model.proto and parsing tokenizer.model in convert-pth-to-ggml.py.

@j-f1
Copy link
Collaborator

j-f1 commented Mar 13, 2023

Currently rebuilding the 7B model now, will give it a few tests then open a PR.

@beiller
Copy link
Contributor Author

beiller commented Mar 13, 2023

Nice @j-f1 I think a general algorithm of tokenizing would be, attempt to look up in the token set, and it it doesnt exist, decompose it into 0xXX tokens.

It might make displaying the text more difficult. If you detect a 0xXX on decode, wait until the next token is not a hex and then display it. Hopefully we can tell where to stop tho.

Also this PR has a bug and prompt crashes please dont merge!

./main -m ./models/13B/ggml-model-q4_0.bin -t 4 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647 -n 512 -p 关于爱因斯坦的生平。他出生于

@j-f1
Copy link
Collaborator

j-f1 commented Mar 13, 2023

For tokenizing, I think that will happen automatically, assuming the std::string text is UTF-8 encoded (since it picks the longest possible token matching the current string, which could end up being one of the one-byte tokens). For decoding, I think a version of your code in this PR could work (where you check for an invalid UTF-8 sequence before printing)

@beiller
Copy link
Contributor Author

beiller commented Mar 13, 2023

@j-f1 "which could end up being one of the one-byte tokens" -> which could end up being multiple of the one-byte tokens. Encoding "you" could end up with [512], but if not found, it would wind up as [55, 63, 99] (numbers made up).

@beiller beiller force-pushed the feature/tokenization branch from 6dd03a8 to ce7ebb3 Compare March 13, 2023 01:46
@beiller beiller mentioned this pull request Mar 13, 2023
@j-f1
Copy link
Collaborator

j-f1 commented Mar 13, 2023

@beiller beiller changed the title Add sentencepiece tokenizer and modify build Add sentencepiece tokenizer and modify build (Support UTF-8 / Emoijs) Mar 13, 2023
@Piezoid
Copy link
Contributor

Piezoid commented Mar 13, 2023

Once llama.cpp gets python bindings (#82), things like interaction, tokenization, and even sampling are easier to develop and test in python. I'm thinking that making the core of llama.cpp text-agnostic would allow more flexibility for less feature creep.

@beiller
Copy link
Contributor Author

beiller commented Mar 13, 2023

@Piezoid great points. I think maybe the python integration should go up to ggml as a library wouldn't that make more sense? But yep I'd agree with you there. This problem has actually been solved in 2 other PRs which I believe are cleaner and simpler. The tokens are actually stored in model files, and plugged in during the scripts that quantize etc.

The cleanest PR I believe is here:

#79

@ggerganov
Copy link
Member

Resolved in #79

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prompt interrupted before continuation for Unicode UTF-8 emojis Unicode support
4 participants