Add sentencepiece tokenizer and modify build (Support UTF-8 / Emoijs) #66

beiller · 2023-03-12T23:29:25Z

Resolves #11

Resolves #63

I think its best to add sentencepiece which is small and easy to build. Here is a PR but we can hack together something still. The changes to main.cpp are still relevant for either case.

beiller · 2023-03-12T23:37:48Z

Sentencepiece fails to build in ubuntu. Of course it does :)

j-f1 · 2023-03-12T23:58:15Z

A simpler approach (maybe?) could be to parse the tokenizer.model file using protobuf in Python and then generate the tokens appropriately. Working on that now!

beiller · 2023-03-13T00:13:18Z

@j-f1 I would caution that approach, because that is already the current approach already used here, except sentencepiece (python) was used to generate the list of tokens.

result1 = tokenizer.encode("篇")
print(f'token: {result1}')
>>>[29871, 234, 178, 138]

There is no tokenization of 篇. It is tokenized by splitting it into UTF8 chunks.

https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?input=%E7%AF%87&mode=char
Info:

Character | 篇
Hex code point | 7BC7
Decimal code point | 31687
Hex UTF-8 bytes | E7 AF 87

So does it mean decompose to E7 AF 87 ? Will you store 255 "code point" tokens? I think that is what happened. But maybe not. Token ID 131 - 258 stores 0xEFBFBD which is not 256 entries. Even if we did how would we determine to use a normal token vs the codepoint? There is a great deal I don't know about it maybe I am overlooking something!

j-f1 · 2023-03-13T00:26:29Z

SentencePiece stores the following types of tokens:

normal tokens. These appear to be valid UTF-8 text snippets (including both ASCII and non-ASCII text. e.g. 六 reate ▁end (they use ▁ to represent a space for some reason)
The <unk> token (translated as ⁇). Not sure how this is used.
<s> and </s> tokens to represent the start/end of text
<0xXX> tokens (e.g <0xE7><0xAF><0x87> for 篇) for all values 0-255

j-f1 · 2023-03-13T00:28:23Z

I think your changes to buffer output until we have a complete UTF-8 character make sense, but I think we can remove the sentencepiece dependency entirely by vendoring sentencepiece_model.proto and parsing tokenizer.model in convert-pth-to-ggml.py.

j-f1 · 2023-03-13T00:28:58Z

Currently rebuilding the 7B model now, will give it a few tests then open a PR.

beiller · 2023-03-13T00:30:41Z

Nice @j-f1 I think a general algorithm of tokenizing would be, attempt to look up in the token set, and it it doesnt exist, decompose it into 0xXX tokens.

It might make displaying the text more difficult. If you detect a 0xXX on decode, wait until the next token is not a hex and then display it. Hopefully we can tell where to stop tho.

Also this PR has a bug and prompt crashes please dont merge!

./main -m ./models/13B/ggml-model-q4_0.bin -t 4 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647 -n 512 -p 关于爱因斯坦的生平。他出生于

j-f1 · 2023-03-13T00:37:47Z

For tokenizing, I think that will happen automatically, assuming the std::string text is UTF-8 encoded (since it picks the longest possible token matching the current string, which could end up being one of the one-byte tokens). For decoding, I think a version of your code in this PR could work (where you check for an invalid UTF-8 sequence before printing)

beiller · 2023-03-13T00:44:11Z

@j-f1 "which could end up being one of the one-byte tokens" -> which could end up being multiple of the one-byte tokens. Encoding "you" could end up with [512], but if not found, it would wind up as [55, 63, 99] (numbers made up).

Fix antiprompt

j-f1 · 2023-03-13T02:30:43Z

→ FIx parsing single-byte UTF-8 tokens by manually parsing the protobuf #73

Piezoid · 2023-03-13T15:44:30Z

Once llama.cpp gets python bindings (#82), things like interaction, tokenization, and even sampling are easier to develop and test in python. I'm thinking that making the core of llama.cpp text-agnostic would allow more flexibility for less feature creep.

beiller · 2023-03-13T15:55:44Z

@Piezoid great points. I think maybe the python integration should go up to ggml as a library wouldn't that make more sense? But yep I'd agree with you there. This problem has actually been solved in 2 other PRs which I believe are cleaner and simpler. The tokens are actually stored in model files, and plugged in during the scripts that quantize etc.

The cleanest PR I believe is here:

#79

ggerganov · 2023-03-13T16:51:17Z

Resolved in #79

beiller added 4 commits March 12, 2023 19:34

work towards tokenizer integration

96dc6a0

Use sentencepiece tokenization

67b1c84

fix build procedure

7deae8a

run build in shell

3c04dfb

beiller force-pushed the feature/tokenization branch from 342e52f to 3c04dfb Compare March 12, 2023 23:36

Try manually adding CXX flag

3e2327c

beiller force-pushed the feature/tokenization branch from 39b42fb to 3e2327c Compare March 12, 2023 23:49

ensure cmake is proper version

07771aa

Ah -std=c++17 is needed

ee36313

undo complicated printing until its fixed sadly

7035718

beiller mentioned this pull request Mar 13, 2023

Prompt interrupted before continuation for Unicode UTF-8 emojis #63

Closed

beiller added 2 commits March 12, 2023 21:45

Bugfix and back to printing as normal

9425a21

Fix antiprompt

Another antiprompt fix

ce7ebb3

beiller force-pushed the feature/tokenization branch from 6dd03a8 to ce7ebb3 Compare March 13, 2023 01:46

beiller mentioned this pull request Mar 13, 2023

Unicode support #11

Closed

beiller changed the title ~~Add sentencepiece tokenizer and modify build~~ Add sentencepiece tokenizer and modify build (Support UTF-8 / Emoijs) Mar 13, 2023

beiller mentioned this pull request Mar 13, 2023

FIx parsing single-byte UTF-8 tokens by manually parsing the protobuf #73

Closed

ggerganov closed this Mar 13, 2023

beiller mentioned this pull request Mar 18, 2023

Differences with the llama tokenizer #167

Closed

Piezoid mentioned this pull request Mar 21, 2023

Improving the repetition penalty #331

Closed

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sentencepiece tokenizer and modify build (Support UTF-8 / Emoijs) #66

Add sentencepiece tokenizer and modify build (Support UTF-8 / Emoijs) #66

Uh oh!

beiller commented Mar 12, 2023 •

edited

Loading

Uh oh!

beiller commented Mar 12, 2023

Uh oh!

j-f1 commented Mar 12, 2023 •

edited

Loading

Uh oh!

beiller commented Mar 13, 2023

Uh oh!

j-f1 commented Mar 13, 2023

Uh oh!

j-f1 commented Mar 13, 2023

Uh oh!

j-f1 commented Mar 13, 2023

Uh oh!

beiller commented Mar 13, 2023

Uh oh!

j-f1 commented Mar 13, 2023

Uh oh!

beiller commented Mar 13, 2023

Uh oh!

j-f1 commented Mar 13, 2023 •

edited

Loading

Uh oh!

Piezoid commented Mar 13, 2023

Uh oh!

beiller commented Mar 13, 2023

Uh oh!

ggerganov commented Mar 13, 2023

Uh oh!

Uh oh!

Add sentencepiece tokenizer and modify build (Support UTF-8 / Emoijs) #66

Add sentencepiece tokenizer and modify build (Support UTF-8 / Emoijs) #66

Uh oh!

Conversation

beiller commented Mar 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beiller commented Mar 12, 2023

Uh oh!

j-f1 commented Mar 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beiller commented Mar 13, 2023

Uh oh!

j-f1 commented Mar 13, 2023

Uh oh!

j-f1 commented Mar 13, 2023

Uh oh!

j-f1 commented Mar 13, 2023

Uh oh!

beiller commented Mar 13, 2023

Uh oh!

j-f1 commented Mar 13, 2023

Uh oh!

beiller commented Mar 13, 2023

Uh oh!

j-f1 commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Piezoid commented Mar 13, 2023

Uh oh!

beiller commented Mar 13, 2023

Uh oh!

ggerganov commented Mar 13, 2023

Uh oh!

Uh oh!

beiller commented Mar 12, 2023 •

edited

Loading

j-f1 commented Mar 12, 2023 •

edited

Loading

j-f1 commented Mar 13, 2023 •

edited

Loading