Skip to content

Fix gemma2 tokenizer convert #8244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 1, 2024
Merged

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jul 1, 2024

Ref comment:

The output model is capable of tokenizing special tokens (used in chat templates):

test.txt

<start_of_turn>user
hello discards HOW are you<end_of_turn>
$ ./llama-tokenize -m ggml-model-f16.gguf -f test.txt

...

     2 -> '<bos>'
   106 -> '<start_of_turn>'
  1645 -> 'user'
   108 -> '
'
 17534 -> 'hello'
  9027 -> ' disc'
  2050 -> 'ards'
 31874 -> ' HOW'
   708 -> ' are'
   692 -> ' you'
   107 -> '<end_of_turn>'
   108 -> '
'

Perplexity is also improved from 8.9711 to 7.8952 (I'm using q8_0 because colab notebook does not have enough VRAM for f16)

$ !./llama.cpp/llama-perplexity -f ./llama.cpp/wikitext-2-raw/wiki.test.raw -ngl 99 -m ./gemma2/ggml-model-q8_0.gguf -c 1024

[277]7.8267,[278]7.8213,[279]7.8372,[280]7.8424,[281]7.8534,[282]7.8573,[283]7.8760,[284]7.8952,
Final estimate: PPL = 7.8952 +/- 0.05648

@ngxson ngxson requested a review from ggerganov July 1, 2024 21:11
@github-actions github-actions bot added the python python script changes label Jul 1, 2024
@abetlen
Copy link
Collaborator

abetlen commented Jul 1, 2024

@ngxson this is closer to the hf tokenizer in my tests however when trying out the cli / server I've noticed that newlines generation seems broken (doesn't occur at all).

image

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 1, 2024

@abetlen Thanks for testing that. I've just tried on my side, and I can confirm that the new line is also broken.

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 1, 2024

Edit: sorry I made a mistake. The output new line token is correct (token ID 108), I'm investigating this further

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 1, 2024

@abetlen Turns out new line and all tokens after ID 108 are marked as control, while they should be normal token. I fixed my code and it should work correctly now:

> list 10 fruits
Here are 10 fruits:

1. Apple
2. Banana
3. Orange
4. Strawberry
5. Grapefruit
6. Mango
7. Pineapple
8. Watermelon
9. Blueberry
10. Raspberry

Tokenized (main.log):

'<start_of_turn>':106, 'user':1645, '':108, 'list':1701, ' ':235248, '1':235274, '0':235276, ' fruits':16803, '<end_of_turn>':107, '':108, '<start_of_turn>':106, 'model':2516, '':108, 'Here':4858, ' are':708, ' ':235248, '1':235274

I also split the code into _create_vocab_sentencepiece and _set_vocab_sentencepiece. This way the function is easier to be reuse.

@ngxson ngxson requested a review from abetlen July 1, 2024 22:52
@bartowski1182
Copy link
Contributor

this seems like the correct one. It even properly tokenizes the prompt with discards noted in the discussion as disc and ards

@ggerganov if you want to take a look

@ngxson ngxson merged commit 5fac350 into ggml-org:master Jul 1, 2024
8 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 2, 2024
* fix gemma2 tokenizer convert

* remove scores

* improve code, fix new line issue
for i in range(108):
# including <unusedX>, <start_of_turn>, <end_of_turn>
toktypes[i] = SentencePieceTokenTypes.CONTROL
self.gguf_writer.add_tokenizer_model("llama")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it’s merged, and it’s a nitpick, and ignores the rule of 3 but especially for someone with little understanding of what the code is actually doing (it’s all magic to me) it would benefit from a separate method for this and the sequence of calls lines 582-586

while removing a small duplication, it also can serve as a helper to understand what it’s doing. And someday someone may fix something in once place and not the other.

❤️

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you’re new to the code base.. if you have a look on the other parts of the file, there’re even more duplications. Not because we don’t care about this, but sometimes duplications make it more visible what the code does.

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Feb 25, 2025
* fix gemma2 tokenizer convert

* remove scores

* improve code, fix new line issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants