Skip to content

BPE Tokenizer doesn't have model > vocab. #5180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
likejazz opened this issue Jan 29, 2024 · 2 comments
Closed

BPE Tokenizer doesn't have model > vocab. #5180

likejazz opened this issue Jan 29, 2024 · 2 comments

Comments

@likejazz
Copy link
Contributor

likejazz commented Jan 29, 2024

In a recent patch(https://github.com/ggerganov/llama.cpp/blame/d2f650cb5b04ee2726663e79b47da5efe196ce00/convert.py#L337), you imported the vocab list from self.bpe_tokenizer['model']['vocab'], which was originally taken from vocab.json file. However, the BPE Tokenizer's vocab.json file does not have a model > vocab. It does not contain any other metadata and consists only of a vocabulary list.

So, in my opinion, 337 line should be modified as follows:

self.vocab = self.bpe_tokenizer

I hope this helps. Thanks.

@ggerganov
Copy link
Member

If I modify it like this, it will stop working for all other cases that have ["model"]["vocab"]. You can probably add some check to see which of the 2 options are available and use that one

@likejazz
Copy link
Contributor Author

OK, I'll send you a little patch for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants