convert : refactor vocab selection logic #6355

cebtenzzre · 2024-03-27T22:20:12Z

This PR fixes some confusion as to the purpose of HfVocab, by making it explicit that it is only for LLaMA "SPM" vocabularies in tokenizer.json format, not generic HuggingFace fast tokenizer (tokenizer.json) vocabs. (There is one exception to this, which is its use for WordPiece - this will be corrected in a follow-up PR.)

PR #5821 fixed some of the confusion as to which files map to which tokenizers, but in adding the automatic fallback to HfVocab it unintentionally caused a few issues.

This PR makes it the job of each vocab class to attempt to load the vocab from the appropriate files, and to fail if tokenizer.json represents the wrong vocab type.

I also changed the Vocab Union to a pair of Protocols to make the API a little more explicit.

With these changes, converting e.g. deepseek-llm-7b-chat results in this exception with the default --vocab-type:

FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']

And converting with --vocab-type bpe --pad-vocab works as expected.

With #5821, the model would appear to convert successfully with the default --vocab-type but fail at runtime, and --vocab-type bpe did not recognize the model.

Prior to #5821, the presence of tokenizer.json caused convert.py to attempt to load it as a sentencepiece model:

RuntimeError: Internal: could not parse ModelProto from /home/jared/dirs/text-ai-models/dl/deepseek-llm-7b-chat/tokenizer.json

Closes #6245
Fixes #6238
Fixes #6216
Fixes #5973

Signed-off-by: Jared Van Bortel <[email protected]>

Fixes #5973 Fixes #6216

cebtenzzre added 12 commits March 27, 2024 13:48

convert : remove redundant annotations

dd1a60c

convert : remove unused vocab attributes

72e95e3

convert : vocab inheritance instead of duck typing

9803bb7

convert-persimmon : typing fixup

03f0c2e

convert : do not allow "no_vocab" in --vocab-types

d852c61

convert : use context managers with most file handles

b2b63d1

convert : fix incorrect added token dedup in BpeVocab

d12a63c

convert : use appropriate exception types

8d2ac2c

Signed-off-by: Jared Van Bortel <[email protected]>

convert-hf : fix type of tokens after #3252

2e6fd63

convert : refactor vocab selection logic

79852ab

Fixes #5973 Fixes #6216

convert-hf : HfVocab -> LlamaHfVocab

ebad773

llama : update vocab type descriptions to reflect actual meaning

80e9fc7

cebtenzzre requested a review from ggerganov March 27, 2024 22:20

ggerganov approved these changes Mar 28, 2024

View reviewed changes

cebtenzzre mentioned this pull request Mar 28, 2024

Allow conversion of Llama / Mistral HF models #6144

Merged

convert : appease flake8

d09e4ac

cebtenzzre merged commit be55134 into master Mar 28, 2024

cebtenzzre deleted the ceb/fix-convert-bpe-hf branch March 28, 2024 15:44

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

convert : refactor vocab selection logic (ggml-org#6355)

f8c06ad

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024

convert : refactor vocab selection logic (ggml-org#6355)

02c6a83

tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024

convert : refactor vocab selection logic (ggml-org#6355)

aacb9ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert : refactor vocab selection logic #6355

convert : refactor vocab selection logic #6355

Uh oh!

cebtenzzre commented Mar 27, 2024

Uh oh!

Uh oh!

convert : refactor vocab selection logic #6355

convert : refactor vocab selection logic #6355

Uh oh!

Conversation

cebtenzzre commented Mar 27, 2024

Uh oh!

Uh oh!