Skip to content

Conversation

ftgreat
Copy link
Contributor

@ftgreat ftgreat commented Jul 15, 2023

Our released Aquila models used bpe tokenizer, so in convert.py we just add one branch for preprocessing bpe tokenizer vocab into sentencepiece in order to use following modules like inference or int4. we have make sure all encoding ids are all the same and have no impact other modules.

Could you please review this pr, thanks.
Related issue: #2093

@howard0su
Copy link
Contributor

Can you provide test instruction so that I can verify the change?

@ftgreat
Copy link
Contributor Author

ftgreat commented Jul 18, 2023

Can you provide test instruction so that I can verify the change?

instruction:
python convert.py models/7B --vocab-only --outfile models/aquila-vocab.bin --vocabtype bpe

requirements:
put vocab.json in models dir, vocab.json from Aquila-tokenizer https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/Aquila-tokenizer-hf/vocab.json

@klosax
Copy link
Contributor

klosax commented Jul 19, 2023

Note: Using an llama model with gpt2 tokenizer will be fully supported in the new ggml file format. ggml-org/ggml#302

@ftgreat
Copy link
Contributor Author

ftgreat commented Jul 25, 2023

Note: Using an llama model with gpt2 tokenizer will be fully supported in the new ggml file format. ggerganov/ggml#302

Could you please give me the support schedule?
And how to add our released models, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants