-
Notifications
You must be signed in to change notification settings - Fork 11.9k
convert-hf-to-gguf-update.py breaks #7207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What is the error? Was it trying to download a tokenizer from hf? I know that dbrx fails The older convert throws a NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()") when trying to do llama3 refuel |
I get this error from convert-hf-to-gguf-update.py using python 3.11 when trying to convert llam3a refuel: OSError: models/tokenizers/llama-spm does not appear to have a file named config.json. Checkout 'https://huggingface.co/models/tokenizers/llama-spm/tree/None' for available files. |
yes the very same error, or also FileNotFoundError: [Errno 2] No such file or directory: 'models/tokenizers/llama-spm/tokenizer.json' |
Tried downloading llama-spm from hf directly except get 404 error - but I think we can also steal one from another llama spm based model |
why though? a) you want to work with llama3 you said, so for this you can ignore llama-spm. b) you do not want the original hf files anyway, but you want what the update script will build for you, if it works. |
I don't understand all the logic tbh but it seems to be pulling configs from hf on the fly. Also, I think llama 3 refuel is bpe so yeah why should I even care |
i just realize: maybe it will work if you just fill out the license form on https://huggingface.co/meta-llama/Llama-2-7b-hf |
i just deleted the dbrx and the llama-spm entries in the model list below line 61 and it seems to work - but then it also says I need to run a bunch of scrips to build vocabs which is something the other script would do automagically I think your above license form is for llama2 which is spm not bpe |
yes that is as it is intended by the devs atm. sounds more difficult than it is. you only need the one vocab actually. and you can also check out the kaggle script linked above which does it all on the fly too. |
From convert-hf-to-gguf.py line 367:
So the entry for BPE tokenizer presumably needs to be added to the xxx-update.py script |
indeed so |
ok i just checked it with license access and that is most probably indeed the cause. the same for dbrx. so 2 options atm, either ask for access for both repos, or delete/comment out both lines. but i would rather change the update script so that this does not break the script. ok here is a PR for that. |
I think the overall intention is to emulate what python AutoTokenizer apply_chat_template() already does - it goes out to hf and pulls down the template automagically |
the similarity ends after the pulling down though |
You and me both |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
just realized that seemingly some recent changes make the script break on creating the llama-spm contents. it runs through without that line. which is my quick and lazy workaround atm (also in a quickly hacked kaggle script to run through the steps to fix the pre tokenizer issue). sorry i cannot look into this further, and maybe it is just some intermediate inconsistency that gets solved in the process of the current edits in the repo. or maybe you want to look into it.
The text was updated successfully, but these errors were encountered: