-
Notifications
You must be signed in to change notification settings - Fork 11.8k
convert : fix byte tokens for --vocab-type hfft #5084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can I get a ack/nack on this? |
I've checked out your branch, compiled Command:
Output (without the changes from this PR):
Output (with the changes from this PR):
So I hope it can be merged soon, as it will finally allow us to correctly convert Llama/Mistral models that doesn't have the |
waiting for this too |
if re.fullmatch(br"<0x[0-9A-Fa-f]{2}>", token_text): | ||
toktype = gguf.TokenType.BYTE | ||
else: | ||
toktype = self.get_token_type(token_id, self.special_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to not implement this logic directly in get_token_type
? It seems out of place here.
@Artefact2 Sorry for the delay - I'm somehow just seeing the PR I see this is now closed - are you considering a new PR? |
No, the bug is fixed as far as I'm concerned. If my patch is not up to the standard required, someone else can take care of it. |
Thanks @Artefact2 for your contribution. Was it fixed by other PRs already merged into the main, or it was fixed for you and you moved on? (totally understand if it's the latter) |
Yes! See #5341. |
Many thanks! |
This is inspired by 9f297f8, which got lost during the refactoring in 6efb8eb.
Fixes #5064.