Skip to content

convert : fix byte tokens for --vocab-type hfft #5084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

convert : fix byte tokens for --vocab-type hfft #5084

wants to merge 1 commit into from

Conversation

Artefact2
Copy link
Collaborator

This is inspired by 9f297f8, which got lost during the refactoring in 6efb8eb.

Fixes #5064.

This is inspired by 9f297f8, which got lost
during the refactoring in 6efb8eb.
@Artefact2
Copy link
Collaborator Author

Can I get a ack/nack on this?

@felladrin
Copy link
Contributor

I've checked out your branch, compiled main, and tested it. It solved the issue for me 🙏

Command:

llama.cpp/main -m model/ggml-model-f32.gguf -p "<|im_start|>user\nHow do I incorporate visual elements into my writing?<|im_end|>\n<|im_start|>assistant" -n 250 -e --repeat_penalty 1.0 -c 0 --temp 0.1 --min-p 0.1

Output (without the changes from this PR):

<|im_start|>user<0x0A>How do I incorporate visual elements into my writing?<|im_end|><0x0A><|im_start|>assistant<0x0A>I do not have access to the specific details of the text material. However, here are some examples of how to use the following:<0x0A><0x0A>

Output (with the changes from this PR):

<|im_start|>user
How do I incorporate visual elements into my writing?<|im_end|>
<|im_start|>assistant
I do not have access to the specific details of the text material. However, I can provide you with a list of the following:

So I hope it can be merged soon, as it will finally allow us to correctly convert Llama/Mistral models that doesn't have the tokenizer.model file along with the model.

@x4080
Copy link

x4080 commented Feb 3, 2024

waiting for this too

if re.fullmatch(br"<0x[0-9A-Fa-f]{2}>", token_text):
toktype = gguf.TokenType.BYTE
else:
toktype = self.get_token_type(token_id, self.special_ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to not implement this logic directly in get_token_type? It seems out of place here.

@Artefact2 Artefact2 closed this Feb 4, 2024
@ggerganov
Copy link
Member

@Artefact2 Sorry for the delay - I'm somehow just seeing the PR

I see this is now closed - are you considering a new PR?

@Artefact2
Copy link
Collaborator Author

No, the bug is fixed as far as I'm concerned. If my patch is not up to the standard required, someone else can take care of it.

@maziyarpanahi
Copy link

the bug is fixed

Thanks @Artefact2 for your contribution. Was it fixed by other PRs already merged into the main, or it was fixed for you and you moved on? (totally understand if it's the latter)

@Artefact2
Copy link
Collaborator Author

Was it fixed by other PRs already merged into the main

Yes! See #5341.

@maziyarpanahi
Copy link

Was it fixed by other PRs already merged into the main

Yes! See #5341.

Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

convert.py --vocab-type hfft produces <0x0A> instead of new lines
7 participants