convert : fix byte tokens for --vocab-type hfft #5084

Artefact2 · 2024-01-22T18:22:41Z

This is inspired by 9f297f8, which got lost during the refactoring in 6efb8eb.

Fixes #5064.

convert.py

This is inspired by 9f297f8, which got lost during the refactoring in 6efb8eb.

Artefact2 · 2024-02-03T12:25:59Z

Can I get a ack/nack on this?

felladrin · 2024-02-03T19:58:28Z

I've checked out your branch, compiled main, and tested it. It solved the issue for me 🙏

Command:

llama.cpp/main -m model/ggml-model-f32.gguf -p "<|im_start|>user\nHow do I incorporate visual elements into my writing?<|im_end|>\n<|im_start|>assistant" -n 250 -e --repeat_penalty 1.0 -c 0 --temp 0.1 --min-p 0.1

Output (without the changes from this PR):

<|im_start|>user<0x0A>How do I incorporate visual elements into my writing?<|im_end|><0x0A><|im_start|>assistant<0x0A>I do not have access to the specific details of the text material. However, here are some examples of how to use the following:<0x0A><0x0A>

Output (with the changes from this PR):

<|im_start|>user
How do I incorporate visual elements into my writing?<|im_end|>
<|im_start|>assistant
I do not have access to the specific details of the text material. However, I can provide you with a list of the following:

So I hope it can be merged soon, as it will finally allow us to correctly convert Llama/Mistral models that doesn't have the tokenizer.model file along with the model.

x4080 · 2024-02-03T21:22:12Z

waiting for this too

slaren · 2024-02-03T21:33:37Z

convert.py

+            if re.fullmatch(br"<0x[0-9A-Fa-f]{2}>", token_text):
+                toktype = gguf.TokenType.BYTE
+            else:
+                toktype = self.get_token_type(token_id, self.special_ids)


Is there any reason to not implement this logic directly in get_token_type? It seems out of place here.

ggerganov · 2024-02-05T08:43:34Z

@Artefact2 Sorry for the delay - I'm somehow just seeing the PR

I see this is now closed - are you considering a new PR?

Artefact2 · 2024-02-05T09:27:44Z

No, the bug is fixed as far as I'm concerned. If my patch is not up to the standard required, someone else can take care of it.

maziyarpanahi · 2024-02-11T15:47:24Z

the bug is fixed

Thanks @Artefact2 for your contribution. Was it fixed by other PRs already merged into the main, or it was fixed for you and you moved on? (totally understand if it's the latter)

Artefact2 · 2024-02-11T16:22:09Z

Was it fixed by other PRs already merged into the main

Yes! See #5341.

maziyarpanahi · 2024-02-12T09:14:11Z

Was it fixed by other PRs already merged into the main

Yes! See #5341.

Many thanks!

cebtenzzre reviewed Jan 22, 2024

View reviewed changes

convert.py Outdated Show resolved Hide resolved

convert : fix byte tokens for --vocab-type hfft

067ef86

This is inspired by 9f297f8, which got lost during the refactoring in 6efb8eb.

felladrin mentioned this pull request Feb 3, 2024

Architecture "LlamaForCausalLM" not supported #5142

Closed

slaren reviewed Feb 3, 2024

View reviewed changes

Artefact2 closed this Feb 4, 2024

ggerganov mentioned this pull request Feb 5, 2024

py : handle byte tokens in get_token_type #5341

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert : fix byte tokens for --vocab-type hfft #5084

convert : fix byte tokens for --vocab-type hfft #5084

Uh oh!

Artefact2 commented Jan 22, 2024

Uh oh!

Uh oh!

Artefact2 commented Feb 3, 2024

Uh oh!

felladrin commented Feb 3, 2024

Uh oh!

x4080 commented Feb 3, 2024

Uh oh!

slaren Feb 3, 2024

Uh oh!

ggerganov commented Feb 5, 2024

Uh oh!

Artefact2 commented Feb 5, 2024

Uh oh!

maziyarpanahi commented Feb 11, 2024

Uh oh!

Artefact2 commented Feb 11, 2024

Uh oh!

maziyarpanahi commented Feb 12, 2024

Uh oh!

Uh oh!

convert : fix byte tokens for --vocab-type hfft #5084

convert : fix byte tokens for --vocab-type hfft #5084

Uh oh!

Conversation

Artefact2 commented Jan 22, 2024

Uh oh!

Uh oh!

Artefact2 commented Feb 3, 2024

Uh oh!

felladrin commented Feb 3, 2024

Uh oh!

x4080 commented Feb 3, 2024

Uh oh!

slaren Feb 3, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Feb 5, 2024

Uh oh!

Artefact2 commented Feb 5, 2024

Uh oh!

maziyarpanahi commented Feb 11, 2024

Uh oh!

Artefact2 commented Feb 11, 2024

Uh oh!

maziyarpanahi commented Feb 12, 2024

Uh oh!

Uh oh!