feature request - disabling tokenizer in conversion / inference #1765

genenwoochoi · 2023-06-08T20:14:33Z

in #1764 i asked if it'd be possible to add a Huggingface tokenizer. but - HF tokenizers are quite flexible and officially supporting them in llama.cpp (or ggml?) might be a lot of hassle.

a much easier workaround would be allowing to disable tokenizers in both model conversion and inference. this means the users are supposed to encode(text)/decode(ids) in their implementation for using llama.cpp. in my case, for example, i'll use a python GUI and a wrapper anyway.

i'd like to work on it, but honestly i don't think i understand enough to be able to do this. i'd appreciate very much if anyone's interested in it.

KerfuffleV2 · 2023-06-08T20:49:52Z

Do you mean something along the lines of just giving input as token ids separated by spaces? Sounds interesting. Is there really evidence that something like HF's tokenizer tokenizes in a significantly way than the built in one? You can use --verbose-prompt to get the tokens a prompt parses to.

genenwoochoi · 2023-06-08T21:15:03Z

yes exactly. by spaces, commas, or anything. so the llama.cpp would only handle the core language model; list[int] in, list[int] out.
none of them are absolutely better than the other in every aspect (btwSentencePiece is not a built-in tokenizer in anywhere, it's just one of several choices.) but HuggingFace tokenizers are maintained with more versatile in training and using.
if one would like to train a LM by themselves, for sure there is enough motivation to train a tokenizer with different strategies: pre-tokenization, normalization, vocab size, special tokens, etc. and that's why all the popular and independent LLMs have different tokenizers.

KerfuffleV2 · 2023-06-08T21:57:19Z

It should be a very simple change. I guess the main question would be if something like that would actually get merged since it's so niche.

I guess one way to handle it might be to have it as a compile flag which is disabled by default to avoid confusing users. It also probably wouldn't really work too well with stuff like interactive mode.

j-f1 · 2023-06-09T00:08:07Z

You can totally do this with the C++ API — all of the underlying inference APIs run on tokens and you typically have to manually convert between strings and tokens.

genenwoochoi · 2023-06-09T04:41:33Z

@j-f1 i see, thanks! but i would need a convert.py that works without a tokenizer. i'll look into the code.

@KerfuffleV2 it'd be niche if llama keeps dominating the (open-source) LLM world. but, we already have Falcon - and there will be only so many more! surely a lot of them would be not based on SentencePiece.

any thoughts, @ggerganov? or am i missing anything?

ggerganov · 2023-06-10T14:12:46Z

We can extend the llama.h API with a way to pass a user-provided tokenizing function which can in turn do anything the user wants:

read tokens as numbers from a file
call Python
etc

ggerganov added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jun 10, 2023

ilan-theodoro mentioned this issue Nov 2, 2024

Add user-provided tokenizer/detokenizer functionality #10131

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request - disabling tokenizer in conversion / inference #1765

feature request - disabling tokenizer in conversion / inference #1765

genenwoochoi commented Jun 8, 2023

KerfuffleV2 commented Jun 8, 2023

genenwoochoi commented Jun 8, 2023

KerfuffleV2 commented Jun 8, 2023

j-f1 commented Jun 9, 2023

genenwoochoi commented Jun 9, 2023

ggerganov commented Jun 10, 2023

feature request - disabling tokenizer in conversion / inference #1765

feature request - disabling tokenizer in conversion / inference #1765

Comments

genenwoochoi commented Jun 8, 2023

KerfuffleV2 commented Jun 8, 2023

genenwoochoi commented Jun 8, 2023

KerfuffleV2 commented Jun 8, 2023

j-f1 commented Jun 9, 2023

genenwoochoi commented Jun 9, 2023

ggerganov commented Jun 10, 2023