Skip to content

feature request - disabling tokenizer in conversion / inference #1765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
genenwoochoi opened this issue Jun 8, 2023 · 6 comments
Open
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@genenwoochoi
Copy link

in #1764 i asked if it'd be possible to add a Huggingface tokenizer. but - HF tokenizers are quite flexible and officially supporting them in llama.cpp (or ggml?) might be a lot of hassle.

a much easier workaround would be allowing to disable tokenizers in both model conversion and inference. this means the users are supposed to encode(text)/decode(ids) in their implementation for using llama.cpp. in my case, for example, i'll use a python GUI and a wrapper anyway.

i'd like to work on it, but honestly i don't think i understand enough to be able to do this. i'd appreciate very much if anyone's interested in it.

@KerfuffleV2
Copy link
Collaborator

Do you mean something along the lines of just giving input as token ids separated by spaces? Sounds interesting. Is there really evidence that something like HF's tokenizer tokenizes in a significantly way than the built in one? You can use --verbose-prompt to get the tokens a prompt parses to.

@genenwoochoi
Copy link
Author

  • yes exactly. by spaces, commas, or anything. so the llama.cpp would only handle the core language model; list[int] in, list[int] out.
  • none of them are absolutely better than the other in every aspect (btwSentencePiece is not a built-in tokenizer in anywhere, it's just one of several choices.) but HuggingFace tokenizers are maintained with more versatile in training and using.
    if one would like to train a LM by themselves, for sure there is enough motivation to train a tokenizer with different strategies: pre-tokenization, normalization, vocab size, special tokens, etc. and that's why all the popular and independent LLMs have different tokenizers.

@KerfuffleV2
Copy link
Collaborator

It should be a very simple change. I guess the main question would be if something like that would actually get merged since it's so niche.

I guess one way to handle it might be to have it as a compile flag which is disabled by default to avoid confusing users. It also probably wouldn't really work too well with stuff like interactive mode.

@j-f1
Copy link
Collaborator

j-f1 commented Jun 9, 2023

You can totally do this with the C++ API — all of the underlying inference APIs run on tokens and you typically have to manually convert between strings and tokens.

@genenwoochoi
Copy link
Author

@j-f1 i see, thanks! but i would need a convert.py that works without a tokenizer. i'll look into the code.

@KerfuffleV2 it'd be niche if llama keeps dominating the (open-source) LLM world. but, we already have Falcon - and there will be only so many more! surely a lot of them would be not based on SentencePiece.

any thoughts, @ggerganov? or am i missing anything?

@ggerganov
Copy link
Member

We can extend the llama.h API with a way to pass a user-provided tokenizing function which can in turn do anything the user wants:

  • read tokens as numbers from a file
  • call Python
  • etc

@ggerganov ggerganov added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jun 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants