-
Notifications
You must be signed in to change notification settings - Fork 13.1k
Closed
Labels
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
UGM tokenizer is time-consuming when embedding text and is slower than others. When I debugged in the source code, I found the cause was the function llama_tokenize_internal() in the llama-vocab.cpp. llm_tokenizer_ugm will be constructed every time, but its constructor does a for-loop which is very time-consuming. Is there any way to accelerate the process, such as do the for-loop once and cache the needed data if the vocab does not change?
Motivation
It will benefit users for llama.cpp who use an embedding model with UGM tokenizer like multilanguage-e5 or bge-m3...
Possible Implementation
No response