-
Notifications
You must be signed in to change notification settings - Fork 13.1k
Description
As of today we support 5 tokenizer implementations:
LLAMA_VOCAB_TYPE_SPM = 1, // LLaMA tokenizer based on byte-level BPE with byte fallback
LLAMA_VOCAB_TYPE_BPE = 2, // GPT-2 tokenizer based on byte-level BPE
LLAMA_VOCAB_TYPE_WPM = 3, // BERT tokenizer based on WordPiece
LLAMA_VOCAB_TYPE_UGM = 4, // T5 tokenizer based on Unigram
LLAMA_VOCAB_TYPE_RWKV = 5, // RWKV tokenizer based on greedy tokenization
The function llama_tokenize_internal
in llama-vocab.cpp
currently constructs a tokenizer instance on every call which for some of the tokenizers incurs significant overhead. This should be avoided by pre-constructing the tokenizer object upon llama-vocab
creation and abstracting the objects (e.g. llm_tokenizer_spm
, llm_tokenizer_bpe
, etc.) with a common interface.
However, we want llama_tokenize_internal
to remain thread-safe as it currently is (I think). Therefore, the tokenizer objects would likely need to be split into 2 parts:
- immutable pre-computed data (such as tries and lookup tables)
- mutable work data
The first one will be initialized once upon llama-vocab
creation. The latter will be created each time within llama_tokenize_internal
and will be used to store fleeting data while tokenizing.
A test that guarantees thread-safety for all tokenizer via thread sanitizers would be useful.
This should resolve #9180 and also help to multi-thread the tokenization process in llama-server
.
While working on this, the llama-vocab.cpp
can use various simplifications and improvements as well.