Skip to content

llama : refactor llama_vocab #9369

@ggerganov

Description

@ggerganov

As of today we support 5 tokenizer implementations:

        LLAMA_VOCAB_TYPE_SPM  = 1, // LLaMA tokenizer based on byte-level BPE with byte fallback
        LLAMA_VOCAB_TYPE_BPE  = 2, // GPT-2 tokenizer based on byte-level BPE
        LLAMA_VOCAB_TYPE_WPM  = 3, // BERT tokenizer based on WordPiece
        LLAMA_VOCAB_TYPE_UGM  = 4, // T5 tokenizer based on Unigram
        LLAMA_VOCAB_TYPE_RWKV = 5, // RWKV tokenizer based on greedy tokenization

The function llama_tokenize_internal in llama-vocab.cpp currently constructs a tokenizer instance on every call which for some of the tokenizers incurs significant overhead. This should be avoided by pre-constructing the tokenizer object upon llama-vocab creation and abstracting the objects (e.g. llm_tokenizer_spm, llm_tokenizer_bpe, etc.) with a common interface.

However, we want llama_tokenize_internal to remain thread-safe as it currently is (I think). Therefore, the tokenizer objects would likely need to be split into 2 parts:

  • immutable pre-computed data (such as tries and lookup tables)
  • mutable work data

The first one will be initialized once upon llama-vocab creation. The latter will be created each time within llama_tokenize_internal and will be used to store fleeting data while tokenizing.

A test that guarantees thread-safety for all tokenizer via thread sanitizers would be useful.

This should resolve #9180 and also help to multi-thread the tokenization process in llama-server.

While working on this, the llama-vocab.cpp can use various simplifications and improvements as well.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions