new tokenizer-verifier tool to check gguf tokenizer parameters #6988

anisse · 2024-04-29T20:08:16Z

This program verifies that a given gguf model file can tokenize all potential valid characters. Since llama.cpp currently raises an exception when tokenization is not possible, this tool helps verifying that valid ascii and utf-8 will always be properly tokenized.

anisse · 2024-04-29T20:19:05Z

Here is what it looks like for a model with a BPE tokenizer that can tokenize anything:

> ./build/bin/tokenizer-verifier  ../models/llama-2-7b-chat.Q2_K.gguf 2>/dev/null
0/127 7-bit ascii characters could not be tokenized
0/1114111 potential unicode codepoints not tokenized

And for another where this fails because of an incomplete (or misconfigured?) tokenizer:

> ./build/bin/tokenizer-verifier  ../models/croissantllmchat-v0.1.Q8_0.gguf 2> /dev/null
0x1 -> Tokenization failed for char ''
0x2 -> Tokenization failed for char ''
0x3 -> Tokenization failed for char ''
0x4 -> Tokenization failed for char ''
0x5 -> Tokenization failed for char ''
0x6 -> Tokenization failed for char ''
0x7 -> Tokenization failed for char ''
0x8 -> Tokenization failed for char '
0xb -> Tokenization failed for char '
                                     '
0xc -> Tokenization failed for char '
                                     '
0xe -> Tokenization failed for char ''
0xf -> Tokenization failed for char ''
0x10 -> Tokenization failed for char ''
0x11 -> Tokenization failed for char ''
0x12 -> Tokenization failed for char ''
0x13 -> Tokenization failed for char ''
0x14 -> Tokenization failed for char ''
0x15 -> Tokenization failed for char ''
0x16 -> Tokenization failed for char ''
0x17 -> Tokenization failed for char ''
0x18 -> Tokenization failed for char '▒'
0x19 -> Tokenization failed for char ''
0x1a -> Tokenization failed for char '▒'
0x1b -> Tokenization failed for char '
x1c -> Tokenization failed for char ''
0x1d -> Tokenization failed for char ''
0x1e -> Tokenization failed for char ''
0x1f -> Tokenization failed for char ''
0x7f -> Tokenization failed for char ''
29/127 7-bit ascii characters could not be tokenized
1113111/1114111 potential unicode codepoints not tokenized

Note that recent changes have made it very slow fol llama3 (or maybe it's my gguf file?).

This program verifies that a given gguf model file can tokenize all potential valid characters. Since llama.cpp currently raises an exception when tokenization is not possible[1], this tool helps verifying that valid ascii and utf-8 will always be properly tokenized. [1] ggml-org#2580

mofosyne

I noticed that other tools in example folder has README.md, I think we need one for this as well, since it's for a specific purpose as a tool.

mofosyne · 2024-05-13T14:35:53Z

$ ./build/bin/tokenizer-verifier ./models/ggml-vocab-aquila.gguf 
WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_calloc!
llama_model_loader: loaded meta data with 18 key-value pairs and 0 tensors from ./models/ggml-vocab-aquila.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = D:\Diverses\models
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  11:                      tokenizer.ggml.tokens arr[str,100008]  = ["<|endoftext|>", "!", "\"", "#", "$"...
llama_model_loader: - kv  12:                      tokenizer.ggml.scores arr[f32,100008]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,100008]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,99743]   = ["Ġ Ġ", "ä ¸", "Ġ t", "ï ¼", "...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: mismatch in special tokens definition ( 9/100008 vs 8/100008 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 100008
llm_load_print_meta: n_merges         = 99743
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 0.00 K
llm_load_print_meta: model size       = 0.00 MiB (-nan BPW) 
llm_load_print_meta: general.name     = D:\Diverses\models
llm_load_print_meta: BOS token        = 0 '<|endoftext|>'
llm_load_print_meta: EOS token        = 0 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<|endoftext|>'
llm_load_print_meta: LF token         = 129 'Ä'
llama_model_load: vocab only - skipping tensors
0/127 7-bit ascii characters could not be tokenized

... hangs around here?

Also doesn't seem to handle missing models well, ends up segfaulting.

$ ./build/bin/tokenizer-verifier ./models/ggml-vocab-aquil3 
llama_model_load: error loading model: llama_model_loader: failed to load model from ./models/ggml-vocab-aquil3

llama_load_model_from_file: failed to load model
Segmentation fault (core dumped)

This PR might be a bit more adjusting.

anisse · 2024-05-14T12:15:55Z

Thanks a lot for your review @mofosyne, I'll add a README in the next iteration.

... hangs around here?

I don't think it hanged, but it illustrates the issue I talked about earlier:

Note that recent changes have made it very slow fol llama3 (or maybe it's my gguf file?).

Something made the tokenization very slow, but I don't know what. I might bisect it to find the issue.

Also doesn't seem to handle missing models well, ends up segfaulting.

If you look at the code, it loads the models in a very simple and straightforward way, just like the tokenize example; I'll check why it segfaults, but I wouldn't be surprised if it just ends up exposing an actual issue in the llama.cpp API.

anisse force-pushed the master branch from 8eb15ae to a808370 Compare April 30, 2024 06:47

mofosyne added enhancement New feature or request Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 9, 2024

mofosyne requested changes May 13, 2024

View reviewed changes

mofosyne marked this pull request as draft May 13, 2024 14:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

new tokenizer-verifier tool to check gguf tokenizer parameters #6988

new tokenizer-verifier tool to check gguf tokenizer parameters #6988

anisse commented Apr 29, 2024

Uh oh!

anisse commented Apr 29, 2024

Uh oh!

mofosyne left a comment

Uh oh!

mofosyne commented May 13, 2024 •

edited

Loading

Uh oh!

anisse commented May 14, 2024

Uh oh!

Uh oh!

new tokenizer-verifier tool to check gguf tokenizer parameters #6988

Are you sure you want to change the base?

new tokenizer-verifier tool to check gguf tokenizer parameters #6988

Conversation

anisse commented Apr 29, 2024

Uh oh!

anisse commented Apr 29, 2024

Uh oh!

mofosyne left a comment

Choose a reason for hiding this comment

Uh oh!

mofosyne commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anisse commented May 14, 2024

Uh oh!

Uh oh!

mofosyne commented May 13, 2024 •

edited

Loading