Bug: Couldn't get number of tokens from ./llama-cli output!

### What happened?

After a chat-interactive response, the cli terminates with:

`Couldn't get number of tokens from ./llama-cli output!`

and and exit code 1. There's no (other) error message visible, but it prints out a lot of debug info (see below)

I'm using 
- https://huggingface.co/chuanli11/Llama-3.2-3B-Instruct-uncensored locally 
- converted to float 16 (using `docker run --rm -v "./models":/repo ghcr.io/ggerganov/llama.cpp:full --convert "/repo" --outtype f16`) 
- with the examples/chat-persistent.sh and 
- a freshly checked-out and CUDA-compiled llama-cli (although there was no difference with the prebuilt most recent binary).

I can continue the session by restarting the script, but it would be a lot more fun if the session would just continue...

Thanks for any pointers

### Name and Version

$ ./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
version: 4050 (d05b3127)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu


### What operating system are you seeing the problem on?

Linux

### Relevant log output

```shell
main: saving final output to session file './chat/default/current-cache.bin'



User: Couldn't get number of tokens from ./llama-cli output!
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type  f16:  198 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0,7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0,0e+00
llm_load_print_meta: f_norm_rms_eps   = 1,0e-05
llm_load_print_meta: f_clamp_kqv      = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: f_logit_scale    = 0,0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 3,61 B
llm_load_print_meta: model size       = 6,72 GiB (16,00 BPW) 
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =  6879,67 MiB
.................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000,0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init:  CUDA_Host KV buffer size =   448,00 MiB
llama_new_context_with_model: KV self size  =  448,00 MiB, K (f16):  224,00 MiB, V (f16):  224,00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0,49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1008,00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    14,01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 313 (with bs=512), 1 (with bs=1)
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

main: attempting to load saved session from './chat/default/current-cache.bin'
main: loaded a session with prompt size of 2759 tokens
main: session file has low similarity to prompt (1 / 2784 tokens); will mostly be reevaluated
sampler seed: 2611093520
sampler params: 
	repeat_last_n = 256, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = -1
	top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 4064, n_keep = 1

llama_perf_sampler_print:    sampling time =      18,16 ms /  3143 runs   (    0,01 ms per token, 173025,05 tokens per second)
llama_perf_context_print:        load time =    1116,17 ms
llama_perf_context_print: prompt eval time =    3798,26 ms /  2783 tokens (    1,36 ms per token,   732,70 tokens per second)
llama_perf_context_print:        eval time =   46191,98 ms /   358 runs   (  129,03 ms per token,     7,75 tokens per second)
llama_perf_context_print:       total time =   50788,98 ms /  3141 tokens
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Couldn't get number of tokens from ./llama-cli output! #10219

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Couldn't get number of tokens from ./llama-cli output! #10219

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions