Closed
Description
What happened?
After a chat-interactive response, the cli terminates with:
Couldn't get number of tokens from ./llama-cli output!
and and exit code 1. There's no (other) error message visible, but it prints out a lot of debug info (see below)
I'm using
- https://huggingface.co/chuanli11/Llama-3.2-3B-Instruct-uncensored locally
- converted to float 16 (using
docker run --rm -v "./models":/repo ghcr.io/ggerganov/llama.cpp:full --convert "/repo" --outtype f16
) - with the examples/chat-persistent.sh and
- a freshly checked-out and CUDA-compiled llama-cli (although there was no difference with the prebuilt most recent binary).
I can continue the session by restarting the script, but it would be a lot more fun if the session would just continue...
Thanks for any pointers
Name and Version
$ ./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
version: 4050 (d05b312)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
main: saving final output to session file './chat/default/current-cache.bin'
User: Couldn't get number of tokens from ./llama-cli output!
llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type f16: 198 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0,7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0,0e+00
llm_load_print_meta: f_norm_rms_eps = 1,0e-05
llm_load_print_meta: f_clamp_kqv = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: f_logit_scale = 0,0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 3,61 B
llm_load_print_meta: model size = 6,72 GiB (16,00 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 6879,67 MiB
.................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000,0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CUDA_Host KV buffer size = 448,00 MiB
llama_new_context_with_model: KV self size = 448,00 MiB, K (f16): 224,00 MiB, V (f16): 224,00 MiB
llama_new_context_with_model: CPU output buffer size = 0,49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1008,00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 14,01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 313 (with bs=512), 1 (with bs=1)
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: attempting to load saved session from './chat/default/current-cache.bin'
main: loaded a session with prompt size of 2759 tokens
main: session file has low similarity to prompt (1 / 2784 tokens); will mostly be reevaluated
sampler seed: 2611093520
sampler params:
repeat_last_n = 256, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 4064, n_keep = 1
llama_perf_sampler_print: sampling time = 18,16 ms / 3143 runs ( 0,01 ms per token, 173025,05 tokens per second)
llama_perf_context_print: load time = 1116,17 ms
llama_perf_context_print: prompt eval time = 3798,26 ms / 2783 tokens ( 1,36 ms per token, 732,70 tokens per second)
llama_perf_context_print: eval time = 46191,98 ms / 358 runs ( 129,03 ms per token, 7,75 tokens per second)
llama_perf_context_print: total time = 50788,98 ms / 3141 tokens