Skip to content

Misc. bug: The inference speed of llama-server is one-third of that of llama-cli #12171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zts9989 opened this issue Mar 4, 2025 · 8 comments
Labels
bug Something isn't working

Comments

@zts9989
Copy link

zts9989 commented Mar 4, 2025

Name and Version

llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
version: b4819
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
version: b4819
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server -m /data4/qwen2.5-14b-deep-q4_k.gguf  -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047  -ngl 160  --host 0.0.0.0

llama-server -m /data4/qwen2.5-14b-deep-q4_k.gguf  -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047  -ngl 160 -nkvo --host 0.0.0.0

llama-cli -m /data4/qwen2.5-14b-deep-q4_k.gguf -cnv -p "You are a helpful assistant." -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -if -mli -ngl 160 -nkvo

llama-cli -m /data4/qwen2.5-14b-deep-q4_k.gguf -cnv -p "You are a helpful assistant." -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -if -mli -ngl 160

Problem description & steps to reproduce

In the same environment, with the same parameters, the inference speed of llama-server is one-third of that of llama-cli.

Command lines as follows:
llama-server -m /data4/qwen2.5-14b-deep-q4_k.gguf -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -ngl 160 --host 0.0.0.0
This parameter configuration achieves an inference speed of: 25.87 t/s

llama-server -m /data4/qwen2.5-14b-deep-q4_k.gguf -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -ngl 160 -nkvo --host 0.0.0.0
This parameter configuration achieves an inference speed of: 5.83 t/s

llama-cli -m /data4/qwen2.5-14b-deep-q4_k.gguf -cnv -p "You are a helpful assistant." -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -if -mli -ngl 160 -nkvo
This parameter configuration achieves an inference speed of: 18.25 t/s

llama-cli -m /data4/qwen2.5-14b-deep-q4_k.gguf -cnv -p "You are a helpful assistant." -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -if -mli -ngl 160
This parameter configuration achieves an inference speed of: 24.27 t/s

Although the -nkvo option leaves kv calculations on the CPU, which slows down inference speed, for example, llama-cli's speed drops from 24.27 to 18.25, which is expected behavior. However, enabling -nkvo in llama-server causes a greater-than-expected drop. On my other computers, it even drops to 0.x t/s. llama-server with -nkvo is expected to have inference speeds similar to 18.xx t/s. I have reproduced this issue in multiple computer environments. I suspect there is a bug in the thread pool usage of llama-server. Thank you for this project that allows me to run LLMs locally. Looking forward to fixing this bug.

First Bad Commit

No response

Relevant log output

./build/bin/llama-cli -m /data4/qwen2.5-14b-deep-q4_k.gguf -cnv -p "You are a helpful assistant." -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -if -mli -ngl 160 

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
build: 0 (unknown) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1080 Ti) - 11025 MiB free
llama_model_loader: loaded meta data with 25 key-value pairs and 579 tensors from /data4/qwen2.5-14b-deep-q4_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qw14
llama_model_loader: - kv   3:                         general.size_label str              = 15B
llama_model_loader: - kv   4:                          qwen2.block_count u32              = 48
llama_model_loader: - kv   5:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   6:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   7:                  qwen2.feed_forward_length u32              = 13824
llama_model_loader: - kv   8:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv   9:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 8.37 GiB (4.87 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 48
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 13824
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 14B
print_info: model params     = 14.77 B
print_info: general.name     = Qw14
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size =  8148.38 MiB
load_tensors:   CPU_Mapped model buffer size =   417.66 MiB
..........................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 512
llama_init_from_model: n_ctx_per_seq = 512
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    96.00 MiB
llama_init_from_model: KV self size  =   96.00 MiB, K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_init_from_model:      CUDA0 compute buffer size =   307.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    11.01 MiB
llama_init_from_model: graph nodes  = 1495
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 24
*** User-specified prompt in conversation mode will be ignored, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>

system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CUDA : ARCHS = 610 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 3047
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 512
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.600
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to the AI, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> hello
<think>

Hello! How can I assist you today? 😊

> 
llama_perf_sampler_print:    sampling time =       1.69 ms /    18 runs   (    0.09 ms per token, 10663.51 tokens per second)
llama_perf_context_print:        load time =    6771.74 ms
llama_perf_context_print: prompt eval time =     110.77 ms /     5 tokens (   22.15 ms per token,    45.14 tokens per second)
llama_perf_context_print:        eval time =     535.54 ms /    13 runs   (   41.20 ms per token,    **24.27 tokens per second)**
llama_perf_context_print:       total time =   19145.65 ms /    18 tokens
Interrupted by user


./build/bin/llama-cli -m /data4/qwen2.5-14b-deep-q4_k.gguf -cnv -p "You are a helpful assistant." -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -if -mli -ngl 160 -nkvo
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
build: 0 (unknown) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1080 Ti) - 11025 MiB free
llama_model_loader: loaded meta data with 25 key-value pairs and 579 tensors from /data4/qwen2.5-14b-deep-q4_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qw14
llama_model_loader: - kv   3:                         general.size_label str              = 15B
llama_model_loader: - kv   4:                          qwen2.block_count u32              = 48
llama_model_loader: - kv   5:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   6:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   7:                  qwen2.feed_forward_length u32              = 13824
llama_model_loader: - kv   8:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv   9:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 8.37 GiB (4.87 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 48
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 13824
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 14B
print_info: model params     = 14.77 B
print_info: general.name     = Qw14
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size =  8148.38 MiB
load_tensors:   CPU_Mapped model buffer size =   417.66 MiB
..........................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 512
llama_init_from_model: n_ctx_per_seq = 512
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 512, offload = 0, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =    96.00 MiB
llama_init_from_model: KV self size  =   96.00 MiB, K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_init_from_model:      CUDA0 compute buffer size =   317.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    11.01 MiB
llama_init_from_model: graph nodes  = 1495
llama_init_from_model: graph splits = 98
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 24
*** User-specified prompt in conversation mode will be ignored, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>

system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CUDA : ARCHS = 610 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 3047
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 512
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.600
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to the AI, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> hello
<think>

Hello! How can I assist you today? 😊

> 
llama_perf_sampler_print:    sampling time =       1.66 ms /    18 runs   (    0.09 ms per token, 10849.91 tokens per second)
llama_perf_context_print:        load time =    1985.12 ms
llama_perf_context_print: prompt eval time =     148.26 ms /     5 tokens (   29.65 ms per token,    33.73 tokens per second)
llama_perf_context_print:        eval time =     712.26 ms /    13 runs   (   54.79 ms per token,    **18.25 tokens per second**)
llama_perf_context_print:       total time =    4656.76 ms /    18 tokens
Interrupted by user



./build/bin/llama-server -m /data4/qwen2.5-14b-deep-q4_k.gguf  -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047  -ngl 160 -nkvo --host 0.0.0.0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
build: 0 (unknown) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
system info: n_threads = 24, n_threads_batch = 24, total_threads = 48

system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CUDA : ARCHS = 610 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 

main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 47
main: loading model
srv    load_model: loading model '/data4/qwen2.5-14b-deep-q4_k.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1080 Ti) - 11025 MiB free
llama_model_loader: loaded meta data with 25 key-value pairs and 579 tensors from /data4/qwen2.5-14b-deep-q4_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qw14
llama_model_loader: - kv   3:                         general.size_label str              = 15B
llama_model_loader: - kv   4:                          qwen2.block_count u32              = 48
llama_model_loader: - kv   5:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   6:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   7:                  qwen2.feed_forward_length u32              = 13824
llama_model_loader: - kv   8:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv   9:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 8.37 GiB (4.87 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 48
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 13824
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 14B
print_info: model params     = 14.77 B
print_info: general.name     = Qw14
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size =  8148.38 MiB
load_tensors:   CPU_Mapped model buffer size =   417.66 MiB
..........................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 512
llama_init_from_model: n_ctx_per_seq = 512
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 512, offload = 0, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =    96.00 MiB
llama_init_from_model: KV self size  =   96.00 MiB, K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_init_from_model:      CUDA0 compute buffer size =   317.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    11.01 MiB
llama_init_from_model: graph nodes  = 1495
llama_init_from_model: graph splits = 98
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 512
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 9
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 9, n_tokens = 9, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 9, n_tokens = 9
slot      release: id  0 | task 0 | stop processing: n_past = 23, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     264.49 ms /     9 tokens (   29.39 ms per token,    34.03 tokens per second)
       eval time =    2571.52 ms /    15 tokens (  171.43 ms per token,     **5.83 tokens per second**)
      total time =    2836.01 ms /    24 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 192.168.1.10 200
^Csrv    operator(): operator(): cleaning up before exit...
terminate called without an active exception
Aborted


 ./build/bin/llama-server -m /data4/qwen2.5-14b-deep-q4_k.gguf  -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047  -ngl 160  --host 0.0.0.0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
build: 0 (unknown) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
system info: n_threads = 24, n_threads_batch = 24, total_threads = 48

system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CUDA : ARCHS = 610 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 

main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 47
main: loading model
srv    load_model: loading model '/data4/qwen2.5-14b-deep-q4_k.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1080 Ti) - 11025 MiB free
llama_model_loader: loaded meta data with 25 key-value pairs and 579 tensors from /data4/qwen2.5-14b-deep-q4_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qw14
llama_model_loader: - kv   3:                         general.size_label str              = 15B
llama_model_loader: - kv   4:                          qwen2.block_count u32              = 48
llama_model_loader: - kv   5:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   6:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   7:                  qwen2.feed_forward_length u32              = 13824
llama_model_loader: - kv   8:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv   9:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 8.37 GiB (4.87 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 48
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 13824
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 14B
print_info: model params     = 14.77 B
print_info: general.name     = Qw14
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size =  8148.38 MiB
load_tensors:   CPU_Mapped model buffer size =   417.66 MiB
..........................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 512
llama_init_from_model: n_ctx_per_seq = 512
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    96.00 MiB
llama_init_from_model: KV self size  =   96.00 MiB, K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_init_from_model:      CUDA0 compute buffer size =   307.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    11.01 MiB
llama_init_from_model: graph nodes  = 1495
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 512
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 9
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 9, n_tokens = 9, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 9, n_tokens = 9
slot      release: id  0 | task 0 | stop processing: n_past = 23, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =      78.55 ms /     9 tokens (    8.73 ms per token,   114.58 tokens per second)
       eval time =     579.88 ms /    15 tokens (   38.66 ms per token,    **25.87 tokens per second**)
      total time =     658.43 ms /    24 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 192.168.1.10 200
^Cterminate called without an active exception
srv    operator(): operator(): cleaning up before exit...
Aborted
@ggerganov
Copy link
Member

Can you post the speed numbers by adding -t 1 to all 4 commands?

@zts9989
Copy link
Author

zts9989 commented Mar 5, 2025

Can you post the speed numbers by adding -t 1 to all 4 commands?

Of course! Your suggestion is an excellent debugging method indeed!

llama-server -m /data4/qwen2.5-14b-deep-q4_k.gguf -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -ngl 160 --host 0.0.0.0 -t 1
This parameter configuration achieves an inference speed of: 27.12 t/s

llama-server -m /data4/qwen2.5-14b-deep-q4_k.gguf -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -ngl 160 --host 0.0.0.0 -t 1 -nkvo
This parameter configuration achieves an inference speed of: 19.73 t/s

llama-cli -m /data4/qwen2.5-14b-deep-q4_k.gguf -cnv -p "You are a helpful assistant." -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -if -mli -ngl 160 -t 1
This parameter configuration achieves an inference speed of: 26.12 t/s

llama-cli -m /data4/qwen2.5-14b-deep-q4_k.gguf -cnv -p "You are a helpful assistant." -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -if -mli -ngl 160 -nkvo -t 1
This parameter configuration achieves an inference speed of: 18.99 t/s

@zts9989
Copy link
Author

zts9989 commented Mar 5, 2025

I discovered another key point for reproducing the bug: When I compiled the program, I used:
cmake -B build -DGGML_CUDA=ON -DGGML_BUILD_NUMBER=3 -DGGML_OPENMP=OFF -DGGML_SCHED_MAX_COPIES=1

The GGML_OPENMP=OFF flag does not affect the performance of llama-cli, but it severely impacts the performance of llama-server.
Therefore, another critical factor for reproducing the bug is disabling OPENMP.

@ggerganov
Copy link
Member

The GGML_OPENMP=OFF flag does not affect the performance of llama-cli, but it severely impacts the performance of llama-server.

I didn't get - does it impact it in a negative or in a positive way when you use GGML_OPENMP=OFF?

@zts9989
Copy link
Author

zts9989 commented Mar 5, 2025

Let me summarize all the test data (note the two lines marked with asterisks*):
When compiling with GGML_OPENMP=OFF:
llama-cli -t 24 24.27 t/s
llama-cli -t 24 -nkvo 18.25 t/s
llama-svr -t 24 25.87 t/s
llama-svr -t 24 -nkvo 5.83 t/s *******************
llama-cli -t 1 26.12 t/s
llama-cli -t 1 -nkvo 18.99 t/s
llama-svr -t 1 27.12 t/s
llama-svr -t 1 -nkvo 19.73 t/s

Additional tests with GGML_OPENMP=ON (24-thread only):
llama-svr -t 24 -nkvo 18.79 t/s ********************

Observations:

llama-cli and llama-svr appear to use thread pools differently.
llama-cli shows minimal sensitivity to the GGML_OPENMP flag.
llama-svr exhibits abnormal performance degradation with GGML_OPENMP=OFF.

@ggerganov ggerganov added bug Something isn't working and removed bug-unconfirmed labels Mar 5, 2025
@zts9989
Copy link
Author

zts9989 commented Mar 5, 2025

While conducting inference with Deepseek R1 671B Q4 #11397 (comment) , I discovered anomalous data in llama-server. I initially achieved up to 15t/s inference speed using llama-cli, but encountered a significant drop to ** #0.1t/s** when working with the API through llama-server. I subsequently replicated this performance degradation with the "-nkvo" parameter in multiple computing environments (compiled with GGML_OPENMP=OFF across all test systems). To demonstrate the issue more efficiently, I'm using the qwen2.5 14B model as an example here.

@jukofyork
Copy link
Collaborator

Could it be related to the -DGGML_SCHED_MAX_COPIES=1 for the --override-tensor option? Do you still get this without this?

@zts9989
Copy link
Author

zts9989 commented Mar 6, 2025

@jukofyork Thanks for your attention and suggestions. I recompiled and ran the program again. It seems to be unrelated to the options -DGGML_SCHED_MAX_COPIES and --override-tensor. It's only related to GGML_OPENMP.

cmake -B build3 -DGGML_CUDA=ON -DGGML_BUILD_NUMBER=3 -DGGML_OPENMP=OFF

   prompt eval time =     238.19 ms /     6 tokens (   39.70 ms per token,    25.19 tokens per second)
   eval time =    2662.75 ms /    16 tokens (  166.42 ms per token,     6.01 tokens per second)
   total time =    2900.94 ms /    22 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants