Skip to content

SYCL bug: DeepSeek-V2-Lite-Chat-Q4_K_M does not work as expected #12390

@aubreyli

Description

@aubreyli

Name and Version

root@alc-ai:/home/aubrey/work/llama-gpu# ./build/bin/llama-cli --version
version: 4887 (8fcb563)
built with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu

Operating systems

No response

Which llama.cpp modules do you know to be affected?

No response

Command line

 ./build/bin/llama-cli -m /srv/models/DeepSeek-V2-Lite-Chat-Q4_K_M/DeepSeek-V2-Lite-64x1.5B-Chat-Q4_K_M.gguf -ngl 99 -sm none -mg 0 -p "what is your name?" -n 30 -no-cnv

Problem description & steps to reproduce

root@alc-ai:/home/aubrey/work/llama-gpu# ./build/bin/llama-cli -m /srv/models/DeepSeek-V2-Lite-Chat-Q4_K_M/DeepSeek-V2-Lite-64x1.5B-Chat-Q4_K_M.gguf -ngl 99 -sm none -mg 0 -p "what is your name?" -n 30 -no-cnv
build: 4887 (8fcb563) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_loader: loaded meta data with 47 key-value pairs and 377 tensors from /srv/models/DeepSeek-V2-Lite-Chat-Q4_K_M/DeepSeek-V2-Lite-64x1.5B-Chat-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek V2 Lite Chat
llama_model_loader: - kv 3: general.finetune str = Chat
llama_model_loader: - kv 4: general.basename str = DeepSeek-V2-Lite
llama_model_loader: - kv 5: general.size_label str = 64x1.5B
llama_model_loader: - kv 6: general.license str = other
llama_model_loader: - kv 7: general.license.name str = deepseek
llama_model_loader: - kv 8: general.license.link str = https://github.com/deepseek-ai/DeepSe...
llama_model_loader: - kv 9: deepseek2.block_count u32 = 27
llama_model_loader: - kv 10: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 11: deepseek2.embedding_length u32 = 2048
llama_model_loader: - kv 12: deepseek2.feed_forward_length u32 = 10944
llama_model_loader: - kv 13: deepseek2.attention.head_count u32 = 16
llama_model_loader: - kv 14: deepseek2.attention.head_count_kv u32 = 16
llama_model_loader: - kv 15: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 16: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 17: deepseek2.expert_used_count u32 = 6
llama_model_loader: - kv 18: deepseek2.leading_dense_block_count u32 = 1
llama_model_loader: - kv 19: deepseek2.vocab_size u32 = 102400
llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 1408
llama_model_loader: - kv 24: deepseek2.expert_count u32 = 64
llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 2
llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 1.000000
llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = false
llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 1
llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.070700
llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-llm
llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,102400] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 100000
llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 100001
llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 100001
llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 45: general.quantization_version u32 = 2
llama_model_loader: - kv 46: general.file_type u32 = 15
llama_model_loader: - type f32: 108 tensors
llama_model_loader: - type q5_0: 14 tensors
llama_model_loader: - type q8_0: 13 tensors
llama_model_loader: - type q4_K: 229 tensors
llama_model_loader: - type q6_K: 13 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 9.65 GiB (5.28 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 2
load: token to piece cache size = 0.6408 MB
print_info: arch = deepseek2
print_info: vocab_only = 0
print_info: n_ctx_train = 163840
print_info: n_embd = 2048
print_info: n_layer = 27
print_info: n_head = 16
print_info: n_head_kv = 16
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 192
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 3072
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 10944
print_info: n_expert = 64
print_info: n_expert_used = 6
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = yarn
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 16B
print_info: model params = 15.71 B
print_info: general.name = DeepSeek V2 Lite Chat
print_info: n_layer_dense_lead = 1
print_info: n_lora_q = 0
print_info: n_lora_kv = 512
print_info: n_ff_exp = 1408
print_info: n_expert_shared = 2
print_info: expert_weights_scale = 1.0
print_info: expert_weights_norm = 0
print_info: expert_gating_func = softmax
print_info: rope_yarn_log_mul = 0.0707
print_info: vocab type = BPE
print_info: n_vocab = 102400
print_info: n_merges = 99757
print_info: BOS token = 100000 '<|begin▁of▁sentence|>'
print_info: EOS token = 100001 '<|end▁of▁sentence|>'
print_info: EOT token = 100001 '<|end▁of▁sentence|>'
print_info: PAD token = 100001 '<|end▁of▁sentence|>'
print_info: LF token = 185 'Ċ'
print_info: EOG token = 100001 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 27 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 28/28 layers to GPU
load_tensors: CPU_Mapped model buffer size = 112.50 MiB
load_tensors: SYCL0 model buffer size = 9767.98 MiB
.....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000.0
llama_context: freq_scale = 0.025
llama_context: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
Running with Environment Variables:
GGML_SYCL_DEBUG: 0
GGML_SYCL_DISABLE_OPT: 0
Build with Macros:
GGML_SYCL_FORCE_MMQ: no
GGML_SYCL_F16: no
Found 2 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |

ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Arc A770 Graphics 12.55 512 1024 32 16225M 1.6.32224+14
1 [level_zero:gpu:1] Intel UHD Graphics 770 12.2 32 512 32 62707M 1.6.32224+14
SYCL Optimization Feature:
ID Device Type Reorder
-- ------------------- -------
0 [level_zero:gpu:0] Y
1 [level_zero:gpu:1] N
llama_context: SYCL_Host output buffer size = 0.39 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 27, can_shift = 0
init: SYCL0 KV buffer size = 1080.00 MiB
llama_context: KV self size = 1080.00 MiB, K (f16): 648.00 MiB, V (f16): 432.00 MiB
llama_context: SYCL0 compute buffer size = 213.03 MiB
llama_context: SYCL_Host compute buffer size = 12.01 MiB
llama_context: graph nodes = 1924
llama_context: graph splits = 2
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 2656463
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 30, n_keep = 1

what is your name? is the difference between a man and a boy?
2 Answers | Add Yours
A man is an adult human male, while a boy is a

llama_perf_sampler_print: sampling time = 1.10 ms / 36 runs ( 0.03 ms per token, 32786.89 tokens per second)
llama_perf_context_print: load time = 3147.22 ms
llama_perf_context_print: prompt eval time = 288.22 ms / 6 tokens ( 48.04 ms per token, 20.82 tokens per second)
llama_perf_context_print: eval time = 1660.91 ms / 29 runs ( 57.27 ms per token, 17.46 tokens per second)
llama_perf_context_print: total time = 1952.91 ms / 35 tokens

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions