Skip to content

Bug: Vulkan backend freezes during its execution #10037

@GrainyTV

Description

@GrainyTV

What happened?

I was experimenting with the llama.cpp project and llm inference in general. I made a basic chat application (similar to the main.cpp project from the examples) but much simpler. Now it worked but took like 1 minute to generate an answer for 2 prompts so I thought I would look into the GPU based inference using Vulkan. After recompiling the project with Vulkan support enabled, I started a chat session with the ./llama-cli tool and it detected my GPU as a usable device. But then when it started loading the model that I've been using qwen2.5-0.5b-instruct-q8_0.gguf, it just froze in the middle of the execution. The last line that is printed to the console is from the call which shows the different model properties:

llm_load_print_meta: max token length = 256

After that nothing happens. I can stop the execution using Ctrl + C, but that's pretty much it. Now I went ahead and recompiled the project in debug mode to get a better understanding of what is going on, but I couldn't deduce a whole lot of information. I included the GDB logs in the next section. I also tried the ./test-backend-ops application but the same freezing appears there as well.

Name and Version

./llama-cli --version

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon R7 200 Series (AMD open-source driver) | uma: 0 | fp16: 0 | warp size: 64
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon R7 200 Series)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz)
version: 3965 (ac113a0f)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

Temporary breakpoint 1, main (argc=11, argv=0x7fffffffe4b8)
    at /home/personal/Documents/Test/Coding/AI/llama.cpp/examples/main/main.cpp:138
138	int main(int argc, char ** argv) {
(gdb) continue
Continuing.
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon R7 200 Series (AMD open-source driver) | uma: 0 | fp16: 0 | warp size: 64
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon R7 200 Series)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz)
[New Thread 0x7fffee6006c0 (LWP 20010)]
build: 3965 (ac113a0f) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device Vulkan0 (AMD Radeon R7 200 Series) - 768 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from qwen2.5-0.5b-instruct-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 630M
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 7
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q8_0:  170 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 896
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 14
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 128
llm_load_print_meta: n_embd_v_gqa     = 128
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4864
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 630.17 M
llm_load_print_meta: model size       = 638.74 MiB (8.50 BPW)
llm_load_print_meta: general.name     = qwen2.5-0.5b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
[New Thread 0x7fffe7e006c0 (LWP 20012)]
[New Thread 0x7fffe74006c0 (LWP 20013)]
[New Thread 0x7fffe6a006c0 (LWP 20014)]
[New Thread 0x7fffe58006c0 (LWP 20015)]
[Thread 0x7fffe6a006c0 (LWP 20014) exited]
[Thread 0x7fffe74006c0 (LWP 20013) exited]
[New Thread 0x7fffe4e006c0 (LWP 20016)]
[Thread 0x7fffe58006c0 (LWP 20015) exited]
[Thread 0x7fffe7e006c0 (LWP 20012) exited]
[New Thread 0x7fffdfe006c0 (LWP 20017)]
[Thread 0x7fffe4e006c0 (LWP 20016) exited]
[New Thread 0x7fffdf4006c0 (LWP 20018)]
[Thread 0x7fffdfe006c0 (LWP 20017) exited]
[New Thread 0x7fffdea006c0 (LWP 20019)]
[Thread 0x7fffdf4006c0 (LWP 20018) exited]
[New Thread 0x7fffde0006c0 (LWP 20020)]
[Thread 0x7fffdea006c0 (LWP 20019) exited]
[New Thread 0x7fffdd6006c0 (LWP 20021)]
[Thread 0x7fffde0006c0 (LWP 20020) exited]
[New Thread 0x7fffdcc006c0 (LWP 20022)]
[Thread 0x7fffdd6006c0 (LWP 20021) exited]
[New Thread 0x7fffd3e006c0 (LWP 20023)]
[Thread 0x7fffdcc006c0 (LWP 20022) exited]
[New Thread 0x7fffd34006c0 (LWP 20024)]
[Thread 0x7fffd3e006c0 (LWP 20023) exited]
[New Thread 0x7fffd2a006c0 (LWP 20025)]
[Thread 0x7fffd34006c0 (LWP 20024) exited]
[New Thread 0x7fffd20006c0 (LWP 20026)]
[Thread 0x7fffd2a006c0 (LWP 20025) exited]
[New Thread 0x7fffd16006c0 (LWP 20027)]
[Thread 0x7fffd20006c0 (LWP 20026) exited]
[New Thread 0x7fffd0c006c0 (LWP 20028)]
[Thread 0x7fffd16006c0 (LWP 20027) exited]
[New Thread 0x7fffd02006c0 (LWP 20029)]
[Thread 0x7fffd0c006c0 (LWP 20028) exited]
[New Thread 0x7fffcf8006c0 (LWP 20030)]
[Thread 0x7fffd02006c0 (LWP 20029) exited]
[New Thread 0x7fffcee006c0 (LWP 20031)]
[New Thread 0x7fffce4006c0 (LWP 20032)]
[Thread 0x7fffcee006c0 (LWP 20031) exited]
[Thread 0x7fffcf8006c0 (LWP 20030) exited]
[New Thread 0x7fffcda006c0 (LWP 20033)]
[Thread 0x7fffce4006c0 (LWP 20032) exited]
[New Thread 0x7fffcd0006c0 (LWP 20034)]
[Thread 0x7fffcda006c0 (LWP 20033) exited]
[New Thread 0x7fffcc6006c0 (LWP 20035)]
[Thread 0x7fffcd0006c0 (LWP 20034) exited]
[New Thread 0x7fffcbc006c0 (LWP 20036)]
[Thread 0x7fffcc6006c0 (LWP 20035) exited]
[New Thread 0x7fffcb2006c0 (LWP 20037)]
[New Thread 0x7fffca8006c0 (LWP 20038)]
[Thread 0x7fffcb2006c0 (LWP 20037) exited]
[New Thread 0x7fffc9e006c0 (LWP 20039)]
[New Thread 0x7fffc94006c0 (LWP 20040)]
[Thread 0x7fffca8006c0 (LWP 20038) exited]
[New Thread 0x7fffc8a006c0 (LWP 20041)]
[Thread 0x7fffc94006c0 (LWP 20040) exited]
[New Thread 0x7fffc80006c0 (LWP 20042)]
[Thread 0x7fffc8a006c0 (LWP 20041) exited]
[New Thread 0x7fffc76006c0 (LWP 20043)]
[Thread 0x7fffc76006c0 (LWP 20043) exited]
[New Thread 0x7fffc6c006c0 (LWP 20044)]
[Thread 0x7fffc6c006c0 (LWP 20044) exited]
[New Thread 0x7fffc62006c0 (LWP 20045)]
[Thread 0x7fffc9e006c0 (LWP 20039) exited]
[Thread 0x7fffc62006c0 (LWP 20045) exited]
[Thread 0x7fffc80006c0 (LWP 20042) exited]
[Thread 0x7fffcbc006c0 (LWP 20036) exited]

Thread 1 "llama-cli" received signal SIGINT, Interrupt.
0x00007ffff6e9f68e in ?? () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff6e9f68e in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff6ea2080 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x00007ffff70d5ef1 in __gthread_cond_wait (__cond=<optimized out>,
    __mutex=<optimized out>)
    at /usr/src/debug/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:878
#3  std::__condvar::wait (this=<optimized out>, __m=...)
    at /usr/src/debug/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/std_mutex.h:171
#4  std::condition_variable::wait (this=<optimized out>, __lock=...)
    at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/condition_variable.cc:41
#5  0x00007ffff759f834 in operator() (__closure=0x7fffffff7648,
    device=std::shared_ptr<vk_device_struct> (use count 2, weak count 0) = {...}, pipeline=std::shared_ptr<vk_pipeline_struct> (empty) = {...},
    name="matmul_q4_1_f32_aligned_m", spv_size=10476,
    spv_data=0x7ffff782dcc0 <matmul_q4_1_f32_aligned_fp32_data>,
--Type <RET> for more, q to quit, c to continue without paging--c
    entrypoint="main", parameter_count=3, push_constant_size=56,
    wg_denoms=..., specialization_constants=..., align=64)
    at /home/personal/Documents/Test/Coding/AI/llama.cpp/ggml/src/ggml-vulkan.cpp:1249
#6  0x00007ffff75b45e5 in ggml_vk_load_shaders (
    device=std::shared_ptr<vk_device_struct> (use count 2, weak count 0) = {...})
    at /home/personal/Documents/Test/Coding/AI/llama.cpp/ggml/src/ggml-vulkan.cpp:1499
#7  0x00007ffff75de14a in ggml_vk_get_device (idx=0)
    at /home/personal/Documents/Test/Coding/AI/llama.cpp/ggml/src/ggml-vulkan.cpp:1959
#8  0x00007ffff7618233 in ggml_backend_vk_host_buffer_type ()
    at /home/personal/Documents/Test/Coding/AI/llama.cpp/ggml/src/ggml-vulkan.cpp:6387
#9  0x00007ffff761e25d in ggml_backend_vk_device_get_host_buffer_type (
    dev=0x55555578bc30)
    at /home/personal/Documents/Test/Coding/AI/llama.cpp/ggml/src/ggml-vulkan.cpp:6686
#10 0x00007ffff753a87b in ggml_backend_dev_host_buffer_type (
    device=0x55555578bc30)
    at /home/personal/Documents/Test/Coding/AI/llama.cpp/ggml/src/ggml-backend.cpp:486
#11 0x00007ffff7c4974e in llama_default_buffer_type_cpu (model=...,
    host_buffer=true)
    at /home/personal/Documents/Test/Coding/AI/llama.cpp/src/llama.cpp:3410
#12 0x00007ffff7c56ebe in llm_load_tensors (ml=..., model=...,
    n_gpu_layers=10, split_mode=LLAMA_SPLIT_MODE_LAYER, main_gpu=0,
    tensor_split=0x7fffffffd204, use_mlock=false,
    progress_callback=0x7ffff7c8b5dd <_FUN(float, void*)>,
    progress_callback_user_data=0x7fffffffc358)
    at /home/personal/Documents/Test/Coding/AI/llama.cpp/src/llama.cpp:7032
#13 0x00007ffff7c79900 in llama_model_load (
    fname="qwen2.5-0.5b-instruct-q8_0.gguf", model=..., params=...)
    at /home/personal/Documents/Test/Coding/AI/llama.cpp/src/llama.cpp:9135
#14 0x00007ffff7c8bb26 in llama_load_model_from_file (
    path_model=0x5555559b1b40 "qwen2.5-0.5b-instruct-q8_0.gguf",
    params=...)
    at /home/personal/Documents/Test/Coding/AI/llama.cpp/src/llama.cpp:19243
#15 0x000055555563989e in common_init_from_params (params=...)
    at /home/personal/Documents/Test/Coding/AI/llama.cpp/common/common.cpp:849
#16 0x00005555555ad660 in main (argc=11, argv=0x7fffffffe4b8)
    at /home/personal/Documents/Test/Coding/AI/llama.cpp/examples/main/main.cpp:200

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions