-
Notifications
You must be signed in to change notification settings - Fork 13.2k
Description
Name and Version
root@a216981a6379:/app# ./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-icelake.so
version: 6571 (5fb5576)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
NVIDIA GeForce RTX 4060 Ti
Models
bartowski/mistralai_Voxtral-Small-24B-2507-GGUF
Problem description & steps to reproduce
I try to run the gguf from bartowski/mistralai_Voxtral-Small-24B-2507-GGUF. It starts up and does a request, but crashes on the second request.
docker run --gpus=all -v ./models:/models -p 11434:11434 ghcr.io/ggml-org/llama.cpp:full-cuda --server \
-m /models/mistralai_Voxtral-Small-24B-2507-IQ4_XS.gguf \
--mmproj /models/mmproj-mistralai_Voxtral-Small-24B-2507-f16.gguf \
-c 8192 --host 0.0.0.0 --port 11434 --alias Voxtral-Small-24B
logs of second, identical requests start at 10:56:04, also a trap log in dmesg.
First Bad Commit
No response
Relevant log output
2025-09-25T10:52:10.955666594Z main: server is listening on http://0.0.0.0:11434 - starting the main loop
2025-09-25T10:52:10.955669210Z srv update_slots: all slots are idle
2025-09-25T10:55:15.160716004Z srv params_from_: Chat format: Content-only
2025-09-25T10:55:15.160784733Z slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
2025-09-25T10:55:15.160790013Z slot launch_slot_: id 0 | task 0 | processing task
2025-09-25T10:55:15.160793119Z slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 231
2025-09-25T10:55:15.160796044Z slot update_slots: id 0 | task 0 | kv cache rm [0, end)
2025-09-25T10:55:15.160798229Z slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 43, n_tokens = 43, progress = 0.186147
2025-09-25T10:55:15.191839164Z slot update_slots: id 0 | task 0 | kv cache rm [43, end)
2025-09-25T10:55:15.191972264Z srv process_chun: processing audio...
2025-09-25T10:55:15.645839414Z srv process_chun: audio processed in 455 ms
2025-09-25T10:55:15.646268950Z slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 231, n_tokens = 1, progress = 1.000000
2025-09-25T10:55:15.646325687Z slot update_slots: id 0 | task 0 | prompt done, n_past = 231, n_tokens = 1
2025-09-25T10:55:19.897083986Z slot release: id 0 | task 0 | stop processing: n_past = 313, truncated = 0
2025-09-25T10:55:19.897245238Z slot print_timing: id 0 | task 0 |
2025-09-25T10:55:19.897252642Z prompt eval time = 699.04 ms / 231 tokens ( 3.03 ms per token, 330.45 tokens per second)
2025-09-25T10:55:19.897259976Z eval time = 4038.29 ms / 83 tokens ( 48.65 ms per token, 20.55 tokens per second)
2025-09-25T10:55:19.897265166Z total time = 4737.33 ms / 314 tokens
2025-09-25T10:55:19.897268031Z srv update_slots: all slots are idle
2025-09-25T10:55:19.897651090Z srv log_server_r: request: POST /chat/completions 10.170.210.6 200
2025-09-25T10:56:04.770907656Z srv params_from_: Chat format: Content-only
2025-09-25T10:56:04.771023272Z slot get_availabl: id 0 | task 0 | selected slot by lcs similarity, lcs_len = 231, similarity = 0.738 (> 0.100 thold)
2025-09-25T10:56:04.771032369Z slot launch_slot_: id 0 | task 85 | processing task
2025-09-25T10:56:04.771073315Z slot update_slots: id 0 | task 85 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 231
2025-09-25T10:56:04.771075720Z slot update_slots: id 0 | task 85 | need to evaluate at least 1 token for each active slot, n_past = 231, n_prompt_tokens = 231
2025-09-25T10:56:04.771078205Z slot update_slots: id 0 | task 85 | kv cache rm [230, end)
2025-09-25T10:56:04.783625486Z libggml-base.so(+0x183ab)[0x7f0feda723ab]
2025-09-25T10:56:04.783686421Z libggml-base.so(ggml_print_backtrace+0x21f)[0x7f0feda7280f]
2025-09-25T10:56:04.783689747Z libggml-base.so(+0x2b20f)[0x7f0feda8520f]
2025-09-25T10:56:04.783692732Z /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f0fed8da20c]
2025-09-25T10:56:04.783695257Z /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7f0fed8da277]
2025-09-25T10:56:04.783698302Z /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8)[0x7f0fed8da4d8]
2025-09-25T10:56:04.783700787Z ./llama-server(+0x851f9)[0x55a0162091f9]
2025-09-25T10:56:04.783703583Z ./llama-server(+0xea634)[0x55a01626e634]
2025-09-25T10:56:04.783705957Z ./llama-server(+0x8dc6d)[0x55a016211c6d]
2025-09-25T10:56:04.783708472Z ./llama-server(+0x532d7)[0x55a0161d72d7]
2025-09-25T10:56:04.783711147Z /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f0fed525d90]
2025-09-25T10:56:04.783713983Z /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f0fed525e40]
2025-09-25T10:56:04.783716637Z ./llama-server(+0x54e05)[0x55a0161d8e05]
2025-09-25T10:56:04.802912042Z terminate called after throwing an instance of 'std::runtime_error'
2025-09-25T10:56:04.802966865Z what(): Chunk not found
Also dmesg:
[Thu Sep 25 10:56:04 2025] traps: llama-server[11664] general protection fault ip:7f0fed524898 sp:7fff5b90d230 error:0 in libc.so.6[7f0fed524000+195000]