Skip to content

Eval bug: crashes on the second request. #16247

@engelant

Description

@engelant

Name and Version

root@a216981a6379:/app# ./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-icelake.so
version: 6571 (5fb5576)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

NVIDIA GeForce RTX 4060 Ti

Models

bartowski/mistralai_Voxtral-Small-24B-2507-GGUF

Problem description & steps to reproduce

I try to run the gguf from bartowski/mistralai_Voxtral-Small-24B-2507-GGUF. It starts up and does a request, but crashes on the second request.

docker run --gpus=all -v ./models:/models -p 11434:11434 ghcr.io/ggml-org/llama.cpp:full-cuda --server \
        -m /models/mistralai_Voxtral-Small-24B-2507-IQ4_XS.gguf \
        --mmproj /models/mmproj-mistralai_Voxtral-Small-24B-2507-f16.gguf \
        -c 8192 --host 0.0.0.0 --port 11434 --alias Voxtral-Small-24B

logs of second, identical requests start at 10:56:04, also a trap log in dmesg.

First Bad Commit

No response

Relevant log output

2025-09-25T10:52:10.955666594Z main: server is listening on http://0.0.0.0:11434 - starting the main loop
2025-09-25T10:52:10.955669210Z srv  update_slots: all slots are idle
2025-09-25T10:55:15.160716004Z srv  params_from_: Chat format: Content-only
2025-09-25T10:55:15.160784733Z slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
2025-09-25T10:55:15.160790013Z slot launch_slot_: id  0 | task 0 | processing task
2025-09-25T10:55:15.160793119Z slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 231
2025-09-25T10:55:15.160796044Z slot update_slots: id  0 | task 0 | kv cache rm [0, end)
2025-09-25T10:55:15.160798229Z slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 43, n_tokens = 43, progress = 0.186147
2025-09-25T10:55:15.191839164Z slot update_slots: id  0 | task 0 | kv cache rm [43, end)
2025-09-25T10:55:15.191972264Z srv  process_chun: processing audio...
2025-09-25T10:55:15.645839414Z srv  process_chun: audio processed in 455 ms
2025-09-25T10:55:15.646268950Z slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 231, n_tokens = 1, progress = 1.000000
2025-09-25T10:55:15.646325687Z slot update_slots: id  0 | task 0 | prompt done, n_past = 231, n_tokens = 1
2025-09-25T10:55:19.897083986Z slot      release: id  0 | task 0 | stop processing: n_past = 313, truncated = 0
2025-09-25T10:55:19.897245238Z slot print_timing: id  0 | task 0 | 
2025-09-25T10:55:19.897252642Z prompt eval time =     699.04 ms /   231 tokens (    3.03 ms per token,   330.45 tokens per second)
2025-09-25T10:55:19.897259976Z        eval time =    4038.29 ms /    83 tokens (   48.65 ms per token,    20.55 tokens per second)
2025-09-25T10:55:19.897265166Z       total time =    4737.33 ms /   314 tokens
2025-09-25T10:55:19.897268031Z srv  update_slots: all slots are idle
2025-09-25T10:55:19.897651090Z srv  log_server_r: request: POST /chat/completions 10.170.210.6 200
2025-09-25T10:56:04.770907656Z srv  params_from_: Chat format: Content-only
2025-09-25T10:56:04.771023272Z slot get_availabl: id  0 | task 0 | selected slot by lcs similarity, lcs_len = 231, similarity = 0.738 (> 0.100 thold)
2025-09-25T10:56:04.771032369Z slot launch_slot_: id  0 | task 85 | processing task
2025-09-25T10:56:04.771073315Z slot update_slots: id  0 | task 85 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 231
2025-09-25T10:56:04.771075720Z slot update_slots: id  0 | task 85 | need to evaluate at least 1 token for each active slot, n_past = 231, n_prompt_tokens = 231
2025-09-25T10:56:04.771078205Z slot update_slots: id  0 | task 85 | kv cache rm [230, end)
2025-09-25T10:56:04.783625486Z libggml-base.so(+0x183ab)[0x7f0feda723ab]
2025-09-25T10:56:04.783686421Z libggml-base.so(ggml_print_backtrace+0x21f)[0x7f0feda7280f]
2025-09-25T10:56:04.783689747Z libggml-base.so(+0x2b20f)[0x7f0feda8520f]
2025-09-25T10:56:04.783692732Z /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f0fed8da20c]
2025-09-25T10:56:04.783695257Z /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7f0fed8da277]
2025-09-25T10:56:04.783698302Z /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8)[0x7f0fed8da4d8]
2025-09-25T10:56:04.783700787Z ./llama-server(+0x851f9)[0x55a0162091f9]
2025-09-25T10:56:04.783703583Z ./llama-server(+0xea634)[0x55a01626e634]
2025-09-25T10:56:04.783705957Z ./llama-server(+0x8dc6d)[0x55a016211c6d]
2025-09-25T10:56:04.783708472Z ./llama-server(+0x532d7)[0x55a0161d72d7]
2025-09-25T10:56:04.783711147Z /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f0fed525d90]
2025-09-25T10:56:04.783713983Z /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f0fed525e40]
2025-09-25T10:56:04.783716637Z ./llama-server(+0x54e05)[0x55a0161d8e05]
2025-09-25T10:56:04.802912042Z terminate called after throwing an instance of 'std::runtime_error'
2025-09-25T10:56:04.802966865Z   what():  Chunk not found

Also dmesg:
[Thu Sep 25 10:56:04 2025] traps: llama-server[11664] general protection fault ip:7f0fed524898 sp:7fff5b90d230 error:0 in libc.so.6[7f0fed524000+195000]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions