Eval bug: Abort is called in a thread from a custom thread pool during a llama_decode call

### Name and Version

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA A2, compute capability 8.6, VMM: yes
  Device 1: NVIDIA A2, compute capability 8.6, VMM: yes
register_backend: registered backend CUDA (2 devices)
register_device: registered device CUDA0 (NVIDIA A2)
register_device: registered device CUDA1 (NVIDIA A2)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel Xeon Processor (Icelake))
version: 5572 (7675c555)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

2 x A2 or  3 x A100

### Models

SmolLM2-360M-Instruct-BF16

### Problem description & steps to reproduce

We are testing inference with 15 threads in the worker pool when an abort is called from one of those threads during a llama_decode call. 

- main thread
- 3 threads are started by llama.cpp: a host thread and an extra thread for every single GPU(as far as I understand):
- 15 threads are started by our worker pool

```
(gdb) info thread
  Id   Target Id                                          Frame 
  1    Thread 0x7ffff7629000 (LWP 4059) "bai-test"        0x00007ffff5898d71 in __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=4071, futex_word=0x7fff9cdde2d0) at ./nptl/futex-internal.c:57
  2    Thread 0x7fffb5dff000 (LWP 4062) "cuda00001400006" 0x00007ffff591b4cd in __GI___poll (fds=0x555555bda010, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  3    Thread 0x7fffa8dde000 (LWP 4069) "cuda-EvtHandlr"  0x00007ffff591b4cd in __GI___poll (fds=0x7fffa4000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  4    Thread 0x7fffa2dde000 (LWP 4070) "cuda-EvtHandlr"  0x00007ffff591b4cd in __GI___poll (fds=0x7fff98000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  5    Thread 0x7fff9cdde000 (LWP 4071) "bai-test"        __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555babc0c) at ./nptl/futex-internal.c:103
  6    Thread 0x7fff93ff0000 (LWP 4072) "bai-test"        __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555babc0c) at ./nptl/futex-internal.c:103
  7    Thread 0x7fff937ef000 (LWP 4073) "bai-test"        __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555bac7ec) at ./nptl/futex-internal.c:103
  8    Thread 0x7fff92fee000 (LWP 4074) "bai-test"        __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555bac7e8) at ./nptl/futex-internal.c:103
  9    Thread 0x7fff927ed000 (LWP 4075) "bai-test"        __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555bac7ec) at ./nptl/futex-internal.c:103
  10   Thread 0x7fff91fec000 (LWP 4076) "bai-test"        0x00007fffea04d771 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  11   Thread 0x7fff917eb000 (LWP 4077) "bai-test"        __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555babc0c) at ./nptl/futex-internal.c:103
  12   Thread 0x7fff90fea000 (LWP 4078) "bai-test"        0x00007ffff7fc3e36 in ?? ()
  13   Thread 0x7fff8bfff000 (LWP 4079) "bai-test"        __futex_abstimed_wait_common (cancel=false, private=0, abstime=0x0, clockid=0, expected=2, futex_word=0x555555babc08) at ./nptl/futex-internal.c:103
  14   Thread 0x7fff8b7fe000 (LWP 4080) "bai-test"        0x00007fffea94e6d0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  15   Thread 0x7fff8affd000 (LWP 4081) "bai-test"        __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555bac7ec) at ./nptl/futex-internal.c:103
* 16   Thread 0x7fff8a7fc000 (LWP 4082) "bai-test"        __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
  17   Thread 0x7fff89ffb000 (LWP 4083) "bai-test"        __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555bac7e8) at ./nptl/futex-internal.c:103
  18   Thread 0x7fff897fa000 (LWP 4084) "bai-test"        __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555babc0c) at ./nptl/futex-internal.c:103
  19   Thread 0x7fff88ff9000 (LWP 4085) "bai-test"        __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555bac7e8) at ./nptl/futex-internal.c:103
```

Callstack:
```
Thread 16 "bai-test" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff8a7fc000 (LWP 4082)]
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
warning: 44	./nptl/pthread_kill.c: No such file or directory
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff584527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff58288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff5f0f836 in ggml_abort (file=0x7ffff6613268 "/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu", line=75, fmt=0x7ffff661325d "CUDA error")
    at /home/ubuntu/sources/llama.cpp/ggml/src/ggml.c:221
#6  0x00007ffff60e6c77 in ggml_cuda_error (stmt=0x7ffff66147e2 "cudaGetLastError()", func=0x7ffff66146fa "ggml_cuda_op_mul_mat", 
    file=0x7ffff6613268 "/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu", line=1676, 
    msg=0x7ffff5493f90 "operation failed due to a previous error during capture") at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75
#7  0x00007ffff60ed62d in ggml_cuda_op_mul_mat (ctx=..., src0=0x5555580d3930, src1=0x7fff0d04ab00, dst=0x7fff0d04ade0, 
    op=0x7ffff6108641 <ggml_cuda_op_mul_mat_vec(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*)>, quantize_src1=0x0) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1676
#8  0x00007ffff60ef1e7 in ggml_cuda_mul_mat (ctx=..., src0=0x5555580d3930, src1=0x7fff0d04ab00, dst=0x7fff0d04ade0)
    at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1976
#9  0x00007ffff60f0ac6 in ggml_cuda_compute_forward (ctx=..., dst=0x7fff0d04ade0) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2264
#10 0x00007ffff60f221f in evaluate_and_capture_cuda_graph (cuda_ctx=0x7fff14001410, cgraph=0x7fff140fd1c8, graph_evaluated_or_captured=@0x7fff8a7d75db: false, 
    use_cuda_graph=@0x7fff8a7d75d9: true, cuda_graph_update_required=@0x7fff8a7d75da: true) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2673
#11 0x00007ffff60f28c8 in ggml_backend_cuda_graph_compute (backend=0x7fff14001950, cgraph=0x7fff140fd1c8)
    at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2780
#12 0x00007ffff5f27a02 in ggml_backend_graph_compute_async (backend=0x7fff14001950, cgraph=0x7fff140fd1c8) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:334
#13 0x00007ffff5f2bb6d in ggml_backend_sched_compute_splits (sched=0x7fff1403ab70) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:1404
#14 0x00007ffff5f2c809 in ggml_backend_sched_graph_compute_async (sched=0x7fff1403ab70, graph=0x7fff0cdf6030)
    at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:1596
#15 0x00007ffff7b88691 in llama_context::graph_compute (this=0x7fff14000b70, gf=0x7fff0cdf6030, batched=false) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:1381
#16 0x00007ffff7b84f30 in llama_context::process_ubatch (this=0x7fff14000b70, ubatch=..., gtype=LLM_GRAPH_TYPE_DECODER, mstate=0x7fff14178a00, 
    ret=@0x7fff8a7d7848: -369628841) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:683
#17 0x00007ffff7b868bf in llama_context::decode (this=0x7fff14000b70, inp_batch=...) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:1018
#18 0x00007ffff7b8d301 in llama_decode (ctx=0x7fff14000b70, batch=...) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:2681
```

All threads are sharing same model and vocabulary but each thread has its own llama_context and llama_sampler:
llama_model_params.n_gpu_layers = 99
llama_context_params.ctx = 1000

* I understand that the bottleneck for inference is 1+2 threads in llama.cpp, but we also need our thread pool for regular business logic, where multithreading fits our needs.
* Same test works steadily on Metal

### First Bad Commit

_No response_

### Relevant log output

```shell
ggml_gallocr_needs_realloc: node inp_embd is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [ffn_inp-0] [960 128 1 1]
check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [ffn_inp-17] [960 128 1 1]

CUDA error: operation failed due to a previous error during capture
  current device: 1, in function ggml_cuda_op_mul_mat at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1676
  cudaGetLastError()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Abort is called in a thread from a custom thread pool during a llama_decode call #13990

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Abort is called in a thread from a custom thread pool during a llama_decode call #13990

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions