Description
Name and Version
./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA A2, compute capability 8.6, VMM: yes
Device 1: NVIDIA A2, compute capability 8.6, VMM: yes
register_backend: registered backend CUDA (2 devices)
register_device: registered device CUDA0 (NVIDIA A2)
register_device: registered device CUDA1 (NVIDIA A2)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel Xeon Processor (Icelake))
version: 5572 (7675c55)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
2 x A2 or 3 x A100
Models
SmolLM2-360M-Instruct-BF16
Problem description & steps to reproduce
We are testing inference with 15 threads in the worker pool when an abort is called from one of those threads during a llama_decode call.
- main thread
- 3 threads are started by llama.cpp: a host thread and an extra thread for every single GPU(as far as I understand):
- 15 threads are started by our worker pool
(gdb) info thread
Id Target Id Frame
1 Thread 0x7ffff7629000 (LWP 4059) "bai-test" 0x00007ffff5898d71 in __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=4071, futex_word=0x7fff9cdde2d0) at ./nptl/futex-internal.c:57
2 Thread 0x7fffb5dff000 (LWP 4062) "cuda00001400006" 0x00007ffff591b4cd in __GI___poll (fds=0x555555bda010, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
3 Thread 0x7fffa8dde000 (LWP 4069) "cuda-EvtHandlr" 0x00007ffff591b4cd in __GI___poll (fds=0x7fffa4000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
4 Thread 0x7fffa2dde000 (LWP 4070) "cuda-EvtHandlr" 0x00007ffff591b4cd in __GI___poll (fds=0x7fff98000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
5 Thread 0x7fff9cdde000 (LWP 4071) "bai-test" __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555babc0c) at ./nptl/futex-internal.c:103
6 Thread 0x7fff93ff0000 (LWP 4072) "bai-test" __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555babc0c) at ./nptl/futex-internal.c:103
7 Thread 0x7fff937ef000 (LWP 4073) "bai-test" __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555bac7ec) at ./nptl/futex-internal.c:103
8 Thread 0x7fff92fee000 (LWP 4074) "bai-test" __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555bac7e8) at ./nptl/futex-internal.c:103
9 Thread 0x7fff927ed000 (LWP 4075) "bai-test" __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555bac7ec) at ./nptl/futex-internal.c:103
10 Thread 0x7fff91fec000 (LWP 4076) "bai-test" 0x00007fffea04d771 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
11 Thread 0x7fff917eb000 (LWP 4077) "bai-test" __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555babc0c) at ./nptl/futex-internal.c:103
12 Thread 0x7fff90fea000 (LWP 4078) "bai-test" 0x00007ffff7fc3e36 in ?? ()
13 Thread 0x7fff8bfff000 (LWP 4079) "bai-test" __futex_abstimed_wait_common (cancel=false, private=0, abstime=0x0, clockid=0, expected=2, futex_word=0x555555babc08) at ./nptl/futex-internal.c:103
14 Thread 0x7fff8b7fe000 (LWP 4080) "bai-test" 0x00007fffea94e6d0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
15 Thread 0x7fff8affd000 (LWP 4081) "bai-test" __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555bac7ec) at ./nptl/futex-internal.c:103
* 16 Thread 0x7fff8a7fc000 (LWP 4082) "bai-test" __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
17 Thread 0x7fff89ffb000 (LWP 4083) "bai-test" __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555bac7e8) at ./nptl/futex-internal.c:103
18 Thread 0x7fff897fa000 (LWP 4084) "bai-test" __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555babc0c) at ./nptl/futex-internal.c:103
19 Thread 0x7fff88ff9000 (LWP 4085) "bai-test" __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x555555bac7e8) at ./nptl/futex-internal.c:103
Callstack:
Thread 16 "bai-test" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff8a7fc000 (LWP 4082)]
Download failed: Invalid argument. Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
warning: 44 ./nptl/pthread_kill.c: No such file or directory
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007ffff584527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007ffff58288ff in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007ffff5f0f836 in ggml_abort (file=0x7ffff6613268 "/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu", line=75, fmt=0x7ffff661325d "CUDA error")
at /home/ubuntu/sources/llama.cpp/ggml/src/ggml.c:221
#6 0x00007ffff60e6c77 in ggml_cuda_error (stmt=0x7ffff66147e2 "cudaGetLastError()", func=0x7ffff66146fa "ggml_cuda_op_mul_mat",
file=0x7ffff6613268 "/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu", line=1676,
msg=0x7ffff5493f90 "operation failed due to a previous error during capture") at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75
#7 0x00007ffff60ed62d in ggml_cuda_op_mul_mat (ctx=..., src0=0x5555580d3930, src1=0x7fff0d04ab00, dst=0x7fff0d04ade0,
op=0x7ffff6108641 <ggml_cuda_op_mul_mat_vec(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*)>, quantize_src1=0x0) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1676
#8 0x00007ffff60ef1e7 in ggml_cuda_mul_mat (ctx=..., src0=0x5555580d3930, src1=0x7fff0d04ab00, dst=0x7fff0d04ade0)
at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1976
#9 0x00007ffff60f0ac6 in ggml_cuda_compute_forward (ctx=..., dst=0x7fff0d04ade0) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2264
#10 0x00007ffff60f221f in evaluate_and_capture_cuda_graph (cuda_ctx=0x7fff14001410, cgraph=0x7fff140fd1c8, graph_evaluated_or_captured=@0x7fff8a7d75db: false,
use_cuda_graph=@0x7fff8a7d75d9: true, cuda_graph_update_required=@0x7fff8a7d75da: true) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2673
#11 0x00007ffff60f28c8 in ggml_backend_cuda_graph_compute (backend=0x7fff14001950, cgraph=0x7fff140fd1c8)
at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2780
#12 0x00007ffff5f27a02 in ggml_backend_graph_compute_async (backend=0x7fff14001950, cgraph=0x7fff140fd1c8) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:334
#13 0x00007ffff5f2bb6d in ggml_backend_sched_compute_splits (sched=0x7fff1403ab70) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:1404
#14 0x00007ffff5f2c809 in ggml_backend_sched_graph_compute_async (sched=0x7fff1403ab70, graph=0x7fff0cdf6030)
at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:1596
#15 0x00007ffff7b88691 in llama_context::graph_compute (this=0x7fff14000b70, gf=0x7fff0cdf6030, batched=false) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:1381
#16 0x00007ffff7b84f30 in llama_context::process_ubatch (this=0x7fff14000b70, ubatch=..., gtype=LLM_GRAPH_TYPE_DECODER, mstate=0x7fff14178a00,
ret=@0x7fff8a7d7848: -369628841) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:683
#17 0x00007ffff7b868bf in llama_context::decode (this=0x7fff14000b70, inp_batch=...) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:1018
#18 0x00007ffff7b8d301 in llama_decode (ctx=0x7fff14000b70, batch=...) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:2681
All threads are sharing same model and vocabulary but each thread has its own llama_context and llama_sampler:
llama_model_params.n_gpu_layers = 99
llama_context_params.ctx = 1000
- I understand that the bottleneck for inference is 1+2 threads in llama.cpp, but we also need our thread pool for regular business logic, where multithreading fits our needs.
- Same test works steadily on Metal
First Bad Commit
No response
Relevant log output
ggml_gallocr_needs_realloc: node inp_embd is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [ffn_inp-0] [960 128 1 1]
check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [ffn_inp-17] [960 128 1 1]
CUDA error: operation failed due to a previous error during capture
current device: 1, in function ggml_cuda_op_mul_mat at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1676
cudaGetLastError()