Skip to content

Misc. bug: Denial of Service (crash) when using verbose output with input tokens that are not in printable range. #12178

@avioligo

Description

@avioligo

Name and Version

~/git/llama.cpp/build/bin/ [tags/b4798] ./llama-cli --version
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3 Pro)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3 Pro)
version: 4798 (1782cdf)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0

Operating systems

No response

Which llama.cpp modules do you know to be affected?

llama-server

Command line

# Run the latest tag inside container
docker run --rm -it ubuntu
apt update
apt install -y wget unzip curl build-essential
wget https://github.com/ggml-org/llama.cpp/releases/download/b4798/llama-b4798-bin-ubuntu-arm64.zip
unzip llama-b4798-bin-ubuntu-arm64.zip

# Run the server
LD_LIBRARY_PATH=$(pwd)/build/bin ./build/bin/llama-server -m /tmp/DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf -v

# Crash the server
curl -vvvv http://localhost:8080/v1/completions -d '{"prompt":[-1]}'

Problem description & steps to reproduce

Summary

An unhandled exception crashes the server in the "v1/completions" route, if the server is running in verbose mode. An attacker can send a completion request with tokens that are not in the vocabulary / out of range, which crashes the application.

I opened an issue per your request, after closing https://github.com/ggml-org/llama.cpp/security/advisories/GHSA-9fg6-6f9w-fgj3

Details

I recently ran into an unintended crash that can be triggered by anyone, using a single HTTP request, causing denial of service. I affects both Debug and Release builds and depends on the -v flag.

PoC

I verified the POC on 2 different OSs (Linux, MacOS) and architectures (QWEN, LLAMA). I tested it on ubuntu container and on my host (up to date M3). I tested both compiling from source but here I will use the prebuilt binaries.

Download and Run the latest llama server build with -v flag:

docker run --rm -it ubuntu
apt update
apt install -y wget unzip curl build-essential
wget https://github.com/ggml-org/llama.cpp/releases/download/b4798/llama-b4798-bin-ubuntu-arm64.zip
unzip llama-b4798-bin-ubuntu-arm64.zip

Modify the model path, I copied it with docker cp

LD_LIBRARY_PATH=$(pwd)/build/bin ./build/bin/llama-server -m /tmp/DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf -v

The following line will infer 2 tokens successfully to make sure the model works first:

curl -vvvv http://localhost:8080/v1/completions -d '{"prompt":[1], "max_tokens": 2}'

The following line will crash the application, which demonstrated the denial of service vulnerability by triggering uncaught exception:

curl -vvvv http://localhost:8080/v1/completions -d '{"prompt":[-1]}'

The server crashes with due to out of range error (no token mapped for the value -1) when it tries to print it to STDOUT.

que          post: new task, id = 56/1, front = 0
que    start_loop: processing new tasks
que    start_loop: processing task, id = 56
slot get_availabl: id  0 | task 0 | selected slot by lru, t_last = 43778350372
slot        reset: id  0 | task 0 |
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151936)
Aborted

Impact

Any server running llama.cpp in verbose mode is subject to DOS using a single HTTP request.

Suggested behavior (opinion)

When receiving tokens as a list of integers, the server should make sure the tokens are supported in the current vocab / tokenizer. Validating every input just for verbose logging does not make sense and can be IO expensive, which is an overkill.
But, on the other hand, the user should get a clear error when invalid tokens / unsupported tokens are provided, whether or not the server runs in verbose mode. It might also skip tokens that are not known when printing them by checking the value has a mapping and is in the supported range.

Thanks in advance!

First Bad Commit

No response

Relevant log output

que          post: new task, id = 56/1, front = 0
que    start_loop: processing new tasks
que    start_loop: processing task, id = 56
slot get_availabl: id  0 | task 0 | selected slot by lru, t_last = 43778350372
slot        reset: id  0 | task 0 |
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151936)
Aborted

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions