Misc. bug: Denial of Service (crash) when using verbose output with input tokens that are not in printable range.

### Name and Version

~/git/llama.cpp/build/bin/ [tags/b4798] ./llama-cli --version
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3 Pro)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3 Pro)
version: 4798 (1782cdfe)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0

### Operating systems

_No response_

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
# Run the latest tag inside container
docker run --rm -it ubuntu
apt update
apt install -y wget unzip curl build-essential
wget https://github.com/ggml-org/llama.cpp/releases/download/b4798/llama-b4798-bin-ubuntu-arm64.zip
unzip llama-b4798-bin-ubuntu-arm64.zip

# Run the server
LD_LIBRARY_PATH=$(pwd)/build/bin ./build/bin/llama-server -m /tmp/DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf -v

# Crash the server
curl -vvvv http://localhost:8080/v1/completions -d '{"prompt":[-1]}'
```

### Problem description & steps to reproduce

# Summary
An unhandled exception crashes the server in the "v1/completions" route, if the server is running in verbose mode. An attacker can send a completion request with tokens that are not in the vocabulary / out of range, which crashes the application.

I opened an issue per your request, after closing https://github.com/ggml-org/llama.cpp/security/advisories/GHSA-9fg6-6f9w-fgj3

## Details
I recently ran into an unintended crash that can be triggered by anyone, using a single HTTP request, causing denial of service. I affects both Debug and Release builds and depends on the -v flag.

## PoC
I verified the POC on 2 different OSs (Linux, MacOS) and architectures (QWEN, LLAMA). I tested it on ubuntu container and on my host (up to date M3). I tested both compiling from source but here I will use the prebuilt binaries.

Download and Run the latest llama server build with -v flag:
```
docker run --rm -it ubuntu
apt update
apt install -y wget unzip curl build-essential
wget https://github.com/ggml-org/llama.cpp/releases/download/b4798/llama-b4798-bin-ubuntu-arm64.zip
unzip llama-b4798-bin-ubuntu-arm64.zip
```

# Modify the model path, I copied it with docker cp
```
LD_LIBRARY_PATH=$(pwd)/build/bin ./build/bin/llama-server -m /tmp/DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf -v
```
The following line will infer 2 tokens successfully to make sure the model works first:
```
curl -vvvv http://localhost:8080/v1/completions -d '{"prompt":[1], "max_tokens": 2}'
```
The following line will crash the application, which demonstrated the denial of service vulnerability by triggering uncaught exception:
```
curl -vvvv http://localhost:8080/v1/completions -d '{"prompt":[-1]}'
```
The server crashes with due to out of range error (no token mapped for the value -1) when it tries to print it to STDOUT.
```
que          post: new task, id = 56/1, front = 0
que    start_loop: processing new tasks
que    start_loop: processing task, id = 56
slot get_availabl: id  0 | task 0 | selected slot by lru, t_last = 43778350372
slot        reset: id  0 | task 0 |
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151936)
Aborted
```

## Impact
Any server running llama.cpp in verbose mode is subject to DOS using a single HTTP request.

## Suggested behavior (opinion)
When receiving tokens as a list of integers, the server should make sure the tokens are supported in the current vocab / tokenizer. Validating every input just for verbose logging does not make sense and can be IO expensive, which is an overkill.
But, on the other hand, the user should get a clear error when invalid tokens / unsupported tokens are provided, whether or not the server runs in verbose mode. It might also skip tokens that are not known when printing them by checking the value has a mapping and is in the supported range.

Thanks in advance!

### First Bad Commit

_No response_

### Relevant log output

```shell
que          post: new task, id = 56/1, front = 0
que    start_loop: processing new tasks
que    start_loop: processing task, id = 56
slot get_availabl: id  0 | task 0 | selected slot by lru, t_last = 43778350372
slot        reset: id  0 | task 0 |
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151936)
Aborted
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Denial of Service (crash) when using verbose output with input tokens that are not in printable range. #12178

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Summary

Details

PoC

Modify the model path, I copied it with docker cp

Impact

Suggested behavior (opinion)

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: Denial of Service (crash) when using verbose output with input tokens that are not in printable range. #12178

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Summary

Details

PoC

Modify the model path, I copied it with docker cp

Impact

Suggested behavior (opinion)

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions