Skip to content

Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA) #13280

@matteoserva

Description

@matteoserva

Name and Version

llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
version: 5269 (1d36b36)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

rtx 5060ti 16GB + rtx4060ti 16GB

Models

Qwen_Qwen3-30B-A3B-Q6_K.gguf by bartowski.
sha256sum: d511d02955714b08ff1b4354d6eae8ea513179a83fa5498466db2731528074dd

Problem description & steps to reproduce

I'm using a grammar to simulate the nothink qwen prompt format. Sometimes the output is generated correctly, sometimes the model outputs the wrong token while still aligned with the grammar.

The command I'm using to test:

curl http://localhost:8080/completion -H "Content-Type: application/json" -d '{
  "prompt": "<|im_start|>system\n<|im_end|>\n<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n",
  "grammar": "root ::= \"<think>\\n\\n</think>\\n\\n\" .*",
  "temperature": 0.001,
  "n_predict": 6,
  "seed": 42
}'

Correct output: [151667 198 198 151668 ...] <think>\n\n</think>...
Wrong output: [151667 198 198 27 14 ...] <think>\n\n</...

Sometimes the model outputs the correct output, sometimes it outputs the wrong output and the following output breaks since the model cannot see the </think> token. I'm not restarting llama-server between tests and not changing the seed. I expect the model to always output the token 151668

Command line used to launch llama-server: /llama-server -ngl 175 -t 6 -c 32768 --host 0.0.0.0 -fa -ctk q8_0 -ctv q8_0 --slots -a current --temp 0.6

First Bad Commit

No response

Relevant log output

`{"index":0,"content":"<think>\n\n</think"...`

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions