Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA)

### Name and Version

llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
version: 5269 (1d36b367)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

rtx 5060ti 16GB + rtx4060ti 16GB

### Models

Qwen_Qwen3-30B-A3B-Q6_K.gguf by bartowski.
sha256sum: d511d02955714b08ff1b4354d6eae8ea513179a83fa5498466db2731528074dd

### Problem description & steps to reproduce

I'm using a grammar to simulate the nothink qwen prompt format. Sometimes the output is generated correctly, sometimes the model outputs the wrong token while still aligned with the grammar.

The command I'm using to test:
```
curl http://localhost:8080/completion -H "Content-Type: application/json" -d '{
  "prompt": "<|im_start|>system\n<|im_end|>\n<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n",
  "grammar": "root ::= \"<think>\\n\\n</think>\\n\\n\" .*",
  "temperature": 0.001,
  "n_predict": 6,
  "seed": 42
}'
```

Correct output: `[151667 198 198 151668 ...]  <think>\n\n</think>...`
Wrong output: `[151667 198 198 27 14 ...]    <think>\n\n</...`

Sometimes the model outputs the correct output, sometimes it outputs the wrong output and the following output breaks since the model cannot see the `</think>` token. I'm not restarting llama-server between tests and not changing the seed. I expect the model to always output the token 151668

Command line used to launch llama-server: `/llama-server -ngl 175 -t 6 -c 32768 --host 0.0.0.0 -fa -ctk q8_0 -ctv q8_0 --slots -a current --temp 0.6`

### First Bad Commit

_No response_

### Relevant log output

```shell
`{"index":0,"content":"<think>\n\n</think"...`
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA) #13280

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA) #13280

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions