[Bug]: python sampler is faster than flashinfer sampler

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Your output of `python collect_env.py` here
```

</details>


### 🐛 Describe the bug

Hi,

vllm run from docker image.

Version 0.9.0 is much better but still slower.
In 0.8.5 python infer was about 2 token/sec faster than flash infer.

In 0.9.0 difference is on the level 0,5 token/sec

Python sampler

docker run --runtime nvidia --gpus all -d --name vllm-Qwen3-32B-v10 --restart unless-stopped -v ~/.cache/vllm:/root/.cache/vllm -v ~/.cache/huggingface:/root/.cache/huggingface -e VLLM_FLASH_ATTN_VERSION=2 -e VLLM_USE_V1=1 -e VLLM_USE_FLASHINFER_SAMPLER=0 -e VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE=1 -e VLLM_ATTENTION_BACKEND=FLASH_ATTN -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True -e VLLM_ENABLE_V1_MULTIPROCESSING=1 -e MAX_JOBS=32 -e VLLM_USE_PRECOMPILED=true -e RAY_ROTATION_MAX_BYTES=0 -e RAY_ROTATION_BACKUP_COUNT=0 -p 8000:8000 vllm/vllm-openai:v0.9.0 --model Qwen/Qwen3-32B-FP8 --served-model-name BSSTelcoChat experimental reasoning llm --max-model-len 26060 --max-seq-len-to-capture 26060 --max-num-batched-tokens 26060 --block-size 32 --gpu-memory-utilization 0.999999 --seed 0 --max-log-len 35 --enable-auto-tool-choice --tool-call-parser hermes --tokenizer-pool-size 64 --max-parallel-loading-workers 64 --long-prefill-token-threshold 1024 --max-num-partial-prefills 1 --max-num-seqs 128 --enable-prefix-caching --max-logprobs 0

INFO 05-27 23:31:40 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%



FlashInfer sampler

docker run --runtime nvidia --gpus all -d --name vllm-Qwen3-32B-v11 --restart unless-stopped -v ~/.cache/vllm:/root/.cache/vllm -v ~/.cache/huggingface:/root/.cache/huggingface -e VLLM_FLASH_ATTN_VERSION=2 -e VLLM_USE_V1=1 -e VLLM_USE_FLASHINFER_SAMPLER=1 -e VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE=1 -e VLLM_ATTENTION_BACKEND=FLASH_ATTN -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True -e VLLM_ENABLE_V1_MULTIPROCESSING=1 -e MAX_JOBS=32 -e VLLM_USE_PRECOMPILED=true -e RAY_ROTATION_MAX_BYTES=0 -e RAY_ROTATION_BACKUP_COUNT=0 -p 8000:8000 vllm/vllm-openai:v0.9.0 --model Qwen/Qwen3-32B-FP8 --served-model-name BSSTelcoChat experimental reasoning llm --max-model-len 26060 --max-seq-len-to-capture 26060 --max-num-batched-tokens 26060 --block-size 32 --gpu-memory-utilization 0.999999 --seed 0 --max-log-len 35 --enable-auto-tool-choice --tool-call-parser hermes --tokenizer-pool-size 64 --max-parallel-loading-workers 64 --long-prefill-token-threshold 1024 --max-num-partial-prefills 1 --max-num-seqs 128 --enable-prefix-caching --max-logprobs 0

INFO 05-27 23:27:09 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%



as I understand choosing next token by python code should be slower than doing this on gpu,
so something must be wrong with FlashInfer sampler


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: python sampler is faster than flashinfer sampler #18811

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: python sampler is faster than flashinfer sampler #18811

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions