-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Your current environment
The output of python collect_env.py
Your output of `python collect_env.py` here
🐛 Describe the bug
Hi,
vllm run from docker image.
Version 0.9.0 is much better but still slower.
In 0.8.5 python infer was about 2 token/sec faster than flash infer.
In 0.9.0 difference is on the level 0,5 token/sec
Python sampler
docker run --runtime nvidia --gpus all -d --name vllm-Qwen3-32B-v10 --restart unless-stopped -v ~/.cache/vllm:/root/.cache/vllm -v ~/.cache/huggingface:/root/.cache/huggingface -e VLLM_FLASH_ATTN_VERSION=2 -e VLLM_USE_V1=1 -e VLLM_USE_FLASHINFER_SAMPLER=0 -e VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE=1 -e VLLM_ATTENTION_BACKEND=FLASH_ATTN -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True -e VLLM_ENABLE_V1_MULTIPROCESSING=1 -e MAX_JOBS=32 -e VLLM_USE_PRECOMPILED=true -e RAY_ROTATION_MAX_BYTES=0 -e RAY_ROTATION_BACKUP_COUNT=0 -p 8000:8000 vllm/vllm-openai:v0.9.0 --model Qwen/Qwen3-32B-FP8 --served-model-name BSSTelcoChat experimental reasoning llm --max-model-len 26060 --max-seq-len-to-capture 26060 --max-num-batched-tokens 26060 --block-size 32 --gpu-memory-utilization 0.999999 --seed 0 --max-log-len 35 --enable-auto-tool-choice --tool-call-parser hermes --tokenizer-pool-size 64 --max-parallel-loading-workers 64 --long-prefill-token-threshold 1024 --max-num-partial-prefills 1 --max-num-seqs 128 --enable-prefix-caching --max-logprobs 0
INFO 05-27 23:31:40 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
FlashInfer sampler
docker run --runtime nvidia --gpus all -d --name vllm-Qwen3-32B-v11 --restart unless-stopped -v ~/.cache/vllm:/root/.cache/vllm -v ~/.cache/huggingface:/root/.cache/huggingface -e VLLM_FLASH_ATTN_VERSION=2 -e VLLM_USE_V1=1 -e VLLM_USE_FLASHINFER_SAMPLER=1 -e VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE=1 -e VLLM_ATTENTION_BACKEND=FLASH_ATTN -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True -e VLLM_ENABLE_V1_MULTIPROCESSING=1 -e MAX_JOBS=32 -e VLLM_USE_PRECOMPILED=true -e RAY_ROTATION_MAX_BYTES=0 -e RAY_ROTATION_BACKUP_COUNT=0 -p 8000:8000 vllm/vllm-openai:v0.9.0 --model Qwen/Qwen3-32B-FP8 --served-model-name BSSTelcoChat experimental reasoning llm --max-model-len 26060 --max-seq-len-to-capture 26060 --max-num-batched-tokens 26060 --block-size 32 --gpu-memory-utilization 0.999999 --seed 0 --max-log-len 35 --enable-auto-tool-choice --tool-call-parser hermes --tokenizer-pool-size 64 --max-parallel-loading-workers 64 --long-prefill-token-threshold 1024 --max-num-partial-prefills 1 --max-num-seqs 128 --enable-prefix-caching --max-logprobs 0
INFO 05-27 23:27:09 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
as I understand choosing next token by python code should be slower than doing this on gpu,
so something must be wrong with FlashInfer sampler
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.