-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Closed
Labels
performancePerformance-related issuesPerformance-related issues
Description
Proposal to improve performance
The following statement comes from #10980
The vLLM v1 engine can exploit APC when a prompt repeats within a batch, even if that prompt was not seen in a previous batch. Therefore, no warmup request is required.
Could you please show me the PR for this feature? I've tested on v0.7.3
and it seems a warmup request is still required for n>1 cases.
Here is a simple command to reproduce this problem.
VLLM_USE_V1=1 python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B -tp 1 --input-len 10 --n 2 --output-len 1 --batch-size 1 --trust-remote-code --num-iters 1 --num-iters-warmup 0 --load-format dummy
You can see input_ids
set to LlamaModel.forward
repeats twice, which leads to computation wastes on prefill tokens.
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
performancePerformance-related issuesPerformance-related issues