Skip to content

[Performance]: [V1] duplicated prefill tokens for n>1 #14686

@hewr2010

Description

@hewr2010

Proposal to improve performance

The following statement comes from #10980

The vLLM v1 engine can exploit APC when a prompt repeats within a batch, even if that prompt was not seen in a previous batch. Therefore, no warmup request is required.

Could you please show me the PR for this feature? I've tested on v0.7.3 and it seems a warmup request is still required for n>1 cases.

Here is a simple command to reproduce this problem.

VLLM_USE_V1=1 python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B -tp 1 --input-len 10 --n 2 --output-len 1 --batch-size 1 --trust-remote-code --num-iters 1 --num-iters-warmup 0 --load-format dummy

You can see input_ids set to LlamaModel.forward repeats twice, which leads to computation wastes on prefill tokens.

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions