Skip to content

Commit f783320

Browse files
jiangpeng36jiangpeng36Ronald1995
authored andcommitted
[Perf][V1] Fully overlap model execution (vllm-project#2783)
This PR is based on top of [#23569](vllm-project/vllm#23569) and [#24219](vllm-project/vllm#24219). ### What this PR does / why we need it? This PR allows the model runner to function asynchronously when using async scheduling. This allows full overlap of the cpu operations (including prepare_inputs) and the model forward pass. This diff is functional and does not support speculative decoding, PP, or guided decoding. Expected speedup is 5-10% over the current async scheduling. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? server ``` python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\ --trust-remote-code --enforce-eager \ --distributed-executor-backend=mp \ -tp=4 \ --port 8006 \ --max-model-len 32000 \ --block-size 128 \ --gpu-memory-utilization 0.99 ``` client ``` python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \ --dataset-name random --random-input-len 2048 --random-output-len 2048 \ --ignore-eos\ --num-prompts 48 --max-concurrency 48 --request-rate inf --temperature 0 \ --metric-percentiles 90 --base-url http://localhost:8006 --save-result \ --result-dir $PROFILER_DIR ``` benchmark test based on Qwen3-32B TPOT result: ||forward async| scheduler async |sync| |-|-|-|-| |avg|41.73|41.86|44.20| |improve0|0.3%|0|0| |improve1|5.58%|0|0| benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result: ||forward async|sync| |-|-|-| |avg|23.22|29.16| |improve|20.3%|0| - vLLM version: main - vLLM main: vllm-project/vllm@e93f4cc Signed-off-by: jiangpeng36 <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: jiangpeng36 <[email protected]> Co-authored-by: Ronald1995 <[email protected]>
1 parent 2900600 commit f783320

File tree

4 files changed

+226
-75
lines changed

4 files changed

+226
-75
lines changed

tests/e2e/singlecard/test_ascend_scheduler.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# SPDX-License-Identifier: Apache-2.0
22
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
33
import pytest
4+
from vllm import SamplingParams
45

56
from tests.e2e.conftest import VllmRunner
67
from tests.e2e.model_utils import check_outputs_equal
@@ -86,3 +87,25 @@ def test_chunked_prefill_with_ascend_scheduler(
8687
name_0="vllm_output",
8788
name_1="chunked_prefill_output",
8889
)
90+
91+
92+
def test_async_scheduling() -> None:
93+
prompts = [
94+
"Hello, my name is",
95+
"The president of the United States is",
96+
"The capital of France is",
97+
"The future of AI is",
98+
] * 10
99+
sampling_params = SamplingParams(temperature=0.2,
100+
max_tokens=10,
101+
stop_token_ids=None)
102+
103+
with VllmRunner(
104+
"Qwen/Qwen2.5-0.5B-Instruct",
105+
max_model_len=4096,
106+
max_num_seqs=50,
107+
dtype="bfloat16",
108+
gpu_memory_utilization=0.9,
109+
async_scheduling=True,
110+
) as vllm_model:
111+
vllm_model.generate(prompts, sampling_params=sampling_params)

0 commit comments

Comments
 (0)