Skip to content

[Performance]: 0.8.1 vs 0.7.4dev122 R1 H20 performance benchmark test,0.8.1 What is the reason for the 14% performance improvement(throughput tokens/s) #15881

@chuanyi-zjc

Description

@chuanyi-zjc

Proposal to improve performance

No response

Report of performance regression

perf test R1 model: input/output=3500/1500, on the same host, vllm 0.8.1 throughtput(total) improve 14%, v0.8.1 why? What are the technical optimizations

python3 /root/vllm/benchmarks/benchmark_serving.py --backend vllm
--model /data00/models/DeepSeek-R1
--base-url http://127.0.0.1:8000
--endpoint /v1/completions
--num-prompts 4
--request-rate 1
--metric_percentiles '50,90,95,99'
--goodput ttft:5000 tpot:250
--max-concurrency 4
--random-input-len 3500
--random-output-len 1500
--dataset-name random
--ignore-eos --trust-remote-code
--save-result \

0.7.4dev122 perf result:

Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 23.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 23
100%|██████████| 92/92 [05:55<00:00, 3.86s/it]
============ Serving Benchmark Result ============
Successful requests: 92
Benchmark duration (s): 355.48
Total input tokens: 322000
Total generated tokens: 138000
Request throughput (req/s): 0.26
Request goodput (req/s): 0.22
Output token throughput (tok/s): 388.21
Total Token throughput (tok/s): 1294.03
---------------Time to First Token----------------
Mean TTFT (ms): 2571.20
Median TTFT (ms): 1802.51
P50 TTFT (ms): 1802.51
P90 TTFT (ms): 6756.42
P95 TTFT (ms): 7032.22
P99 TTFT (ms): 7141.43
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 56.90
Median TPOT (ms): 56.94
P50 TPOT (ms): 56.94
P90 TPOT (ms): 59.67
P95 TPOT (ms): 60.27
P99 TPOT (ms): 63.01
---------------Inter-token Latency----------------
Mean ITL (ms): 112.90
Median ITL (ms): 104.94
P50 ITL (ms): 104.94
P90 ITL (ms): 110.60
P95 ITL (ms): 112.36
P99 ITL (ms): 121.95

081 perf result:

Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 23.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 23
100%|██████████| 92/92 [05:11<00:00, 3.38s/it]
============ Serving Benchmark Result ============
Successful requests: 92
Benchmark duration (s): 311.27
Total input tokens: 322000
Total generated tokens: 138000
Request throughput (req/s): 0.30
Request goodput (req/s): 0.25
Output token throughput (tok/s): 443.35
Total Token throughput (tok/s): 1477.82
---------------Time to First Token----------------
Mean TTFT (ms): 2186.95
Median TTFT (ms): 1826.80
P50 TTFT (ms): 1826.80
P90 TTFT (ms): 5747.68
P95 TTFT (ms): 5931.91
P99 TTFT (ms): 6068.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 49.50
Median TPOT (ms): 49.74
P50 TPOT (ms): 49.74
P90 TPOT (ms): 51.92
P95 TPOT (ms): 52.42
P99 TPOT (ms): 55.33

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions