-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Proposal to improve performance
No response
Report of performance regression
perf test R1 model: input/output=3500/1500, on the same host, vllm 0.8.1 throughtput(total) improve 14%, v0.8.1 why? What are the technical optimizations
python3 /root/vllm/benchmarks/benchmark_serving.py --backend vllm
--model /data00/models/DeepSeek-R1
--base-url http://127.0.0.1:8000
--endpoint /v1/completions
--num-prompts 4
--request-rate 1
--metric_percentiles '50,90,95,99'
--goodput ttft:5000 tpot:250
--max-concurrency 4
--random-input-len 3500
--random-output-len 1500
--dataset-name random
--ignore-eos --trust-remote-code
--save-result \
0.7.4dev122 perf result:
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 23.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 23
100%|██████████| 92/92 [05:55<00:00, 3.86s/it]
============ Serving Benchmark Result ============
Successful requests: 92
Benchmark duration (s): 355.48
Total input tokens: 322000
Total generated tokens: 138000
Request throughput (req/s): 0.26
Request goodput (req/s): 0.22
Output token throughput (tok/s): 388.21
Total Token throughput (tok/s): 1294.03
---------------Time to First Token----------------
Mean TTFT (ms): 2571.20
Median TTFT (ms): 1802.51
P50 TTFT (ms): 1802.51
P90 TTFT (ms): 6756.42
P95 TTFT (ms): 7032.22
P99 TTFT (ms): 7141.43
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 56.90
Median TPOT (ms): 56.94
P50 TPOT (ms): 56.94
P90 TPOT (ms): 59.67
P95 TPOT (ms): 60.27
P99 TPOT (ms): 63.01
---------------Inter-token Latency----------------
Mean ITL (ms): 112.90
Median ITL (ms): 104.94
P50 ITL (ms): 104.94
P90 ITL (ms): 110.60
P95 ITL (ms): 112.36
P99 ITL (ms): 121.95
081 perf result:
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 23.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 23
100%|██████████| 92/92 [05:11<00:00, 3.38s/it]
============ Serving Benchmark Result ============
Successful requests: 92
Benchmark duration (s): 311.27
Total input tokens: 322000
Total generated tokens: 138000
Request throughput (req/s): 0.30
Request goodput (req/s): 0.25
Output token throughput (tok/s): 443.35
Total Token throughput (tok/s): 1477.82
---------------Time to First Token----------------
Mean TTFT (ms): 2186.95
Median TTFT (ms): 1826.80
P50 TTFT (ms): 1826.80
P90 TTFT (ms): 5747.68
P95 TTFT (ms): 5931.91
P99 TTFT (ms): 6068.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 49.50
Median TPOT (ms): 49.74
P50 TPOT (ms): 49.74
P90 TPOT (ms): 51.92
P95 TPOT (ms): 52.42
P99 TPOT (ms): 55.33
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.