[Performance]: 0.8.1 vs 0.7.4dev122 R1 H20 performance benchmark test，0.8.1 What is the reason for the 14% performance improvement(throughput tokens/s)

### Proposal to improve performance

_No response_

### Report of performance regression

perf test R1 model: input/output=3500/1500, on the same host, vllm 0.8.1 throughtput（total） improve 14%, v0.8.1 why? What are the technical optimizations

python3 /root/vllm/benchmarks/benchmark_serving.py --backend vllm  \
            --model /data00/models/DeepSeek-R1 \
            --base-url http://127.0.0.1:8000 \
            --endpoint /v1/completions \
            --num-prompts 4 \
            --request-rate 1 \
            --metric_percentiles '50,90,95,99' \
            --goodput ttft:5000 tpot:250 \
            --max-concurrency 4 \
            --random-input-len 3500 \
            --random-output-len 1500 \
            --dataset-name random \
            --ignore-eos --trust-remote-code \
            --save-result \


**0.7.4dev122** perf result:

Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 23.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 23
100%|██████████| 92/92 [05:55<00:00,  3.86s/it]
============ Serving Benchmark Result ============
Successful requests:                     92
Benchmark duration (s):                  355.48
Total input tokens:                      322000
Total generated tokens:                  138000
Request throughput (req/s):              0.26
Request goodput (req/s):                 0.22
Output token throughput (tok/s):         388.21
Total Token throughput (tok/s):          1294.03
---------------Time to First Token----------------
Mean TTFT (ms):                          2571.20
Median TTFT (ms):                        1802.51
P50 TTFT (ms):                           1802.51
P90 TTFT (ms):                           6756.42
P95 TTFT (ms):                           7032.22
P99 TTFT (ms):                           7141.43
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          56.90
Median TPOT (ms):                        56.94
P50 TPOT (ms):                           56.94
P90 TPOT (ms):                           59.67
P95 TPOT (ms):                           60.27
P99 TPOT (ms):                           63.01
---------------Inter-token Latency----------------
Mean ITL (ms):                           112.90
Median ITL (ms):                         104.94
P50 ITL (ms):                            104.94
P90 ITL (ms):                            110.60
P95 ITL (ms):                            112.36
P99 ITL (ms):                            121.95

**081** perf result: 

Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 23.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 23
100%|██████████| 92/92 [05:11<00:00,  3.38s/it]
============ Serving Benchmark Result ============
Successful requests:                     92
Benchmark duration (s):                  311.27
Total input tokens:                      322000
Total generated tokens:                  138000
Request throughput (req/s):              0.30
Request goodput (req/s):                 0.25
Output token throughput (tok/s):         443.35
Total Token throughput (tok/s):          1477.82
---------------Time to First Token----------------
Mean TTFT (ms):                          2186.95
Median TTFT (ms):                        1826.80
P50 TTFT (ms):                           1826.80
P90 TTFT (ms):                           5747.68
P95 TTFT (ms):                           5931.91
P99 TTFT (ms):                           6068.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          49.50
Median TPOT (ms):                        49.74
P50 TPOT (ms):                           49.74
P90 TPOT (ms):                           51.92
P95 TPOT (ms):                           52.42
P99 TPOT (ms):                           55.33

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

```text
The output of `python collect_env.py`
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: 0.8.1 vs 0.7.4dev122 R1 H20 performance benchmark test，0.8.1 What is the reason for the 14% performance improvement(throughput tokens/s) #15881

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: 0.8.1 vs 0.7.4dev122 R1 H20 performance benchmark test，0.8.1 What is the reason for the 14% performance improvement(throughput tokens/s) #15881

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions