[benchmark] add peak throughput metrics and plot #23867

simon-mo · 2025-08-28T23:36:52Z

This pull request enhances the benchmarking capabilities in vllm/benchmarks/serve.py and related files by adding new performance metrics, improving code formatting, and making minor functional improvements. The most significant update is the calculation and reporting of peak output token throughput and peak concurrent requests during benchmarking runs. Additionally, the code now conditionally displays terminal plots if the required dependencies are available, and several formatting and style improvements have been made for better readability.

Benchmarking metrics and reporting:

Added calculation of peak output token throughput (max_output_tokens_per_s) and peak concurrent requests (max_concurrent_requests) to the BenchmarkMetrics dataclass, and included these metrics in the output and printed summaries.
Tracked the start_time for each request in RequestFuncOutput and ensured it is set in all relevant request functions, enabling accurate time-based metric calculations.

Visualization improvements:

Added conditional support for terminal-based plotting of output token throughput and concurrent requests per second using termplotlib and gnuplot if available.

Code formatting and style:

Improved code formatting for better readability, including consistent line breaks, indentation, and argument formatting in function calls and assertions.

$ vllm bench serve --model 'Qwen/Qwen3-0.6B' 
INFO 08-28 22:41:28 [__init__.py:241] Automatically detected platform cuda.
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7fa6fd6a8360>, seed=0, num_prompts=1000, dataset_name='random', no_stream=False, dataset_path=None, custom_output_len=256, custom_skip_chat_template=False, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 0}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, endpoint_type='openai', label=None, backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', max_concurrency=None, model='Qwen/Qwen3-0.6B', tokenizer=None, use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, request_id_prefix='benchmark-serving', top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=600)
INFO 08-28 22:41:42 [datasets.py:509] Sampling input_len from [1024, 1024] and output_len from [128, 128]
Starting initial single prompt test run...
Waiting for endpoint to become up in 600 seconds
 |                                                              | 00:00 elapsed, 4:54:29 remaining
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████| 1000/1000 [00:12<00:00, 78.25it/s]

                              Output tokens per second
  18000 +-------------------------------------------------------------------+
        |                                                    *              |
  16000 |                                                  ** *             |
        |                            ******               *    *            |
  14000 |                           *      *             *      *           |
        |                           *       *           *       *           |
  12000 |                          *         *        **         *          |
        |                         *           *      *            *         |
  10000 |                        *             ******             *         |
        |                        *                                 *        |
        |                       *                                  *        |
   8000 |                     **                                    *       |
        |              **   **                                      *       |
   6000 |             *  ***                                         *      |
        |            *                                               *      |
   4000 |           *                                                *      |
        |          *                                                  *     |
   2000 |        **                                                   *     |
        |      **                                                      *    |
      0 +-------------------------------------------------------------------+
        0         2        4         6         8         10       12        14

                          Concurrent requests per second
  1000 +--------------------------------------------------------------------+
       |                 *******************                                |
       |                                    ***                             |
       |                                       **                           |
   800 |                                         **                         |
       |                                           **                       |
       |                                             **                     |
       |                                               **                   |
   600 |                                                 *                  |
       |                                                  **                |
       |                                                    *               |
   400 |                                                     **             |
       |                                                       **           |
       |                                                         **         |
       |                                                           *        |
   200 |                                                            *       |
       |                                                             *      |
       |                                                             *      |
       |                                                              *     |
     0 +--------------------------------------------------------------------+
       0         2         4         6        8         10        12        14
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  12.78     
Total input tokens:                      1021255   
Total generated tokens:                  117355    
Request throughput (req/s):              78.25     
Output token throughput (tok/s):         9182.55   
Total Token throughput (tok/s):          89091.56  
Max output token throughput (tok/s):     16614.00  
Max concurrent requests (req/s):         1000.00   
---------------Time to First Token----------------
Mean TTFT (ms):                          5654.81   
Median TTFT (ms):                        5127.15   
P99 TTFT (ms):                           10142.20  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.70     
Median TPOT (ms):                        43.34     
P99 TPOT (ms):                           49.51     
---------------Inter-token Latency----------------
Mean ITL (ms):                           39.07     
Median ITL (ms):                         35.23     
P99 ITL (ms):                            55.36     
==================================================

simon-mo · 2025-08-28T23:37:26Z

vllm/benchmarks/serve.py

            "on the benchmark arguments.",
            stacklevel=2)
+
+    # Calculate max output tokens per second metric


this is the main change. the formatting in this file is pretty sad

gemini-code-assist

Code Review

This pull request introduces valuable new metrics for peak throughput and concurrency to the serving benchmark, along with terminal plots for visualization. The implementation is mostly solid, but I've identified a couple of issues related to the correctness of the peak concurrency calculation and its presentation in the output. Specifically, the peak concurrency is calculated independently of the peak throughput, which seems to deviate from the intended metric of 'concurrency at peak throughput'. Additionally, the unit for concurrent requests in the summary table is misleading. Addressing these points will improve the accuracy and clarity of the new benchmark metrics.

gemini-code-assist · 2025-08-28T23:38:11Z

vllm/benchmarks/serve.py

+            max_output_tokens_per_s = float(np.max(tokens_per_second))
+            max_concurrent_requests = int(
+                np.max(concurrent_requests_per_second))


The calculation of max_concurrent_requests is independent of max_output_tokens_per_s. It finds the maximum concurrent requests over the entire benchmark, not the concurrency at the time of peak token throughput. To align with the goal of finding "concurrent requests at that peak", you should find the index of the maximum token throughput and use that to get the corresponding concurrent requests.

Suggested change

max_output_tokens_per_s = float(np.max(tokens_per_second))

max_concurrent_requests = int(

np.max(concurrent_requests_per_second))

peak_idx = np.argmax(tokens_per_second)

max_output_tokens_per_s = float(tokens_per_second[peak_idx])

max_concurrent_requests = int(concurrent_requests_per_second[peak_idx])

gemini-code-assist · 2025-08-28T23:38:11Z

vllm/benchmarks/serve.py

+        print("{:<40} {:<10.2f}".format("Peak concurrent requests (req/s):",
+                                        metrics.max_concurrent_requests))


The metric max_concurrent_requests is a count of concurrent requests, not a rate. The unit (req/s) in the printed output is misleading and should be removed. Also, since it's an integer count, using an integer format specifier would be more appropriate than a float one.

Suggested change

print("{:<40} {:<10.2f}".format("Peak concurrent requests (req/s):",

metrics.max_concurrent_requests))

print("{:<40} {:<10}".format("Peak concurrent requests:",

metrics.max_concurrent_requests))

minosfuture

Already started using this PR. Very helpful! Thanks!

Signed-off-by: simon-mo <[email protected]>

…litPR into model_register * 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits) Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085) [Docs] Fix API Reference (vllm-project#25140) [Kernel] Better inf handling for grouped topk cu (vllm-project#24886) [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769) [benchmark] add peak throughput metrics and plot (vllm-project#23867) [Spec Decode] Efficient padded speculation (vllm-project#24539) [V0 Deprecation] Remove more V0 tests (vllm-project#25117) [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078) [XPU] Whisper model support on XPU Platform (vllm-project#25123) Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077) [Model] enable data parallel for InternVL vision encoder (vllm-project#23909) [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254) [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006) [Docs] Clean up the contributing README (vllm-project#25099) [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955) [Kernels] Enable DeepGEMM by default (vllm-project#24462) [V0 Deprecation] Skip PP test (vllm-project#25128) [V0 Deprecation] Remove misc V0 tests (vllm-project#25118) [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115) ...

Signed-off-by: simon-mo <[email protected]>

Signed-off-by: simon-mo <[email protected]> Signed-off-by: charlifu <[email protected]>

Signed-off-by: simon-mo <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Signed-off-by: simon-mo <[email protected]>

[benchmark] add peak throughput metrics and plot

a478670

simon-mo commented Aug 28, 2025

View reviewed changes

mergify bot added the performance Performance-related issues label Aug 28, 2025

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

minosfuture approved these changes Aug 30, 2025

View reviewed changes

simon-mo added 2 commits September 17, 2025 15:23

Merge branch 'main' of github.com:vllm-project/vllm into output-metrics

66630b4

address comments

c7602c0

Signed-off-by: simon-mo <[email protected]>

simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 17, 2025

add tip

a58e729

Signed-off-by: simon-mo <[email protected]>

simon-mo merged commit a904ea7 into vllm-project:main Sep 18, 2025
41 checks passed

debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025

[benchmark] add peak throughput metrics and plot (vllm-project#23867)

6d863d0

Signed-off-by: simon-mo <[email protected]>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[benchmark] add peak throughput metrics and plot (vllm-project#23867)

44e6213

Signed-off-by: simon-mo <[email protected]>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[benchmark] add peak throughput metrics and plot (vllm-project#23867)

f478dd7

Signed-off-by: simon-mo <[email protected]> Signed-off-by: charlifu <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[benchmark] add peak throughput metrics and plot (vllm-project#23867)

f6e3814

Signed-off-by: simon-mo <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[benchmark] add peak throughput metrics and plot (vllm-project#23867)

a03d564

Signed-off-by: simon-mo <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[benchmark] add peak throughput metrics and plot #23867

[benchmark] add peak throughput metrics and plot #23867

Uh oh!

simon-mo commented Aug 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

simon-mo Aug 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

minosfuture left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		print("{:<40} {:<10.2f}".format("Peak concurrent requests (req/s):",
		metrics.max_concurrent_requests))

Uh oh!

[benchmark] add peak throughput metrics and plot #23867

[benchmark] add peak throughput metrics and plot #23867

Uh oh!

Conversation

simon-mo commented Aug 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simon-mo Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

minosfuture left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simon-mo commented Aug 28, 2025 •

edited by github-actions bot

Loading