Skip to content

Conversation

simon-mo
Copy link
Collaborator

@simon-mo simon-mo commented Aug 28, 2025

This pull request enhances the benchmarking capabilities in vllm/benchmarks/serve.py and related files by adding new performance metrics, improving code formatting, and making minor functional improvements. The most significant update is the calculation and reporting of peak output token throughput and peak concurrent requests during benchmarking runs. Additionally, the code now conditionally displays terminal plots if the required dependencies are available, and several formatting and style improvements have been made for better readability.

Benchmarking metrics and reporting:

  • Added calculation of peak output token throughput (max_output_tokens_per_s) and peak concurrent requests (max_concurrent_requests) to the BenchmarkMetrics dataclass, and included these metrics in the output and printed summaries.
  • Tracked the start_time for each request in RequestFuncOutput and ensured it is set in all relevant request functions, enabling accurate time-based metric calculations.

Visualization improvements:

  • Added conditional support for terminal-based plotting of output token throughput and concurrent requests per second using termplotlib and gnuplot if available.

Code formatting and style:

  • Improved code formatting for better readability, including consistent line breaks, indentation, and argument formatting in function calls and assertions.
$ vllm bench serve --model 'Qwen/Qwen3-0.6B' 
INFO 08-28 22:41:28 [__init__.py:241] Automatically detected platform cuda.
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7fa6fd6a8360>, seed=0, num_prompts=1000, dataset_name='random', no_stream=False, dataset_path=None, custom_output_len=256, custom_skip_chat_template=False, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 0}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, endpoint_type='openai', label=None, backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', max_concurrency=None, model='Qwen/Qwen3-0.6B', tokenizer=None, use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, request_id_prefix='benchmark-serving', top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=600)
INFO 08-28 22:41:42 [datasets.py:509] Sampling input_len from [1024, 1024] and output_len from [128, 128]
Starting initial single prompt test run...
Waiting for endpoint to become up in 600 seconds
 |                                                              | 00:00 elapsed, 4:54:29 remaining
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████| 1000/1000 [00:12<00:00, 78.25it/s]

                              Output tokens per second
  18000 +-------------------------------------------------------------------+
        |                                                    *              |
  16000 |                                                  ** *             |
        |                            ******               *    *            |
  14000 |                           *      *             *      *           |
        |                           *       *           *       *           |
  12000 |                          *         *        **         *          |
        |                         *           *      *            *         |
  10000 |                        *             ******             *         |
        |                        *                                 *        |
        |                       *                                  *        |
   8000 |                     **                                    *       |
        |              **   **                                      *       |
   6000 |             *  ***                                         *      |
        |            *                                               *      |
   4000 |           *                                                *      |
        |          *                                                  *     |
   2000 |        **                                                   *     |
        |      **                                                      *    |
      0 +-------------------------------------------------------------------+
        0         2        4         6         8         10       12        14

                          Concurrent requests per second
  1000 +--------------------------------------------------------------------+
       |                 *******************                                |
       |                                    ***                             |
       |                                       **                           |
   800 |                                         **                         |
       |                                           **                       |
       |                                             **                     |
       |                                               **                   |
   600 |                                                 *                  |
       |                                                  **                |
       |                                                    *               |
   400 |                                                     **             |
       |                                                       **           |
       |                                                         **         |
       |                                                           *        |
   200 |                                                            *       |
       |                                                             *      |
       |                                                             *      |
       |                                                              *     |
     0 +--------------------------------------------------------------------+
       0         2         4         6        8         10        12        14
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  12.78     
Total input tokens:                      1021255   
Total generated tokens:                  117355    
Request throughput (req/s):              78.25     
Output token throughput (tok/s):         9182.55   
Total Token throughput (tok/s):          89091.56  
Max output token throughput (tok/s):     16614.00  
Max concurrent requests (req/s):         1000.00   
---------------Time to First Token----------------
Mean TTFT (ms):                          5654.81   
Median TTFT (ms):                        5127.15   
P99 TTFT (ms):                           10142.20  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.70     
Median TPOT (ms):                        43.34     
P99 TPOT (ms):                           49.51     
---------------Inter-token Latency----------------
Mean ITL (ms):                           39.07     
Median ITL (ms):                         35.23     
P99 ITL (ms):                            55.36     
==================================================

"on the benchmark arguments.",
stacklevel=2)

# Calculate max output tokens per second metric
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the main change. the formatting in this file is pretty sad

@mergify mergify bot added the performance Performance-related issues label Aug 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable new metrics for peak throughput and concurrency to the serving benchmark, along with terminal plots for visualization. The implementation is mostly solid, but I've identified a couple of issues related to the correctness of the peak concurrency calculation and its presentation in the output. Specifically, the peak concurrency is calculated independently of the peak throughput, which seems to deviate from the intended metric of 'concurrency at peak throughput'. Additionally, the unit for concurrent requests in the summary table is misleading. Addressing these points will improve the accuracy and clarity of the new benchmark metrics.

Comment on lines +386 to +388
max_output_tokens_per_s = float(np.max(tokens_per_second))
max_concurrent_requests = int(
np.max(concurrent_requests_per_second))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation of max_concurrent_requests is independent of max_output_tokens_per_s. It finds the maximum concurrent requests over the entire benchmark, not the concurrency at the time of peak token throughput. To align with the goal of finding "concurrent requests at that peak", you should find the index of the maximum token throughput and use that to get the corresponding concurrent requests.

Suggested change
max_output_tokens_per_s = float(np.max(tokens_per_second))
max_concurrent_requests = int(
np.max(concurrent_requests_per_second))
peak_idx = np.argmax(tokens_per_second)
max_output_tokens_per_s = float(tokens_per_second[peak_idx])
max_concurrent_requests = int(concurrent_requests_per_second[peak_idx])

Comment on lines 686 to 687
print("{:<40} {:<10.2f}".format("Peak concurrent requests (req/s):",
metrics.max_concurrent_requests))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The metric max_concurrent_requests is a count of concurrent requests, not a rate. The unit (req/s) in the printed output is misleading and should be removed. Also, since it's an integer count, using an integer format specifier would be more appropriate than a float one.

Suggested change
print("{:<40} {:<10.2f}".format("Peak concurrent requests (req/s):",
metrics.max_concurrent_requests))
print("{:<40} {:<10}".format("Peak concurrent requests:",
metrics.max_concurrent_requests))

Copy link
Contributor

@minosfuture minosfuture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already started using this PR. Very helpful! Thanks!

@simon-mo simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 17, 2025
Signed-off-by: simon-mo <[email protected]>
@simon-mo simon-mo merged commit a904ea7 into vllm-project:main Sep 18, 2025
41 checks passed
845473182 pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 18, 2025
…litPR into model_register

* 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits)
  Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085)
  [Docs] Fix API Reference (vllm-project#25140)
  [Kernel] Better inf handling for grouped topk cu (vllm-project#24886)
  [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769)
  [benchmark] add peak throughput metrics and plot (vllm-project#23867)
  [Spec Decode] Efficient padded speculation (vllm-project#24539)
  [V0 Deprecation] Remove more V0 tests (vllm-project#25117)
  [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078)
  [XPU] Whisper model support on XPU Platform (vllm-project#25123)
  Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077)
  [Model] enable data parallel for InternVL vision encoder (vllm-project#23909)
  [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254)
  [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960)
  [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006)
  [Docs] Clean up the contributing README (vllm-project#25099)
  [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955)
  [Kernels] Enable DeepGEMM by default (vllm-project#24462)
  [V0 Deprecation] Skip PP test (vllm-project#25128)
  [V0 Deprecation] Remove misc V0 tests (vllm-project#25118)
  [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115)
  ...
debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants