-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[benchmark] add peak throughput metrics and plot #23867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
"on the benchmark arguments.", | ||
stacklevel=2) | ||
|
||
# Calculate max output tokens per second metric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the main change. the formatting in this file is pretty sad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces valuable new metrics for peak throughput and concurrency to the serving benchmark, along with terminal plots for visualization. The implementation is mostly solid, but I've identified a couple of issues related to the correctness of the peak concurrency calculation and its presentation in the output. Specifically, the peak concurrency is calculated independently of the peak throughput, which seems to deviate from the intended metric of 'concurrency at peak throughput'. Additionally, the unit for concurrent requests in the summary table is misleading. Addressing these points will improve the accuracy and clarity of the new benchmark metrics.
max_output_tokens_per_s = float(np.max(tokens_per_second)) | ||
max_concurrent_requests = int( | ||
np.max(concurrent_requests_per_second)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation of max_concurrent_requests
is independent of max_output_tokens_per_s
. It finds the maximum concurrent requests over the entire benchmark, not the concurrency at the time of peak token throughput. To align with the goal of finding "concurrent requests at that peak", you should find the index of the maximum token throughput and use that to get the corresponding concurrent requests.
max_output_tokens_per_s = float(np.max(tokens_per_second)) | |
max_concurrent_requests = int( | |
np.max(concurrent_requests_per_second)) | |
peak_idx = np.argmax(tokens_per_second) | |
max_output_tokens_per_s = float(tokens_per_second[peak_idx]) | |
max_concurrent_requests = int(concurrent_requests_per_second[peak_idx]) |
vllm/benchmarks/serve.py
Outdated
print("{:<40} {:<10.2f}".format("Peak concurrent requests (req/s):", | ||
metrics.max_concurrent_requests)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric max_concurrent_requests
is a count of concurrent requests, not a rate. The unit (req/s)
in the printed output is misleading and should be removed. Also, since it's an integer count, using an integer format specifier would be more appropriate than a float one.
print("{:<40} {:<10.2f}".format("Peak concurrent requests (req/s):", | |
metrics.max_concurrent_requests)) | |
print("{:<40} {:<10}".format("Peak concurrent requests:", | |
metrics.max_concurrent_requests)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already started using this PR. Very helpful! Thanks!
Signed-off-by: simon-mo <[email protected]>
…litPR into model_register * 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits) Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085) [Docs] Fix API Reference (vllm-project#25140) [Kernel] Better inf handling for grouped topk cu (vllm-project#24886) [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769) [benchmark] add peak throughput metrics and plot (vllm-project#23867) [Spec Decode] Efficient padded speculation (vllm-project#24539) [V0 Deprecation] Remove more V0 tests (vllm-project#25117) [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078) [XPU] Whisper model support on XPU Platform (vllm-project#25123) Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077) [Model] enable data parallel for InternVL vision encoder (vllm-project#23909) [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254) [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006) [Docs] Clean up the contributing README (vllm-project#25099) [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955) [Kernels] Enable DeepGEMM by default (vllm-project#24462) [V0 Deprecation] Skip PP test (vllm-project#25128) [V0 Deprecation] Remove misc V0 tests (vllm-project#25118) [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115) ...
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]> Signed-off-by: charlifu <[email protected]>
Signed-off-by: simon-mo <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: simon-mo <[email protected]>
This pull request enhances the benchmarking capabilities in
vllm/benchmarks/serve.py
and related files by adding new performance metrics, improving code formatting, and making minor functional improvements. The most significant update is the calculation and reporting of peak output token throughput and peak concurrent requests during benchmarking runs. Additionally, the code now conditionally displays terminal plots if the required dependencies are available, and several formatting and style improvements have been made for better readability.Benchmarking metrics and reporting:
max_output_tokens_per_s
) and peak concurrent requests (max_concurrent_requests
) to theBenchmarkMetrics
dataclass, and included these metrics in the output and printed summaries.start_time
for each request inRequestFuncOutput
and ensured it is set in all relevant request functions, enabling accurate time-based metric calculations.Visualization improvements:
termplotlib
andgnuplot
if available.Code formatting and style: