-
Notifications
You must be signed in to change notification settings - Fork 629
Closed
Labels
Description
How does this result test
Please refer to vllm-ascend benchmark
Rules
online serving test
- Input length: randomly sample 200 prompts ShareGPT and lmarena-ai/vision-arena-bench-v0.1(multi-modal) dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: throughput, TTFT (median time to the first token ), ITL (median inter-token latency) TPOT(median time per output token).
latency test
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: end-to-end latency.
throughput test
- Input length: randomly sample 200 prompts from ShareGPT and lmarena-ai/vision-arena-bench-v0.1(multi-modal) dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: throughput.
test parameters
serving parameters
[
{
"test_name": "serving_qwen2_5vl_7B_tp1",
"qps_list": [
1,
4,
16,
"inf"
],
"server_parameters": {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"trust_remote_code": "",
"max_model_len": 16384
},
"client_parameters": {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"backend": "openai-chat",
"dataset_name": "hf",
"hf_split": "train",
"endpoint": "/v1/chat/completions",
"dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
"num_prompts": 200
}
},
{
"test_name": "serving_qwen2_5_7B_tp1",
"qps_list": [
1,
4,
16,
"inf"
],
"server_parameters": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"load_format": "dummy"
},
"client_parameters": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
}
]
throughput parameters
[
{
"test_name": "throughput_qwen2_5_7B_tp1",
"parameters": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200,
"backend": "vllm"
}
},
{
"test_name": "throughput_qwen2_5vl_7B_tp1",
"parameters": {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"tensor_parallel_size": 1,
"backend": "vllm-chat",
"dataset_name": "hf",
"hf_split": "train",
"max_model_len": 16384,
"dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
"num_prompts": 200
}
}
]
latency parameters
[
{
"test_name": "latency_qwen2_5vl_7B_tp1",
"parameters": {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"tensor_parallel_size": 1,
"max_model_len": 16384,
"num_iters_warmup": 5,
"num_iters": 15
}
},
{
"test_name": "latency_qwen2_5_7B_tp1",
"parameters": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"max_model_len": 16384,
"num_iters_warmup": 5,
"num_iters": 15
}
}
]
vllm-ascend v0.7.3 + MindIE Trubo
Online serving tests
Qwen2.5-7B-Instruct
| Test name | Request rate (req/s) | Tput (req/s) | Output Tput (tok/s) | TTFT (ms) | TPOT (ms) | ITL (ms) |
|---|---|---|---|---|---|---|
| serving_qwen2_5_7B_tp1_qps_1 | 1 | 0.95543 | 213.524 | 69.5214 | 23.351 | 22.5373 |
| serving_qwen2_5_7B_tp1_qps_4 | 4 | 3.15729 | 705.608 | 72.2935 | 30.3927 | 26.7608 |
| serving_qwen2_5_7B_tp1_qps_16 | 16 | 5.64525 | 1261.63 | 85.6201 | 55.3979 | 36.701 |
| serving_qwen2_5_7B_tp1_qps_inf | inf | 6.647 | 1485.51 | 3198.48 | 50.8949 | 40.9818 |
Qwen2.5-VL-7B-Instruct
| Test name | Request rate (req/s) | Tput (req/s) | Output Tput (tok/s) | TTFT (ms) | TPOT (ms) | ITL (ms) |
|---|---|---|---|---|---|---|
| serving_qwen2_5vl_7B_tp1_qps_1 | 1 | 0.997964 | 109.247 | 374.364 | 30.8042 | 22.887 |
| serving_qwen2_5vl_7B_tp1_qps_4 | 4 | 3.7565 | 411.938 | 301.918 | 67.4147 | 31.0699 |
| serving_qwen2_5vl_7B_tp1_qps_16 | 16 | 6.2011 | 673.533 | 8744.6 | 158.046 | 64.3575 |
| serving_qwen2_5vl_7B_tp1_qps_inf | inf | 5.11639 | 557.508 | 17186.9 | 195.048 | 64.0484 |
vllm-ascend v0.7.3
Online serving
Qwen2.5-7B-Instruct
| Test name | Request rate (req/s) | Tput (req/s) | Output Tput (tok/s) | TTFT (ms) | TPOT (ms) | ITL (ms) |
|---|---|---|---|---|---|---|
| serving_qwen2_5_7B_tp1_qps_1 | 1 | 0.922267 | 206.113 | 97.8581 | 36.5128 | 35.3481 |
| serving_qwen2_5_7B_tp1_qps_4 | 4 | 2.79137 | 623.829 | 104.405 | 48.9373 | 41.9421 |
| serving_qwen2_5_7B_tp1_qps_16 | 16 | 4.37615 | 978.005 | 114.836 | 75.88 | 51.2339 |
| serving_qwen2_5_7B_tp1_qps_inf | inf | 5.06881 | 1132.8 | 2845.75 | 62.8063 | 53.0942 |
Qwen2.5-VL-7B-Instruct
| Test name | Request rate (req/s) | Tput (req/s) | Output Tput (tok/s) | TTFT (ms) | TPOT (ms) | ITL (ms) |
|---|---|---|---|---|---|---|
| serving_qwen2_5vl_7B_tp1_qps_1 | 1 | 0.993584 | 109.046 | 345.717 | 40.9481 | 32.1247 |
| serving_qwen2_5vl_7B_tp1_qps_4 | 4 | 3.66915 | 402.579 | 321.537 | 86.7003 | 42.4031 |
| serving_qwen2_5vl_7B_tp1_qps_16 | 16 | 6.07012 | 668.412 | 8580.35 | 164.858 | 71.5467 |
| serving_qwen2_5vl_7B_tp1_qps_inf | inf | 5.26736 | 579.831 | 14326.2 | 208.966 | 73.1931 |
Offline tests
Latency tests
| Test name | Mean latency (ms) | Median latency (ms) | P99 latency (ms) |
|---|---|---|---|
| latency_qwen2_5vl_7B_tp1 | 4128.36 | 4126.04 | 4200.24 |
| latency_qwen2_5_7B_tp1 | 4274.96 | 4284.9 | 4335.71 |
Throughput tests
| Test name | Num of reqs | Total num of tokens | Elapsed time (s) | Tput (req/s) | Tput (tok/s) |
|---|---|---|---|---|---|
| throughput_qwen2_5_7B_tp1 | 200 | 88257 | 40.8233 | 4.89917 | 2161.93 |
| throughput_qwen2_5vl_7B_tp1 | 200 | 203673 | 69.9566 | 2.85892 | 2911.42 |
Conclusion
First, we compared the online serving results with and without mindie_turbo integration
Qwen2.5-7B-Instruct
| QPS | Metric | vllm-ascend+ mindie_turbo | vllm-ascend | Optimization |
|---|---|---|---|---|
| 1 | Tput | 0.95543 | 0.922267 | +3.60% |
| Output Tput | 213.524 | 206.113 | +3.59% | |
| TTFT | 69.5214 | 97.8581 | +28.98% | |
| TPOT | 23.351 | 36.5128 | +36.05% | |
| ITL | 22.5373 | 35.3481 | +36.24% | |
| 4 | Tput | 3.15729 | 2.79137 | +13.11% |
| Output Tput | 705.608 | 623.829 | +13.09% | |
| TTFT | 72.2935 | 104.405 | +30.78% | |
| TPOT | 30.3927 | 48.9373 | +37.91% | |
| ITL | 26.7608 | 41.9421 | +36.19% | |
| 16 | Tput | 5.64525 | 4.37615 | +28.96% |
| Output Tput | 1261.63 | 978.005 | +28.98% | |
| TTFT | 85.6201 | 114.836 | +25.46% | |
| TPOT | 55.3979 | 75.88 | +26.98% | |
| ITL | 36.701 | 51.2339 | +28.37% | |
| inf | Tput | 6.647 | 5.06881 | +31.12% |
| Output Tput | 1485.51 | 1132.8 | +31.21% | |
| TTFT | 3198.48 | 2845.75 | -12.41% | |
| TPOT | 50.8949 | 62.8063 | +18.96% | |
| ITL | 40.9818 | 53.0942 | +22.80% |
Qwen2.5-VL-7B-Instruct
| QPS | Metric | vllm-ascend+ mindie_turbo | vllm-ascend | Optimization |
|---|---|---|---|---|
| 1 | Tput | 0.997964 | 0.993584 | +0.44% |
| Output Tput | 109.247 | 109.046 | +0.18% | |
| TTFT | 374.364 | 345.717 | -8.27% | |
| TPOT | 30.8042 | 40.9481 | +24.78% | |
| ITL | 22.887 | 32.1247 | +28.74% | |
| 4 | Tput | 3.7565 | 3.66915 | +2.38% |
| Output Tput | 411.938 | 402.579 | +2.33% | |
| TTFT | 301.918 | 321.537 | +6.10% | |
| TPOT | 67.4147 | 86.7003 | +22.22% | |
| ITL | 31.0699 | 42.4031 | +26.71% | |
| 16 | Tput | 6.2011 | 6.07012 | +2.16% |
| Output Tput | 673.533 | 668.412 | +0.77% | |
| TTFT | 8744.6 | 8580.35 | -1.91% | |
| TPOT | 158.046 | 164.858 | +4.14% | |
| ITL | 64.3575 | 71.5467 | +10.06% | |
| inf | Tput | 5.11639 | 5.26736 | -2.87% |
| Output Tput | 557.508 | 579.831 | -3.85% | |
| TTFT | 17186.9 | 14326.2 | -20.03% | |
| TPOT | 195.048 | 208.966 | +6.63% | |
| ITL | 64.0484 | 73.1931 | +12.50% |