-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Description
Motivation.
With the recent addition of the KV Connector API (PR #15960) and the growing adoption of KV cache offloading strategies, there is an increasing need for benchmarking tools that simulate realistic multi-turn chat interactions.
KV cache offloading works best when cached data can be reused, but still needs to be retrieved from the offloading backend.
This approach is particularly beneficial for multi-turn conversations, which rely on KV cache reuse.
However, even with APC enabled, long pauses between conversation turns often cause the necessary KV blocks to be evicted from GPU memory.
Currently, the vLLM library lacks a dedicated benchmarking suite that emulates realistic full-session conversations, including system prompts and chat history.
This RFC proposes a benchmark tool that simulates real-world, multi-client behavior using REST API calls (OpenAI API).
It measures key performance metrics such as:
- Time to First Token (TTFT).
- Time Per Output Token (TPOT).
- End-to-End Latency.
- Throughput (requests per second).
This tool is designed to:
- Identify performance bottlenecks in cache-heavy deployments.
- Evaluate the impact of cache-aware routing strategies (e.g., user-to-node affinity).
- Stress-test server behavior under concurrent, realistic workloads.
Proposed Change.
Introduce a new script benchmark_serving_multi_turn.py
,
with the following core features:
✅ Support for ShareGPT or Synthetic Conversations
- Input from ShareGPT-style JSON datasets.
- Generation of synthetic conversations with control over the following:
- Number of turns in the conversations.
- Input/output token count per turn.
- Prefix token count (shared or unique, prepended to the first user prompt of every conversation).
- Multiple random distributions (uniform, lognormal, etc.) for the parameters mentioned above.
🧠 Full Chat Context Handling
- Submits each request with full chat history (including all previous user/assistant turns), mimicking actual multi-turn chat interactions.
🧪Parallel Requests Support
- Parallel clients (
--num-clients
) using multiprocessing. - Each client sends one request at a time.
- Option to enable/disable API streaming mode.
🔥 Warm-Up Support
- Optional one-time warmup step to exclude cold-start effects from measured stats (
--warmup-step
). - Sends the first user turn of every conversation as the warmup.
🎯 KV Cache Offloading Benchmarking
- Each client can alternate between multiple conversations (
--max-active-conversations
), simulating natural delays between conversation turns (e.g., messages minutes or hours apart). - Useful for benchmarking KV cache retrieval from offloading backends.
📊 Output & Debugging
- Outputs optional Excel summary (
--excel-output
). - Real-time benchmark metrics: estimated RPS, progress %, ETA, etc.
- Optionally print every prompt/response (
--print-content
). - Saves updated conversations with model completions (
--output-file
). - Optional answer verification against expected dataset responses (
--verify-output
).
(Should be used with temperature of 0 for deterministic results).
Example usage:
Input JSON file generate_multi_turn.json
(for generation of synthetic conversations):
{
"filetype": "generate_conversations",
"num_conversations": 24,
"text_files": ["pg1184.txt"],
"print_stats": false,
"prompt_input": {
"num_turns": {
"distribution": "uniform",
"min": 12,
"max": 18
},
"common_prefix_num_tokens": {
"distribution": "constant",
"value": 500
},
"prefix_num_tokens": {
"distribution": "lognormal",
"mean": 6,
"sigma": 4,
"max": 1500
},
"num_tokens": {
"distribution": "uniform",
"min": 120,
"max": 160
}
},
"prompt_output": {
"num_tokens": {
"distribution": "uniform",
"min": 80,
"max": 120
}
}
}
The prompt_input
section controls the request (input tokens and number of turns).
The prompt_output
section controls the output tokens in the assistant answers.
Start the vLLM server (vllm serve
) before running the benchmark.
Example benchmark command:
export MODEL_NAME=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/
python benchmark_serving_multi_turn.py --model $MODEL_NAME --input-file generate_multi_turn.json \
--num-clients 2 --max-active-conversations 6
Note: the model path is required because the tool uses the model tokenizer.
With the input file (generate_multi_turn.json) given above, 24 conversations will be generated by the tool ("num_conversations": 24
).
In the example command we can see that --num-clients 2
and --max-active-conversations 6
.
That means up 3 active conversations per client (6 / 2 = 3).
This scenario can be represented by the following diagram:
In the picture above we can see 24 conversations (the orange blocks, numbered 1 to 24).
The tool has a task queue and a result queues (process/thread-safe queues).
At the start of the benchmark, the task queue will filled with 24 conversations.
Each client will handle up to 3 conversations and will alternate between them (round robin).
Increasing the --max-active-conversations
will increase the delay between subsequent turns of each conversations (because of the round robin between the client's active conversations).
When a conversation is complete (has no more turns), the client will insert the conversation (with assistant answers and performance measurements) into the result queue.
If the task queue is not empty the client will pull more conversations from it (up to 3 active per client).
The benchmark will end when at least one client finished (and there are no more conversations in the task queue).
Example output (summary of the benchmark):
Conversations Generation Parameters:
text_files=pg1184.txt
input_num_turns=UniformDistribution[12, 18]
input_common_prefix_num_tokens=Constant[500]
input_prefix_num_tokens=LognormalDistribution[6, 4]
input_num_tokens=UniformDistribution[120, 160]
output_num_tokens=UniformDistribution[80, 120]
----------------------------------------------------------------------------------------------------
Statistics summary:
runtime_sec = 215.810
requests_per_sec = 0.769
----------------------------------------------------------------------------------------------------
count mean std min 25% 50% 75% 90% 99% max
ttft_ms 166.0 78.22 67.63 45.91 59.94 62.26 64.43 69.66 353.18 567.54
tpot_ms 166.0 25.37 0.57 24.40 25.07 25.31 25.50 25.84 27.50 28.05
latency_ms 166.0 2591.07 326.90 1998.53 2341.62 2573.01 2860.10 3003.50 3268.46 3862.94
input_num_turns 166.0 7.43 4.57 1.00 3.00 7.00 11.00 13.00 17.00 17.00
input_num_tokens 166.0 2006.20 893.56 522.00 1247.75 2019.00 2718.00 3233.00 3736.45 3899.00
output_num_tokens 166.0 100.01 11.80 80.00 91.00 99.00 109.75 116.00 120.00 120.00
output_num_chunks 166.0 99.01 11.80 79.00 90.00 98.00 108.75 115.00 119.00 119.00
----------------------------------------------------------------------------------------------------
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.