Skip to content

[RFC]: Online inference benchmark tool for multi-turn conversations #20265

@pliops-daniels

Description

@pliops-daniels

Motivation.

With the recent addition of the KV Connector API (PR #15960) and the growing adoption of KV cache offloading strategies, there is an increasing need for benchmarking tools that simulate realistic multi-turn chat interactions.

KV cache offloading works best when cached data can be reused, but still needs to be retrieved from the offloading backend.
This approach is particularly beneficial for multi-turn conversations, which rely on KV cache reuse.

However, even with APC enabled, long pauses between conversation turns often cause the necessary KV blocks to be evicted from GPU memory.

Currently, the vLLM library lacks a dedicated benchmarking suite that emulates realistic full-session conversations, including system prompts and chat history.

This RFC proposes a benchmark tool that simulates real-world, multi-client behavior using REST API calls (OpenAI API).

It measures key performance metrics such as:

  • Time to First Token (TTFT).
  • Time Per Output Token (TPOT).
  • End-to-End Latency.
  • Throughput (requests per second).

This tool is designed to:

  • Identify performance bottlenecks in cache-heavy deployments.
  • Evaluate the impact of cache-aware routing strategies (e.g., user-to-node affinity).
  • Stress-test server behavior under concurrent, realistic workloads.

Proposed Change.

Introduce a new script benchmark_serving_multi_turn.py,
with the following core features:

✅ Support for ShareGPT or Synthetic Conversations

  • Input from ShareGPT-style JSON datasets.
  • Generation of synthetic conversations with control over the following:
    • Number of turns in the conversations.
    • Input/output token count per turn.
    • Prefix token count (shared or unique, prepended to the first user prompt of every conversation).
    • Multiple random distributions (uniform, lognormal, etc.) for the parameters mentioned above.

🧠 Full Chat Context Handling

  • Submits each request with full chat history (including all previous user/assistant turns), mimicking actual multi-turn chat interactions.

🧪Parallel Requests Support

  • Parallel clients (--num-clients) using multiprocessing.
  • Each client sends one request at a time.
  • Option to enable/disable API streaming mode.

🔥 Warm-Up Support

  • Optional one-time warmup step to exclude cold-start effects from measured stats (--warmup-step).
  • Sends the first user turn of every conversation as the warmup.

🎯 KV Cache Offloading Benchmarking

  • Each client can alternate between multiple conversations (--max-active-conversations), simulating natural delays between conversation turns (e.g., messages minutes or hours apart).
  • Useful for benchmarking KV cache retrieval from offloading backends.

📊 Output & Debugging

  • Outputs optional Excel summary (--excel-output).
  • Real-time benchmark metrics: estimated RPS, progress %, ETA, etc.
  • Optionally print every prompt/response (--print-content).
  • Saves updated conversations with model completions (--output-file).
  • Optional answer verification against expected dataset responses (--verify-output).
    (Should be used with temperature of 0 for deterministic results).

Example usage:
Input JSON file generate_multi_turn.json (for generation of synthetic conversations):

{
    "filetype": "generate_conversations",
    "num_conversations": 24,
    "text_files": ["pg1184.txt"],
    "print_stats": false,
    "prompt_input": {
        "num_turns": {
            "distribution": "uniform",
            "min": 12,
            "max": 18
        },
        "common_prefix_num_tokens": {
            "distribution": "constant",
            "value": 500
        },
        "prefix_num_tokens": {
            "distribution": "lognormal",
            "mean": 6,
            "sigma": 4,
            "max": 1500
        },
        "num_tokens": {
            "distribution": "uniform",
            "min": 120,
            "max": 160
        }
    },
    "prompt_output": {
        "num_tokens": {
            "distribution": "uniform",
            "min": 80,
            "max": 120
        }
    }
}

The prompt_input section controls the request (input tokens and number of turns).
The prompt_output section controls the output tokens in the assistant answers.

Start the vLLM server (vllm serve) before running the benchmark.

Example benchmark command:

export MODEL_NAME=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/

python benchmark_serving_multi_turn.py --model $MODEL_NAME --input-file generate_multi_turn.json \
--num-clients 2 --max-active-conversations 6

Note: the model path is required because the tool uses the model tokenizer.

With the input file (generate_multi_turn.json) given above, 24 conversations will be generated by the tool ("num_conversations": 24).
In the example command we can see that --num-clients 2 and --max-active-conversations 6.
That means up 3 active conversations per client (6 / 2 = 3).

This scenario can be represented by the following diagram:

Image

In the picture above we can see 24 conversations (the orange blocks, numbered 1 to 24).

The tool has a task queue and a result queues (process/thread-safe queues).

At the start of the benchmark, the task queue will filled with 24 conversations.

Each client will handle up to 3 conversations and will alternate between them (round robin).
Increasing the --max-active-conversations will increase the delay between subsequent turns of each conversations (because of the round robin between the client's active conversations).
When a conversation is complete (has no more turns), the client will insert the conversation (with assistant answers and performance measurements) into the result queue.
If the task queue is not empty the client will pull more conversations from it (up to 3 active per client).

The benchmark will end when at least one client finished (and there are no more conversations in the task queue).

Example output (summary of the benchmark):

Conversations Generation Parameters:
text_files=pg1184.txt
input_num_turns=UniformDistribution[12, 18]
input_common_prefix_num_tokens=Constant[500]
input_prefix_num_tokens=LognormalDistribution[6, 4]
input_num_tokens=UniformDistribution[120, 160]
output_num_tokens=UniformDistribution[80, 120]
----------------------------------------------------------------------------------------------------
Statistics summary:
runtime_sec = 215.810
requests_per_sec = 0.769
----------------------------------------------------------------------------------------------------
                   count     mean     std      min      25%      50%      75%      90%      99%      max
ttft_ms            166.0    78.22   67.63    45.91    59.94    62.26    64.43    69.66   353.18   567.54
tpot_ms            166.0    25.37    0.57    24.40    25.07    25.31    25.50    25.84    27.50    28.05
latency_ms         166.0  2591.07  326.90  1998.53  2341.62  2573.01  2860.10  3003.50  3268.46  3862.94
input_num_turns    166.0     7.43    4.57     1.00     3.00     7.00    11.00    13.00    17.00    17.00
input_num_tokens   166.0  2006.20  893.56   522.00  1247.75  2019.00  2718.00  3233.00  3736.45  3899.00
output_num_tokens  166.0   100.01   11.80    80.00    91.00    99.00   109.75   116.00   120.00   120.00
output_num_chunks  166.0    99.01   11.80    79.00    90.00    98.00   108.75   115.00   119.00   119.00
----------------------------------------------------------------------------------------------------

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions