Calculate "peak token throughput" number that is closer to vLLM generation throughput

When doing a GuideLLM sweep, the token throughput numbers for each RPS test are lower than a user would expect given the vLLM generation throughput metric. 

I think there are two main reasons for this. 
1. Server is under-utilized at the beginning of the run while the number of requests in flight is ramping up
2. GuideLLM undercounts the tokens generated towards the end of the test duration, due to cancelled requests not being counted (see https://github.com/neuralmagic/guidellm/issues/77) 

I propose that we add an additional performance metric, we could call it something along the lines of "peak token generation throughput", hopefully in fewer words.

I have a couple of ideas on how this could be calculated:
1. Calculate token throughput in a sliding window of 30s or 1 minute and use the max, using decode times across all ongoing requests in that window.
2. Calculate token throughput in the duration from the time that the first completed request finished, and the time last completed request started. This doesn't perfectly eliminate the "underutilization" at the beginning and end of the test, but gets pretty close.

WDYT?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Calculate "peak token throughput" number that is closer to vLLM generation throughput #92

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Calculate "peak token throughput" number that is closer to vLLM generation throughput #92

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions