You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When doing a GuideLLM sweep, the token throughput numbers for each RPS test are lower than a user would expect given the vLLM generation throughput metric.
I think there are two main reasons for this.
Server is under-utilized at the beginning of the run while the number of requests in flight is ramping up
I propose that we add an additional performance metric, we could call it something along the lines of "peak token generation throughput", hopefully in fewer words.
I have a couple of ideas on how this could be calculated:
Calculate token throughput in a sliding window of 30s or 1 minute and use the max, using decode times across all ongoing requests in that window.
Calculate token throughput in the duration from the time that the first completed request finished, and the time last completed request started. This doesn't perfectly eliminate the "underutilization" at the beginning and end of the test, but gets pretty close.
WDYT?
The text was updated successfully, but these errors were encountered:
#96 now enables us to grab the metrics for any responses that errored (though it makes an assumption on iterations == tokens, since we don't have usage stats from the server in this case).
Since I need to rework the report generation, I'll include these fixes within that
…ation Refactor (#96)
Full refactor of GuideLLM enabling better overall performance to ensure
minimal overhead for benchmarking with a new multiprocess and threaded
scheduler along with significant updates to the output formats enabling
better analysis, visibility, and clarity.
<img width="668" alt="Screenshot 2025-04-11 at 2 26 13 PM"
src="https://github.com/user-attachments/assets/a723854a-7fe0-4eb2-9408-f632e747c3c2"
/>
Fixes:
- #92
- #77
- #47
- #79
---------
Co-authored-by: Alexandre Marques <[email protected]>
Co-authored-by: Samuel Monson <[email protected]>
Co-authored-by: David Gray <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.
When doing a GuideLLM sweep, the token throughput numbers for each RPS test are lower than a user would expect given the vLLM generation throughput metric.
I think there are two main reasons for this.
I propose that we add an additional performance metric, we could call it something along the lines of "peak token generation throughput", hopefully in fewer words.
I have a couple of ideas on how this could be calculated:
WDYT?
The text was updated successfully, but these errors were encountered: