Skip to content

Calculate "peak token throughput" number that is closer to vLLM generation throughput #92

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dagrayvid opened this issue Mar 6, 2025 · 2 comments
Labels
enhancement New feature or request metrics Metrics workstream

Comments

@dagrayvid
Copy link
Collaborator

dagrayvid commented Mar 6, 2025

When doing a GuideLLM sweep, the token throughput numbers for each RPS test are lower than a user would expect given the vLLM generation throughput metric.

I think there are two main reasons for this.

  1. Server is under-utilized at the beginning of the run while the number of requests in flight is ramping up
  2. GuideLLM undercounts the tokens generated towards the end of the test duration, due to cancelled requests not being counted (see Record partially completed request metrics #77)

I propose that we add an additional performance metric, we could call it something along the lines of "peak token generation throughput", hopefully in fewer words.

I have a couple of ideas on how this could be calculated:

  1. Calculate token throughput in a sliding window of 30s or 1 minute and use the max, using decode times across all ongoing requests in that window.
  2. Calculate token throughput in the duration from the time that the first completed request finished, and the time last completed request started. This doesn't perfectly eliminate the "underutilization" at the beginning and end of the test, but gets pretty close.

WDYT?

@dagrayvid dagrayvid added enhancement New feature or request metrics Metrics workstream labels Mar 6, 2025
@markurtz
Copy link
Member

#96 now enables us to grab the metrics for any responses that errored (though it makes an assumption on iterations == tokens, since we don't have usage stats from the server in this case).

Since I need to rework the report generation, I'll include these fixes within that

markurtz added a commit that referenced this issue Apr 11, 2025
…ation Refactor (#96)

Full refactor of GuideLLM enabling better overall performance to ensure
minimal overhead for benchmarking with a new multiprocess and threaded
scheduler along with significant updates to the output formats enabling
better analysis, visibility, and clarity.

<img width="668" alt="Screenshot 2025-04-11 at 2 26 13 PM"
src="https://github.com/user-attachments/assets/a723854a-7fe0-4eb2-9408-f632e747c3c2"
/>

Fixes:
- #92 
- #77 
- #47 
- #79

---------

Co-authored-by: Alexandre Marques <[email protected]>
Co-authored-by: Samuel Monson <[email protected]>
Co-authored-by: David Gray <[email protected]>
@markurtz
Copy link
Member

Closing out now that #96 is on main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request metrics Metrics workstream
Projects
None yet
Development

No branches or pull requests

2 participants