Max RPS in sweep mode could be higher #93

dagrayvid · 2025-03-07T18:35:23Z

Currently, in sweep mode, we perform N tests that are linearly spaced between the RPS values measured during a sequential test and a throughput test. However, in certain cases (e.g., long output sequences >1500 tokens or very short test durations), throughput mode may under-estimate the maximum throughput the server can handle.

In throughput mode, if we think of the model server as a black box that receives an overload of requests and, after an initial delay (finishing the first request), completes requests at some ~constant rate, the goal is to determine the server’s output rate. Currently, we calculate the average RPS over the entire test duration. As the test duration increases, this average RPS will approach the server's maximum sustained output rate. However, for shorter test durations or long output sequences, the initial delay (before the server reaches its steady state) significantly impacts the total test time. This leads to an average RPS that is much lower than the true maximum rate the server can achieve once the "pipeline is full."

I propose that we calculate the upper bound for the RPS sweep using the time window from when the first request completes to when the last request completes. Specifically, the RPS should be calculated as:

RPS = (len(requests) - 1) / (max(requests["end_time"]) - min(requests["end_time"]))

I think this approach would more accurately reflect the server's true maximum throughput in terms of RPS. What do you think?

markurtz · 2025-03-10T17:34:54Z

Thanks @dagrayvid! I think this is definitely an issue for the constant or poison async pathways. Throughput, though affected by this, I think would be less affected. Primarily because it's not waiting on one request to complete -- as many as possible (under the concurrent limit) are sent as fast as possible, so the server should reach saturation very quickly. Having run some tests recently, the throughput scenario has the first bulk of requests (up to concurrent limit) all finishing around the same time, at least with the latest multi process scheduler. So I don't think throwing away the first request will shift that much.

Along these lines, though, I was thinking about this for the async pathways that are nonthroughput and how those take some time to converge to the desired request rate due to slow start times.

To fix both, I think it would make sense to add an input "warmup" and "cooldown" or something along those lines, which enable the user to specify throwing away the first N and last M requests. This way it's explicit exactly what's happening and it can be tuned based on engine, server, and scenario

dagrayvid · 2025-03-10T18:25:42Z

Having run some tests recently, the throughput scenario has the first bulk of requests (up to concurrent limit) all finishing around the same time, at least with the latest multi process scheduler. So I don't think throwing away the first request will shift that much.

Are you referring to tests where the input and output sequence lengths are constant? This idea originally came after I analyzed some results from a run where the dataset was normally distributed with 512 input tokens and 2048 output tokens. In this case there is about 1 minute before any requests are complete, but after that requests are being completed continuously. So the RPS for the first 1 minute is ~0, and for a 5 minute run this first minute really impacts the total RPS. As we increase the test duration, the total RPS would approach the "max" throughput I am describing in the issue.

I am assuming here that if we continued to run this test for longer, requests would continue to be completed at approximately the rate we see for the last 3 minutes of this 5 minute test.

markurtz · 2025-04-11T21:06:37Z

Ah, yep, thanks @dagrayvid, that makes more sense with the dynamic lengths. I was just looking at a single length which resulted in the behavior I mentioned.

With #96 landing, we now have the ability to set warmup and cooldown periods, either by number of requests or by duration, which will ignore any requests within that range for the final results. Does this work for what you'd like to enable here?

dagrayvid · 2025-04-22T20:10:04Z

@markurtz AFAIU, I don't think that the warmup period has the desired effect, at least in sweep mode. It seems like the upper bound RPS value for the sweep, which is determined during the throughput test, is calculated the same way whether a warmup period is specified or not.

For the desired effect, we would need to calculate RPS based on the number of completed requests after the warmup, even if they were started before the warmup. I have only skimmed the relevant code, so I might be missing something here.

markurtz · 2025-04-22T21:46:06Z

@dagrayvid I haven't thouroughly tested it, but the underlying request set that the throughput RPS is calculated off of should be different. Anything under the warmup and cooldown period are discarded from the aggregator and not added to the collected requests, specifically here: https://github.com/neuralmagic/guidellm/blob/main/src/guidellm/benchmark/aggregator.py#L411

The RPS calculation isn't done until after the benchmark has been compiled, so won't include the requests in warmup/cooldown provided that logic is discarding them properly: https://github.com/neuralmagic/guidellm/blob/main/src/guidellm/benchmark/benchmarker.py#L242

Let me know if you're seeing something different here, though. A quick way to check would be if the scheduler stats show a different number of requests than the stored requests and that number matches the expected for how many should have been discarded.

ashishkamra · 2025-06-02T01:35:34Z

@dagrayvid can you confirm if after specifying the warmup and cool down periods, the measured rps during the throughput scenario approaches the true peak sustained rps ?

dagrayvid · 2025-06-04T13:48:13Z

Following up on this after the conversation we had in the GuideLLM call.

Because the warmup filters out requests by start time, I don't think it would accomplish what we need here.

Because of the prefill time at the beginning of the throughput mode (which is sometimes exaggerated due to pre-emption and the large number of requests hitting the server simultaneously), and the tail at the end due to incomplete requests which are not included in the RPS calculation, the average RPS often underestimates the RPS the server could sustain.

I think for the sweep, the peak RPS should be something like the peak request completion rate sustained for some period of time during the test, perhaps calculated with a sliding window. The right duration of that window will depend on the sequence lengths in the dataset and the speed of the SUT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Max RPS in sweep mode could be higher #93

Max RPS in sweep mode could be higher #93

dagrayvid commented Mar 7, 2025

markurtz commented Mar 10, 2025

Uh oh!

dagrayvid commented Mar 10, 2025

Uh oh!

markurtz commented Apr 11, 2025

Uh oh!

dagrayvid commented Apr 22, 2025

Uh oh!

markurtz commented Apr 22, 2025 •

edited

Loading

Uh oh!

ashishkamra commented Jun 2, 2025

Uh oh!

dagrayvid commented Jun 4, 2025

Uh oh!

Max RPS in sweep mode could be higher #93

Max RPS in sweep mode could be higher #93

Comments

dagrayvid commented Mar 7, 2025

markurtz commented Mar 10, 2025

Uh oh!

dagrayvid commented Mar 10, 2025

Uh oh!

markurtz commented Apr 11, 2025

Uh oh!

dagrayvid commented Apr 22, 2025

Uh oh!

markurtz commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashishkamra commented Jun 2, 2025

Uh oh!

dagrayvid commented Jun 4, 2025

Uh oh!

markurtz commented Apr 22, 2025 •

edited

Loading