-
Notifications
You must be signed in to change notification settings - Fork 42
Max RPS in sweep mode could be higher #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @dagrayvid! I think this is definitely an issue for the constant or poison async pathways. Throughput, though affected by this, I think would be less affected. Primarily because it's not waiting on one request to complete -- as many as possible (under the concurrent limit) are sent as fast as possible, so the server should reach saturation very quickly. Having run some tests recently, the throughput scenario has the first bulk of requests (up to concurrent limit) all finishing around the same time, at least with the latest multi process scheduler. So I don't think throwing away the first request will shift that much. Along these lines, though, I was thinking about this for the async pathways that are nonthroughput and how those take some time to converge to the desired request rate due to slow start times. To fix both, I think it would make sense to add an input "warmup" and "cooldown" or something along those lines, which enable the user to specify throwing away the first N and last M requests. This way it's explicit exactly what's happening and it can be tuned based on engine, server, and scenario |
Are you referring to tests where the input and output sequence lengths are constant? This idea originally came after I analyzed some results from a run where the dataset was normally distributed with 512 input tokens and 2048 output tokens. In this case there is about 1 minute before any requests are complete, but after that requests are being completed continuously. So the RPS for the first 1 minute is ~0, and for a 5 minute run this first minute really impacts the total RPS. As we increase the test duration, the total RPS would approach the "max" throughput I am describing in the issue. I am assuming here that if we continued to run this test for longer, requests would continue to be completed at approximately the rate we see for the last 3 minutes of this 5 minute test. |
Ah, yep, thanks @dagrayvid, that makes more sense with the dynamic lengths. I was just looking at a single length which resulted in the behavior I mentioned. With #96 landing, we now have the ability to set warmup and cooldown periods, either by number of requests or by duration, which will ignore any requests within that range for the final results. Does this work for what you'd like to enable here? |
@markurtz AFAIU, I don't think that the warmup period has the desired effect, at least in sweep mode. It seems like the upper bound RPS value for the sweep, which is determined during the throughput test, is calculated the same way whether a warmup period is specified or not. For the desired effect, we would need to calculate RPS based on the number of completed requests after the warmup, even if they were started before the warmup. I have only skimmed the relevant code, so I might be missing something here. |
@dagrayvid I haven't thouroughly tested it, but the underlying request set that the throughput RPS is calculated off of should be different. Anything under the warmup and cooldown period are discarded from the aggregator and not added to the collected requests, specifically here: https://github.com/neuralmagic/guidellm/blob/main/src/guidellm/benchmark/aggregator.py#L411 The RPS calculation isn't done until after the benchmark has been compiled, so won't include the requests in warmup/cooldown provided that logic is discarding them properly: https://github.com/neuralmagic/guidellm/blob/main/src/guidellm/benchmark/benchmarker.py#L242 Let me know if you're seeing something different here, though. A quick way to check would be if the scheduler stats show a different number of requests than the stored requests and that number matches the expected for how many should have been discarded. |
@dagrayvid can you confirm if after specifying the warmup and cool down periods, the measured rps during the throughput scenario approaches the true peak sustained rps ? |
Following up on this after the conversation we had in the GuideLLM call. Because the warmup filters out requests by start time, I don't think it would accomplish what we need here. Because of the prefill time at the beginning of the throughput mode (which is sometimes exaggerated due to pre-emption and the large number of requests hitting the server simultaneously), and the tail at the end due to incomplete requests which are not included in the RPS calculation, the average RPS often underestimates the RPS the server could sustain. I think for the sweep, the peak RPS should be something like the peak request completion rate sustained for some period of time during the test, perhaps calculated with a sliding window. The right duration of that window will depend on the sequence lengths in the dataset and the speed of the SUT. |
Currently, in sweep mode, we perform N tests that are linearly spaced between the RPS values measured during a sequential test and a throughput test. However, in certain cases (e.g., long output sequences >1500 tokens or very short test durations), throughput mode may under-estimate the maximum throughput the server can handle.
In throughput mode, if we think of the model server as a black box that receives an overload of requests and, after an initial delay (finishing the first request), completes requests at some ~constant rate, the goal is to determine the server’s output rate. Currently, we calculate the average RPS over the entire test duration. As the test duration increases, this average RPS will approach the server's maximum sustained output rate. However, for shorter test durations or long output sequences, the initial delay (before the server reaches its steady state) significantly impacts the total test time. This leads to an average RPS that is much lower than the true maximum rate the server can achieve once the "pipeline is full."
I propose that we calculate the upper bound for the RPS sweep using the time window from when the first request completes to when the last request completes. Specifically, the RPS should be calculated as:
RPS = (len(requests) - 1) / (max(requests["end_time"]) - min(requests["end_time"]))
I think this approach would more accurately reflect the server's true maximum throughput in terms of RPS. What do you think?
The text was updated successfully, but these errors were encountered: