You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feel free to dive in, but wanted to callout that there will be some restructuring along these lines for the output standardization in case there is duplicate work. Will share something out a bit later
Let's also make sure that the output JSON is storing all of the metadata we may want to know:
Model_name
Quantized (None, INT4, INT8, FP8)
Hardware
Inference Scenario
vllm version
vllm-config file (need the file)
GuideLLM results:
Tokens per Second
Time to First Token (TTFT)
Inter-token Latency (ITL)
End-to-End Request Latency (e2e_latency)
Requests Per Second (RPS) Profiles/Sweeps
Cost to generate a million output tokens (Internal) - Future
There is a lot of minor improvements that can be made to the JSON output of guidellm. For example:
"decode_times": { "data": [] }
->"decode_times": []
)."request_latency_percentiles": [ 1, 2,... ]
->"request_latency_percentiles": { "p01": 1, "p05": 5,. ... }
)The text was updated successfully, but these errors were encountered: