-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[NIXL][Misc] Expose metrics from NIXL for logging to CLI #25388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces telemetry for NIXL, exposing various metrics for logging to the CLI. The changes include updating the nixl dependency, adding logic to collect transfer statistics when telemetry is enabled via an environment variable, and implementing aggregation and reduction of these stats for display. The implementation is well-structured, but I've identified a potential ZeroDivisionError in the metrics reduction logic that could cause a crash. My feedback includes a suggestion to prevent this.
8d4dd5c to
8f3e9d3
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
0321679 to
0e0661b
Compare
|
Addressed changes here @markmc , thanks! |
| "P90 post time (ms)": 0, | ||
| "Avg MB per transfer": 0, | ||
| "Throughput (MB/s)": 0, | ||
| "Avg number of descriptors": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This string template thingy is repeated twice in the same function. Very minor nit
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Rahul Tuli <[email protected]>
Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]>
…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Tomer Asida <[email protected]>
…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Karan Goel <[email protected]>
…t#25388) Signed-off-by: NickLucche <[email protected]>
…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…t#25388) Signed-off-by: NickLucche <[email protected]>
…t#25388) Signed-off-by: NickLucche <[email protected]>
…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…t#25388) Signed-off-by: NickLucche <[email protected]>
…t#25388) Signed-off-by: NickLucche <[email protected]>
This PR exposes the telemetry which is available in NIXL starting with v0.6.0 for logging to CLI.
Telemetry can be enabled in NIXL by setting:
Output example in vLLM:
My plan is to match this behavior for the current release, and then switch to a default "telemetry-on" mode starting with the next one (@markmc looking for feedback on this), so that a single on/off logging toggle can be maintained throughout vLLM.
The upcoming NIXL release will also allow setting telemetry through config (thanks @mkhazraee ) so I would propose we get rid of the env var handling altogether with the next upgrade.
Current metrics being tracked:
I would appreciate feedback here on any other metrics you see fit (cc @robertgshaw2-redhat @tlrmchlsmth and other llm-d power-users) or any other derived ones (see
reduce()) .In particular regarding failures, I will sync with @njhill on upcoming PRs to handle per-request/per-block failures.
At this moment, from the nixl interface perspective, we just crash on recognized failed transfers
vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
Lines 1119 to 1124 in 3d2c56b
Prometheus
The small set of current metrics exposed is for cli-only, but will follow-up with a separate PR to expose them to Prometheus.
Current design though should be very much compatible with Prometheus as is, with all stats being represented by an "Histogram" object (as per @markmc guidance).
Anything inside
def reduce()are summary stats for cli that I do not expect to need in Prometheus. All other "raw" metrics collected should just be ingested and be available for custom aggregation by the logging engine.