Skip to content

Conversation

@NickLucche
Copy link
Collaborator

@NickLucche NickLucche commented Sep 22, 2025

This PR exposes the telemetry which is available in NIXL starting with v0.6.0 for logging to CLI.

Telemetry can be enabled in NIXL by setting:

NIXL_TELEMETRY_ENABLE=1/y/yes/on

Output example in vLLM:

(APIServer pid=205253) INFO 09-22 13:28:53 [metrics.py:98] KV Transfer metrics: Num successful transfers=1, Avg xfer time (ms)=9.862, P90 xfer time (ms)=9.862, Avg post time (ms)=5.693, P90 post time (ms)=5.693, Avg MB per transfer=70.0, Throughput (MB/s)=7097.952, Avg number of descriptors=56.0

My plan is to match this behavior for the current release, and then switch to a default "telemetry-on" mode starting with the next one (@markmc looking for feedback on this), so that a single on/off logging toggle can be maintained throughout vLLM.
The upcoming NIXL release will also allow setting telemetry through config (thanks @mkhazraee ) so I would propose we get rid of the env var handling altogether with the next upgrade.

Current metrics being tracked:

  • num_successful_transfers
  • transfer_duration
  • post_duration
  • bytes_transferred
  • num_descriptors

I would appreciate feedback here on any other metrics you see fit (cc @robertgshaw2-redhat @tlrmchlsmth and other llm-d power-users) or any other derived ones (see reduce()) .

In particular regarding failures, I will sync with @njhill on upcoming PRs to handle per-request/per-block failures.
At this moment, from the nixl interface perspective, we just crash on recognized failed transfers

elif xfer_state == "PROC":
in_progress = True
continue
else:
raise RuntimeError("Transfer failed with state %s",
xfer_state)
.

Prometheus

The small set of current metrics exposed is for cli-only, but will follow-up with a separate PR to expose them to Prometheus.
Current design though should be very much compatible with Prometheus as is, with all stats being represented by an "Histogram" object (as per @markmc guidance).

Anything inside def reduce() are summary stats for cli that I do not expect to need in Prometheus. All other "raw" metrics collected should just be ingested and be available for custom aggregation by the logging engine.

@mergify
Copy link

mergify bot commented Sep 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces telemetry for NIXL, exposing various metrics for logging to the CLI. The changes include updating the nixl dependency, adding logic to collect transfer statistics when telemetry is enabled via an environment variable, and implementing aggregation and reduction of these stats for display. The implementation is well-structured, but I've identified a potential ZeroDivisionError in the metrics reduction logic that could cause a crash. My feedback includes a suggestion to prevent this.

@mergify
Copy link

mergify bot commented Sep 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 23, 2025
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
@NickLucche
Copy link
Collaborator Author

Addressed changes here @markmc , thanks!

@mergify mergify bot removed the needs-rebase label Sep 30, 2025
"P90 post time (ms)": 0,
"Avg MB per transfer": 0,
"Throughput (MB/s)": 0,
"Avg number of descriptors": 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This string template thingy is repeated twice in the same function. Very minor nit

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 3, 2025 08:40
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 3, 2025
@DarkLight1337 DarkLight1337 merged commit 48f3090 into vllm-project:main Oct 3, 2025
50 checks passed
rahul-tuli pushed a commit to neuralmagic/vllm that referenced this pull request Oct 3, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025
karan pushed a commit to karan/vllm that referenced this pull request Oct 6, 2025
southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants