[NIXL][Misc] Expose metrics from NIXL for logging to CLI #25388

NickLucche · 2025-09-22T14:39:07Z

This PR exposes the telemetry which is available in NIXL starting with v0.6.0 for logging to CLI.

Telemetry can be enabled in NIXL by setting:

NIXL_TELEMETRY_ENABLE=1/y/yes/on

Output example in vLLM:

(APIServer pid=205253) INFO 09-22 13:28:53 [metrics.py:98] KV Transfer metrics: Num successful transfers=1, Avg xfer time (ms)=9.862, P90 xfer time (ms)=9.862, Avg post time (ms)=5.693, P90 post time (ms)=5.693, Avg MB per transfer=70.0, Throughput (MB/s)=7097.952, Avg number of descriptors=56.0

My plan is to match this behavior for the current release, and then switch to a default "telemetry-on" mode starting with the next one (@markmc looking for feedback on this), so that a single on/off logging toggle can be maintained throughout vLLM.
The upcoming NIXL release will also allow setting telemetry through config (thanks @mkhazraee ) so I would propose we get rid of the env var handling altogether with the next upgrade.

Current metrics being tracked:

num_successful_transfers
transfer_duration
post_duration
bytes_transferred
num_descriptors

I would appreciate feedback here on any other metrics you see fit (cc @robertgshaw2-redhat @tlrmchlsmth and other llm-d power-users) or any other derived ones (see reduce()) .

In particular regarding failures, I will sync with @njhill on upcoming PRs to handle per-request/per-block failures.
At this moment, from the nixl interface perspective, we just crash on recognized failed transfers

vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Lines 1119 to 1124 in 3d2c56b

    
           elif xfer_state == "PROC": 
        
               in_progress = True 
        
               continue 
        
           else: 
        
               raise RuntimeError("Transfer failed with state %s", 
        
                                  xfer_state)

.

Prometheus

The small set of current metrics exposed is for cli-only, but will follow-up with a separate PR to expose them to Prometheus.
Current design though should be very much compatible with Prometheus as is, with all stats being represented by an "Histogram" object (as per @markmc guidance).

Anything inside def reduce() are summary stats for cli that I do not expect to need in Prometheus. All other "raw" metrics collected should just be ingested and be available for custom aggregation by the logging engine.

mergify · 2025-09-22T14:39:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces telemetry for NIXL, exposing various metrics for logging to the CLI. The changes include updating the nixl dependency, adding logic to collect transfer statistics when telemetry is enabled via an environment variable, and implementing aggregation and reduction of these stats for display. The implementation is well-structured, but I've identified a potential ZeroDivisionError in the metrics reduction logic that could cause a crash. My feedback includes a suggestion to prevent this.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

mergify · 2025-09-23T22:53:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Signed-off-by: NickLucche <[email protected]>

NickLucche · 2025-09-30T14:46:28Z

Addressed changes here @markmc , thanks!

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

markmc · 2025-09-30T16:25:40Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+                "P90 post time (ms)": 0,
+                "Avg MB per transfer": 0,
+                "Throughput (MB/s)": 0,
+                "Avg number of descriptors": 0,


This string template thingy is repeated twice in the same function. Very minor nit

Signed-off-by: NickLucche <[email protected]>

…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Rahul Tuli <[email protected]>

Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]>

…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Karan Goel <[email protected]>

…t#25388) Signed-off-by: NickLucche <[email protected]>

…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…t#25388) Signed-off-by: NickLucche <[email protected]>

…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…t#25388) Signed-off-by: NickLucche <[email protected]>

NickLucche requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 22, 2025 14:39

mergify bot added ci/build v1 labels Sep 22, 2025

mergify bot added needs-rebase kv-connector labels Sep 22, 2025

gemini-code-assist bot reviewed Sep 22, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Show resolved Hide resolved

NickLucche force-pushed the nixl-telemetry branch from 8d4dd5c to 8f3e9d3 Compare September 22, 2025 15:59

mergify bot removed the needs-rebase label Sep 22, 2025

robertgshaw2-redhat reviewed Sep 22, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Sep 23, 2025

markmc requested changes Sep 24, 2025

View reviewed changes

NickLucche added 4 commits September 30, 2025 14:21

expose metrics from nixl

655ef23

Signed-off-by: NickLucche <[email protected]>

review

e9401bb

Signed-off-by: NickLucche <[email protected]>

telemetry on by default

3cbb4c4

Signed-off-by: NickLucche <[email protected]>

test new xferstats changes

0e0661b

Signed-off-by: NickLucche <[email protected]>

NickLucche force-pushed the nixl-telemetry branch from 0321679 to 0e0661b Compare September 30, 2025 14:45

mergify bot removed the needs-rebase label Sep 30, 2025

markmc approved these changes Sep 30, 2025

View reviewed changes

NickLucche added 2 commits October 1, 2025 12:23

do not track num_successful_transfers explicitely

9dffc32

Signed-off-by: NickLucche <[email protected]>

precommit

32b22d9

Signed-off-by: NickLucche <[email protected]>

DarkLight1337 approved these changes Oct 3, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) October 3, 2025 08:40

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 3, 2025

Merge branch 'main' into nixl-telemetry

2cc02d2

DarkLight1337 merged commit 48f3090 into vllm-project:main Oct 3, 2025
50 checks passed

rahul-tuli pushed a commit to neuralmagic/vllm that referenced this pull request Oct 3, 2025

[NIXL][Misc] Expose metrics from NIXL for logging to CLI (vllm-projec…

727d46b

…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Rahul Tuli <[email protected]>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[NIXL][Misc] Expose metrics from NIXL for logging to CLI (#25388)

2168fc8

Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]>

tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025

[NIXL][Misc] Expose metrics from NIXL for logging to CLI (vllm-projec…

48f7031

…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

NickLucche mentioned this pull request Oct 6, 2025

[Metrics] [KVConnector] Add connector prefix cache hit rate stats #26245

Merged

karan pushed a commit to karan/vllm that referenced this pull request Oct 6, 2025

[NIXL][Misc] Expose metrics from NIXL for logging to CLI (vllm-projec…

91fd4ec

…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Karan Goel <[email protected]>

southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025

[NIXL][Misc] Expose metrics from NIXL for logging to CLI (vllm-projec…

ff3cd5b

…t#25388) Signed-off-by: NickLucche <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[NIXL][Misc] Expose metrics from NIXL for logging to CLI (vllm-projec…

68ca7e7

…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[NIXL][Misc] Expose metrics from NIXL for logging to CLI (vllm-projec…

6f878ba

…t#25388) Signed-off-by: NickLucche <[email protected]>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[NIXL][Misc] Expose metrics from NIXL for logging to CLI (vllm-projec…

e209658

…t#25388) Signed-off-by: NickLucche <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[NIXL][Misc] Expose metrics from NIXL for logging to CLI (vllm-projec…

78b93ea

…t#25388) Signed-off-by: NickLucche <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

NickLucche mentioned this pull request Nov 4, 2025

[Feature]: [P/D] Expose kv_transfer metrics (print to console, and to promethus) #21784

Closed

1 task

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[NIXL][Misc] Expose metrics from NIXL for logging to CLI (vllm-projec…

df7c279

…t#25388) Signed-off-by: NickLucche <[email protected]>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[NIXL][Misc] Expose metrics from NIXL for logging to CLI (vllm-projec…

f7076ca

…t#25388) Signed-off-by: NickLucche <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[NIXL][Misc] Expose metrics from NIXL for logging to CLI #25388

[NIXL][Misc] Expose metrics from NIXL for logging to CLI #25388

Uh oh!

NickLucche commented Sep 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Sep 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NickLucche commented Sep 30, 2025

Uh oh!

Uh oh!

markmc Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	elif xfer_state == "PROC":
	in_progress = True
	continue
	else:
	raise RuntimeError("Transfer failed with state %s",
	xfer_state)

Uh oh!

[NIXL][Misc] Expose metrics from NIXL for logging to CLI #25388

[NIXL][Misc] Expose metrics from NIXL for logging to CLI #25388

Uh oh!

Conversation

NickLucche commented Sep 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prometheus

Uh oh!

mergify bot commented Sep 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NickLucche commented Sep 30, 2025

Uh oh!

Uh oh!

markmc Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NickLucche commented Sep 22, 2025 •

edited by github-actions bot

Loading