Skip to content

Enhanced Load Balancing Metrics for Model and Endpoint Performance Tracking #227

@rootfs

Description

@rootfs

Is your feature request related to a problem? Please describe.
The semantic router currently prioritizes accuracy-based model selection but lacks load balancing metrics to track endpoint and model performance over different time intervals. This leads to potential load imbalances where high accuracy models/endpoints become bottlenecks. We need time-windowed metrics to enable load balancing that can trade off accuracy for latency and distribute load effectively.

Describe the solution you'd like

Time-Windowed Performance Tracking

Multiple time horizons

1m, 5m, 15m, 1h, 24h windows (configurable)

Key metrics per window:

  • Request rates and completion rates
  • Latency distributions (P50, P95, P99)
  • Token throughput (prompt/completion tokens)
  • Error rates and timeout frequencies

Sample Endpoint-Level Metrics

llm_endpoint_latency_windowed_seconds{endpoint, model, time_window}
llm_endpoint_requests_windowed_total{endpoint, model, time_window}  
llm_endpoint_tokens_windowed_total{endpoint, model, token_type, time_window}
llm_endpoint_utilization_percentage{endpoint, time_window}
llm_endpoint_queue_depth_estimated{endpoint, model}

Metadata

Metadata

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions