-
Notifications
You must be signed in to change notification settings - Fork 289
Open
Labels
area/networkingarea/observabilityhelp wantedExtra attention is neededExtra attention is neededpriority/P1Important / Should-HaveImportant / Should-Have
Milestone
Description
Is your feature request related to a problem? Please describe.
The semantic router currently prioritizes accuracy-based model selection but lacks load balancing metrics to track endpoint and model performance over different time intervals. This leads to potential load imbalances where high accuracy models/endpoints become bottlenecks. We need time-windowed metrics to enable load balancing that can trade off accuracy for latency and distribute load effectively.
Describe the solution you'd like
Time-Windowed Performance Tracking
Multiple time horizons
1m, 5m, 15m, 1h, 24h windows (configurable)
Key metrics per window:
- Request rates and completion rates
- Latency distributions (P50, P95, P99)
- Token throughput (prompt/completion tokens)
- Error rates and timeout frequencies
Sample Endpoint-Level Metrics
llm_endpoint_latency_windowed_seconds{endpoint, model, time_window}
llm_endpoint_requests_windowed_total{endpoint, model, time_window}
llm_endpoint_tokens_windowed_total{endpoint, model, token_type, time_window}
llm_endpoint_utilization_percentage{endpoint, time_window}
llm_endpoint_queue_depth_estimated{endpoint, model}
Metadata
Metadata
Assignees
Labels
area/networkingarea/observabilityhelp wantedExtra attention is neededExtra attention is neededpriority/P1Important / Should-HaveImportant / Should-Have