-
Notifications
You must be signed in to change notification settings - Fork 176
Description
Hi,
My name is Nir Rozenbaum, and I am part of a team at IBM Research working on routing of inference workloads.
We believe the Gateway API inference extension could be a strong fit for our needs, and we would like to discuss some key requirements based on our customer use cases.
Some of the requirements we would like to raise include:
-
Serving Request Priority (SLA-based Routing): In our use case, request prioritization is determined by customer SLAs, ensuring that higher-paying customers receive priority over lower-tier or internal users. While the current design allows prioritization based on model criticality, we need a mechanism to prioritize requests dynamically based on SLA tiers. For example, inference requests from IBMers using IBM Cloud for internal purposes (free tier) should be deprioritized in favor of paying customers. Similarly, customers on a premium plan should experience lower wait times than those on a standard plan.
-
Session Affinity & Cache-Aware Routing: We need the ability to route requests to a specific vLLM pod based on session headers. This ensures that inference requests within the same session are consistently directed to the same vLLM instance, improving efficiency and caching performance.
-
Maintaining Existing Routing Logic: These capabilities should be additive, complementing the current model-based and LoRA model-aware routing mechanisms rather than replacing them.
-
Additionally, we are looking into addressing scalability and fault tolerance challenges in the current reference implementation.
We would love to explore how best to collaborate on these requirements and contribute to the project.
Our team has extensive experience contributing to multiple CNCF projects, working with GoLang, and implementing real-world routing of inference workload for enterprise customers.
What would be the best way to engage in discussions and contribute to this effort?
Looking forward to your thoughts.