Skip to content

Commit 812758d

Browse files
modify
Signed-off-by: Rui Zhang <[email protected]>
1 parent 2a326e0 commit 812758d

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

website/docs/proposals/production-stack-integration.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,12 @@
33

44
## 1. Overview
55

6-
The goal of this document is to outline a comprehensive integration strategy between **vLLM Semantic Router** and the **vLLM Production Stack**. The vLLM Production Stack is a cloud‑native reference system for deploying vLLM at scale. It provides several deployment ways that spin up vLLM servers, a request router and an observability stack. The request router can direct traffic to different models, perform service discovery and fault tolerance through the Kubernetes API, and support round‑robin, session‑based, prefix‑aware, KV-aware and disaggregated-prefill routing. The Semantic Router adds a **system‑intelligence layer** that classifies each user request, selects the most suitable model from a pool, injects domain‑specific system prompts, performs semantic caching and enforces enterprise‑grade security checks such as PII and jailbreak detection.
6+
The goal of this document is to outline a comprehensive integration strategy between **vLLM Semantic Router** and the **vLLM Production Stack**. The vLLM Production Stack is a cloud‑native reference system for deploying vLLM at scale. It provides several deployment ways that spin up vLLM servers, a request router and an observability stack. The request router can direct traffic to different models, perform service discovery and fault tolerance through the Kubernetes API, and support round‑robin, session‑based, prefix‑aware, KV-aware and disaggregated-prefill routing with LMCache native support. The Semantic Router adds a **system‑intelligence layer** that classifies each user request, selects the most suitable model from a pool, injects domain‑specific system prompts, performs semantic caching and enforces enterprise‑grade security checks such as PII and jailbreak detection.
77

88
By combining these two systems we obtain a unified inference stack. Semantic routing ensures that each request is answered by the best possible model. Production‑Stack routing maximizes infrastructure and inference efficiency, and exposes rich metrics. Together they provide:
99

1010
* **System‑level intelligence** — understand the user’s intent, choose the right model, inject appropriate system prompts and pre‑filter tools.
11-
* **Infrastructure efficiency** — scale from a single instance to a distributed vLLM deployment without changing application code, routing traffic across multiple models with token‑level optimization.
11+
* **Infrastructure efficiency** — scale from a single instance to a distributed vLLM deployment without changing application code, routing traffic across multiple models with token‑level optimization and LMCache native support.
1212
* **Security and compliance** — block PII and jailbreak prompts before they reach the model.
1313
* **Observability** — monitor requests, latency and GPU usage through the Production‑Stack’s Grafana dashboard and trace semantic‑router decisions.
1414

@@ -22,7 +22,7 @@ The vLLM Production Stack provides the building blocks for serving large langua
2222

2323
| Capability | Description |
2424
| --- | --- |
25-
| **Distributed deployment** | Deploy multiple vLLM instances and scale from single‑instance to multi‑instance clusters without changing application code. |
25+
| **Distributed deployment** | Deploy multiple vLLM instances with LMCache native support and scale from single‑instance to multi‑instance clusters without changing application code. |
2626
| **Request router** | Routes requests to different models and instances, supports different kinds of routing logic including disaggregated-prefill, KVCache-aware, prefix-aware, session and round-robin based routing. |
2727
| **Service discovery & fault tolerance** | Uses Kubernetes API for automatic discovery and removes failed nodes from the pool. |
2828
| **Observability** | Provides a Grafana dashboard to display latency distributions, time‑to‑first‑token, number of running or pending requests and GPU KV‑cache usage. |
@@ -62,7 +62,7 @@ The two systems target different layers of the inference stack:
6262

6363
#### Production Stack – Infrastructure Optimization Layer
6464

65-
* Improve inference efficiency using round‑robin, session‑based, prefix‑aware routing, KVCache-aware and disaggregated-prefill routing.
65+
* Improve inference efficiency with LMCache native support using round‑robin, session‑based, prefix‑aware routing, KVCache-aware and disaggregated-prefill routing.
6666
* Offloads KV‑cache to CPU memory and remote storage (via LMCache) and supports KV‑cache aware routing strategies.
6767
* Scales horizontally via Kubernetes and exposes metrics and traces for monitoring.
6868

0 commit comments

Comments
 (0)