Skip to content

Commit a1587ff

Browse files
authored
Merge branch 'kubernetes-sigs:main' into flow-control
2 parents 28444d6 + 4318d95 commit a1587ff

File tree

4 files changed

+105
-33
lines changed

4 files changed

+105
-33
lines changed

mkdocs.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ theme:
1212
logo: images/logo/logo-text-large-horizontal-white.png
1313
favicon: images/favicon-64.png
1414
features:
15+
- content.code.annotate
1516
- search.highlight
1617
- navigation.tabs
1718
- navigation.top
@@ -55,7 +56,7 @@ nav:
5556
Design Principles: concepts/design-principles.md
5657
Conformance: concepts/conformance.md
5758
Roles and Personas: concepts/roles-and-personas.md
58-
- Implementations:
59+
- Implementations:
5960
- Gateways: implementations/gateways.md
6061
- Model Servers: implementations/model-servers.md
6162
- FAQ: faq.md
@@ -70,7 +71,7 @@ nav:
7071
- InferencePool Rollout: guides/inferencepool-rollout.md
7172
- Metrics and Observability: guides/metrics-and-observability.md
7273
- Configuration Guide:
73-
- Configuring the plugins via configuration files or text: guides/epp-configuration/config-text.md
74+
- Configuring the plugins via configuration files or text: guides/epp-configuration/config-text.md
7475
- Prefix Cache Aware Plugin: guides/epp-configuration/prefix-aware.md
7576
- Troubleshooting Guide: guides/troubleshooting.md
7677
- Implementer Guides:

site-src/guides/index.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,8 +86,9 @@ A cluster with:
8686

8787
=== "GKE"
8888

89-
1. Enable the Gateway API and configure proxy-only subnets when necessary. See [Deploy Gateways](https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways)
90-
for detailed instructions.
89+
1. Enable the Google Kubernetes Engine API, Compute Engine API, the Network Services API and configure proxy-only subnets when necessary.
90+
See [Deploy Inference Gateways](https://cloud.google.com/kubernetes-engine/docs/how-to/deploy-gke-inference-gateway)
91+
for detailed instructions.
9192

9293
2. Deploy Inference Gateway:
9394

Lines changed: 98 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,53 @@
11
# Serve multiple generative AI models
2-
A company wants to deploy multiple large language models (LLMs) to serve different workloads.
3-
For example, they might want to deploy a Gemma3 model for a chatbot interface and a Deepseek model for a recommendation application.
2+
3+
A company wants to deploy multiple large language models (LLMs) to a cluster to serve different workloads.
4+
For example, they might want to deploy a Gemma3 model for a chatbot interface and a DeepSeek model for a recommendation application.
45
The company needs to ensure optimal serving performance for these LLMs.
5-
By using an Inference Gateway, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
6-
You can then route requests based on the model name (such as "chatbot" and "recommender") and the `Criticality` property.
6+
By using an Inference Gateway, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
7+
You can then route requests based on the model name (such as `chatbot` and `recommender`) and the `Criticality` property.
78

89
## How
10+
911
The following diagram illustrates how an Inference Gateway routes requests to different models based on the model name.
10-
The model name is extracted by [Body-Based routing](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md)
12+
The model name is extracted by [Body-Based routing](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) (BBR)
1113
from the request body to the header. The header is then matched to dispatch
1214
requests to different `InferencePool` (and their EPPs) instances.
1315
![Serving multiple generative AI models](../images/serve-mul-gen-AI-models.png)
1416

17+
### Deploy Body-Based Routing
18+
19+
To enable body-based routing, you need to deploy the Body-Based Routing ExtProc server using Helm. Depending on your Gateway provider, you can use one of the following commands:
20+
21+
=== "GKE"
22+
23+
```bash
24+
helm install body-based-router \
25+
--set provider.name=gke \
26+
--version v0.5.1 \
27+
oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing
28+
```
29+
30+
=== "Istio"
31+
32+
```bash
33+
helm install body-based-router \
34+
--set provider.name=istio \
35+
--version v0.5.1 \
36+
oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing
37+
```
38+
39+
=== "Other"
40+
41+
```bash
42+
helm install body-based-router \
43+
--version v0.5.1 \
44+
oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing
45+
```
46+
47+
### Configure HTTPRoute
48+
1549
This example illustrates a conceptual example regarding how to use the `HTTPRoute` object to route based on model name like “chatbot” or “recommender” to `InferencePool`.
50+
1651
```yaml
1752
apiVersion: gateway.networking.k8s.io/v1
1853
kind: HTTPRoute
@@ -25,8 +60,7 @@ spec:
2560
- matches:
2661
- headers:
2762
- type: Exact
28-
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
29-
name: X-Gateway-Model-Name
63+
name: X-Gateway-Model-Name # (1)!
3064
value: chatbot
3165
path:
3266
type: PathPrefix
@@ -37,38 +71,74 @@ spec:
3771
- matches:
3872
- headers:
3973
- type: Exact
40-
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
41-
name: X-Gateway-Model-Name
74+
name: X-Gateway-Model-Name # (2)!
4275
value: recommender
4376
path:
4477
type: PathPrefix
4578
value: /
4679
backendRefs:
4780
- name: deepseek-r1
48-
kind: InferencePool
81+
kind: InferencePool
4982
```
5083
84+
1. [BBR](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header with key `X-Gateway-Model-Name`. The header can then be used in the `HTTPRoute` to route requests to different `InferencePool` instances.
85+
2. [BBR](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header with key `X-Gateway-Model-Name`. The header can then be used in the `HTTPRoute` to route requests to different `InferencePool` instances.
86+
5187
## Try it out
5288

5389
1. Get the gateway IP:
5490
```bash
5591
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80
5692
```
57-
2. Send a few requests to model "chatbot" as follows:
58-
```bash
59-
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
60-
"model": "chatbot",
61-
"prompt": "What is the color of the sky",
62-
"max_tokens": 100,
63-
"temperature": 0
64-
}'
65-
```
66-
3. Send a few requests to model "recommender" as follows:
67-
```bash
68-
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
69-
"model": "recommender",
70-
"prompt": "Give me restaurant recommendations in Paris",
71-
"max_tokens": 100,
72-
"temperature": 0
73-
}'
74-
```
93+
94+
=== "Chat Completions API"
95+
96+
1. Send a few requests to model `chatbot` as follows:
97+
```bash
98+
curl -X POST -i ${IP}:${PORT}/v1/chat/completions \
99+
-H "Content-Type: application/json" \
100+
-d '{
101+
"model": "chatbot",
102+
"messages": [{"role": "user", "content": "What is the color of the sky?"}],
103+
"max_tokens": 100,
104+
"temperature": 0
105+
}'
106+
```
107+
108+
2. Send a few requests to model `recommender` as follows:
109+
```bash
110+
curl -X POST -i ${IP}:${PORT}/v1/chat/completions \
111+
-H "Content-Type: application/json" \
112+
-d '{
113+
"model": "recommender",
114+
"messages": [{"role": "user", "content": "Give me restaurant recommendations in Paris"}],
115+
"max_tokens": 100,
116+
"temperature": 0
117+
}'
118+
```
119+
120+
=== "Completions API"
121+
122+
1. Send a few requests to model `chatbot` as follows:
123+
```bash
124+
curl -X POST -i ${IP}:${PORT}/v1/completions \
125+
-H 'Content-Type: application/json' \
126+
-d '{
127+
"model": "chatbot",
128+
"prompt": "What is the color of the sky",
129+
"max_tokens": 100,
130+
"temperature": 0
131+
}'
132+
```
133+
134+
2. Send a few requests to model `recommender` as follows:
135+
```bash
136+
curl -X POST -i ${IP}:${PORT}/v1/completions \
137+
-H 'Content-Type: application/json' \
138+
-d '{
139+
"model": "recommender",
140+
"prompt": "Give me restaurant recommendations in Paris",
141+
"max_tokens": 100,
142+
"temperature": 0
143+
}'
144+
```

site-src/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ The following specific terms to this project:
2929
from [Model Serving](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol/README.md).
3030
- **Metrics and Capabilities**: Data provided by model serving platforms about
3131
performance, availability and capabilities to optimize routing. Includes
32-
things like [Prefix Cache] status or [LoRA Adapters] availability.
32+
things like [Prefix Cache](https://docs.vllm.ai/en/stable/design/v1/prefix_caching.html) status or [LoRA Adapters](https://docs.vllm.ai/en/stable/features/lora.html) availability.
3333
- **Endpoint Picker(EPP)**: An implementation of an `Inference Scheduler` with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal).
3434

3535
[Inference Gateway]:#concepts-and-definitions

0 commit comments

Comments
 (0)