Skip to content

Commit ed32a43

Browse files
capri-xiyueliu-congrobscott
authored
docs: added examples to address various generative AI application scenarios by using gateway api inference extension (#812)
* added common cases * added more details Signed-off-by: Xiyue Yu <[email protected]> * fixed comments * changed file location * fixed typo * Update site-src/guides/serve-multiple-lora-adapters.md Co-authored-by: Cong Liu <[email protected]> * Update site-src/guides/serve-multiple-lora-adapters.md Co-authored-by: Cong Liu <[email protected]> * Update mkdocs.yml Co-authored-by: Rob Scott <[email protected]> * Update site-src/guides/serve-multiple-lora-adapters.md Co-authored-by: Rob Scott <[email protected]> * Update site-src/guides/serve-multiple-genai-models.md Co-authored-by: Rob Scott <[email protected]> * added subsession * fixed wording --------- Signed-off-by: Xiyue Yu <[email protected]> Co-authored-by: Cong Liu <[email protected]> Co-authored-by: Rob Scott <[email protected]>
1 parent d55ead7 commit ed32a43

File tree

5 files changed

+176
-2
lines changed

5 files changed

+176
-2
lines changed

mkdocs.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,9 +61,12 @@ nav:
6161
- Guides:
6262
- User Guides:
6363
- Getting started: guides/index.md
64+
- Use Cases:
65+
- Serve Multiple GenAI models: guides/serve-multiple-genai-models.md
66+
- Serve Multiple LoRA adapters: guides/serve-multiple-lora-adapters.md
6467
- Rollout:
65-
- Adapter Rollout: guides/adapter-rollout.md
66-
- InferencePool Rollout: guides/inferencepool-rollout.md
68+
- Adapter Rollout: guides/adapter-rollout.md
69+
- InferencePool Rollout: guides/inferencepool-rollout.md
6770
- Metrics: guides/metrics.md
6871
- Implementer's Guide: guides/implementers.md
6972
- Performance:
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Serve multiple generative AI models
2+
A company wants to deploy multiple large language models (LLMs) to serve different workloads.
3+
For example, they might want to deploy a Gemma3 model for a chatbot interface and a Deepseek model for a recommendation application.
4+
The company needs to ensure optimal serving performance for these LLMs.
5+
Using Gateway API Inference Extension, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
6+
You can then route requests based on the model name (such as "chatbot" and "recommender") and the `Criticality` property.
7+
8+
## How
9+
The following diagram illustrates how Gateway API Inference Extension routes requests to different models based on the model name.
10+
![Serving multiple generative AI models](../images/serve-mul-gen-AI-models.png)
11+
12+
This example illustrates a conceptual example regarding how to use the `HTTPRoute` object to route based on model name like “chatbot” or “recommender” to `InferencePool`.
13+
```yaml
14+
apiVersion: gateway.networking.k8s.io/v1
15+
kind: HTTPRoute
16+
metadata:
17+
name: routes-to-llms
18+
spec:
19+
parentRefs:
20+
- name: inference-gateway
21+
rules:
22+
- matches:
23+
- headers:
24+
- type: Exact
25+
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
26+
name: X-Gateway-Model-Name
27+
value: chatbot
28+
path:
29+
type: PathPrefix
30+
value: /
31+
backendRefs:
32+
- name: gemma3
33+
kind: InferencePool
34+
- matches:
35+
- headers:
36+
- type: Exact
37+
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
38+
name: X-Gateway-Model-Name
39+
value: recommender
40+
path:
41+
type: PathPrefix
42+
value: /
43+
backendRefs:
44+
- name: deepseek-r1
45+
kind: InferencePool
46+
```
47+
48+
## Try it out
49+
50+
1. Get the gateway IP:
51+
```bash
52+
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80
53+
```
54+
2. Send a few requests to model "chatbot" as follows:
55+
```bash
56+
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
57+
"model": "chatbot",
58+
"prompt": "What is the color of the sky",
59+
"max_tokens": 100,
60+
"temperature": 0
61+
}'
62+
```
63+
3. Send a few requests to model "recommender" as follows:
64+
```bash
65+
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
66+
"model": "chatbot",
67+
"prompt": "Give me restaurant recommendations in Paris",
68+
"max_tokens": 100,
69+
"temperature": 0
70+
}'
71+
```
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Serve LoRA adapters on a shared pool
2+
A company wants to serve LLMs for document analysis and focuses on audiences in multiple languages, such as English and Spanish.
3+
They have a fine-tuned LoRA adapter for each language, but need to efficiently use their GPU and TPU capacity.
4+
You can use Gateway API Inference Extension to deploy dynamic LoRA fine-tuned adapters for each language (for example, `english-bot` and `spanish-bot`) on a common base model and accelerator.
5+
This lets you reduce the number of required accelerators by densely packing multiple models in a shared pool.
6+
7+
## How
8+
The following diagram illustrates how Gateway API Inference Extension serves multiple LoRA adapters on a shared pool.
9+
![Serving LoRA adapters on a shared pool](../images/serve-LoRA-adapters.png)
10+
This example illustrates how you can densely serve multiple LoRA adapters with distinct workload performance objectives on a common InferencePool.
11+
```yaml
12+
apiVersion: gateway.networking.x-k8s.io/v1alpha1
13+
kind: InferencePool
14+
metadata:
15+
name: gemma3
16+
spec:
17+
selector:
18+
pool: gemma3
19+
```
20+
Let us say we have a couple of LoRA adapters named “english-bot” and “spanish-bot” for the Gemma3 base model.
21+
You can create an `InferenceModel` resource and associate these LoRA adapters to the relevant InferencePool resource.
22+
In this case, we associate these LoRA adapters to the gemma3 InferencePool resource created above.
23+
24+
```yaml
25+
apiVersion: inference.networking.x-k8s.io/v1alpha2
26+
kind: InferenceModel
27+
metadata:
28+
name: english-bot
29+
spec:
30+
modelName: english-bot
31+
criticality: Standard
32+
poolRef:
33+
name: gemma3
34+
35+
---
36+
apiVersion: inference.networking.x-k8s.io/v1alpha2
37+
kind: InferenceModel
38+
metadata:
39+
name: spanish-bot
40+
spec:
41+
modelName: spanish-bot
42+
criticality: Critical
43+
poolRef:
44+
name: gemma3
45+
46+
```
47+
Now, you can route your requests from the gateway using the `HTTPRoute` object.
48+
```yaml
49+
apiVersion: gateway.networking.k8s.io/v1
50+
kind: Gateway
51+
metadata:
52+
name: inference-gateway
53+
spec:
54+
listeners:
55+
- protocol: HTTP
56+
port: 80
57+
name: http
58+
59+
---
60+
kind: HTTPRoute
61+
apiVersion: gateway.networking.k8s.io/v1
62+
metadata:
63+
name: routes-to-llms
64+
spec:
65+
parentRefs:
66+
- name: inference-gateway
67+
rules:
68+
- matches:
69+
path:
70+
type: PathPrefix
71+
value: /
72+
backendRefs:
73+
- name: gemma3
74+
kind: InferencePool
75+
```
76+
77+
## Try it out
78+
79+
1. Get the gateway IP:
80+
```bash
81+
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80
82+
```
83+
2. Send a few requests to model "english-bot" as follows:
84+
```bash
85+
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
86+
"model": "english-bot",
87+
"prompt": "What is the color of the sky",
88+
"max_tokens": 100,
89+
"temperature": 0
90+
}'
91+
```
92+
3. Send a few requests to model "spanish-bot" as follows:
93+
```bash
94+
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
95+
"model": "spanish-bot",
96+
"prompt": "¿De qué color es...?",
97+
"max_tokens": 100,
98+
"temperature": 0
99+
}'
100+
```
371 KB
Loading
403 KB
Loading

0 commit comments

Comments
 (0)