|
| 1 | +# Serve LoRA adapters on a shared pool |
| 2 | +A company wants to serve LLMs for document analysis and focuses on audiences in multiple languages, such as English and Spanish. |
| 3 | +They have a fine-tuned LoRA adapter for each language, but need to efficiently use their GPU and TPU capacity. |
| 4 | +You can use Gateway API Inference Extension to deploy dynamic LoRA fine-tuned adapters for each language (for example, `english-bot` and `spanish-bot`) on a common base model and accelerator. |
| 5 | +This lets you reduce the number of required accelerators by densely packing multiple models in a shared pool. |
| 6 | + |
| 7 | +## How |
| 8 | +The following diagram illustrates how Gateway API Inference Extension serves multiple LoRA adapters on a shared pool. |
| 9 | + |
| 10 | +This example illustrates how you can densely serve multiple LoRA adapters with distinct workload performance objectives on a common InferencePool. |
| 11 | +```yaml |
| 12 | +apiVersion: gateway.networking.x-k8s.io/v1alpha1 |
| 13 | +kind: InferencePool |
| 14 | +metadata: |
| 15 | + name: gemma3 |
| 16 | +spec: |
| 17 | + selector: |
| 18 | + pool: gemma3 |
| 19 | +``` |
| 20 | +Let us say we have a couple of LoRA adapters named “english-bot” and “spanish-bot” for the Gemma3 base model. |
| 21 | +You can create an `InferenceModel` resource and associate these LoRA adapters to the relevant InferencePool resource. |
| 22 | +In this case, we associate these LoRA adapters to the gemma3 InferencePool resource created above. |
| 23 | + |
| 24 | +```yaml |
| 25 | +apiVersion: inference.networking.x-k8s.io/v1alpha2 |
| 26 | +kind: InferenceModel |
| 27 | +metadata: |
| 28 | + name: english-bot |
| 29 | +spec: |
| 30 | + modelName: english-bot |
| 31 | + criticality: Standard |
| 32 | + poolRef: |
| 33 | + name: gemma3 |
| 34 | + |
| 35 | +--- |
| 36 | +apiVersion: inference.networking.x-k8s.io/v1alpha2 |
| 37 | +kind: InferenceModel |
| 38 | +metadata: |
| 39 | + name: spanish-bot |
| 40 | +spec: |
| 41 | + modelName: spanish-bot |
| 42 | + criticality: Critical |
| 43 | + poolRef: |
| 44 | + name: gemma3 |
| 45 | + |
| 46 | +``` |
| 47 | +Now, you can route your requests from the gateway using the `HTTPRoute` object. |
| 48 | +```yaml |
| 49 | +apiVersion: gateway.networking.k8s.io/v1 |
| 50 | +kind: Gateway |
| 51 | +metadata: |
| 52 | + name: inference-gateway |
| 53 | +spec: |
| 54 | + listeners: |
| 55 | + - protocol: HTTP |
| 56 | + port: 80 |
| 57 | + name: http |
| 58 | +
|
| 59 | +--- |
| 60 | +kind: HTTPRoute |
| 61 | +apiVersion: gateway.networking.k8s.io/v1 |
| 62 | +metadata: |
| 63 | + name: routes-to-llms |
| 64 | +spec: |
| 65 | + parentRefs: |
| 66 | + - name: inference-gateway |
| 67 | + rules: |
| 68 | + - matches: |
| 69 | + path: |
| 70 | + type: PathPrefix |
| 71 | + value: / |
| 72 | + backendRefs: |
| 73 | + - name: gemma3 |
| 74 | + kind: InferencePool |
| 75 | +``` |
| 76 | + |
| 77 | +## Try it out |
| 78 | + |
| 79 | +1. Get the gateway IP: |
| 80 | +```bash |
| 81 | +IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80 |
| 82 | +``` |
| 83 | +2. Send a few requests to model "english-bot" as follows: |
| 84 | +```bash |
| 85 | +curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ |
| 86 | +"model": "english-bot", |
| 87 | +"prompt": "What is the color of the sky", |
| 88 | +"max_tokens": 100, |
| 89 | +"temperature": 0 |
| 90 | +}' |
| 91 | +``` |
| 92 | +3. Send a few requests to model "spanish-bot" as follows: |
| 93 | +```bash |
| 94 | +curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ |
| 95 | +"model": "spanish-bot", |
| 96 | +"prompt": "¿De qué color es...?", |
| 97 | +"max_tokens": 100, |
| 98 | +"temperature": 0 |
| 99 | +}' |
| 100 | +``` |
0 commit comments