@@ -36,31 +36,37 @@ This example demonstrates how to deploy a server for AI inference using [vLLM](h
3636
3737## Detailed Steps & Explanation
3838
39- 1 . Ensure Hugging Face permissions to retrieve model:
39+ 1 . Create the namespace:
40+
41+ ``` bash
42+ kubectl apply -f vllm-namespace.yaml
43+ ```
44+
45+ 2 . Ensure Hugging Face permissions to retrieve model:
4046
4147``` bash
4248# Env var HF_TOKEN contains hugging face account token
43- kubectl create secret generic hf-secret \
49+ kubectl create secret generic hf-secret -n vllm-example \
4450 --from-literal=hf_token=$HF_TOKEN
4551```
4652
47- 2 . Apply vLLM server:
53+ 3 . Apply vLLM server:
4854
4955``` bash
50- kubectl apply -f vllm-deployment.yaml
56+ kubectl apply -f vllm-deployment.yaml -n vllm-example
5157```
5258
5359 - Wait for deployment to reconcile, creating vLLM pod(s):
5460
5561``` bash
56- kubectl wait --for=condition=Available --timeout=900s deployment/vllm-gemma-deployment
57- kubectl get pods -l app=gemma-server -w
62+ kubectl wait --for=condition=Available --timeout=900s deployment/vllm-gemma-deployment -n vllm-example
63+ kubectl get pods -l app=gemma-server -w -n vllm-example
5864```
5965
6066 - View vLLM pod logs:
6167
6268``` bash
63- kubectl logs -f -l app=gemma-server
69+ kubectl logs -f -l app=gemma-server -n vllm-example
6470```
6571
6672Expected output:
@@ -77,11 +83,11 @@ Expected output:
7783...
7884```
7985
80- 3 . Create service:
86+ 4 . Create service:
8187
8288``` bash
8389# ClusterIP service on port 8080 in front of vllm deployment
84- kubectl apply -f vllm-service.yaml
90+ kubectl apply -f vllm-service.yaml -n vllm-example
8591```
8692
8793## Verification / Seeing it Work
@@ -90,18 +96,18 @@ kubectl apply -f vllm-service.yaml
9096
9197``` bash
9298# Forward a local port (e.g., 8080) to the service port (e.g., 8080)
93- kubectl port-forward service/vllm-service 8080:8080
99+ kubectl port-forward service/vllm-service 8080:8080 -n vllm-example
94100```
95101
961022 . Send request to local forwarding port:
97103
98104``` bash
99105curl -X POST http://localhost:8080/v1/chat/completions \
100106-H " Content-Type: application/json" \
101- -d ' {
102- "model": "google/gemma-3-1b-it",
103- "messages": [{"role": "user", "content": "Explain Quantum Computing in simple terms."}],
104- "max_tokens": 100
107+ -d ' { \
108+ "model": "google/gemma-3-1b-it", \
109+ "messages": [{"role": "user", "content": "Explain Quantum Computing in simple terms." }], \
110+ "max_tokens": 100 \
105111}'
106112```
107113
@@ -151,9 +157,10 @@ Node selectors make sure vLLM pods land on Nodes with the correct GPU, and they
151157## Cleanup
152158
153159` ` ` bash
154- kubectl delete -f vllm-service.yaml
155- kubectl delete -f vllm-deployment.yaml
156- kubectl delete -f secret/hf_secret
160+ kubectl delete -f vllm-service.yaml -n vllm-example
161+ kubectl delete -f vllm-deployment.yaml -n vllm-example
162+ kubectl delete secret hf-secret -n vllm-example
163+ kubectl delete -f vllm-namespace.yaml
157164```
158165
159166---
0 commit comments