From 4b58e5cf21ef4a97d16d78069baa0ce8c1d2e087 Mon Sep 17 00:00:00 2001 From: Kunjan Date: Mon, 10 Feb 2025 18:07:15 -0800 Subject: [PATCH 01/13] Integrate dynamic-lora-sidecar into main guide and add makefile, cloudbuild to build and publish lora-syncer image Signed-off-by: Kunjan --- site-src/guides/index.md | 68 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) diff --git a/site-src/guides/index.md b/site-src/guides/index.md index e4cbec6f6..a0d368122 100644 --- a/site-src/guides/index.md +++ b/site-src/guides/index.md @@ -19,6 +19,74 @@ This quickstart guide is intended for engineers familiar with k8s and model serv kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Llama2 kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/vllm/deployment.yaml ``` + **OPTIONALLY**: Enable Dynamic loading of Lora adapters. + + [Deploy sample vllm deployment with Dynamic lora adapter enabled and Lora syncer sidecar](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/tools/dynamic-lora-sidecar/deployment.yaml) + + ***Safely rollout v2 adapter*** + + 1. Update lora configmap + + ``` yaml + + apiVersion: v1 + kind: ConfigMap + metadata: + name: dynamic-lora-config + data: + configmap.yaml: | + vLLMLoRAConfig: + ensureExist: + models: + - id: chatbot-v1 + source: gs://[TEAM-A-MODELS-BUCKET]/chatbot-v1 + - id: chatbot-v2 + source: gs://[TEAM-A-MODELS-BUCKET]/chatbot-v2 + ``` + + 2. Configure a canary rollout with traffic split using LLMService. In this example, 10% of traffic to the chatbot model will be sent to v2. + + ``` yaml + model: + name: chatbot + targetModels: + targetModelName: chatbot-v1 + weight: 90 + targetModelName: chatbot-v2 + weight: 10 + ``` + + 3. Finish rollout by setting the traffic to the new version 100%. + ```yaml + model: + name: chatbot + targetModels: + targetModelName: chatbot-v2 + weight: 100 + ``` + + 4. Remove v1 from dynamic lora configmap. + ```yaml + apiVersion: v1 + kind: ConfigMap + metadata: + name: dynamic-lora-config + data: + configmap.yaml: | + vLLMLoRAConfig: + ensureExist: + models: + - id: chatbot-v2 + source: gs://[TEAM-A-MODELS-BUCKET]/chatbot-v2 + ensureNotExist: # Explicitly unregisters the adapter from model servers + models: + - id: chatbot-v1 + source: gs://[TEAM-A-MODELS-BUCKET]/chatbot-v1 + ``` + + + + 1. **Install the Inference Extension CRDs:** ```sh From 985ed8ef8ae2cc33d7e152e1244e625501806f32 Mon Sep 17 00:00:00 2001 From: Kunjan Date: Mon, 10 Feb 2025 18:07:15 -0800 Subject: [PATCH 02/13] Add makefile and cloudbuild file to build and push lora-syncer Signed-off-by: Kunjan --- .../vllm/deployment-with-syncer.yaml | 158 ++++++++++++++++++ pkg/manifests/vllm/deployment.yaml | 47 ++---- site-src/guides/dynamic-lora.md | 79 +++++++++ site-src/guides/index.md | 64 ------- tools/dynamic-lora-sidecar/Makefile | 59 +++++++ tools/dynamic-lora-sidecar/cloudbuild.yaml | 17 ++ 6 files changed, 325 insertions(+), 99 deletions(-) create mode 100644 pkg/manifests/vllm/deployment-with-syncer.yaml create mode 100644 site-src/guides/dynamic-lora.md create mode 100644 tools/dynamic-lora-sidecar/Makefile create mode 100644 tools/dynamic-lora-sidecar/cloudbuild.yaml diff --git a/pkg/manifests/vllm/deployment-with-syncer.yaml b/pkg/manifests/vllm/deployment-with-syncer.yaml new file mode 100644 index 000000000..9359123dd --- /dev/null +++ b/pkg/manifests/vllm/deployment-with-syncer.yaml @@ -0,0 +1,158 @@ +apiVersion: v1 +kind: Service +metadata: + name: vllm-llama2-7b-pool +spec: + selector: + app: vllm-llama2-7b-pool + ports: + - protocol: TCP + port: 8000 + targetPort: 8000 + type: ClusterIP +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: vllm-llama2-7b-pool +spec: + replicas: 3 + selector: + matchLabels: + app: vllm-llama2-7b-pool + template: + metadata: + labels: + app: vllm-llama2-7b-pool + spec: + containers: + - name: lora + image: "vllm/vllm-openai:latest" + imagePullPolicy: Always + command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] + args: + - "--model" + - "meta-llama/Llama-2-7b-hf" + - "--tensor-parallel-size" + - "1" + - "--port" + - "8000" + - "--enable-lora" + - "--max-loras" + - "4" + - "--max-cpu-loras" + - "12" + - "--lora-modules" + - '{"name": "sql-lora-0", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' + - '{"name": "sql-lora-1", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' + - '{"name": "sql-lora-2", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' + - '{"name": "sql-lora-3", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' + - '{"name": "sql-lora-4", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' + - '{"name": "tweet-summary-0", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' + - '{"name": "tweet-summary-1", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' + - '{"name": "tweet-summary-2", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' + - '{"name": "tweet-summary-3", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' + - '{"name": "tweet-summary-4", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' + - '{"name": "sql-lora", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' + - '{"name": "tweet-summary", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' + env: + - name: PORT + value: "8000" + - name: HUGGING_FACE_HUB_TOKEN + valueFrom: + secretKeyRef: + name: hf-token + key: token + - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING + value: "true" + ports: + - containerPort: 8000 + name: http + protocol: TCP + livenessProbe: + failureThreshold: 240 + httpGet: + path: /health + port: http + scheme: HTTP + initialDelaySeconds: 5 + periodSeconds: 5 + successThreshold: 1 + timeoutSeconds: 1 + readinessProbe: + failureThreshold: 600 + httpGet: + path: /health + port: http + scheme: HTTP + initialDelaySeconds: 5 + periodSeconds: 5 + successThreshold: 1 + timeoutSeconds: 1 + resources: + limits: + nvidia.com/gpu: 1 + requests: + nvidia.com/gpu: 1 + volumeMounts: + - mountPath: /data + name: data + - mountPath: /dev/shm + name: shm + - name: adapters + mountPath: "/adapters" + initContainers: + - name: lora-adapter-syncer + tty: true + stdin: true + image: #Replace image + restartPolicy: Always + imagePullPolicy: Always + env: + - name: DYNAMIC_LORA_ROLLOUT_CONFIG + value: "/config/configmap.yaml" + volumeMounts: # DO NOT USE subPath + - name: config-volume + mountPath: /config + restartPolicy: Always + schedulerName: default-scheduler + terminationGracePeriodSeconds: 30 + volumes: + - name: data + emptyDir: {} + - name: shm + emptyDir: + medium: Memory + - name: adapters + emptyDir: {} + - name: config-volume + configMap: + name: dynamic-lora-config + +--- + +apiVersion: v1 +kind: ConfigMap +metadata: + name: dynamic-lora-config +data: + configmap.yaml: | + vLLMLoRAConfig: + name: sql-loras-llama + port: 8000 + ensureExist: + models: + - base-model: meta-llama/Llama-2-7b-hf + id: sql-lora-v1 + source: yard1/llama-2-7b-sql-lora-test + - base-model: meta-llama/Llama-2-7b-hf + id: sql-lora-v3 + source: yard1/llama-2-7b-sql-lora-test + - base-model: meta-llama/Llama-2-7b-hf + id: sql-lora-v4 + source: yard1/llama-2-7b-sql-lora-test + ensureNotExist: + models: + - base-model: meta-llama/Llama-2-7b-hf + id: sql-lora-v2 + source: yard1/llama-2-7b-sql-lora-test \ No newline at end of file diff --git a/pkg/manifests/vllm/deployment.yaml b/pkg/manifests/vllm/deployment.yaml index 4af0891d7..8ea95365b 100644 --- a/pkg/manifests/vllm/deployment.yaml +++ b/pkg/manifests/vllm/deployment.yaml @@ -43,18 +43,18 @@ spec: - "--max-cpu-loras" - "12" - "--lora-modules" - - "sql-lora=/adapters/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/" - - "tweet-summary=/adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403" - - 'sql-lora-0=/adapters/yard1/llama-2-7b-sql-lora-test_0' - - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1' - - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2' - - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3' - - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4' - - 'tweet-summary-0=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0' - - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1' - - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2' - - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3' - - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4' + - '{"name": "sql-lora-0", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' + - '{"name": "sql-lora-1", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' + - '{"name": "sql-lora-2", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' + - '{"name": "sql-lora-3", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' + - '{"name": "sql-lora-4", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' + - '{"name": "tweet-summary-0", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' + - '{"name": "tweet-summary-1", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' + - '{"name": "tweet-summary-2", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' + - '{"name": "tweet-summary-3", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' + - '{"name": "tweet-summary-4", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' + - '{"name": "sql-lora", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' + - '{"name": "tweet-summary", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' env: - name: PORT value: "8000" @@ -99,29 +99,6 @@ spec: name: shm - name: adapters mountPath: "/adapters" - initContainers: - - name: adapter-loader - image: ghcr.io/tomatillo-and-multiverse/adapter-puller:demo - command: ["python"] - args: - - ./pull_adapters.py - - --adapter - - yard1/llama-2-7b-sql-lora-test - - --adapter - - vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm - - --duplicate-count - - "5" - env: - - name: HF_TOKEN - valueFrom: - secretKeyRef: - name: hf-token - key: token - - name: HF_HOME - value: /adapters - volumeMounts: - - name: adapters - mountPath: "/adapters" restartPolicy: Always schedulerName: default-scheduler terminationGracePeriodSeconds: 30 diff --git a/site-src/guides/dynamic-lora.md b/site-src/guides/dynamic-lora.md new file mode 100644 index 000000000..5356c7e73 --- /dev/null +++ b/site-src/guides/dynamic-lora.md @@ -0,0 +1,79 @@ +# Getting started with Gateway API Inference Extension with Dynamic lora updates on vllm + +The goal of this guide is to get a single InferencePool running with VLLM and demonstrate use of dynamic lora updating ! + +### Requirements + - Envoy Gateway [v1.2.1](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher + - A cluster with: + - Support for Services of type `LoadBalancer`. (This can be validated by ensuring your Envoy Gateway is up and running). For example, with Kind, + you can follow [these steps](https://kind.sigs.k8s.io/docs/user/loadbalancer). + - 3 GPUs to run the sample model server. Adjust the number of replicas in `./manifests/vllm/deployment.yaml` as needed. + +### Steps + +1. **Deploy Sample VLLM Model Server with dynamic lora update enabled and dynamic lora syncer sidecar ** + [Deploy sample vllm deployment with Dynamic lora adapter enabled and Lora syncer sidecar and configmap](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/manifests/vllm/dynamic-lora-sidecar/deployment.yaml) + +Rest of the steps are same as [general setup](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/site-src/guides/index.md) + + +### Safely rollout v2 adapter + +1. Update lora configmap + +``` yaml + + apiVersion: v1 + kind: ConfigMap + metadata: + name: dynamic-lora-config + data: + configmap.yaml: | + vLLMLoRAConfig: + ensureExist: + models: + - id: tweet-summary-v1 + source: tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1 + - id: tweet-summary-v2 + source: tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2 + ``` + +2. Configure a canary rollout with traffic split using LLMService. In this example, 10% of traffic to the chatbot model will be sent to v2. + +``` yaml +model: + name: chatbot + targetModels: + targetModelName: chatbot-v1 + weight: 90 + targetModelName: chatbot-v2 + weight: 10 +``` + +3. Finish rollout by setting the traffic to the new version 100%. +```yaml +model: + name: chatbot + targetModels: + targetModelName: chatbot-v2 + weight: 100 +``` + +4. Remove v1 from dynamic lora configmap. +```yaml + apiVersion: v1 + kind: ConfigMap + metadata: + name: dynamic-lora-config + data: + configmap.yaml: | + vLLMLoRAConfig: + ensureExist: + models: + - id: chatbot-v2 + source: gs://[TEAM-A-MODELS-BUCKET]/chatbot-v2 + ensureNotExist: # Explicitly unregisters the adapter from model servers + models: + - id: chatbot-v1 + source: gs://[TEAM-A-MODELS-BUCKET]/chatbot-v1 +``` diff --git a/site-src/guides/index.md b/site-src/guides/index.md index a0d368122..2cc971c61 100644 --- a/site-src/guides/index.md +++ b/site-src/guides/index.md @@ -19,70 +19,6 @@ This quickstart guide is intended for engineers familiar with k8s and model serv kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Llama2 kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/vllm/deployment.yaml ``` - **OPTIONALLY**: Enable Dynamic loading of Lora adapters. - - [Deploy sample vllm deployment with Dynamic lora adapter enabled and Lora syncer sidecar](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/tools/dynamic-lora-sidecar/deployment.yaml) - - ***Safely rollout v2 adapter*** - - 1. Update lora configmap - - ``` yaml - - apiVersion: v1 - kind: ConfigMap - metadata: - name: dynamic-lora-config - data: - configmap.yaml: | - vLLMLoRAConfig: - ensureExist: - models: - - id: chatbot-v1 - source: gs://[TEAM-A-MODELS-BUCKET]/chatbot-v1 - - id: chatbot-v2 - source: gs://[TEAM-A-MODELS-BUCKET]/chatbot-v2 - ``` - - 2. Configure a canary rollout with traffic split using LLMService. In this example, 10% of traffic to the chatbot model will be sent to v2. - - ``` yaml - model: - name: chatbot - targetModels: - targetModelName: chatbot-v1 - weight: 90 - targetModelName: chatbot-v2 - weight: 10 - ``` - - 3. Finish rollout by setting the traffic to the new version 100%. - ```yaml - model: - name: chatbot - targetModels: - targetModelName: chatbot-v2 - weight: 100 - ``` - - 4. Remove v1 from dynamic lora configmap. - ```yaml - apiVersion: v1 - kind: ConfigMap - metadata: - name: dynamic-lora-config - data: - configmap.yaml: | - vLLMLoRAConfig: - ensureExist: - models: - - id: chatbot-v2 - source: gs://[TEAM-A-MODELS-BUCKET]/chatbot-v2 - ensureNotExist: # Explicitly unregisters the adapter from model servers - models: - - id: chatbot-v1 - source: gs://[TEAM-A-MODELS-BUCKET]/chatbot-v1 - ``` diff --git a/tools/dynamic-lora-sidecar/Makefile b/tools/dynamic-lora-sidecar/Makefile new file mode 100644 index 000000000..93f7672d2 --- /dev/null +++ b/tools/dynamic-lora-sidecar/Makefile @@ -0,0 +1,59 @@ +IMAGE_NAME := lora-syncer +IMAGE_REGISTRY ?= us-central1-docker.pkg.dev/k8s-staging-images/llm-instance-gateway +IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(IMAGE_NAME) + +GIT_TAG ?= $(shell git describe --tags --dirty --always) +EXTRA_TAG ?= $(if $(_PULL_BASE_REF),$(_PULL_BASE_REF),main) +IMAGE_TAG ?= $(IMAGE_REPO):$(GIT_TAG) +EXTRA_IMAGE_TAG ?= $(IMAGE_REPO):$(EXTRA_TAG) + + +PLATFORMS ?= linux/amd64 + + +DOCKER_BUILDX_CMD ?= docker buildx +IMAGE_BUILD_CMD ?= $(DOCKER_BUILDX_CMD) build +IMAGE_BUILD_EXTRA_OPTS ?= + +# --- Targets --- +.PHONY: image-local-build +image-local-build: + BUILDER=$(shell $(DOCKER_BUILDX_CMD) create --use) + $(MAKE) image-build PUSH=$(PUSH) + $(DOCKER_BUILDX_CMD) rm $$BUILDER + +.PHONY: image-local-push +image-local-push: PUSH=--push +image-local-push: image-local-build + +.PHONY: image-build +image-build: + $(IMAGE_BUILD_CMD) -t $(IMAGE_TAG) \ + --platform=$(PLATFORMS) \ + --build-arg BASE_IMAGE=$(BASE_IMAGE) \ + --build-arg BUILDER_IMAGE=$(BUILDER_IMAGE) \ + $(PUSH) \ + $(IMAGE_BUILD_EXTRA_OPTS) ./ + +.PHONY: image-push +image-push: PUSH=--push +image-push: image-build + +.PHONY: run +run: + docker run -v $(CURDIR)/config:/config -u appuser $(IMAGE_TAG) # Use the user name + +.PHONY: clean +clean: + docker rmi $(IMAGE_TAG) $(EXTRA_IMAGE_TAG) 2>/dev/null || true + +.PHONY: clean-dangling +clean-dangling: + docker rmi $(docker images -f "dangling=true" -q) 2>/dev/null || true + +.PHONY: test +test: + python -m unittest discover + +.PHONY: all +all: test image-build \ No newline at end of file diff --git a/tools/dynamic-lora-sidecar/cloudbuild.yaml b/tools/dynamic-lora-sidecar/cloudbuild.yaml new file mode 100644 index 000000000..e91a238a6 --- /dev/null +++ b/tools/dynamic-lora-sidecar/cloudbuild.yaml @@ -0,0 +1,17 @@ +# See https://cloud.google.com/cloud-build/docs/build-config +timeout: 3000s + +steps: + - name: gcr.io/k8s-testimages/gcb-docker-gcloud:v20220830-45cbff55bc + entrypoint: make + args: + - image-push + env: + - GIT_TAG=$_GIT_TAG + - EXTRA_TAG=$_PULL_BASE_REF + - DOCKER_BUILDX_CMD=/buildx-entrypoint + +substitutions: + _GIT_TAG: '0.0.0' # Default value for Git tag + _PULL_BASE_REF: 'main' # Default value for branch/tag +# No options needed! \ No newline at end of file From 03b274136525bcdd138984d104840f0dbd49f85d Mon Sep 17 00:00:00 2001 From: Kunjan Date: Mon, 10 Feb 2025 18:07:15 -0800 Subject: [PATCH 03/13] Add makefile and cloudbuild file to build and push lora-syncer Signed-off-by: Kunjan --- Makefile | 29 ++++++++++++++++++++++ cloudbuild.yaml | 8 ++++++ tools/dynamic-lora-sidecar/cloudbuild.yaml | 17 ------------- 3 files changed, 37 insertions(+), 17 deletions(-) delete mode 100644 tools/dynamic-lora-sidecar/cloudbuild.yaml diff --git a/Makefile b/Makefile index b7654ed71..f2198844c 100644 --- a/Makefile +++ b/Makefile @@ -31,6 +31,10 @@ IMAGE_NAME := epp IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(IMAGE_NAME) IMAGE_TAG ?= $(IMAGE_REPO):$(GIT_TAG) +SYNCER_IMAGE_NAME := lora-syncer +SYNCER_IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(IMAGE_NAME) +SYNCER_IMAGE_TAG ?= $(IMAGE_REPO):$(GIT_TAG) + BASE_IMAGE ?= gcr.io/distroless/base-debian10 BUILDER_IMAGE ?= golang:1.23-alpine ifdef GO_VERSION @@ -163,6 +167,31 @@ image-build: ## Build the EPP image using Docker Buildx. image-push: PUSH=--push ## Build the EPP image and push it to $IMAGE_REPO. image-push: image-build +##@ Lora Syncer + +.PHONY: syncer-image-local-build +syncer-image-local-build: + BUILDER=$(shell $(DOCKER_BUILDX_CMD) create --use) + $(MAKE) image-build PUSH=$(PUSH) + $(DOCKER_BUILDX_CMD) rm $$BUILDER + +.PHONY: syncer-image-local-push +syncer-image-local-push: PUSH=--push +syncer-image-local-push: syncer-image-local-build + +.PHONY: syncer-image-build +syncer-image-build: + $ cd $(CURDIR)/tools/dynamic-lora-sidecar && $(IMAGE_BUILD_CMD) -t $(SYNCER_IMAGE_TAG) \ + --platform=$(PLATFORMS) \ + --build-arg BASE_IMAGE=$(BASE_IMAGE) \ + --build-arg BUILDER_IMAGE=$(BUILDER_IMAGE) \ + $(PUSH) \ + $(IMAGE_BUILD_EXTRA_OPTS) ./ + +.PHONY: syncer-image-push +syncer-image-push: PUSH=--push +syncer-image-push: syncer-image-build + .PHONY: image-load image-load: LOAD=--load ## Build the EPP image and load it in the local Docker registry. image-load: image-build diff --git a/cloudbuild.yaml b/cloudbuild.yaml index 2da147f4a..40e45923e 100644 --- a/cloudbuild.yaml +++ b/cloudbuild.yaml @@ -12,6 +12,14 @@ steps: - GIT_TAG=$_GIT_TAG - EXTRA_TAG=$_PULL_BASE_REF - DOCKER_BUILDX_CMD=/buildx-entrypoint + - name: lora-adapter-syncer + entrypoint: make + args: + - syncer-image-push + env: + - GIT_TAG=$_GIT_TAG + - EXTRA_TAG=$_PULL_BASE_REF + - DOCKER_BUILDX_CMD=/buildx-entrypoint substitutions: # _GIT_TAG will be filled with a git-based tag for the image, of the form vYYYYMMDD-hash, and # can be used as a substitution diff --git a/tools/dynamic-lora-sidecar/cloudbuild.yaml b/tools/dynamic-lora-sidecar/cloudbuild.yaml deleted file mode 100644 index e91a238a6..000000000 --- a/tools/dynamic-lora-sidecar/cloudbuild.yaml +++ /dev/null @@ -1,17 +0,0 @@ -# See https://cloud.google.com/cloud-build/docs/build-config -timeout: 3000s - -steps: - - name: gcr.io/k8s-testimages/gcb-docker-gcloud:v20220830-45cbff55bc - entrypoint: make - args: - - image-push - env: - - GIT_TAG=$_GIT_TAG - - EXTRA_TAG=$_PULL_BASE_REF - - DOCKER_BUILDX_CMD=/buildx-entrypoint - -substitutions: - _GIT_TAG: '0.0.0' # Default value for Git tag - _PULL_BASE_REF: 'main' # Default value for branch/tag -# No options needed! \ No newline at end of file From 3271c3f9dc878e8a0a0d666ea9c77f189944fce8 Mon Sep 17 00:00:00 2001 From: Kunjan Date: Thu, 13 Feb 2025 15:08:47 -0800 Subject: [PATCH 04/13] Update site-src/guides/dynamic-lora.md Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com> --- site-src/guides/dynamic-lora.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site-src/guides/dynamic-lora.md b/site-src/guides/dynamic-lora.md index 5356c7e73..0cfd514a3 100644 --- a/site-src/guides/dynamic-lora.md +++ b/site-src/guides/dynamic-lora.md @@ -38,7 +38,7 @@ Rest of the steps are same as [general setup](https://github.com/kubernetes-sigs source: tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2 ``` -2. Configure a canary rollout with traffic split using LLMService. In this example, 10% of traffic to the chatbot model will be sent to v2. +2. Configure a canary rollout with traffic split using InferenceModel. In this example, 10% of traffic to the chatbot model will be sent to `tweet-summary-3`. ``` yaml model: From 62adbb1ee57708beed65b0245ad12fcecfa5c723 Mon Sep 17 00:00:00 2001 From: Kunjan Date: Thu, 13 Feb 2025 15:12:32 -0800 Subject: [PATCH 05/13] Update site-src/guides/dynamic-lora.md Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com> --- site-src/guides/dynamic-lora.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site-src/guides/dynamic-lora.md b/site-src/guides/dynamic-lora.md index 0cfd514a3..a842ebd5d 100644 --- a/site-src/guides/dynamic-lora.md +++ b/site-src/guides/dynamic-lora.md @@ -1,6 +1,6 @@ # Getting started with Gateway API Inference Extension with Dynamic lora updates on vllm -The goal of this guide is to get a single InferencePool running with VLLM and demonstrate use of dynamic lora updating ! +The goal of this guide is to get a single InferencePool running with vLLM and demonstrate use of dynamic lora updating! ### Requirements - Envoy Gateway [v1.2.1](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher From 6f5b9e71fa09b1bedb6203c75b5921aad301e78f Mon Sep 17 00:00:00 2001 From: Kunjan Date: Mon, 10 Feb 2025 18:07:15 -0800 Subject: [PATCH 06/13] Add makefile and cloudbuild file to build and push lora-syncer Signed-off-by: Kunjan --- .../vllm/deployment-with-syncer.yaml | 25 ++------ pkg/manifests/vllm/deployment.yaml | 10 ---- site-src/guides/dynamic-lora.md | 58 ++++++++++++------- 3 files changed, 42 insertions(+), 51 deletions(-) diff --git a/pkg/manifests/vllm/deployment-with-syncer.yaml b/pkg/manifests/vllm/deployment-with-syncer.yaml index 9359123dd..b32d3eb14 100644 --- a/pkg/manifests/vllm/deployment-with-syncer.yaml +++ b/pkg/manifests/vllm/deployment-with-syncer.yaml @@ -43,18 +43,8 @@ spec: - "--max-cpu-loras" - "12" - "--lora-modules" - - '{"name": "sql-lora-0", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' - - '{"name": "sql-lora-1", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' - - '{"name": "sql-lora-2", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' - - '{"name": "sql-lora-3", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' - - '{"name": "sql-lora-4", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' - '{"name": "tweet-summary-0", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' - '{"name": "tweet-summary-1", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' - - '{"name": "tweet-summary-2", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' - - '{"name": "tweet-summary-3", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' - - '{"name": "tweet-summary-4", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' - - '{"name": "sql-lora", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' - - '{"name": "tweet-summary", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' env: - name: PORT value: "8000" @@ -143,16 +133,13 @@ data: ensureExist: models: - base-model: meta-llama/Llama-2-7b-hf - id: sql-lora-v1 - source: yard1/llama-2-7b-sql-lora-test + id: tweet-summary-0 + source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm - base-model: meta-llama/Llama-2-7b-hf - id: sql-lora-v3 - source: yard1/llama-2-7b-sql-lora-test - - base-model: meta-llama/Llama-2-7b-hf - id: sql-lora-v4 - source: yard1/llama-2-7b-sql-lora-test + id: tweet-summary-1 + source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm ensureNotExist: models: - base-model: meta-llama/Llama-2-7b-hf - id: sql-lora-v2 - source: yard1/llama-2-7b-sql-lora-test \ No newline at end of file + id: tweet-summary-2 + source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm \ No newline at end of file diff --git a/pkg/manifests/vllm/deployment.yaml b/pkg/manifests/vllm/deployment.yaml index 8ea95365b..1d115f4d4 100644 --- a/pkg/manifests/vllm/deployment.yaml +++ b/pkg/manifests/vllm/deployment.yaml @@ -43,18 +43,8 @@ spec: - "--max-cpu-loras" - "12" - "--lora-modules" - - '{"name": "sql-lora-0", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' - - '{"name": "sql-lora-1", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' - - '{"name": "sql-lora-2", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' - - '{"name": "sql-lora-3", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' - - '{"name": "sql-lora-4", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' - '{"name": "tweet-summary-0", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' - '{"name": "tweet-summary-1", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' - - '{"name": "tweet-summary-2", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' - - '{"name": "tweet-summary-3", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' - - '{"name": "tweet-summary-4", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' - - '{"name": "sql-lora", "path": "yard1/llama-2-7b-sql-lora-test", "base_model_name": "llama-2"}' - - '{"name": "tweet-summary", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' env: - name: PORT value: "8000" diff --git a/site-src/guides/dynamic-lora.md b/site-src/guides/dynamic-lora.md index a842ebd5d..a4f8ba0b9 100644 --- a/site-src/guides/dynamic-lora.md +++ b/site-src/guides/dynamic-lora.md @@ -29,33 +29,40 @@ Rest of the steps are same as [general setup](https://github.com/kubernetes-sigs name: dynamic-lora-config data: configmap.yaml: | - vLLMLoRAConfig: - ensureExist: - models: - - id: tweet-summary-v1 - source: tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1 - - id: tweet-summary-v2 - source: tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2 + vLLMLoRAConfig: + name: sql-loras-llama + port: 8000 + ensureExist: + models: + - base-model: meta-llama/Llama-2-7b-hf + id: tweet-summary-0 + source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm + - base-model: meta-llama/Llama-2-7b-hf + id: tweet-summary-1 + source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm ``` -2. Configure a canary rollout with traffic split using InferenceModel. In this example, 10% of traffic to the chatbot model will be sent to `tweet-summary-3`. +2. Configure a canary rollout with traffic split using LLMService. In this example, 40% of traffic for tweet-summary model will be sent to the ***tweet-summary-2*** adapter . ``` yaml model: - name: chatbot + name: tweet-summary targetModels: - targetModelName: chatbot-v1 - weight: 90 - targetModelName: chatbot-v2 + targetModelName: tweet-summary-0 weight: 10 + targetModelName: tweet-summary-1 + weight: 40 + targetModelName: tweet-summary-2 + weight: 40 + ``` 3. Finish rollout by setting the traffic to the new version 100%. ```yaml model: - name: chatbot + name: tweet-summary targetModels: - targetModelName: chatbot-v2 + targetModelName: tweet-summary-2 weight: 100 ``` @@ -68,12 +75,19 @@ model: data: configmap.yaml: | vLLMLoRAConfig: - ensureExist: - models: - - id: chatbot-v2 - source: gs://[TEAM-A-MODELS-BUCKET]/chatbot-v2 - ensureNotExist: # Explicitly unregisters the adapter from model servers - models: - - id: chatbot-v1 - source: gs://[TEAM-A-MODELS-BUCKET]/chatbot-v1 + name: sql-loras-llama + port: 8000 + ensureExist: + models: + - base-model: meta-llama/Llama-2-7b-hf + id: tweet-summary-2 + source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm + ensureNotExist: + models: + - base-model: meta-llama/Llama-2-7b-hf + id: tweet-summary-1 + source: gs://[HUGGING FACE PATH] + - base-model: meta-llama/Llama-2-7b-hf + id: tweet-summary-0 + source: gs://[HUGGING FACE PATH] ``` From 78b9bfecf2939430ffd70367f2acd979e0cdd5b4 Mon Sep 17 00:00:00 2001 From: Daneyon Hansen Date: Thu, 13 Feb 2025 18:20:20 -0500 Subject: [PATCH 07/13] Adds image-load and kind-load Make targets (#288) Signed-off-by: Daneyon Hansen --- Makefile | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/Makefile b/Makefile index f2198844c..f6edb6300 100644 --- a/Makefile +++ b/Makefile @@ -167,6 +167,14 @@ image-build: ## Build the EPP image using Docker Buildx. image-push: PUSH=--push ## Build the EPP image and push it to $IMAGE_REPO. image-push: image-build +.PHONY: image-load +image-load: LOAD=--load ## Build the EPP image and load it in the local Docker registry. +image-load: image-build + +.PHONY: image-kind +image-kind: image-build ## Build the EPP image and load it to kind cluster $KIND_CLUSTER ("kind" by default). + kind load docker-image $(IMAGE_TAG) --name $(KIND_CLUSTER) + ##@ Lora Syncer .PHONY: syncer-image-local-build From 9c367f9bc75587f1c829b091f9297bb8417093a7 Mon Sep 17 00:00:00 2001 From: Kunjan Date: Mon, 10 Feb 2025 18:07:15 -0800 Subject: [PATCH 08/13] Add makefile and cloudbuild file to build and push lora-syncer Signed-off-by: Kunjan --- Makefile | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/Makefile b/Makefile index f6edb6300..312be9e7c 100644 --- a/Makefile +++ b/Makefile @@ -167,6 +167,31 @@ image-build: ## Build the EPP image using Docker Buildx. image-push: PUSH=--push ## Build the EPP image and push it to $IMAGE_REPO. image-push: image-build +##@ Lora Syncer + +.PHONY: syncer-image-local-build +syncer-image-local-build: + BUILDER=$(shell $(DOCKER_BUILDX_CMD) create --use) + $(MAKE) image-build PUSH=$(PUSH) + $(DOCKER_BUILDX_CMD) rm $$BUILDER + +.PHONY: syncer-image-local-push +syncer-image-local-push: PUSH=--push +syncer-image-local-push: syncer-image-local-build + +.PHONY: syncer-image-build +syncer-image-build: + $ cd $(CURDIR)/tools/dynamic-lora-sidecar && $(IMAGE_BUILD_CMD) -t $(SYNCER_IMAGE_TAG) \ + --platform=$(PLATFORMS) \ + --build-arg BASE_IMAGE=$(BASE_IMAGE) \ + --build-arg BUILDER_IMAGE=$(BUILDER_IMAGE) \ + $(PUSH) \ + $(IMAGE_BUILD_EXTRA_OPTS) ./ + +.PHONY: syncer-image-push +syncer-image-push: PUSH=--push +syncer-image-push: syncer-image-build + .PHONY: image-load image-load: LOAD=--load ## Build the EPP image and load it in the local Docker registry. image-load: image-build From 5b31a4cd4d98f38431a805c1204c14ac1e4991f4 Mon Sep 17 00:00:00 2001 From: Kunjan Date: Thu, 13 Feb 2025 16:00:28 -0800 Subject: [PATCH 09/13] Add build targets for lora syncer Signed-off-by: Kunjan --- Makefile | 38 ++----------------- site-src/guides/dynamic-lora.md | 5 ++- tools/dynamic-lora-sidecar/Makefile | 59 ----------------------------- 3 files changed, 8 insertions(+), 94 deletions(-) delete mode 100644 tools/dynamic-lora-sidecar/Makefile diff --git a/Makefile b/Makefile index 312be9e7c..348bdd1f5 100644 --- a/Makefile +++ b/Makefile @@ -26,6 +26,7 @@ PLATFORMS ?= linux/amd64 DOCKER_BUILDX_CMD ?= docker buildx IMAGE_BUILD_CMD ?= $(DOCKER_BUILDX_CMD) build IMAGE_BUILD_EXTRA_OPTS ?= +SYNCER_IMAGE_BUILD_EXTRA_OPTS ?= IMAGE_REGISTRY ?= us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension IMAGE_NAME := epp IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(IMAGE_NAME) @@ -43,9 +44,11 @@ endif ifdef EXTRA_TAG IMAGE_EXTRA_TAG ?= $(IMAGE_REPO):$(EXTRA_TAG) +SYNCER_IMAGE_EXTRA_TAG ?= $(SYNCER_IMAGE_REPO):$(EXTRA_TAG) endif ifdef IMAGE_EXTRA_TAG IMAGE_BUILD_EXTRA_OPTS += -t $(IMAGE_EXTRA_TAG) +SYNCER_IMAGE_BUILD_EXTRA_OPTS += -t $(SYNCER_IMAGE_EXTRA_TAG) endif # The name of the kind cluster to use for the "kind-load" target. @@ -167,31 +170,6 @@ image-build: ## Build the EPP image using Docker Buildx. image-push: PUSH=--push ## Build the EPP image and push it to $IMAGE_REPO. image-push: image-build -##@ Lora Syncer - -.PHONY: syncer-image-local-build -syncer-image-local-build: - BUILDER=$(shell $(DOCKER_BUILDX_CMD) create --use) - $(MAKE) image-build PUSH=$(PUSH) - $(DOCKER_BUILDX_CMD) rm $$BUILDER - -.PHONY: syncer-image-local-push -syncer-image-local-push: PUSH=--push -syncer-image-local-push: syncer-image-local-build - -.PHONY: syncer-image-build -syncer-image-build: - $ cd $(CURDIR)/tools/dynamic-lora-sidecar && $(IMAGE_BUILD_CMD) -t $(SYNCER_IMAGE_TAG) \ - --platform=$(PLATFORMS) \ - --build-arg BASE_IMAGE=$(BASE_IMAGE) \ - --build-arg BUILDER_IMAGE=$(BUILDER_IMAGE) \ - $(PUSH) \ - $(IMAGE_BUILD_EXTRA_OPTS) ./ - -.PHONY: syncer-image-push -syncer-image-push: PUSH=--push -syncer-image-push: syncer-image-build - .PHONY: image-load image-load: LOAD=--load ## Build the EPP image and load it in the local Docker registry. image-load: image-build @@ -219,20 +197,12 @@ syncer-image-build: --build-arg BASE_IMAGE=$(BASE_IMAGE) \ --build-arg BUILDER_IMAGE=$(BUILDER_IMAGE) \ $(PUSH) \ - $(IMAGE_BUILD_EXTRA_OPTS) ./ + $(SYNCER_IMAGE_BUILD_EXTRA_OPTS) ./ .PHONY: syncer-image-push syncer-image-push: PUSH=--push syncer-image-push: syncer-image-build -.PHONY: image-load -image-load: LOAD=--load ## Build the EPP image and load it in the local Docker registry. -image-load: image-build - -.PHONY: image-kind -image-kind: image-build ## Build the EPP image and load it to kind cluster $KIND_CLUSTER ("kind" by default). - kind load docker-image $(IMAGE_TAG) --name $(KIND_CLUSTER) - ##@ Docs .PHONY: build-docs diff --git a/site-src/guides/dynamic-lora.md b/site-src/guides/dynamic-lora.md index a4f8ba0b9..948c2d365 100644 --- a/site-src/guides/dynamic-lora.md +++ b/site-src/guides/dynamic-lora.md @@ -40,6 +40,9 @@ Rest of the steps are same as [general setup](https://github.com/kubernetes-sigs - base-model: meta-llama/Llama-2-7b-hf id: tweet-summary-1 source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm + - base-model: meta-llama/Llama-2-7b-hf + id: tweet-summary-2 + source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm ``` 2. Configure a canary rollout with traffic split using LLMService. In this example, 40% of traffic for tweet-summary model will be sent to the ***tweet-summary-2*** adapter . @@ -49,7 +52,7 @@ model: name: tweet-summary targetModels: targetModelName: tweet-summary-0 - weight: 10 + weight: 20 targetModelName: tweet-summary-1 weight: 40 targetModelName: tweet-summary-2 diff --git a/tools/dynamic-lora-sidecar/Makefile b/tools/dynamic-lora-sidecar/Makefile deleted file mode 100644 index 93f7672d2..000000000 --- a/tools/dynamic-lora-sidecar/Makefile +++ /dev/null @@ -1,59 +0,0 @@ -IMAGE_NAME := lora-syncer -IMAGE_REGISTRY ?= us-central1-docker.pkg.dev/k8s-staging-images/llm-instance-gateway -IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(IMAGE_NAME) - -GIT_TAG ?= $(shell git describe --tags --dirty --always) -EXTRA_TAG ?= $(if $(_PULL_BASE_REF),$(_PULL_BASE_REF),main) -IMAGE_TAG ?= $(IMAGE_REPO):$(GIT_TAG) -EXTRA_IMAGE_TAG ?= $(IMAGE_REPO):$(EXTRA_TAG) - - -PLATFORMS ?= linux/amd64 - - -DOCKER_BUILDX_CMD ?= docker buildx -IMAGE_BUILD_CMD ?= $(DOCKER_BUILDX_CMD) build -IMAGE_BUILD_EXTRA_OPTS ?= - -# --- Targets --- -.PHONY: image-local-build -image-local-build: - BUILDER=$(shell $(DOCKER_BUILDX_CMD) create --use) - $(MAKE) image-build PUSH=$(PUSH) - $(DOCKER_BUILDX_CMD) rm $$BUILDER - -.PHONY: image-local-push -image-local-push: PUSH=--push -image-local-push: image-local-build - -.PHONY: image-build -image-build: - $(IMAGE_BUILD_CMD) -t $(IMAGE_TAG) \ - --platform=$(PLATFORMS) \ - --build-arg BASE_IMAGE=$(BASE_IMAGE) \ - --build-arg BUILDER_IMAGE=$(BUILDER_IMAGE) \ - $(PUSH) \ - $(IMAGE_BUILD_EXTRA_OPTS) ./ - -.PHONY: image-push -image-push: PUSH=--push -image-push: image-build - -.PHONY: run -run: - docker run -v $(CURDIR)/config:/config -u appuser $(IMAGE_TAG) # Use the user name - -.PHONY: clean -clean: - docker rmi $(IMAGE_TAG) $(EXTRA_IMAGE_TAG) 2>/dev/null || true - -.PHONY: clean-dangling -clean-dangling: - docker rmi $(docker images -f "dangling=true" -q) 2>/dev/null || true - -.PHONY: test -test: - python -m unittest discover - -.PHONY: all -all: test image-build \ No newline at end of file From 2846d6a3b24c3245d025bebe2a25170cd4e89ab4 Mon Sep 17 00:00:00 2001 From: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com> Date: Fri, 14 Feb 2025 13:06:58 -0800 Subject: [PATCH 10/13] Apply suggestions from code review --- Makefile | 4 ++-- pkg/manifests/vllm/deployment-with-syncer.yaml | 2 +- site-src/guides/dynamic-lora.md | 1 - 3 files changed, 3 insertions(+), 4 deletions(-) diff --git a/Makefile b/Makefile index 348bdd1f5..1d8fc531c 100644 --- a/Makefile +++ b/Makefile @@ -33,8 +33,8 @@ IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(IMAGE_NAME) IMAGE_TAG ?= $(IMAGE_REPO):$(GIT_TAG) SYNCER_IMAGE_NAME := lora-syncer -SYNCER_IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(IMAGE_NAME) -SYNCER_IMAGE_TAG ?= $(IMAGE_REPO):$(GIT_TAG) +SYNCER_IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(SYNCER_IMAGE_NAME) +SYNCER_IMAGE_TAG ?= $(SYNCER_IMAGE_REPO):$(GIT_TAG) BASE_IMAGE ?= gcr.io/distroless/base-debian10 BUILDER_IMAGE ?= golang:1.23-alpine diff --git a/pkg/manifests/vllm/deployment-with-syncer.yaml b/pkg/manifests/vllm/deployment-with-syncer.yaml index b32d3eb14..d6110f4b1 100644 --- a/pkg/manifests/vllm/deployment-with-syncer.yaml +++ b/pkg/manifests/vllm/deployment-with-syncer.yaml @@ -95,7 +95,7 @@ spec: - name: lora-adapter-syncer tty: true stdin: true - image: #Replace image + image: us-central1-docker.pkg.dev/ahg-gke-dev/jobset2/lora-syncer:6dc97be restartPolicy: Always imagePullPolicy: Always env: diff --git a/site-src/guides/dynamic-lora.md b/site-src/guides/dynamic-lora.md index 948c2d365..e2396d69b 100644 --- a/site-src/guides/dynamic-lora.md +++ b/site-src/guides/dynamic-lora.md @@ -22,7 +22,6 @@ Rest of the steps are same as [general setup](https://github.com/kubernetes-sigs 1. Update lora configmap ``` yaml - apiVersion: v1 kind: ConfigMap metadata: From 6bbbacb9585b2d52c38276f76b8f17b57f68f947 Mon Sep 17 00:00:00 2001 From: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com> Date: Fri, 14 Feb 2025 13:09:10 -0800 Subject: [PATCH 11/13] Apply suggestions from code review --- site-src/guides/dynamic-lora.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/site-src/guides/dynamic-lora.md b/site-src/guides/dynamic-lora.md index e2396d69b..f10bb47f8 100644 --- a/site-src/guides/dynamic-lora.md +++ b/site-src/guides/dynamic-lora.md @@ -12,16 +12,16 @@ The goal of this guide is to get a single InferencePool running with vLLM and de ### Steps 1. **Deploy Sample VLLM Model Server with dynamic lora update enabled and dynamic lora syncer sidecar ** - [Deploy sample vllm deployment with Dynamic lora adapter enabled and Lora syncer sidecar and configmap](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/manifests/vllm/dynamic-lora-sidecar/deployment.yaml) + [Redeploy the vLLM deployment with Dynamic lora adapter enabled and Lora syncer sidecar and configmap](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/manifests/vllm/dynamic-lora-sidecar/deployment.yaml) Rest of the steps are same as [general setup](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/site-src/guides/index.md) ### Safely rollout v2 adapter -1. Update lora configmap +1. Update the LoRA syncer ConfigMap to make the new adapter version available on the model servers. -``` yaml +```yaml apiVersion: v1 kind: ConfigMap metadata: @@ -46,7 +46,7 @@ Rest of the steps are same as [general setup](https://github.com/kubernetes-sigs 2. Configure a canary rollout with traffic split using LLMService. In this example, 40% of traffic for tweet-summary model will be sent to the ***tweet-summary-2*** adapter . -``` yaml +```yaml model: name: tweet-summary targetModels: From ebfaa6ef2c6d500ffe18300eb07901cb2efc3049 Mon Sep 17 00:00:00 2001 From: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com> Date: Fri, 14 Feb 2025 13:10:25 -0800 Subject: [PATCH 12/13] Apply suggestions from code review --- site-src/guides/dynamic-lora.md | 1 - 1 file changed, 1 deletion(-) diff --git a/site-src/guides/dynamic-lora.md b/site-src/guides/dynamic-lora.md index f10bb47f8..0f9c31893 100644 --- a/site-src/guides/dynamic-lora.md +++ b/site-src/guides/dynamic-lora.md @@ -42,7 +42,6 @@ Rest of the steps are same as [general setup](https://github.com/kubernetes-sigs - base-model: meta-llama/Llama-2-7b-hf id: tweet-summary-2 source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm - ``` 2. Configure a canary rollout with traffic split using LLMService. In this example, 40% of traffic for tweet-summary model will be sent to the ***tweet-summary-2*** adapter . From 277125c4b72f531f266f6350f030db7650192c38 Mon Sep 17 00:00:00 2001 From: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com> Date: Fri, 14 Feb 2025 13:11:38 -0800 Subject: [PATCH 13/13] Apply suggestions from code review --- site-src/guides/dynamic-lora.md | 1 - 1 file changed, 1 deletion(-) diff --git a/site-src/guides/dynamic-lora.md b/site-src/guides/dynamic-lora.md index 0f9c31893..ef3c2b0f8 100644 --- a/site-src/guides/dynamic-lora.md +++ b/site-src/guides/dynamic-lora.md @@ -42,7 +42,6 @@ Rest of the steps are same as [general setup](https://github.com/kubernetes-sigs - base-model: meta-llama/Llama-2-7b-hf id: tweet-summary-2 source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm - 2. Configure a canary rollout with traffic split using LLMService. In this example, 40% of traffic for tweet-summary model will be sent to the ***tweet-summary-2*** adapter . ```yaml