-
Couldn't load subscription status.
- Fork 184
Enhancements to LLM Instance Gateway: Scheduling Logic, and Documentation Updates #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
dcd4109
bf33d74
235cfca
5d3bcae
0e27908
adb9f8b
fb9aebe
481ec1d
9af96d4
41d18fd
474a95f
bbb343b
b858dd2
6cd2752
93dfe94
5dcf1a8
b8e18cb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -121,4 +121,4 @@ spec: | |
| emptyDir: | ||
| medium: Memory | ||
| - name: adapters | ||
| emptyDir: {} | ||
| emptyDir: {} | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,9 +7,14 @@ The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy. | |
|
|
||
| 1. **Deploy Sample vLLM Application** | ||
|
|
||
| A sample vLLM deployment with the proper protocol to work with LLM Instance Gateway can be found [here](https://github.com/kubernetes-sigs/llm-instance-gateway/blob/6f9869d6595d2d0f8e6febcbec0f348cb44a3012/examples/poc/manifests/samples/vllm-lora-deployment.yaml#L18). | ||
| A sample vLLM deployment with the proper protocol to work with LLM Instance Gateway can be found [here](https://github.com/kubernetes-sigs/llm-instance-gateway/tree/main/examples/poc/manifests/vllm/vllm-lora-deployment.yaml#L18). | ||
|
|
||
| 1. **Update Envoy Gateway Config to enable Patch Policy** | ||
| 2. **Deploy LLM Service and LLMServerPool** | ||
|
||
|
|
||
| You can find a sample LLM service and LLMServerPool configuration, based on the vLLM deployments mentioned above, [here](https://github.com/kubernetes-sigs/llm-instance-gateway/tree/main/examples/poc/manifests/llmservice.yaml). | ||
|
|
||
|
|
||
| 3. **Update Envoy Gateway Config to enable Patch Policy** | ||
|
|
||
| Our custom LLM Gateway ext-proc is patched into the existing envoy gateway via `EnvoyPatchPolicy`. To enable this feature, we must extend the Envoy Gateway config map. To do this, simply run: | ||
| ```bash | ||
|
|
@@ -20,26 +25,25 @@ The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy. | |
| Additionally, if you would like to enable the admin interface, you can uncomment the admin lines and run this again. | ||
|
|
||
|
|
||
| 1. **Deploy Gateway** | ||
| 4. **Deploy Gateway** | ||
|
|
||
| ```bash | ||
| kubectl apply -f ./manifests/gateway.yaml | ||
| ``` | ||
|
|
||
| 1. **Deploy Ext-Proc** | ||
| 5. **Deploy Ext-Proc** | ||
|
|
||
| ```bash | ||
| kubectl apply -f ./manifests/ext_proc.yaml | ||
| kubectl apply -f ./manifests/patch_policy.yaml | ||
| ``` | ||
| **NOTE**: Ensure the `instance-gateway-ext-proc` deployment is updated with the pod names and internal IP addresses of the vLLM replicas. This step is crucial for the correct routing of requests based on headers. This won't be needed once we make ext proc dynamically read the pods. | ||
|
|
||
| 1. **Try it out** | ||
| 6. **Try it out** | ||
|
|
||
| Wait until the gateway is ready. | ||
|
|
||
| ```bash | ||
| IP=$(kubectl get gateway/llm-gateway -o jsonpath='{.status.addresses[0].value}') | ||
| IP=$(kubectl get gateway/instance-gateway -o jsonpath='{.status.addresses[0].value}') | ||
| PORT=8081 | ||
|
|
||
| curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ | ||
|
|
@@ -48,4 +52,11 @@ The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy. | |
| "max_tokens": 100, | ||
| "temperature": 0 | ||
| }' | ||
| ``` | ||
| ``` | ||
|
|
||
|
|
||
| ## Scheduling Package in Ext Proc | ||
| The scheduling package implements request scheduling algorithms for load balancing requests across backend pods in an inference gateway. The scheduler ensures efficient resource utilization while maintaining low latency and prioritizing critical requests. It applies a series of filters based on metrics and heuristics to select the best pod for a given request. | ||
|
|
||
| # Flowchart | ||
| <img src="../docs/schedular-flowchart.png" alt="Scheduling Algorithm" width="400" /> | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -12,24 +12,53 @@ func (s *Server) HandleResponseHeaders(reqCtx *RequestContext, req *extProcPb.Pr | |
| h := req.Request.(*extProcPb.ProcessingRequest_ResponseHeaders) | ||
| klog.V(3).Infof("Headers before: %+v\n", h) | ||
|
|
||
| resp := &extProcPb.ProcessingResponse{ | ||
| Response: &extProcPb.ProcessingResponse_ResponseHeaders{ | ||
| ResponseHeaders: &extProcPb.HeadersResponse{ | ||
| Response: &extProcPb.CommonResponse{ | ||
| HeaderMutation: &extProcPb.HeaderMutation{ | ||
| SetHeaders: []*configPb.HeaderValueOption{ | ||
| { | ||
| Header: &configPb.HeaderValue{ | ||
| // This is for debugging purpose only. | ||
| Key: "x-went-into-resp-headers", | ||
| RawValue: []byte("true"), | ||
| var resp *extProcPb.ProcessingResponse | ||
| if reqCtx.TargetPod != nil { | ||
|
||
| resp = &extProcPb.ProcessingResponse{ | ||
| Response: &extProcPb.ProcessingResponse_ResponseHeaders{ | ||
| ResponseHeaders: &extProcPb.HeadersResponse{ | ||
| Response: &extProcPb.CommonResponse{ | ||
| HeaderMutation: &extProcPb.HeaderMutation{ | ||
| SetHeaders: []*configPb.HeaderValueOption{ | ||
| { | ||
| Header: &configPb.HeaderValue{ | ||
| // This is for debugging purpose only. | ||
| Key: "x-went-into-resp-headers", | ||
| RawValue: []byte("true"), | ||
| }, | ||
| }, | ||
| { | ||
| Header: &configPb.HeaderValue{ | ||
| Key: "target-pod", | ||
| RawValue: []byte(reqCtx.TargetPod.Address), | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| } | ||
| } else { | ||
| resp = &extProcPb.ProcessingResponse{ | ||
| Response: &extProcPb.ProcessingResponse_ResponseHeaders{ | ||
| ResponseHeaders: &extProcPb.HeadersResponse{ | ||
| Response: &extProcPb.CommonResponse{ | ||
| HeaderMutation: &extProcPb.HeaderMutation{ | ||
| SetHeaders: []*configPb.HeaderValueOption{ | ||
| { | ||
| Header: &configPb.HeaderValue{ | ||
| // This is for debugging purpose only. | ||
| Key: "x-went-into-resp-headers", | ||
| RawValue: []byte("true"), | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| } | ||
| } | ||
| return resp, nil | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -121,6 +121,11 @@ func leastQueuingFilterFunc(req *LLMRequest, pods []*backend.PodMetrics) ([]*bac | |
| return filtered, nil | ||
| } | ||
|
|
||
| // loRAAffinityPredicate is a filter function to check whether a pod has affinity to the lora requested. | ||
|
||
| func lowQueueingPodPredicate(_ *LLMRequest, pod *backend.PodMetrics) bool { | ||
| return pod.WaitingQueueSize < queueingThresholdLoRA | ||
| } | ||
|
|
||
| // leastKVCacheFilterFunc finds the max and min KV cache of all pods, divides the whole range | ||
| // (max-min) by the number of pods, and finds the pods that fall into the first range. | ||
| // The intuition is that if there are multiple pods that share similar KV cache in the low range, we | ||
|
|
@@ -159,6 +164,17 @@ func lowLoRACostPredicate(req *LLMRequest, pod *backend.PodMetrics) bool { | |
| return ok || len(pod.ActiveModels) < pod.MaxActiveModels | ||
| } | ||
|
|
||
| // loRAAffinityPredicate is a filter function to check whether a pod has affinity to the lora requested. | ||
| func loRAAffinityPredicate(req *LLMRequest, pod *backend.PodMetrics) bool { | ||
| _, ok := pod.ActiveModels[req.ResolvedTargetModel] | ||
| return ok | ||
| } | ||
|
|
||
| // canAcceptNewLoraPredicate is a filter function to check whether a pod has room to load the adapter. | ||
| func canAcceptNewLoraPredicate(req *LLMRequest, pod *backend.PodMetrics) bool { | ||
| return len(pod.ActiveModels) < pod.MaxActiveModels | ||
| } | ||
|
|
||
| func criticalRequestPredicate(req *LLMRequest, pod *backend.PodMetrics) bool { | ||
| return req.Critical | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,7 +16,11 @@ const ( | |
| // TODO(https://github.com/kubernetes-sigs/llm-instance-gateway/issues/16) Make this configurable. | ||
| kvCacheThreshold = 0.8 | ||
| // TODO(https://github.com/kubernetes-sigs/llm-instance-gateway/issues/16) Make this configurable. | ||
| queueThreshold = 5 | ||
| queueThresholdCritical = 5 | ||
| // TODO(https://github.com/kubernetes-sigs/llm-instance-gateway/issues/16) Make this configurable. | ||
| // the threshold for queued requests to be considered low below which we can prioritize LoRA affinity. | ||
| // The value of 50 is arrived heuristicically based on experiments. | ||
| queueingThresholdLoRA = 50 | ||
kaushikmitr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ) | ||
|
|
||
| var ( | ||
|
|
@@ -29,7 +33,7 @@ var ( | |
|
|
||
| // lowLatencyFilter tries to minimize the latency. The heuristic is to pick a server with lower | ||
|
||
| // cost to load an adapter and has low KV cache, which typically yields lower latency. | ||
| lowLatencyFilter = &filter{ | ||
| queueLoRAAndKVCacheFilter = &filter{ | ||
| name: "least queuing", | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. update the name? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I changed the filter names to be more descriptive. |
||
| filter: leastQueuingFilterFunc, | ||
| nextOnSuccessOrFailure: &filter{ | ||
|
|
@@ -42,13 +46,40 @@ var ( | |
| }, | ||
| } | ||
|
|
||
| // lowLatencyFilter is the same as lowLatencyFilter but without the LoRA cost filter. | ||
| queueAndKVCacheFilter = &filter{ | ||
| name: "least queuing", | ||
| filter: leastQueuingFilterFunc, | ||
| nextOnSuccessOrFailure: &filter{ | ||
| name: "least KV cache percent", | ||
| filter: leastKVCacheFilterFunc, | ||
| }, | ||
| } | ||
|
|
||
| // lowLatencyFilterModified defaults to lowLatencyFilterLoRA above a certain queueing threshold. LoRA affinity takes precedence below that queueing threshold. | ||
| lowLatencyFilter = &filter{ | ||
| name: "low queueing filter", | ||
| filter: toFilterFunc((lowQueueingPodPredicate)), | ||
| nextOnSuccess: &filter{ | ||
| name: "affinity LoRA", | ||
| filter: toFilterFunc(loRAAffinityPredicate), | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why not use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lowLoRACostPredicate picks both pods with canAcceptNewLoraPredicate and loRAAffinityPredicate, For stronger affinity we want to pick only pods with loRAAffinityPredicate and if no such pod is present only then pick canAcceptNewLoraPredicate There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not do that for the other branch too then? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The lowLoRACostPredicate ensures weak affinity by spreading the load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to a single pod. This gave good performance in our initial benchmarking results in the scenario where # of lora slots > # of lora adapters. loRAAffinityPredicate on the other hand ensures strong affinity i.e it pins requests to a single pod with that adapter. Depending on the scenario one or the other might be better. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we document this reasoning please? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a comment to lowLoRACostPredicate with the reasoning, like we have in leastKVCacheFilterFunc. |
||
| nextOnSuccess: queueAndKVCacheFilter, | ||
| nextOnFailure: &filter{ | ||
| name: "min cost LoRA", | ||
| filter: toFilterFunc(canAcceptNewLoraPredicate), | ||
| nextOnSuccessOrFailure: queueAndKVCacheFilter, | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if we replace This will simplify the code, however at the cost of potentially more confusion with the noop step. It's up to you. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree but also think this will make it more confusing, also i think queueAndKVCacheFilter is something we might need in future. For example, when we the request does not have need a lora adapter we can directly apply queueAndKVCacheFilter instead of checking for lora affinity. |
||
| }, | ||
| }, | ||
| nextOnFailure: queueLoRAAndKVCacheFilter, | ||
| } | ||
|
|
||
| sheddableRequestFilter = &filter{ | ||
| // When there is at least one model server that's not queuing requests, and still has KV | ||
| // cache below a certain threshold, we consider this model server has capacity to handle | ||
| // a sheddable request without impacting critical requests. | ||
| name: "has capacity for sheddable requests", | ||
| filter: toFilterFunc(noQueueAndLessThanKVCacheThresholdPredicate(queueThreshold, kvCacheThreshold)), | ||
| nextOnSuccess: lowLatencyFilter, | ||
| filter: toFilterFunc(noQueueAndLessThanKVCacheThresholdPredicate(queueThresholdCritical, kvCacheThreshold)), | ||
| nextOnSuccess: queueLoRAAndKVCacheFilter, | ||
| // If all pods are queuing or running above the KVCache threshold, we drop the sheddable | ||
| // request to make room for critical requests. | ||
| nextOnFailure: &filter{ | ||
|
|
@@ -62,6 +93,7 @@ var ( | |
| ) | ||
|
|
||
| func NewScheduler(pmp PodMetricsProvider) *Scheduler { | ||
|
|
||
| return &Scheduler{ | ||
| podMetricsProvider: pmp, | ||
| filter: defaultFilter, | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.