generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 182
Repo structure #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
442a38b
Reapply "proposed repo structure + copy of initial proposal"
kfswain 0267839
adding placeholder so dirs show up in GH UI
kfswain 56025bf
More placeholder files so dir structure shows up in GH UI
kfswain e19ee80
Adding status to proposal doc
kfswain 8b247c5
formatting cleanup
kfswain f3f70b0
removing initial proposal
kfswain 0c7b6a2
removing charter, as it would have high overlap with the root dir README
kfswain ae75ee5
adjusting structure based on reviewer feedback
kfswain 5abaabf
deleting dir
kfswain b5e9eb3
typo
kfswain File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Empty file.
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
TODO: describe LLM Instance Gateway charter. And determine if this repo is the best place for this charter | ||
kfswain marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
209 changes: 209 additions & 0 deletions
209
docs/proposals/001-initial-proposal/001-initial-proposal.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,209 @@ | ||
|
||
# LLM Instance Gateway | ||
<!-- toc --> | ||
kfswain marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Proposal](#proposal) | ||
- [Gateway](#gateway) | ||
- [CRDs](#crds) | ||
- [Envoy | ||
Solution](#envoy-solution) | ||
- [Model Server Protocol](#model-server-protocol) | ||
- [PoC Design Details](#poc-design-details) | ||
- [Overview](#overview) | ||
- [Request Flow](#request-flow) | ||
- [Pod selection algorithm in PoC](#pod-selection-algorithm-in-poc) | ||
- [Artifacts](#artifacts) <!-- /toc --> | ||
|
||
## Summary | ||
|
||
As presented in the [demo](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458) and building further upon the [joint proposal](https://docs.google.com/document/d/1BkwDlgFxSKKPHhM9kS28CdDIyJ3Xkdue3Iw1INaUkGw/edit?tab=t.0#heading=h.ajlsibmfh8wr), we are proposing that a gateway, focused on | ||
multiplexing | ||
use cases upon shared hardware has distinct advantages in enabling efficient and fair use of multiple use-cases over a shared pool of compute. | ||
|
||
## Motivation | ||
|
||
Novel advancements in fine-tuning like [LoRA](https://arxiv.org/abs/2106.09685) and [Multi-LoRA](https://arxiv.org/abs/2310.18547) have enabled multiple distinct use cases to share accelerators. As this new tech is adopted, the Day1/2 operational concerns quickly become necessary. | ||
kfswain marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
Kubernetes as long been a standard in easing and automating operational tasks of | ||
workloads. A mechanism (gateway) within the K8s ecosystem is a | ||
reasonable, and expected way for a user to support multiple LLM use cases on shared | ||
accelerators. | ||
|
||
### Goals | ||
|
||
#### Proposal Goals | ||
|
||
- Create an Inference Gateway project group for wg-serving collaboration, | ||
including: chat channel & dedicated repo (sponsored by sig-network) | ||
|
||
#### Gateway Goals | ||
|
||
- Fast reconfiguration - New use cases (including LoRA adapters or client | ||
configuration) can be rolled out / back in seconds to clients without waiting for | ||
a new model server to start. | ||
- Efficient accelerator sharing - Use cases can use less than an accelerator | ||
kfswain marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
or temporarily burst without needing to start a new model server leading to | ||
fewer wasted accelerators and better pooling of shared capacity. | ||
- Operational resilience - Use cases share available accelerators fairly and | ||
can have distinct priorities, latency objectives, and failure policies. | ||
- Standardized LoRA - Simple recommended patterns for deploying and loading | ||
LoRA adapters on a wide range of Kubernetes environments into model servers. | ||
- Composability - Approach should be composable with: | ||
- K8s Gateway API | ||
- Other gateway features and projects, including high level LLM gateways | ||
- existing deployment tools like kserve or kaito | ||
- different model servers | ||
|
||
### Non-Goals | ||
|
||
|
||
#### Proposal Non-Goals | ||
- Creation of a fully realized KEP | ||
|
||
#### Gateway Non-Goals | ||
|
||
- Replacing the features of pre-existing Gateways | ||
- Defining how serving workloads must be deployed | ||
|
||
## Proposal | ||
|
||
### Gateway | ||
|
||
#### CRD(s) | ||
|
||
To adequately achieve the above goals, we propose the addition of 1 or more CRDs | ||
to express: | ||
|
||
- The boundaries of a compute pool that shares a base model | ||
- Including the deployment of a routing solution (PoC details below) | ||
- A specific use case upon one or more backend pools | ||
- The objectives that this use case needs to achieve | ||
|
||
The example API we showed in our demo looked like: | ||
|
||
``` | ||
kind: LLMRoute | ||
apiVersion: inference.x-k8s.io/v1alpha1 | ||
metadata: | ||
name: assistant | ||
spec: | ||
parentRefs: | ||
- name: ai-gw | ||
backendRefs: | ||
- name: assistant | ||
adapter: | ||
name: sentiment | ||
priority: 100 | ||
objectives: | ||
- type: OutputTokenLatency | ||
|
||
latency: | ||
value: 2s | ||
quantile: | ||
numerator: 99 | ||
metrics: | ||
|
||
format: Prometheus | ||
``` | ||
|
||
#### Envoy Solution | ||
|
||
Any gateway solution *must* be compatible with Envoy Proxy, and have a plan with | ||
how to integrate these features into the Envoy ecosystem over the long term. | ||
|
||
#### Model Server Protocol | ||
|
||
In the PoC investigation we discovered the need for certain control and data to | ||
be exposed by the model server. In order for a model server to work properly | ||
with this LLM Instance Gateway, the model server would need to implement this | ||
protocol. | ||
|
||
Key requirements would roughly look like: | ||
- A method, or set of methods to dynamically update the available LoRA catalog on a model server | ||
- Metrics, shared as a header on response data, or some other similar mechanism, for data like: | ||
- Networking-friendly metric share (shared as a header, or other | ||
lightweight mechanism, just not in the body) | ||
- Adapter State | ||
- Available catalog | ||
- Queue data (per adapter) | ||
|
||
|
||
## PoC Design | ||
|
||
From the proof of concept we believe the following architecture is a starting point for this proposal: | ||
|
||
- Envoy Proxy | ||
- An OSS starting point that is generally accepted and used | ||
- Ext proc | ||
- A necessary tool to extend the capabilities of Envoy to allow for routing based on the Open AI model field (within the body) | ||
- An agile tool for development of novel LLM Instance Gateway features | ||
- CRD/K8s API interface | ||
- Model server modifications | ||
- Necessary to extend existing tooling to provide the proper routing data to Envoy | ||
- Potentially extend further to support [ORCA](https://github.com/envoyproxy/envoy/issues/6614) headers as a method of metrics transfer | ||
|
||
### Overview | ||
|
||
Our very high level diagram of how this looked: | ||
 | ||
|
||
To briefly describe how the components work together: | ||
|
||
- When an `LLMRoute` is defined, our gateway recognizes this new service, and | ||
allows traffic for the specified adapter to be admitted to the backend pool. | ||
- We support and expect Open AI API spec as the default when reading the | ||
adapter. | ||
|
||
- Incoming traffic for a validated service is then routed to ExtProc, where | ||
routing and fairness decisions are made. | ||
|
||
- We attempt to route to a model server that has the adapter already loaded, | ||
so long as there is batch capacity | ||
|
||
|
||
### Request Flow | ||
|
||
Below is an example of a | ||
life of a request using this described design: | ||
 | ||
|
||
> Notes: | ||
> | ||
> 1. Ext Proc: External processing calls an external gRPC service to | ||
> process HTTP requests and responses | ||
> | ||
> 2. Original Dst: Original destination | ||
> cluster can be used when incoming connections are redirected to Envoy either | ||
> via an iptables REDIRECT or TPROXY target or with Proxy Protocol. In these | ||
> cases requests routed to an original destination cluster are forwarded to | ||
> upstream hosts as addressed by the redirection metadata, without any explicit | ||
> host configuration or upstream host discovery. We implemented this using the | ||
> bootstrap feature of Envoy Gateway | ||
|
||
### Pod selection algorithm in PoC | ||
|
||
Metrics stored in Ext Proc Cache: | ||
- Active adapters in Each pod | ||
- Number of pending requests in each adapter in each pod. | ||
|
||
Given a request, read the relevant metrics from the cache and find which pods have that lora adapter loaded. | ||
Out of the set of pods that have the lora adapter loaded and the number of pending requests in that adapter is below a threshold, pick the one with the | ||
most amount of pending requests (we pick the most to prevent flopping). | ||
- If no pods satisfy 1 or 2 then pick a pod with: (in following priority): | ||
1. Least number of active adapters. | ||
1. Least total pending requests | ||
|
||
### Artifacts: | ||
|
||
- [Ext-proc/Envoy/Benchmarking repo](https://github.com/tomatillo-and-multiverse/lora-inference-gateway) | ||
- Repo we used to develop the ext proc image used in the PoC | ||
- Also contains the manifests required to deploy gateway | ||
- [vLLM fork](https://github.com/kaushikmitr/vllm) | ||
- Presentation: | ||
- [Slides](https://docs.google.com/presentation/d/1I1XDf6fQQEtHxJtZxFdIaUcUA3lLBC7neW823diWS78/edit?usp=sharing) | ||
- [Recording](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458) | ||
- [PoC Design & Experimentation data](https://docs.google.com/document/d/17wB0BgeV8JrGtccxZqkOqFyNC4gPBNqdKg8Oe9xMkio/edit#heading=h.eeeqp85g68qy) |
Jeffwan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
TO-DO: describe the model server protocol here that should be implemented if a model server would like to integrate with LLM Instance Gateway |
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.