-
Notifications
You must be signed in to change notification settings - Fork 182
Populating api-types & concepts #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
a8935cd
b0e33bf
5cd7ed9
6c227b7
8f7209c
95cc070
1a5cbc6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,12 @@ | ||
# API Overview | ||
|
||
TODO | ||
## Background | ||
The API design is based on these axioms: | ||
|
||
|
||
- Pools of shared compute should be *discrete* for scheduling to properly work | ||
kfswain marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
- Pod-level scheduling should not be handled by a high-level gateway | ||
- Simple services should be simple to define (or are implicitly defined via reasonable defaults) | ||
- This solution should be composable with other Gateway solutions and flexible to fit customer needs | ||
- The MVP will heavily assume requests are done using the OpenAI spec, but open to extension in the future | ||
- The Gateway should route in a way that does not generate a queue of requests at the model server level | ||
- Model serving differs from web-serving in critical ways. One of these is the existence of multiple models for the same service, which can materially impact behavior, depending on the model served. As opposed to a web-service that has mechanisms to render implementation changes invisible to an end user |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,24 @@ | ||
# Roles and Personas | ||
|
||
TODO | ||
Before diving into the details of the API, decriptions of the personas these APIs were designed for will help convey the thought process of the API design. | ||
|
||
## Inference Platform Admin | ||
|
||
The Inference Platform Admin creates and manages the infrastructure necessary to run LLM workloads. Including handling Ops for: | ||
- Hardware | ||
kfswain marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Model Server | ||
- Base Model | ||
- Resource Allocation for Workloads | ||
- Gateway configuration | ||
- etc | ||
|
||
## Inference Workload Owner | ||
|
||
An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes: | ||
- Defining criticality | ||
- Managing fine-tunes | ||
- LoRA Adapters | ||
- System Prompts | ||
- Prompt Cache | ||
- etc. | ||
- Managing rollout of adapters |
Uh oh!
There was an error while loading. Please reload this page.