Skip to content

Commit 9c65063

Browse files
committed
Update docs about InferenceModel
1 parent f82ed07 commit 9c65063

File tree

1 file changed

+42
-30
lines changed

1 file changed

+42
-30
lines changed

docs/proposals/002-api-proposal/README.md

Lines changed: 42 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,7 @@ type InferenceModel struct {
228228
metav1.TypeMeta
229229

230230
Spec InferenceModelSpec
231+
Status InferenceModelStatus
231232
}
232233

233234
type InferenceModelSpec struct {
@@ -253,22 +254,39 @@ type InferenceModelSpec struct {
253254
// If not specified, the target model name is defaulted to the ModelName parameter.
254255
// ModelName is often in reference to a LoRA adapter.
255256
TargetModels []TargetModel
256-
// Reference to the InferencePool that the model registers to. It must exist in the same namespace.
257-
PoolReference *LocalObjectReference
257+
// PoolRef is a reference to the inference pool, the pool must exist in the same namespace.
258+
PoolRef PoolObjectReference
259+
}
260+
261+
// PoolObjectReference identifies an API object within the namespace of the
262+
// referrer.
263+
type PoolObjectReference struct {
264+
// Group is the group of the referent.
265+
Group Group
266+
267+
// Kind is kind of the referent. For example "InferencePool".
268+
Kind Kind
269+
270+
// Name is the name of the referent.
271+
Name ObjectName
258272
}
259273

260274
// Defines how important it is to serve the model compared to other models.
261275
// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field should ALWAYS be optional(use a pointer), and set no default.
262276
// This allows us to union this with a oneOf field in the future should we wish to adjust/extend this behavior.
263277
type Criticality string
264278
const (
265-
// Most important. Requests to this band will be shed last.
266-
Critical Criticality = "Critical"
267-
// More important than Sheddable, less important than Critical.
268-
// Requests in this band will be shed before critical traffic.
269-
Default Criticality = "Default"
270-
// Least important. Requests to this band will be shed before all other bands.
271-
Sheddable Criticality = "Sheddable"
279+
// Critical defines the highest level of criticality. Requests to this band will be shed last.
280+
Critical Criticality = "Critical"
281+
282+
// Standard defines the base criticality level and is more important than Sheddable but less
283+
// important than Critical. Requests in this band will be shed before critical traffic.
284+
// Most models are expected to fall within this band.
285+
Standard Criticality = "Standard"
286+
287+
// Sheddable defines the lowest level of criticality. Requests to this band will be shed before
288+
// all other bands.
289+
Sheddable Criticality = "Sheddable"
272290
)
273291

274292
// TargetModel represents a deployed model or a LoRA adapter. The
@@ -281,24 +299,16 @@ const (
281299
type TargetModel struct {
282300
// The name of the adapter as expected by the ModelServer.
283301
Name string
284-
// Weight is used to determine the percentage of traffic that should be
302+
// Weight is used to determine the percentage of traffic that should be
285303
// sent to this target model when multiple versions of the model are specified.
286-
Weight *int
304+
Weight *int32
287305
}
288306

289-
// LocalObjectReference identifies an API object within the namespace of the
290-
// referrer.
291-
type LocalObjectReference struct {
292-
// Group is the group of the referent.
293-
Group Group
294-
295-
// Kind is kind of the referent. For example "InferencePool".
296-
Kind Kind
297-
298-
// Name is the name of the referent.
299-
Name ObjectName
307+
// InferenceModelStatus defines the observed state of InferenceModel
308+
type InferenceModelStatus struct {
309+
// Conditions track the state of the InferenceModel.
310+
Conditions []metav1.Condition
300311
}
301-
302312
```
303313

304314
### Yaml Examples
@@ -322,27 +332,29 @@ spec:
322332
323333
Here we consume the pool with two InferenceModels. Where `sql-code-assist` is both the name of the model and the name of the LoRA adapter on the model server. And `npc-bot` has a layer of indirection for those names, as well as a specified criticality. Both `sql-code-assist` and `npc-bot` have available LoRA adapters on the InferencePool and routing to each InferencePool happens earlier (at the K8s Gateway).
324334
```yaml
325-
apiVersion: inference.x-k8s.io/v1alpha1
335+
apiVersion: inference.x-k8s.io/v1alpha2
326336
kind: InferenceModel
327337
metadata:
328338
name: sql-code-assist
329339
spec:
330340
modelName: sql-code-assist
331-
poolRef: base-model-pool
341+
poolRef:
342+
name: base-model-pool
332343
---
333-
apiVersion: inference.x-k8s.io/v1alpha1
344+
apiVersion: inference.x-k8s.io/v1alpha2
334345
kind: InferenceModel
335346
metadata:
336347
name: npc-bot
337348
spec:
338349
modelName: npc-bot
339350
criticality: Critical
340351
targetModels:
341-
targetModelName: npc-bot-v1
352+
- name: npc-bot-v1
353+
weight: 50
354+
- name: npc-bot-v2
342355
weight: 50
343-
targetModelName: npc-bot-v2
344-
weight: 50
345-
poolRef: base-model-pool
356+
poolRef:
357+
name: base-model-pool
346358
```
347359

348360

0 commit comments

Comments
 (0)