-
Notifications
You must be signed in to change notification settings - Fork 739
Open
Description
Enhanced Error-handling config
Current State
See https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing
The NVIDIA GPU Device Plugin
We register for NVML Events of type nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccError
We treat the following XIDs as non-fatal errors:
XID | Description |
---|---|
13 | Graphics Engine Exception |
31 | GPU memory page fault |
43 | GPU stopped processing |
45 | Preemptive cleanup, due to previous errors |
68 | Video processor exception |
109 | Context Switch Timeout Error |
We allow additional Xids to be specified in the DP_DISABLE_HEALTHCHECKS
envvar with the following logic:
- If the value is
xids
orall
we disable healthchecks entirely. - A comma-separated list of numeric XIDs to ignore: e.g.
109,68
The GKE Device Plugin
By default the following error is checked:
XID | Description |
---|---|
48 | Double-bit ECC Error |
The XID_CONFIG
envvar is used to specifiy a comma-separated list of additional XIDs to treat as critical.
Proposal
Add the following config section:
version: v1
health:
disabled: false
eventTypes: [EventTypeXidCriticalError, EventTypeDoubleBitEccError, EventTypeSingleBitEccError]
ignoredXIDs: [13, 31, 43, 45, 68]
criticalXIDs: all
GKE defaults:
version: v1
health:
disabled: false
eventTypes: [EventTypeXidCriticalError]
ignoredXIDs: []
criticalXIDs: [48]
Metadata
Metadata
Assignees
Labels
No labels