Skip to content

Add advanced health check configuration to config file #1340

@elezar

Description

@elezar

Enhanced Error-handling config

Current State

See https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing

The NVIDIA GPU Device Plugin

We register for NVML Events of type nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccError

We treat the following XIDs as non-fatal errors:

XID Description
13 Graphics Engine Exception
31 GPU memory page fault
43 GPU stopped processing
45 Preemptive cleanup, due to previous errors
68 Video processor exception
109 Context Switch Timeout Error

We allow additional Xids to be specified in the DP_DISABLE_HEALTHCHECKS envvar with the following logic:

  • If the value is xids or all we disable healthchecks entirely.
  • A comma-separated list of numeric XIDs to ignore: e.g. 109,68

The GKE Device Plugin

See https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/0509b1f9f4b9a357b44ba65e7b508ded8bd5ecf0/pkg/gpu/nvidia/health_check/health_checker.go#L41

By default the following error is checked:

XID Description
48 Double-bit ECC Error

The XID_CONFIG envvar is used to specifiy a comma-separated list of additional XIDs to treat as critical.

Proposal

Add the following config section:

version: v1
health:
  disabled: false
  eventTypes: [EventTypeXidCriticalError, EventTypeDoubleBitEccError, EventTypeSingleBitEccError]
  ignoredXIDs: [13, 31, 43, 45, 68]
  criticalXIDs: all

GKE defaults:

version: v1
health:
  disabled: false
  eventTypes: [EventTypeXidCriticalError]
  ignoredXIDs: []
  criticalXIDs: [48]

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions