Skip to content

Commit a7a169c

Browse files
committed
gpu: doc: monitoring resource notes
Also align xelink-sidecar deployment with the new files in the xpu manager project. Signed-off-by: Tuomas Katila <[email protected]>
1 parent 6334fca commit a7a169c

File tree

5 files changed

+34
-7
lines changed

5 files changed

+34
-7
lines changed

cmd/gpu_plugin/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ For workloads on different KMDs, see [KMD and UMD](#kmd-and-umd).
5353

5454
| Flag | Argument | Default | Meaning |
5555
|:---- |:-------- |:------- |:------- |
56-
| -enable-monitoring | - | disabled | Enable 'i915_monitoring' resource that provides access to all Intel GPU devices on the node |
56+
| -enable-monitoring | - | disabled | Enable '*_monitoring' resource that provides access to all Intel GPU devices on the node, [see use](./monitoring.md) |
5757
| -resource-manager | - | disabled | Enable fractional resource management, [see use](./fractional.md) |
5858
| -shared-dev-num | int | 1 | Number of containers that can share the same GPU device |
5959
| -allocation-policy | string | none | 3 possible values: balanced, packed, none. For shared-dev-num > 1: _balanced_ mode spreads workloads among GPU devices, _packed_ mode fills one GPU fully before moving to next, and _none_ selects first available device from kubelet. Default is _none_. Allocation policy does not have an effect when resource manager is enabled. |

cmd/gpu_plugin/monitoring.md

+32
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Monitoring GPUs
2+
3+
## i915_monitoring resource
4+
5+
GPU plugin can be configured to register a monitoring resource for the nodes that have Intel GPUs on them. `gpu.intel.com/i915_monitoring` (or `gpu.intel.com/xe_monitoring`) is a singular resource on the nodes. A container requesting it, will get access to _all_ the Intel GPUs (`i915` or `xe` KMD device files) on the node. The idea behind this resource is to allow the container to _monitor_ the GPUs. A container requesting the `i915_monitoring` resource would typically export data to some metrics consumer. An example for such a consumer is [Prometheus](https://prometheus.io/).
6+
7+
<figure>
8+
<img src="monitoring.png"/>
9+
<figcaption>Monitoring Pod listening to all GPUs while one Pod is using a GPU.</figcaption>
10+
</figure>
11+
12+
For the monitoring applications, there are two possibilities: [Intel XPU Manager](https://github.com/intel/xpumanager/) and [collectd](https://github.com/collectd/collectd/tree/collectd-6.0). Intel XPU Manager is readily available as a container and with a deployment yaml. collectd has Intel GPU support in its 6.0 branch, but there are no public containers available for it.
13+
14+
To deploy XPU Manager to a cluster, one has to run the following kubectl:
15+
```
16+
$ kubectl apply -k https://github.com/intel/xpumanager/deployment/kubernetes/daemonset/base
17+
```
18+
19+
This will deploy an XPU Manager daemonset to run on all the nodes having the `i915_monitoring` resource.
20+
21+
## Prometheus integration with XPU Manager
22+
23+
For deploying Prometheus to a cluster, see [this page](https://prometheus-operator.dev/docs/user-guides/getting-started/). One can also use Prometheus' [helm chart](https://github.com/prometheus-community/helm-charts).
24+
25+
Prometheus requires additional Kubernetes configuration so it can fetch GPU metrics. The following steps will add a Kubernetes Service and a ServiceMonitor components. The components instruct Prometheus how and where from to retrieve the metrics.
26+
27+
```
28+
$ kubectl apply -f https://raw.githubusercontent.com/intel/xpumanager/master/deployment/kubernetes/monitoring/service-intel-xpum.yaml
29+
$ kubectl apply -f https://raw.githubusercontent.com/intel/xpumanager/master/deployment/kubernetes/monitoring/servicemonitor-intel-xpum.yaml
30+
```
31+
32+
With those components in place, one can query Intel GPU metrics from Prometheus with `xpum_` prefix.

cmd/gpu_plugin/monitoring.png

48.5 KB
Loading

deployments/xpumanager_sidecar/kustom/kustom_xpumanager.yaml

-5
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,3 @@ spec:
2727
- ALL
2828
readOnlyRootFilesystem: true
2929
runAsUser: 0
30-
- name: xpumd
31-
resources:
32-
limits:
33-
$patch: replace
34-
gpu.intel.com/i915_monitoring: 1

deployments/xpumanager_sidecar/kustomization.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
resources:
2-
- https://raw.githubusercontent.com/intel/xpumanager/V1.2.18/deployment/kubernetes/daemonset-intel-xpum.yaml
2+
- https://raw.githubusercontent.com/intel/xpumanager/V1.2.29/deployment/kubernetes/daemonset/base/daemonset-intel-xpum.yaml
33
namespace: monitoring
44
apiVersion: kustomize.config.k8s.io/v1beta1
55
kind: Kustomization

0 commit comments

Comments
 (0)