@@ -34,21 +34,68 @@ class DeviceStatsMonitor(Callback):
34
34
r"""Automatically monitors and logs device stats during training, validation and testing stage.
35
35
``DeviceStatsMonitor`` is a special callback as it requires a ``logger`` to passed as argument to the ``Trainer``.
36
36
37
- Device statistics are logged with keys prefixed as ``DeviceStatsMonitor.{hook_name}/{base_metric_name}`` (e.g.,
38
- ``DeviceStatsMonitor.on_train_batch_start/cpu_percent``).
39
- The source of these metrics depends on the ``cpu_stats`` flag and the active accelerator.
40
37
41
- CPU (via ``psutil``): Logs ``cpu_percent``, ``cpu_vm_percent``, ``cpu_swap_percent``.
42
- All are percentages (%).
43
- CUDA GPU (via :func:`torch.cuda.memory_stats`): Logs detailed memory statistics from
44
- PyTorch's allocator (e.g., ``allocated_bytes.all.current``, ``num_ooms``; all in Bytes).
45
- GPU compute utilization is not logged by default.
46
- Other Accelerators (e.g., TPU, MPS): Logs device-specific stats:
38
+ **Logged Metrics**
47
39
48
- - TPU example: ``avg. free memory (MB)``.
49
- - MPS example: ``mps.current_allocated_bytes``.
40
+ Logs device statistics with keys prefixed as ``DeviceStatsMonitor.{hook_name}/{base_metric_name}``.
50
41
51
- Observe logs or check accelerator documentation for details.
42
+ The actual metrics depend on the active accelerator and the ``cpu_stats`` flag.
43
+
44
+ **CPU (via `psutil`)**
45
+
46
+ - ``cpu_percent``: System-wide CPU utilization (%)
47
+ - ``cpu_vm_percent``: System-wide virtual memory (RAM) utilization (%)
48
+ - ``cpu_swap_percent``: System-wide swap memory utilization (%)
49
+
50
+ **CUDA GPU (via `torch.cuda.memory_stats`)**
51
+
52
+ Logs memory statistics from PyTorch caching allocator (all in Bytes).
53
+ GPU compute utilization is not logged by default.
54
+
55
+ *General Memory Usage:*
56
+
57
+ - ``allocated_bytes.all.current``: Current allocated GPU memory
58
+ - ``allocated_bytes.all.peak``: Peak allocated GPU memory
59
+ - ``reserved_bytes.all.current``: Current reserved GPU memory (allocated + cached)
60
+ - ``reserved_bytes.all.peak``: Peak reserved GPU memory
61
+ - ``active_bytes.all.current``: Current GPU memory in active use
62
+ - ``active_bytes.all.peak``: Peak GPU memory in active use
63
+ - ``inactive_split_bytes.all.current``: Memory in inactive, splittable blocks
64
+
65
+ *Allocator Pool Statistics* (for ``small_pool`` and ``large_pool``):
66
+
67
+ - ``allocated_bytes.{pool_type}.current`` / ``.peak``
68
+ - ``reserved_bytes.{pool_type}.current`` / ``.peak``
69
+ - ``active_bytes.{pool_type}.current`` / ``.peak``
70
+
71
+ *Allocator Events:*
72
+
73
+ - ``num_ooms``: Cumulative out-of-memory errors
74
+ - ``num_alloc_retries``: Number of allocation retries
75
+ - ``num_device_alloc``: Number of device allocations
76
+ - ``num_device_free``: Number of device deallocations
77
+
78
+ For a full list of CUDA memory stats, see:
79
+ https://pytorch.org/docs/stable/generated/torch.cuda.memory_stats.html
80
+
81
+ **TPU (via `torch_xla`)**
82
+
83
+ *Memory Metrics* (per device, e.g. ``xla:0``):
84
+
85
+ - ``memory.free.xla:0``: Free HBM memory (MB)
86
+ - ``memory.used.xla:0``: Used HBM memory (MB)
87
+ - ``memory.percent.xla:0``: Percentage of HBM memory used (%)
88
+
89
+ *XLA Operation Counters:*
90
+
91
+ - ``CachedCompile.xla``
92
+ - ``CreateXlaTensor.xla``
93
+ - ``DeviceDataCacheMiss.xla``
94
+ - ``UncachedCompile.xla``
95
+ - ``xla::add.xla``, ``xla::addmm.xla``, etc.
96
+
97
+ These counters can be retrieved using:
98
+ ``torch_xla.debug.metrics.counter_names()``
52
99
53
100
Args:
54
101
cpu_stats: if ``None``, it will log CPU stats only if the accelerator is CPU.
0 commit comments