Skip to content

Commit fd323ed

Browse files
BordaMrAnayDongre
authored andcommitted
update
1 parent 8dee98a commit fd323ed

File tree

1 file changed

+59
-12
lines changed

1 file changed

+59
-12
lines changed

src/lightning/pytorch/callbacks/device_stats_monitor.py

Lines changed: 59 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -34,21 +34,68 @@ class DeviceStatsMonitor(Callback):
3434
r"""Automatically monitors and logs device stats during training, validation and testing stage.
3535
``DeviceStatsMonitor`` is a special callback as it requires a ``logger`` to passed as argument to the ``Trainer``.
3636
37-
Device statistics are logged with keys prefixed as ``DeviceStatsMonitor.{hook_name}/{base_metric_name}`` (e.g.,
38-
``DeviceStatsMonitor.on_train_batch_start/cpu_percent``).
39-
The source of these metrics depends on the ``cpu_stats`` flag and the active accelerator.
4037
41-
CPU (via ``psutil``): Logs ``cpu_percent``, ``cpu_vm_percent``, ``cpu_swap_percent``.
42-
All are percentages (%).
43-
CUDA GPU (via :func:`torch.cuda.memory_stats`): Logs detailed memory statistics from
44-
PyTorch's allocator (e.g., ``allocated_bytes.all.current``, ``num_ooms``; all in Bytes).
45-
GPU compute utilization is not logged by default.
46-
Other Accelerators (e.g., TPU, MPS): Logs device-specific stats:
38+
**Logged Metrics**
4739
48-
- TPU example: ``avg. free memory (MB)``.
49-
- MPS example: ``mps.current_allocated_bytes``.
40+
Logs device statistics with keys prefixed as ``DeviceStatsMonitor.{hook_name}/{base_metric_name}``.
5041
51-
Observe logs or check accelerator documentation for details.
42+
The actual metrics depend on the active accelerator and the ``cpu_stats`` flag.
43+
44+
**CPU (via `psutil`)**
45+
46+
- ``cpu_percent``: System-wide CPU utilization (%)
47+
- ``cpu_vm_percent``: System-wide virtual memory (RAM) utilization (%)
48+
- ``cpu_swap_percent``: System-wide swap memory utilization (%)
49+
50+
**CUDA GPU (via `torch.cuda.memory_stats`)**
51+
52+
Logs memory statistics from PyTorch caching allocator (all in Bytes).
53+
GPU compute utilization is not logged by default.
54+
55+
*General Memory Usage:*
56+
57+
- ``allocated_bytes.all.current``: Current allocated GPU memory
58+
- ``allocated_bytes.all.peak``: Peak allocated GPU memory
59+
- ``reserved_bytes.all.current``: Current reserved GPU memory (allocated + cached)
60+
- ``reserved_bytes.all.peak``: Peak reserved GPU memory
61+
- ``active_bytes.all.current``: Current GPU memory in active use
62+
- ``active_bytes.all.peak``: Peak GPU memory in active use
63+
- ``inactive_split_bytes.all.current``: Memory in inactive, splittable blocks
64+
65+
*Allocator Pool Statistics* (for ``small_pool`` and ``large_pool``):
66+
67+
- ``allocated_bytes.{pool_type}.current`` / ``.peak``
68+
- ``reserved_bytes.{pool_type}.current`` / ``.peak``
69+
- ``active_bytes.{pool_type}.current`` / ``.peak``
70+
71+
*Allocator Events:*
72+
73+
- ``num_ooms``: Cumulative out-of-memory errors
74+
- ``num_alloc_retries``: Number of allocation retries
75+
- ``num_device_alloc``: Number of device allocations
76+
- ``num_device_free``: Number of device deallocations
77+
78+
For a full list of CUDA memory stats, see:
79+
https://pytorch.org/docs/stable/generated/torch.cuda.memory_stats.html
80+
81+
**TPU (via `torch_xla`)**
82+
83+
*Memory Metrics* (per device, e.g. ``xla:0``):
84+
85+
- ``memory.free.xla:0``: Free HBM memory (MB)
86+
- ``memory.used.xla:0``: Used HBM memory (MB)
87+
- ``memory.percent.xla:0``: Percentage of HBM memory used (%)
88+
89+
*XLA Operation Counters:*
90+
91+
- ``CachedCompile.xla``
92+
- ``CreateXlaTensor.xla``
93+
- ``DeviceDataCacheMiss.xla``
94+
- ``UncachedCompile.xla``
95+
- ``xla::add.xla``, ``xla::addmm.xla``, etc.
96+
97+
These counters can be retrieved using:
98+
``torch_xla.debug.metrics.counter_names()``
5299
53100
Args:
54101
cpu_stats: if ``None``, it will log CPU stats only if the accelerator is CPU.

0 commit comments

Comments
 (0)