feat: add NaN detection during training #4986

Copilot · 2025-09-20T13:26:36Z

Properly reverted the implib file to address reviewer feedback:

✅ Changes Made

Properly Reverted Implib File (Comment #2365994965)

Correctly reverted source/3rdparty/implib/implib-gen.py to the exact original state before the PR
Previous revert was incomplete - file was still reformatted (indentation, quotes, etc.)
Now matches the exact original state from the base branch

🔧 Technical Details

The issue was that the file had been automatically reformatted by code formatters, changing:

Indentation style (spaces to 4-space indentation)
Quote style (single to double quotes)
Code formatting (spacing, line breaks)

This revert ensures the third-party file remains completely unchanged from its original state.

✅ Validation

All 11 tests still pass
Basic functionality verified
Code linting passes (excluding the implib file)
Third-party file is now properly preserved

The implib file is now exactly as it was before the PR started, addressing the reviewer's concern that "it's still different from devel".

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…tion Co-authored-by: njzjz <[email protected]>

source/3rdparty/implib/implib-gen.py

deepmd/utils/nan_detector.py

deepmd/pt/train/training.py

deepmd/utils/nan_detector.py

…nherit from RuntimeError Co-authored-by: njzjz <[email protected]>

deepmd/tf/train/trainer.py

deepmd/pt/train/training.py

…CPU values and fixed loss keys Co-authored-by: njzjz <[email protected]>

…energy loss Co-authored-by: njzjz <[email protected]>

source/3rdparty/implib/implib-gen.py

codecov · 2025-09-20T19:10:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.21%. Comparing base (e49741c) to head (0bebb06).

Additional details and impacted files

@@           Coverage Diff           @@
##            devel    #4986   +/-   ##
=======================================
  Coverage   84.21%   84.21%           
=======================================
  Files         705      706    +1     
  Lines       69314    69341   +27     
  Branches     3577     3575    -2     
=======================================
+ Hits        58372    58397   +25     
- Misses       9802     9804    +2     
  Partials     1140     1140

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

source/3rdparty/implib/implib-gen.py

source/tests/common/test_nan_detector.py

source/tests/common/test_nan_integration.py

Co-authored-by: njzjz <[email protected]>

Copilot

Pull Request Overview

This PR adds NaN detection functionality during training to prevent wasted training time when loss becomes NaN. The implementation includes a dedicated NaN detector utility and integration across all training backends (TensorFlow, PyTorch, and Paddle).

Creates a new NaN detection utility that raises exceptions when NaN is detected in total loss
Integrates NaN checking into training loops for TF, PyTorch, and Paddle backends
Adds comprehensive test coverage for both the utility functions and integration scenarios

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`deepmd/utils/nan_detector.py`	New utility module with NaN detection function and custom exception class
`deepmd/tf/train/trainer.py`	Integrates NaN checking into TensorFlow training loop
`deepmd/pt/train/training.py`	Integrates NaN checking into PyTorch training loop for both single and multi-task scenarios
`deepmd/pd/train/training.py`	Integrates NaN checking into Paddle training loop for both single and multi-task scenarios
`source/tests/common/test_nan_detector.py`	Unit tests for the NaN detection utility functions
`source/tests/common/test_nan_integration.py`	Integration tests verifying NaN detection behavior during training scenarios

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-22T11:12:59Z

deepmd/tf/train/trainer.py

+
+        # Check for NaN in total loss before writing to file and saving checkpoint
+        # We check the main total loss component that represents training loss
+        check_total_loss_nan(cur_batch, train_results["rmse"])


The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

Suggested change

check_total_loss_nan(cur_batch, train_results["rmse"])

check_total_loss_nan(cur_batch, train_results["loss"])

Copilot · 2025-09-22T11:13:00Z

deepmd/pt/train/training.py

+                    if self.rank == 0 and "rmse" in train_results:
+                        check_total_loss_nan(display_step_id, train_results["rmse"])


The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

Suggested change

if self.rank == 0 and "rmse" in train_results:

check_total_loss_nan(display_step_id, train_results["rmse"])

if self.rank == 0:

check_total_loss_nan(display_step_id, loss)

Copilot · 2025-09-22T11:13:00Z

deepmd/pt/train/training.py

+                        if self.rank == 0 and "rmse" in train_results[_key]:
+                            check_total_loss_nan(
+                                display_step_id, train_results[_key]["rmse"]


The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

Suggested change

if self.rank == 0 and "rmse" in train_results[_key]:

check_total_loss_nan(

display_step_id, train_results[_key]["rmse"]

if self.rank == 0:

check_total_loss_nan(

display_step_id, loss

Copilot · 2025-09-22T11:13:00Z

deepmd/pd/train/training.py

                    train_results = log_loss_train(loss, more_loss)
+                    # Check for NaN in total loss using CPU values from lcurve computation
+                    if self.rank == 0 and "rmse" in train_results:
+                        check_total_loss_nan(display_step_id, train_results["rmse"])


The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

Suggested change

check_total_loss_nan(display_step_id, train_results["rmse"])

check_total_loss_nan(display_step_id, loss)

Copilot · 2025-09-22T11:13:01Z

deepmd/pd/train/training.py

+                        # Check for NaN in total loss using CPU values from lcurve computation
+                        if self.rank == 0 and "rmse" in train_results[_key]:
+                            check_total_loss_nan(
+                                display_step_id, train_results[_key]["rmse"]


The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

chatgpt-codex-connector

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

chatgpt-codex-connector · 2025-09-22T11:14:59Z

deepmd/tf/train/trainer.py

+        # Check for NaN in total loss before writing to file and saving checkpoint
+        # We check the main total loss component that represents training loss
+        check_total_loss_nan(cur_batch, train_results["rmse"])


[P1] Guard against missing 'rmse' metric in TensorFlow NaN check

NaN detection in valid_on_the_fly calls check_total_loss_nan(cur_batch, train_results["rmse"]) unconditionally. However get_evaluation_results often produces metrics keyed as rmse_e, rmse_f, etc., and does not guarantee a "rmse" entry (the comment below mentions rmse_*). In those configurations training now raises KeyError: 'rmse' before any logging or checkpointing, whereas the Paddle and PyTorch trainers already guard with "rmse" in train_results. TensorFlow should perform the same presence check or compute the appropriate scalar before invoking the NaN detector.

Useful? React with 👍 / 👎.

Initial plan

ef431a1

Copilot AI assigned Copilot and njzjz Sep 20, 2025

Copilot started work on behalf of njzjz September 20, 2025 13:26 View session

feat(training): add comprehensive NaN detection with tests and valida…

9eb1bea

…tion Co-authored-by: njzjz <[email protected]>

Copilot AI changed the title ~~[WIP] [Feature Request] throw an error when the total loss is NaN~~ feat(training): add NaN detection in loss to prevent corrupted training Sep 20, 2025

Copilot finished work on behalf of njzjz September 20, 2025 13:43

Copilot AI requested a review from njzjz September 20, 2025 13:43

njzjz requested changes Sep 20, 2025

View reviewed changes

Copilot started work on behalf of njzjz September 20, 2025 16:18 View session

fix(training): address PR feedback - simplify NaN detection API and i…

5a22dfc

…nherit from RuntimeError Co-authored-by: njzjz <[email protected]>

Copilot AI changed the title ~~feat(training): add NaN detection in loss to prevent corrupted training~~ feat(training): add NaN detection in total loss to prevent corrupted training Sep 20, 2025

Copilot AI requested a review from njzjz September 20, 2025 16:39

Copilot stopped work on behalf of njzjz due to an error September 20, 2025 16:39
Copilot has encountered an error. See logs for additional details.

njzjz requested changes Sep 20, 2025

View reviewed changes

deepmd/tf/train/trainer.py Outdated Show resolved Hide resolved

deepmd/tf/train/trainer.py Outdated Show resolved Hide resolved

deepmd/tf/train/trainer.py Outdated Show resolved Hide resolved

deepmd/pt/train/training.py Outdated Show resolved Hide resolved

Copilot started work on behalf of njzjz September 20, 2025 17:17 View session

fix(training): optimize NaN detection based on feedback - use lcurve …

0852b7c

…CPU values and fixed loss keys Co-authored-by: njzjz <[email protected]>

Copilot AI requested a review from njzjz September 20, 2025 17:29

Copilot finished work on behalf of njzjz September 20, 2025 17:29

Copilot started work on behalf of njzjz September 20, 2025 17:59 View session

fix(training): use 'rmse' key for total loss instead of 'rmse_e' for …

7a2b41e

…energy loss Co-authored-by: njzjz <[email protected]>

Copilot finished work on behalf of njzjz September 20, 2025 18:09

github-actions bot added the Python label Sep 20, 2025

github-advanced-security bot found potential problems Sep 20, 2025

View reviewed changes

source/3rdparty/implib/implib-gen.py Fixed Show fixed Hide fixed

source/3rdparty/implib/implib-gen.py Fixed Show fixed Hide fixed

source/3rdparty/implib/implib-gen.py Fixed Show fixed Hide fixed

source/3rdparty/implib/implib-gen.py Fixed Show fixed Hide fixed

njzjz reviewed Sep 21, 2025

View reviewed changes

source/3rdparty/implib/implib-gen.py Show resolved Hide resolved

source/tests/common/test_nan_detector.py Outdated Show resolved Hide resolved

source/tests/common/test_nan_integration.py Outdated Show resolved Hide resolved

Copilot started work on behalf of njzjz September 21, 2025 07:45 View session

fix: revert implib file and clean up redundant test code

22cb9ef

Co-authored-by: njzjz <[email protected]>

Copilot AI requested a review from njzjz September 21, 2025 07:53

Copilot finished work on behalf of njzjz September 21, 2025 07:53

Copilot started work on behalf of njzjz September 21, 2025 08:07 View session

fix: properly revert implib file to exact original state

0bebb06

Co-authored-by: njzjz <[email protected]>

Copilot stopped work on behalf of njzjz due to an error September 21, 2025 08:14
Copilot has encountered an error. See logs for additional details.

njzjz changed the title ~~feat(training): add NaN detection in total loss to prevent corrupted training~~ feat: add NaN detection during training Sep 22, 2025

njzjz marked this pull request as ready for review September 22, 2025 11:11

Copilot AI review requested due to automatic review settings September 22, 2025 11:11

Copilot AI reviewed Sep 22, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Sep 22, 2025

View reviewed changes

	check_total_loss_nan(cur_batch, train_results["rmse"])
	check_total_loss_nan(cur_batch, train_results["loss"])

		if self.rank == 0 and "rmse" in train_results:
		check_total_loss_nan(display_step_id, train_results["rmse"])

	check_total_loss_nan(display_step_id, train_results["rmse"])
	check_total_loss_nan(display_step_id, loss)

feat: add NaN detection during training #4986

Are you sure you want to change the base?

feat: add NaN detection during training #4986

Uh oh!

Conversation

Copilot AI commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Changes Made

🔧 Technical Details

✅ Validation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Sep 20, 2025 •

edited

Loading

codecov bot commented Sep 20, 2025 •

edited

Loading