-
Notifications
You must be signed in to change notification settings - Fork 575
feat: add NaN detection during training #4986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
…tion Co-authored-by: njzjz <[email protected]>
…nherit from RuntimeError Co-authored-by: njzjz <[email protected]>
…CPU values and fixed loss keys Co-authored-by: njzjz <[email protected]>
…energy loss Co-authored-by: njzjz <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## devel #4986 +/- ##
=======================================
Coverage 84.21% 84.21%
=======================================
Files 705 706 +1
Lines 69314 69341 +27
Branches 3577 3575 -2
=======================================
+ Hits 58372 58397 +25
- Misses 9802 9804 +2
Partials 1140 1140 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: njzjz <[email protected]>
Co-authored-by: njzjz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds NaN detection functionality during training to prevent wasted training time when loss becomes NaN. The implementation includes a dedicated NaN detector utility and integration across all training backends (TensorFlow, PyTorch, and Paddle).
- Creates a new NaN detection utility that raises exceptions when NaN is detected in total loss
- Integrates NaN checking into training loops for TF, PyTorch, and Paddle backends
- Adds comprehensive test coverage for both the utility functions and integration scenarios
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
deepmd/utils/nan_detector.py |
New utility module with NaN detection function and custom exception class |
deepmd/tf/train/trainer.py |
Integrates NaN checking into TensorFlow training loop |
deepmd/pt/train/training.py |
Integrates NaN checking into PyTorch training loop for both single and multi-task scenarios |
deepmd/pd/train/training.py |
Integrates NaN checking into Paddle training loop for both single and multi-task scenarios |
source/tests/common/test_nan_detector.py |
Unit tests for the NaN detection utility functions |
source/tests/common/test_nan_integration.py |
Integration tests verifying NaN detection behavior during training scenarios |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
|
||
| # Check for NaN in total loss before writing to file and saving checkpoint | ||
| # We check the main total loss component that represents training loss | ||
| check_total_loss_nan(cur_batch, train_results["rmse"]) |
Copilot
AI
Sep 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.
| check_total_loss_nan(cur_batch, train_results["rmse"]) | |
| check_total_loss_nan(cur_batch, train_results["loss"]) |
Copilot uses AI. Check for mistakes.
| if self.rank == 0 and "rmse" in train_results: | ||
| check_total_loss_nan(display_step_id, train_results["rmse"]) |
Copilot
AI
Sep 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.
| if self.rank == 0 and "rmse" in train_results: | |
| check_total_loss_nan(display_step_id, train_results["rmse"]) | |
| if self.rank == 0: | |
| check_total_loss_nan(display_step_id, loss) |
Copilot uses AI. Check for mistakes.
| if self.rank == 0 and "rmse" in train_results[_key]: | ||
| check_total_loss_nan( | ||
| display_step_id, train_results[_key]["rmse"] |
Copilot
AI
Sep 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.
| if self.rank == 0 and "rmse" in train_results[_key]: | |
| check_total_loss_nan( | |
| display_step_id, train_results[_key]["rmse"] | |
| if self.rank == 0: | |
| check_total_loss_nan( | |
| display_step_id, loss |
Copilot uses AI. Check for mistakes.
| train_results = log_loss_train(loss, more_loss) | ||
| # Check for NaN in total loss using CPU values from lcurve computation | ||
| if self.rank == 0 and "rmse" in train_results: | ||
| check_total_loss_nan(display_step_id, train_results["rmse"]) |
Copilot
AI
Sep 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.
| check_total_loss_nan(display_step_id, train_results["rmse"]) | |
| check_total_loss_nan(display_step_id, loss) |
Copilot uses AI. Check for mistakes.
| # Check for NaN in total loss using CPU values from lcurve computation | ||
| if self.rank == 0 and "rmse" in train_results[_key]: | ||
| check_total_loss_nan( | ||
| display_step_id, train_results[_key]["rmse"] |
Copilot
AI
Sep 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
Reply with @codex fix comments to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
| # Check for NaN in total loss before writing to file and saving checkpoint | ||
| # We check the main total loss component that represents training loss | ||
| check_total_loss_nan(cur_batch, train_results["rmse"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] Guard against missing 'rmse' metric in TensorFlow NaN check
NaN detection in valid_on_the_fly calls check_total_loss_nan(cur_batch, train_results["rmse"]) unconditionally. However get_evaluation_results often produces metrics keyed as rmse_e, rmse_f, etc., and does not guarantee a "rmse" entry (the comment below mentions rmse_*). In those configurations training now raises KeyError: 'rmse' before any logging or checkpointing, whereas the Paddle and PyTorch trainers already guard with "rmse" in train_results. TensorFlow should perform the same presence check or compute the appropriate scalar before invoking the NaN detector.
Useful? React with 👍 / 👎.
Properly reverted the implib file to address reviewer feedback:
✅ Changes Made
Properly Reverted Implib File (Comment #2365994965)
source/3rdparty/implib/implib-gen.pyto the exact original state before the PR🔧 Technical Details
The issue was that the file had been automatically reformatted by code formatters, changing:
This revert ensures the third-party file remains completely unchanged from its original state.
✅ Validation
The implib file is now exactly as it was before the PR started, addressing the reviewer's concern that "it's still different from devel".
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.