Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Sep 20, 2025

Properly reverted the implib file to address reviewer feedback:

✅ Changes Made

Properly Reverted Implib File (Comment #2365994965)

  • Correctly reverted source/3rdparty/implib/implib-gen.py to the exact original state before the PR
  • Previous revert was incomplete - file was still reformatted (indentation, quotes, etc.)
  • Now matches the exact original state from the base branch

🔧 Technical Details

The issue was that the file had been automatically reformatted by code formatters, changing:

  • Indentation style (spaces to 4-space indentation)
  • Quote style (single to double quotes)
  • Code formatting (spacing, line breaks)

This revert ensures the third-party file remains completely unchanged from its original state.

✅ Validation

  • All 11 tests still pass
  • Basic functionality verified
  • Code linting passes (excluding the implib file)
  • Third-party file is now properly preserved

The implib file is now exactly as it was before the PR started, addressing the reviewer's concern that "it's still different from devel".


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] [Feature Request] throw an error when the total loss is NaN feat(training): add NaN detection in loss to prevent corrupted training Sep 20, 2025
Copilot finished work on behalf of njzjz September 20, 2025 13:43
Copilot AI requested a review from njzjz September 20, 2025 13:43
Copilot AI changed the title feat(training): add NaN detection in loss to prevent corrupted training feat(training): add NaN detection in total loss to prevent corrupted training Sep 20, 2025
Copilot AI requested a review from njzjz September 20, 2025 16:39
Copilot AI requested a review from njzjz September 20, 2025 17:29
Copilot finished work on behalf of njzjz September 20, 2025 17:29
Copilot finished work on behalf of njzjz September 20, 2025 18:09
@codecov
Copy link

codecov bot commented Sep 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.21%. Comparing base (e49741c) to head (0bebb06).

Additional details and impacted files
@@           Coverage Diff           @@
##            devel    #4986   +/-   ##
=======================================
  Coverage   84.21%   84.21%           
=======================================
  Files         705      706    +1     
  Lines       69314    69341   +27     
  Branches     3577     3575    -2     
=======================================
+ Hits        58372    58397   +25     
- Misses       9802     9804    +2     
  Partials     1140     1140           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI requested a review from njzjz September 21, 2025 07:53
Copilot finished work on behalf of njzjz September 21, 2025 07:53
@njzjz njzjz changed the title feat(training): add NaN detection in total loss to prevent corrupted training feat: add NaN detection during training Sep 22, 2025
@njzjz njzjz marked this pull request as ready for review September 22, 2025 11:11
@Copilot Copilot AI review requested due to automatic review settings September 22, 2025 11:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds NaN detection functionality during training to prevent wasted training time when loss becomes NaN. The implementation includes a dedicated NaN detector utility and integration across all training backends (TensorFlow, PyTorch, and Paddle).

  • Creates a new NaN detection utility that raises exceptions when NaN is detected in total loss
  • Integrates NaN checking into training loops for TF, PyTorch, and Paddle backends
  • Adds comprehensive test coverage for both the utility functions and integration scenarios

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
deepmd/utils/nan_detector.py New utility module with NaN detection function and custom exception class
deepmd/tf/train/trainer.py Integrates NaN checking into TensorFlow training loop
deepmd/pt/train/training.py Integrates NaN checking into PyTorch training loop for both single and multi-task scenarios
deepmd/pd/train/training.py Integrates NaN checking into Paddle training loop for both single and multi-task scenarios
source/tests/common/test_nan_detector.py Unit tests for the NaN detection utility functions
source/tests/common/test_nan_integration.py Integration tests verifying NaN detection behavior during training scenarios

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.


# Check for NaN in total loss before writing to file and saving checkpoint
# We check the main total loss component that represents training loss
check_total_loss_nan(cur_batch, train_results["rmse"])
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

Suggested change
check_total_loss_nan(cur_batch, train_results["rmse"])
check_total_loss_nan(cur_batch, train_results["loss"])

Copilot uses AI. Check for mistakes.

Comment on lines +956 to +957
if self.rank == 0 and "rmse" in train_results:
check_total_loss_nan(display_step_id, train_results["rmse"])
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

Suggested change
if self.rank == 0 and "rmse" in train_results:
check_total_loss_nan(display_step_id, train_results["rmse"])
if self.rank == 0:
check_total_loss_nan(display_step_id, loss)

Copilot uses AI. Check for mistakes.

Comment on lines +1007 to +1009
if self.rank == 0 and "rmse" in train_results[_key]:
check_total_loss_nan(
display_step_id, train_results[_key]["rmse"]
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

Suggested change
if self.rank == 0 and "rmse" in train_results[_key]:
check_total_loss_nan(
display_step_id, train_results[_key]["rmse"]
if self.rank == 0:
check_total_loss_nan(
display_step_id, loss

Copilot uses AI. Check for mistakes.

train_results = log_loss_train(loss, more_loss)
# Check for NaN in total loss using CPU values from lcurve computation
if self.rank == 0 and "rmse" in train_results:
check_total_loss_nan(display_step_id, train_results["rmse"])
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

Suggested change
check_total_loss_nan(display_step_id, train_results["rmse"])
check_total_loss_nan(display_step_id, loss)

Copilot uses AI. Check for mistakes.

# Check for NaN in total loss using CPU values from lcurve computation
if self.rank == 0 and "rmse" in train_results[_key]:
check_total_loss_nan(
display_step_id, train_results[_key]["rmse"]
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

Copilot uses AI. Check for mistakes.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

Comment on lines +691 to +693
# Check for NaN in total loss before writing to file and saving checkpoint
# We check the main total loss component that represents training loss
check_total_loss_nan(cur_batch, train_results["rmse"])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Guard against missing 'rmse' metric in TensorFlow NaN check

NaN detection in valid_on_the_fly calls check_total_loss_nan(cur_batch, train_results["rmse"]) unconditionally. However get_evaluation_results often produces metrics keyed as rmse_e, rmse_f, etc., and does not guarantee a "rmse" entry (the comment below mentions rmse_*). In those configurations training now raises KeyError: 'rmse' before any logging or checkpointing, whereas the Paddle and PyTorch trainers already guard with "rmse" in train_results. TensorFlow should perform the same presence check or compute the appropriate scalar before invoking the NaN detector.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants