Skip to content

Conversation

@wesleytruong
Copy link
Contributor

If validation and checkpoint occur on the same training step, do checkpointing first

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 6, 2025
If validation and checkpoint occur on the same training step, do checkpointing first
@wesleytruong wesleytruong force-pushed the wesleytruong-patch-1 branch from cdd27af to 16f6670 Compare August 6, 2025 23:34
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you say more about why?

@wesleytruong
Copy link
Contributor Author

could you say more about why?

It may be a minor use case in real world where validation and checkpoint don't often occur on same step, but I think it could happen, especially since some people may set the behavior to occur this way.

I think it would be a good change since it makes training more robust against errors in validation. If validation fails on the same step then we would lose all our pre-checkpoint progress. Additionally, hidden problems could arise in validation that may interfere with checkpointing, such as I had a problem before where validation and checkpoint on the same step would cause a missing fqn in state dict error during save, but saving on a different step would work. This was likely because validation has to change pp schedule to only do forward and may not always correctly reset pp schedule or model to its original state. Moving checkpoint.save to before validate could help to isolate against errors such as this and make one less thing to worry about.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. That's very thoughtful!

If this is the case, it's less necessary to manually call reshard on FSDP modules, as it was originally there to make checkpoint saving happy. But we can keep it there for now.

@tianyu-l tianyu-l merged commit 2c8b594 into main Aug 8, 2025
7 checks passed
@tianyu-l tianyu-l deleted the wesleytruong-patch-1 branch August 8, 2025 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants