-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[feat] Support time-based checkpointing during training #7613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #7613 +/- ##
======================================
- Coverage 92% 92% -0%
======================================
Files 198 198
Lines 12912 12930 +18
======================================
+ Hits 11912 11914 +2
- Misses 1000 1016 +16 |
|
Does this replace #7515? What's the difference? |
there's no difference. this was following up from our slack discussion as to whether the GPU tests were failing was because i was working off a branch in my fork vs in the lightning repo directly. since the base repo changed, i needed to create a new PR |
What does this PR do?
Fixes #6286
To discuss: The PR currently enforces each trigger (
every_n_train_steps,train_time_interval,every_n_val_epochs) to be mutually exclusive. This means if someone wants to checkpoint every N hours and every M train batches, they'd need to create two callbacks.Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃