Skip to content

Commit 2c8b594

Browse files
authored
Reorder validate and checkpoint in train (#1542)
If validation and checkpoint occur on the same training step, do checkpointing first
1 parent 23e4dfc commit 2c8b594

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

torchtitan/train.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -575,17 +575,17 @@ def train(self):
575575
logger.warning("Ran out of data; last step was canceled.")
576576
break
577577

578+
self.checkpointer.save(
579+
self.step, last_step=(self.step == job_config.training.steps)
580+
)
581+
578582
# Run validation if validator is available
579583
if (
580584
self.job_config.validation.enabled
581585
and self.validator.should_validate(self.step)
582586
):
583587
self.validator.validate(self.model_parts, self.step)
584588

585-
self.checkpointer.save(
586-
self.step, last_step=(self.step == job_config.training.steps)
587-
)
588-
589589
# signal the profiler that the next profiling step has started
590590
if torch_profiler:
591591
torch_profiler.step()

0 commit comments

Comments
 (0)