Commit c8ebd7a
authored
Add a loss comparison script (#2029)
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2049
* __->__ #2029
## Summary
This PR adds `scripts/loss_compare.py` for comparing training losses
between different git commits and/or training configurations.
## Key Features
- Commit Comparison: Compare losses between two different git commits
with deterministic training
- Configuration Comparison: Compare different training configurations on
the same commit
- Reproducibility: Automatically enables deterministic mode and seed
checkpointing for reproducible
comparisons
- Real-time Output: Streams training output to both console and log
files during execution
- Statistical Analysis: Generates step-by-step loss comparisons and
summary statistics
- CI Testing: Includes --assert-equal flag for automated testing to
verify identical losses
## Usage Examples
#### Compare two commits
```
python3 ./scripts/loss_compare.py main my_branch
```
#### Compare two commits with custom configuration
```
python3 ./scripts/loss_compare.py main my_branch \
--baseline-config="./custom.toml"
--baseline-options="--parallelism.tensor_parallel_degree=2" \
```
#### Compare different parallelization strategies on same commit
```
python3 ./scripts/loss_compare.py . . \
--baseline-config="./llama3_8b.toml"
--baseline-options="--parallelism.tensor_parallel_degree=2" \
--test-options="--parallelism.tensor_parallel_degree=1" \
```
#### Assert equality for CI testing
```
python3 ./scripts/loss_compare.py main my_branch --assert-equal
```
## Real Use Cases
Compare full dtensor simple fsdp with fsdp2:
```
python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none"' \
--test-train-file='torchtitan.experiments.full_dtensor.train' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none"' \
--assert-equal --no-seed-checkpoint
[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok
```1 parent 4a5fa99 commit c8ebd7a
File tree
2 files changed
+896
-0
lines changed- .github/workflows
- scripts
2 files changed
+896
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
92 | 92 | | |
93 | 93 | | |
94 | 94 | | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
95 | 102 | | |
96 | 103 | | |
0 commit comments