[Scheduler] Add support for cosine and wsd scheduler #938

yzhangcs · 2025-03-06T09:02:36Z

What does this PR do?

This PR introduces support for cosine and WSD schedulers. The cosine scheduler is widely used in LLM training (e.g., Pythia, OLMo, Llama), while the WSD scheduler, introduced by MiniCPM, features a three-stage learning rate schedule: warmup, stable, and decay. The stable stage keeps the learning rate constant, beneficial for continual pretraining and flexible training budget adjustments.

Why this PR is necessary

The cosine scheduler is a standard for LLM training, and the WSD scheduler addresses specific needs like stable learning rates during training. This PR also supports three decay types: linear (MiniCPM's original approach), cosine (used in hf transformers), and 1-sqrt (found optimal in this paper).
These additions provide flexibility and improved performance for diverse training scenarios.

yzhangcs · 2025-03-06T09:25:19Z

Here is running examples for three schedulers.

bash run_train.sh --optimizer.scheduler=linear \
  --training.steps=50 \
  --training.warmup_steps=5 \
  --optimizer.min_lr_ratio=0.1

bash run_train.sh --optimizer.scheduler=cosine \
  --training.steps=50 \
  --training.warmup_steps=5 \
  --optimizer.min_lr_ratio=0.1

bash run_train.sh --optimizer.scheduler=wsd \
  --training.steps=50 \
  --training.warmup_steps=5 \
  --optimizer.min_lr_ratio=0.1

tianyu-l

In fact, I think the three can be unified.
We only need to define warmup_ratio, decay_ratio, lr_decay_type (among linear sqrt cosine), and maybe lr_min (calling it ratio again would sound confusing, we can explain in the helper message it's a ratio), to achieve everything.

We can explain in the helper message, or in a doc, how to use them to achieve various combinations, including the three you explicitly wrote today.

torchtitan/components/optimizer.py

lessw2020 · 2025-03-06T22:28:20Z

Hi @yzhangcs - thanks for the PR.
I wanted to comment that we purposefully did not include cosine lr as recent work has shown the linear rate is more effective:
Please see: https://arxiv.org/abs/2310.07831
"We perform the most comprehensive evaluation of learning rate schedules to date, evaluating across 10 diverse deep learning problems, a series of LLMs, and a suite of logistic regression problems. We validate that overall, the linear-decay schedule outperforms all commonly used default schedules including cosine annealing. "

That said, I believe the WSD may be useful in some cases (MSFT had a paper on this a long time ago, and I used it back in AI competitions), and the 1 sqrt seems to be newer work than the above.
But I generically wouldn't recommend anyone use cosine lr just on the basis that others use it...the above work pretty much closed that discussion.
(I'm not against enabling it if people want to do their own comparisons, though a link to the above would probably change their minds..but just clarifying why we did not add it when initially setting the default lr schedule).

Also, we used to display the lr but it was removed b/c most people didn't need it in the display...thus would recommend that this display is optional/configurable.

yzhangcs · 2025-03-07T05:38:37Z

@tianyu-l Hi, just updated the stuffs you mentioned.

yzhangcs · 2025-03-07T05:51:12Z

@lessw2020 Thank you for your comments, I'm wondering if it's ok to add some hints and paper link https://arxiv.org/abs/2310.07831 in config_manager?

Also, we used to display the lr but it was removed b/c most people didn't need it in the display...thus would recommend that this display is optional/configurable.

Agreed. I just add the display to make sure the decay is right.
It's ok for me to add a configurable option or directly revert the changes for displaying lr (seems that the number of colors is not enough to distinguish loll).
Which one do you prefer? @lessw2020 @tianyu-l

torchtitan/components/optimizer.py

torchtitan/models/llama/train_configs/debug_model.toml

torchtitan/components/optimizer.py

torchtitan/train.py

tianyu-l

I left some inline comments. Below are some general comments.

Regarding file length/complexity:
I think we should separate things into a new file called lr_scheduler.py. This can be in a separate PR.

Regarding warmpup steps:
Since warmup behavior is closer to lr_scheduler, should we move training.warmup_steps also to the optimizer section (or even consider creating an lr_scheduler section)?

Regarding logging lr rates:
I think we should still log lr to TensorBoard / WandB, maybe after the #945
It can be called from a new get_lr function of LRSchedulersContainer, so that people can modify/inherit it to adapt to desired behaviors.

torchtitan/components/optimizer.py

torchtitan/models/llama/train_configs/debug_model.toml

yzhangcs · 2025-03-10T09:20:41Z

@tianyu-l Hello, I've just updated this PR based on your suggestions. Could you please review it again to see if there's anything I might have missed?

yzhangcs · 2025-03-10T09:55:36Z

Checks for scheduler.decay_type:

bash run_train.sh --scheduler.warmup_steps=4  --scheduler.decay_ratio=0.9  --scheduler.decay_type=linear  --training.steps=40

bash run_train.sh --scheduler.warmup_steps=4  --scheduler.decay_ratio=0.9  --scheduler.decay_type=cosine  --training.steps=40

bash run_train.sh --scheduler.warmup_steps=4  --scheduler.decay_ratio=0.9  --scheduler.decay_type=sqrt  --training.steps=40

Checks for scheduler.decay_ratio:

bash run_train.sh --scheduler.warmup_steps=4  --scheduler.decay_ratio=0.2  --scheduler.decay_type=linear  --training.steps=40

tianyu-l

Looks good to me! I suggest naming scheduler to lr_scheduler. Also please do a final rebase.

docs/converging.md

torchtitan/components/optimizer.py

yzhangcs · 2025-03-10T21:42:14Z

@tianyu-l Hi, thank you for the feedbacks. Just fixed them.

tianyu-l

Thank you!

Could you rebase to latest torchtitan main? I'm seeing changes in merged #945 as part of this PR (e.g. train.py)

…ts (#936) Two very minor changes required by Meta legal as part of adding two new datasets. 1 - License verbiage update in readme 2 - copyright header change in BSD-License.

…940) * people asks about the FSDP2 equivalance of no_sync, that's `set_requires_gradient_sync` * ignored_params is recently implemented. people start using it already. update the doc

@lkhphuc

This is similar in spirit to [PR_944](#944) (cc @lkhphuc) but takes a slightly different approach. Problem - users that default turn on PP training will get -1 for their loss. This is b/c by default, rank 0 is the only one logged. However, for *most* PP schedules, the loss is output on the last rank. Thus, users see -1 for loss and it's a bad/confusing experience. This PR adds a check to review both the current PP schedule (b/c for VBlocks, loss is returned on 0) and if it is a last rank loss schedule, then it checks that the first rank of the last stage is visible in the LOG_RANK environment variable. If not, it warns the user, using Red for the warning if color is enabled, and highlights the rank they should add in yellow: <img width="1236" alt="Screenshot 2025-03-07 at 11 51 46 AM" src="https://github.com/user-attachments/assets/02b18870-90bb-4cfb-89c1-3e92d2fb9bfb" /> Note that I attempted to then modify the LOG_RANK to add the missing last rank...but it has no effect. This is b/c the --log_rank_filter passed into torchrun is fixed and thus the env has no effect. We can fix this by moving to our own filtering via python log filtering (thanks to @d4l3k for this idea) and then it would auto-update. The tradeoff is that we have to init distributed first (to understand the ranks) meaning that at launch, there's a bit of delay before the first logging. From there, then NCCL warnings are not suppressed b/c they are emitted from .cpp file vs torchrun filtering controls that...so we get some additional console spam. This PR thus sticks to a simple warning with Red highlight (assuming color is on) and provide the user how to fix.

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #945 MetricsLogger should be a component as its role is similar to CheckpointManager, which provides some functions and has its own states. More importantly, users may want to customize the metrics. Make it a component and can be customized through TrainSpec. Change the name of `MetricsLogger` to `MetricsProcessor` as it not only log but also process metrics.

yzhangcs · 2025-03-10T22:22:32Z

@tianyu-l Hey, just wanted to check if everything looks good on your end. I’m still getting familiar with rebase, so I want to make sure I did it correctly.
Let me know if there’s still anything I need to adjust!

yzhangcs · 2025-03-10T22:31:30Z

On the newest commits:

bash run_train.sh --lr_scheduler.warmup_steps=4  --lr_scheduler.decay_ratio=0.2  --lr_scheduler.decay_type=linear  --training.steps=40

tianyu-l · 2025-03-10T22:58:20Z

@yzhangcs
Sorry I'm seeing a lot of warnings in CI tests.

[rank0]:[titan] 2025-03-10 22:32:03,960 - root - WARNING - The warmup steps should be less than or equal to the warmup-stable steps (1.9999999999999996). Consider reducing either the decay ratio (0.8) or the warmup steps (2)

Could you help resolve it, e.g. setting decay_ratio to None?

Also the warning is called every single LR scheduler step. Let's move the check outside of this function and do it only once when setting up the scheduler function.
https://github.com/pytorch/torchtitan/pull/938/files#diff-fb95a3c6c32a395a9dafc1625830d088ded3dd406583d5127f7244bd29d2f759R427

yzhangcs · 2025-03-10T23:37:10Z

@tianyu-l sure, I will create a new pr to fix it.

### What does this PR do? Fix some minor issues in PR #938 1. Fix the `decay_ratio` in [debug_model.toml](https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/train_configs/debug_model.toml), ensuing that `warmup_stable_steps` > `warmup_steps` 2. Make sure `warmup_stable_steps` is rounded to an integer 3. Move lr check into `JobConfig`

### What does this PR do? This PR introduces support for cosine and WSD schedulers. The cosine scheduler is widely used in LLM training (e.g., Pythia, OLMo, Llama), while the WSD scheduler, introduced by [MiniCPM](https://arxiv.org/abs/2404.06395), features a three-stage learning rate schedule: warmup, stable, and decay. The stable stage keeps the learning rate constant, beneficial for continual pretraining and flexible training budget adjustments. ### Why this PR is necessary The cosine scheduler is a standard for LLM training, and the WSD scheduler addresses specific needs like stable learning rates during training. This PR also supports three decay types: linear (MiniCPM's original approach), cosine (used in [hf transformers](https://github.com/huggingface/transformers/blob/main/src/transformers/optimization.py#L428)), and 1-sqrt (found optimal in [this paper](https://arxiv.org/html/2408.11029v1)). These additions provide flexibility and improved performance for diverse training scenarios. --------- Co-authored-by: Less Wright <[email protected]> Co-authored-by: Wei (Will) Feng <[email protected]> Co-authored-by: Chien-Chin Huang <[email protected]>

### What does this PR do? Fix some minor issues in PR pytorch#938 1. Fix the `decay_ratio` in [debug_model.toml](https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/train_configs/debug_model.toml), ensuing that `warmup_stable_steps` > `warmup_steps` 2. Make sure `warmup_stable_steps` is rounded to an integer 3. Move lr check into `JobConfig`

This PR adds learning rate logging. There was a previous attempt to implement this in an [earlier PR](#937), but that one was ultimately **closed**. This version ensures that LR logging works properly, I verified it using the WSD scheduler that was recently added in [another PR](#938). <img width="1842" height="730" alt="image" src="https://github.com/user-attachments/assets/8f23674a-d689-4cc2-9d9b-30bff4e63f3b" /> One design consideration here is that torchtitan supports multiple optimizers and learning rate schedules, each potentially having its own LR. However, in practice, I believe that 99.9999% of use cases will use a single LR. Given that, the logging works as follows: - If there is only one learning rate, it gets logged directly under the main charts as `lr`. - If there are multiple learning rates, they are logged under a separate section, each with its corresponding label. Alternatively, we could have ignored the multi-LR case and always logged a single LR, but I prefer this approach since it handles both scenarios robustly with minimal extra code. Happy to adjust if others have a strong preference for simplicity over robustness.

This PR adds learning rate logging. There was a previous attempt to implement this in an [earlier PR](pytorch#937), but that one was ultimately **closed**. This version ensures that LR logging works properly, I verified it using the WSD scheduler that was recently added in [another PR](pytorch#938). <img width="1842" height="730" alt="image" src="https://github.com/user-attachments/assets/8f23674a-d689-4cc2-9d9b-30bff4e63f3b" /> One design consideration here is that torchtitan supports multiple optimizers and learning rate schedules, each potentially having its own LR. However, in practice, I believe that 99.9999% of use cases will use a single LR. Given that, the logging works as follows: - If there is only one learning rate, it gets logged directly under the main charts as `lr`. - If there are multiple learning rates, they are logged under a separate section, each with its corresponding label. Alternatively, we could have ignored the multi-LR case and always logged a single LR, but I prefer this approach since it handles both scenarios robustly with minimal extra code. Happy to adjust if others have a strong preference for simplicity over robustness.

[Scheduler] Add support for cosine and wsd scheduler

629bd51

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 6, 2025

[Misc.] Log learning rate

3c120e3

This was referenced Mar 6, 2025

log learning rate #937

Closed

[RFC] Add lr into metric logging and also rename the loss name #939

Closed

tianyu-l reviewed Mar 6, 2025

View reviewed changes

yzhangcs added 2 commits March 6, 2025 21:34

Unify the three decay lambda fns

2fc78e2

Remove the default value in function signature

fce4a14

yzhangcs added 2 commits March 6, 2025 21:55

Update toml configs

29281a6

Configurable lr_decay_ratio

ed6e1e1

tianyu-l reviewed Mar 7, 2025

View reviewed changes

yzhangcs added 2 commits March 7, 2025 00:32

[Scheduler] Rename lr_decay_fn to linear_warmup_stable_decay

12e83be

Delete lr_decay_type check in build_lr_schedulers

d9b91a5

lkhphuc reviewed Mar 7, 2025

View reviewed changes

torchtitan/train.py Outdated Show resolved Hide resolved

Revert changes on train.py

bbc82b2

tianyu-l reviewed Mar 10, 2025

View reviewed changes

torchtitan/components/optimizer.py Show resolved Hide resolved

torchtitan/components/optimizer.py Show resolved Hide resolved

torchtitan/models/llama/train_configs/debug_model.toml Outdated Show resolved Hide resolved

yzhangcs added 7 commits March 10, 2025 01:38

[Config] Move scheduler-related params to [scheduler] section

2230d3a

Update train.py

01b4b62

Update train.py

e246428

Add all scheduler configs in debug config

3a14cf5

Add warnings if warmup_stable_steps < warmup_steps

69b05df

Revert changes on train.py

827395c

Obey the code format

f3293ab

int type warmup_stable_steps

72a0286

Rename training.warmup_steps to scheduler.warmup_steps

2e2b6b4

tianyu-l reviewed Mar 10, 2025

View reviewed changes

docs/converging.md Outdated Show resolved Hide resolved

torchtitan/components/optimizer.py Outdated Show resolved Hide resolved

Rename scheduler to lr_scheduler

698d63c

tianyu-l approved these changes Mar 10, 2025

View reviewed changes

lessw2020 and others added 11 commits March 10, 2025 15:09

[Legal] Modifications requested by legal for adding additional datase…

5f742f5

…ts (#936) Two very minor changes required by Meta legal as part of adding two new datasets. 1 - License verbiage update in readme 2 - copyright header change in BSD-License.

[FSDP2][doc] highlight set_requires_gradient_sync and ignored_params (#…

e9fe2e5

…940) * people asks about the FSDP2 equivalance of no_sync, that's `set_requires_gradient_sync` * ignored_params is recently implemented. people start using it already. update the doc

[Misc.] Log learning rate

f395ed2

Update train.py

af00afb

Add warnings if warmup_stable_steps < warmup_steps

65f5f66

Revert changes on train.py

1e61236

Rename training.warmup_steps to scheduler.warmup_steps

6328fc7

Merge branch 'main' into main

f378b2f

Fix code formats

1eb7c71

tianyu-l merged commit 0da8901 into pytorch:main Mar 10, 2025
6 checks passed

yzhangcs mentioned this pull request Mar 11, 2025

Conduct lr check only once #950

Merged

idoh mentioned this pull request Jul 18, 2025

Add logging for learning rates in MetricsProcessor #1413

Merged

[Scheduler] Add support for cosine and wsd scheduler #938

[Scheduler] Add support for cosine and wsd scheduler #938

Uh oh!

Conversation

yzhangcs commented Mar 6, 2025

What does this PR do?

Why this PR is necessary

Uh oh!

yzhangcs commented Mar 6, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lessw2020 commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzhangcs commented Mar 7, 2025

Uh oh!

yzhangcs commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yzhangcs commented Mar 10, 2025

Uh oh!

yzhangcs commented Mar 10, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yzhangcs commented Mar 10, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

yzhangcs commented Mar 10, 2025

Uh oh!

yzhangcs commented Mar 10, 2025

Uh oh!

Uh oh!

tianyu-l commented Mar 10, 2025

Uh oh!

yzhangcs commented Mar 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

lessw2020 commented Mar 6, 2025 •

edited

Loading

yzhangcs commented Mar 7, 2025 •

edited

Loading

tianyu-l left a comment •

edited

Loading