Skip to content

Conversation

@rakkit
Copy link
Contributor

@rakkit rakkit commented Nov 19, 2025

This is a draft PR for:

  1. Make the moe's load_balance_coeff configurable
  2. add the batch and seq-wise aux loss for load balance. [ref: dpskv3 eqn. 17~20]

For now, it only applies to the DeepSeek model, but I can add it for all other moe models at the end.
(also, we dont log the aux loss, but i can add it in optimizer hook to do this if you want)

The main concern is that the aux loss does not work well with PP. From what I have tested, it works well only with 1F1B. And it is broken for ZBV or interleaved 1f1b.

To test it:
CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" NGPU=4 ./run_train.sh --model.extra_losses.load_balance_loss_weight=0.001
image

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 19, 2025
job_config, parallel_dims=parallel_dims, ft_manager=self.ft_manager
)

self.loss_fn = functools.partial(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add a condition here to wrap loss or not for MoE. for now all models in torchtitan only return a single output so its ok for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant