Skip to content

Conversation

@garrett361
Copy link
Contributor

@garrett361 garrett361 commented Aug 6, 2025

The current EP grad clipping logic assumes that when using EP all of the norms returned by torch.nn.utils.get_total_norm are DTensors. This assumption can be violated and the subsequent full_tensor call can correspondingly fail in the edge case where the ep_grad list is empty, in which case get_total_norm returns tensor(0.), a non-DTensor.

ep_grads_total_norm = torch.nn.utils.get_total_norm(
ep_grads, norm_type, error_if_nonfinite, foreach
).full_tensor()

File "/app/torchtitan/torchtitan/distributed/utils.py", line 423, in _clip_grad_norm_with_ep
    ).full_tensor()
      ^^^^^^^^^^^
AttributeError: 'Tensor' object has no attribute 'full_tensor'

This edge case can occur in PP+EP setups when model uses some fully dense and some MoE layers (like DSv3), in which case some pp ranks may not be assigned any MoE layers.

I suppose it is possible that non_ep_grads could also be empty, but I can only imagine this happening in extreme cases, so I did not change the non_ep_grads code.

CC @tianyu-l

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 6, 2025
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, had a minor comment.

ep_grads, norm_type, error_if_nonfinite, foreach
).full_tensor()
)
# ep_grads may be an empty list, in which case get_total_norm returns tensor(0.), a non-DTensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This edge case can occur in PP+EP setups when model uses some fully dense and some MoE layers (like DSv3), in which case some pp ranks may not be assigned any MoE layers.

Oh makes sense to me. Could you actually put this example edge case in the comment too? I think it'd be very helpful.

I suppose it is possible that non_ep_grads could also be empty, but I can only imagine this happening in extreme cases, so I did not change the non_ep_grads code.

I think this is not possible if a PP stage always

  1. contains any non-MoE params
  2. contains full MoE modules -- the shared expert and router.gate will be non_ep_params anyways

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded the comment.

I think this is not possible if a PP stage always [...]

Yeah, I was imagining very extreme cases where PP is very granularly applied an somehow a PP rank only ends up owning MoE layers and nothing else. Can't happen for any model or parallelism you could setup with torchtitan:main today, for sure. I was mostly just explaining why I only touched the ep_grads code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need anything else from me on this one @tianyu-l ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry forgot to merge :)

ep_grads, norm_type, error_if_nonfinite, foreach
).full_tensor()
)
# ep_grads may be an empty list, in which case get_total_norm returns tensor(0.), a non-DTensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry forgot to merge :)

@tianyu-l tianyu-l merged commit 23e4dfc into pytorch:main Aug 8, 2025
7 checks passed
@garrett361
Copy link
Contributor Author

Np, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants