[SimpleFSDP] Add support for ddp+tp #1250

ruisizhang123 · 2025-06-01T04:15:37Z

As titled, this PR adds support for DDP+TP under SimpleFSDP's replicate mode.

Profile trace for DDP. As seen, the DDP bwd communication is all-reduce.

Numerical convergence: As seen, the loss convergence discrepancy is in 1e-3 for [ddp:2, tp:2] and [fsdp:2, tp:2] (with mixed-precision training)

The loss convergence is the same for [ddp:2, tp:2] and [fsdp:2, tp:2] (without mixed-precision training)

tianyu-l

Makes sense to me!

Two more comments:

For the comment on
https://github.com/pytorch/torchtitan/pull/1250/files#diff-02c09227aed7868aae47b1b0b6cb3b5105b84f2543cc2dea9c5f3a7cb265eeadR180
I think we need to update it because
For FSDP, it's all-gather in forward and reduce-scatter in backward
For DDP, it's all-reduce in backward.
Note these are in additional to mixed precision dtype conversion.
Let's actually verify such behavior with trace in the PR summary, as we haven't verified it before.
Let's also verify the numerics by comparing "FSDP 2" vs. "DDP2+TP2" (where we assume FSDP as the ground truth).

torchtitan/experiments/simple_fsdp/simple_fsdp.py

ruisizhang123 · 2025-06-01T21:15:21Z

Makes sense to me!

Two more comments:

For the comment on
https://github.com/pytorch/torchtitan/pull/1250/files#diff-02c09227aed7868aae47b1b0b6cb3b5105b84f2543cc2dea9c5f3a7cb265eeadR180
I think we need to update it because
For FSDP, it's all-gather in forward and reduce-scatter in backward
For DDP, it's all-reduce in backward.
Note these are in additional to mixed precision dtype conversion.
Let's actually verify such behavior with trace in the PR summary, as we haven't verified it before.

Let's also verify the numerics by comparing "FSDP 2" vs. "DDP2+TP2" (where we assume FSDP as the ground truth).

Updated. Thank you!

tianyu-l

Numerical convergence: As seen, the loss convergence is close for [ddp:2, tp:2] and [fsdp:2, tp:2].

This actually looks concerning. I would expect the loss to be exactly the same between the two, if random seed, determinism, and the same initialization of parameters are used.

Thinking about the possible reasons, I think parameter init is not controlled -- FSDP would init a sharded tensor on dp mesh, whereas DDP would init a replicate tensor across the dp mesh.

To remove this factor, let's init a seed checkpoint first, and then kickoff two separate runs loading the same checkpoint.
https://github.com/pytorch/torchtitan/blob/main/docs/checkpoint.md#how-to-create-a-seed-checkpoint
(Note that you may have to copy/move/remove of the checkpoints to do avoid not loading from step-0.)

ruisizhang123 · 2025-06-02T05:57:37Z

Numerical convergence: As seen, the loss convergence is close for [ddp:2, tp:2] and [fsdp:2, tp:2].

This actually looks concerning. I would expect the loss to be exactly the same between the two, if random seed, determinism, and the same initialization of parameters are used.

Thinking about the possible reasons, I think parameter init is not controlled -- FSDP would init a sharded tensor on dp mesh, whereas DDP would init a replicate tensor across the dp mesh.

To remove this factor, let's init a seed checkpoint first, and then kickoff two separate runs loading the same checkpoint. https://github.com/pytorch/torchtitan/blob/main/docs/checkpoint.md#how-to-create-a-seed-checkpoint (Note that you may have to copy/move/remove of the checkpoints to do avoid not loading from step-0.)

Seems like I forgot to set the seed to be the same. With the newly updated pic, the discrepancy between DDP & FSDP + TP is much smaller. Sorry for the confusion here.

fegin · 2025-06-02T06:01:41Z

DDP + TP performance is twice faster than FSDP + TP. Is this expected? Does this mean the allgathers are exposed? Or there are performance optimizations that are not turned on yet?

ruisizhang123 · 2025-06-02T06:04:19Z

DDP + TP performance is twice faster than FSDP + TP. Is this expected? Does this mean the allgathers are exposed? Or there are performance optimizations that are not turned on yet?

Yes, with only front-end, SimpleFSDP exposes all of its communications. The optimizations (pre-fetching & bucketing) are performed in the compiler backend, which has not been turned on here.

ruisizhang123 · 2025-06-02T17:14:47Z

Seems like I forgot to set the seed to be the same. With the newly updated pic, the discrepancy between DDP & FSDP + TP is much smaller. Sorry for the confusion here.

As in the PR description, there are still some minor differences between SimpleFSDP+TP's replicate and fully_shard modes. After turning off TP, the gap still exists and is in a similar range of ~1e-3.

Tianyu suggested we could turn off MPT. After turning it off, the discrepancy is much smaller to ~1e-4. We need to look into DTensor redistribute to see if it handles DTensor precision differently in replicate (all-reduce) and fully_shard(reduce-scatter) mode.

I also tested FSDP2 vs DDP loss. The discrepancy is in 1e-4, which is similar to the above SimpleFSDP replicate vs fully_shard (without MPT). We should be good after fixing the SimpleFSDP's MPT bug here.

tianyu-l

Nice job! Thank you for doing all the tests & verifications!
I agree we've isolated the issue to DDP+MPT. Let's follow up in a separate PR.

vadimkantorov · 2025-06-03T08:40:52Z

Is SimpleFSDP also supported in torchtune? Hoping that the both projects share more code and do not spend twice time for reimplementing the same disttributed features...

This is a follow-up on the previous dtensor redistribute PR: #150740, which enables SimpleFSDP's mixed-precision training. In the most recent integration in TorchTitan: pytorch/torchtitan#1250, we found some discrepancies between SimpleFSDP's `fully_shard` and `replicate` modes when MPT is enabled. After debugging, I found the problem is in dtensor redistribute --`local_tensor` is taken out again from the original `input`. Thus, the dtensor used for communication has its original precision instead of using `forward_dtype`. This PR fixes this issue and corrects previously added test cases. After fixing the bug, the loss curves of `fully_shard` and `replicate` mode match perfectly. ![loss](https://github.com/user-attachments/assets/a8faddae-a476-48c0-a411-3fe04d2233bd) Pull Request resolved: #154975 Approved by: https://github.com/tianyu-l

ruisizhang123 · 2025-06-03T20:17:20Z

Is SimpleFSDP also supported in torchtune? Hoping that the both projects share more code and do not spend twice time for reimplementing the same disttributed features...

SimpleFSDP is not supported in torchtune yet. SimpleFSDP is more of a type of FSDP users can apply on top of their model.

For the front-end wrapping, all users need to do is call simple_fsdp.py for FSDP and the rest of parallelism definitions are unchanged.
SimpleFSDP is in experimental version -- we can explore a better way of sharing simple_fsdp.py across repos, but it seems to me we won't be reinventing new wheels for such integration. (maybe @tianyu-l can confirm this)

The FSDP optimizations (bucketing & reordering) are done in the TorchInductor backend. I agree, for pre-training and post-training, the optimal operator bucketing strategy may be different. But the bucketing & reordering are done in TorchInductor and should be independent of torchtitan, torchtune, or any other repos.

This is a follow-up on the previous dtensor redistribute PR: pytorch#150740, which enables SimpleFSDP's mixed-precision training. In the most recent integration in TorchTitan: pytorch/torchtitan#1250, we found some discrepancies between SimpleFSDP's `fully_shard` and `replicate` modes when MPT is enabled. After debugging, I found the problem is in dtensor redistribute --`local_tensor` is taken out again from the original `input`. Thus, the dtensor used for communication has its original precision instead of using `forward_dtype`. This PR fixes this issue and corrects previously added test cases. After fixing the bug, the loss curves of `fully_shard` and `replicate` mode match perfectly. ![loss](https://github.com/user-attachments/assets/a8faddae-a476-48c0-a411-3fe04d2233bd) Pull Request resolved: pytorch#154975 Approved by: https://github.com/tianyu-l

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 1, 2025

ruisizhang123 requested a review from tianyu-l June 1, 2025 04:32

tianyu-l reviewed Jun 1, 2025

View reviewed changes

ruisizhang123 force-pushed the ruisi/ddp+tp branch 2 times, most recently from d144900 to a201f90 Compare June 1, 2025 20:43

add support for ddp+tp

05dc2ff

ruisizhang123 force-pushed the ruisi/ddp+tp branch from a201f90 to 05dc2ff Compare June 1, 2025 20:49

ruisizhang123 requested a review from tianyu-l June 1, 2025 20:49

tianyu-l reviewed Jun 2, 2025

View reviewed changes

tianyu-l approved these changes Jun 2, 2025

View reviewed changes

tianyu-l merged commit 768cde1 into main Jun 2, 2025
8 checks passed

tianyu-l deleted the ruisi/ddp+tp branch June 2, 2025 19:06

ruisizhang123 mentioned this pull request Jun 3, 2025

[dtensor] fix simplefsdp mixed-precision training bugs pytorch/pytorch#154975

Closed

[SimpleFSDP] Add support for ddp+tp #1250

[SimpleFSDP] Add support for ddp+tp #1250

Conversation

ruisizhang123 commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ruisizhang123 commented Jun 1, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 commented Jun 2, 2025

Uh oh!

fegin commented Jun 2, 2025

Uh oh!

ruisizhang123 commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruisizhang123 commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vadimkantorov commented Jun 3, 2025

Uh oh!

ruisizhang123 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ruisizhang123 commented Jun 1, 2025 •

edited

Loading

ruisizhang123 commented Jun 2, 2025 •

edited

Loading

ruisizhang123 commented Jun 2, 2025 •

edited

Loading

ruisizhang123 commented Jun 3, 2025 •

edited

Loading