Commit 768cde1
authored
[SimpleFSDP] Add support for ddp+tp (#1250)
As titled, this PR adds support for DDP+TP under SimpleFSDP's
`replicate` mode.
1. Profile trace for DDP. As seen, the DDP bwd communication is
`all-reduce`.
<img width="1109" alt="Screenshot 2025-06-01 at 1 10 07 PM"
src="https://github.com/user-attachments/assets/91ca56f4-c116-433d-98bf-96869a72de0c"
/>
2. Numerical convergence: As seen, the loss convergence discrepancy is
in 1e-3 for [ddp:2, tp:2] and [fsdp:2, tp:2] (with mixed-precision
training)
<img width="1568" alt="Screenshot 2025-06-01 at 11 39 49 PM"
src="https://github.com/user-attachments/assets/ef429276-da2b-41cd-bed3-fa880cd1efa6"
/>
The loss convergence is the same for [ddp:2, tp:2] and [fsdp:2, tp:2]
(without mixed-precision training)
<img width="1541" alt="Screenshot 2025-06-02 at 11 59 09 AM"
src="https://github.com/user-attachments/assets/2a18ef51-3ebf-4d5f-a27f-70fd15ee59d6"
/>1 parent 7a6ab08 commit 768cde1
File tree
2 files changed
+36
-27
lines changed- torchtitan/experiments/simple_fsdp
- tests
2 files changed
+36
-27
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
77 | 77 | | |
78 | 78 | | |
79 | 79 | | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
90 | 100 | | |
91 | 101 | | |
92 | 102 | | |
| |||
154 | 164 | | |
155 | 165 | | |
156 | 166 | | |
157 | | - | |
158 | | - | |
159 | | - | |
160 | | - | |
| 167 | + | |
| 168 | + | |
161 | 169 | | |
162 | 170 | | |
163 | 171 | | |
| |||
170 | 178 | | |
171 | 179 | | |
172 | 180 | | |
173 | | - | |
| 181 | + | |
| 182 | + | |
174 | 183 | | |
175 | 184 | | |
176 | 185 | | |
| |||
Lines changed: 12 additions & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
131 | 131 | | |
132 | 132 | | |
133 | 133 | | |
134 | | - | |
| 134 | + | |
135 | 135 | | |
136 | 136 | | |
137 | 137 | | |
| |||
144 | 144 | | |
145 | 145 | | |
146 | 146 | | |
147 | | - | |
148 | | - | |
149 | | - | |
150 | | - | |
151 | | - | |
152 | | - | |
153 | | - | |
154 | | - | |
155 | | - | |
156 | | - | |
157 | | - | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
158 | 158 | | |
159 | 159 | | |
160 | 160 | | |
| |||
0 commit comments