Commit 3c84ce0
authored
Refactor PP splitting (#1416)
This refactors the PP splitting logic to consolidate around settings
FQNs for each model chunk. For example:
```
[
['tok_embeddings', 'layers.0'], # stage0
['layers.1', 'layers.2'], # stage1
['layers.3', 'layers.4'], # stage2
... # so on...
]
```
This is better because it can generally be applied to all models, and
the code can be re-used for cases that don't explicitly require
pipelined execution (for example, streaming diloco needs to communicate
model chunks)
Changes:
- Refactor deepseekv3 and llama to share the same pipeline util
functions
- Add module_names_per_model_chunk config, deprecate
pipeline_parallel_split_points
TODO (follow up PRs):
- `pipeline_module_split` will be upstreamed to PyTorch as a
`torch.distributed.pipelining` utility since it contains no model
specific code.
- Additional changes are needed to get this to work for torchft
streaming diloco including updating the training loop to not execute if
the pipeline schedule isn't set and making sure the pipelining_fn return
the correct model chunks.
cc @tushar00jain1 parent f1c8c2c commit 3c84ce0
File tree
6 files changed
+400
-550
lines changed- tests/unit_tests
- torchtitan
- config
- distributed
- models
- deepseek_v3
- infra
- llama3/infra
6 files changed
+400
-550
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | | - | |
66 | | - | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
67 | 65 | | |
68 | | - | |
| 66 | + | |
69 | 67 | | |
70 | 68 | | |
71 | 69 | | |
72 | 70 | | |
73 | 71 | | |
74 | | - | |
75 | | - | |
76 | 72 | | |
77 | 73 | | |
78 | | - | |
79 | | - | |
80 | | - | |
| 74 | + | |
81 | 75 | | |
82 | | - | |
| 76 | + | |
83 | 77 | | |
84 | 78 | | |
85 | 79 | | |
86 | 80 | | |
87 | 81 | | |
88 | | - | |
| 82 | + | |
89 | 83 | | |
90 | 84 | | |
91 | 85 | | |
92 | 86 | | |
93 | 87 | | |
94 | 88 | | |
95 | 89 | | |
96 | | - | |
97 | | - | |
| 90 | + | |
| 91 | + | |
98 | 92 | | |
99 | | - | |
| 93 | + | |
100 | 94 | | |
101 | 95 | | |
102 | 96 | | |
103 | 97 | | |
104 | 98 | | |
105 | | - | |
| 99 | + | |
106 | 100 | | |
107 | 101 | | |
108 | 102 | | |
109 | 103 | | |
110 | 104 | | |
111 | | - | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
119 | 124 | | |
120 | | - | |
121 | | - | |
| 125 | + | |
| 126 | + | |
122 | 127 | | |
123 | 128 | | |
124 | 129 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
290 | 290 | | |
291 | 291 | | |
292 | 292 | | |
| 293 | + | |
293 | 294 | | |
294 | 295 | | |
295 | 296 | | |
| |||
299 | 300 | | |
300 | 301 | | |
301 | 302 | | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
302 | 325 | | |
303 | 326 | | |
304 | | - | |
| 327 | + | |
305 | 328 | | |
306 | 329 | | |
307 | 330 | | |
| |||
0 commit comments