-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Currently, due to compilation issues, we only enable sequence parallelism (and the dependent AsyncTP) for static compile sizes (and not by default). That's because sequence parallelism splits the residual tensor into smaller pieces, which breaks with piecewise compilation and dynamic shapes. #21031 addressed this but got stuck on an Inductor bug that caused extreme memory pressure. That Inductor bug has since been resolved with PyTorch 2.9.
The course of action:
- Pick up Enable sequence parallelism for full cuda graph without specifying compile sizes #21031 and verify that torch==2.9 resolved the memory issue (compare end-to-end activation memory with pass disabled and enabled).
- Test Enable sequence parallelism for full cuda graph without specifying compile sizes #21031 with changes in [torch.compile] CUDAGraph Inductor partition integration #24281 and
-O. use_inductor_graph_partition=True
to check that Inductor partitioning full compilation works with sequence parallelism. - Check end-to-end performance of sequence parallelism alone as well as async TP on a dense unquantized and dense quantized model (both Hopper and Blackwell). Make sure to use full cudagraphs where available.
- Merge the PR guarding on torch 2.9 and enable seq par and async tp by default to get performance on day 0 of torch 2.9 release.
Additionally, the padding requirement should be re-evaluated: we should benchmark to see the performance reduction in padding num_tokens
and compare it to just manually padding with -num_tokens % tp_size
around the sequence parallel section, or by doing uneven work across TP ranks by manipulating the sizes returned by reduce_scatter
(likely too complicated)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status