Skip to content

[Bug]: Sequence Parallelism and Async TP disabled by default #25277

@ProExpertProg

Description

@ProExpertProg

Currently, due to compilation issues, we only enable sequence parallelism (and the dependent AsyncTP) for static compile sizes (and not by default). That's because sequence parallelism splits the residual tensor into smaller pieces, which breaks with piecewise compilation and dynamic shapes. #21031 addressed this but got stuck on an Inductor bug that caused extreme memory pressure. That Inductor bug has since been resolved with PyTorch 2.9.

The course of action:

  1. Pick up Enable sequence parallelism for full cuda graph without specifying compile sizes #21031 and verify that torch==2.9 resolved the memory issue (compare end-to-end activation memory with pass disabled and enabled).
  2. Test Enable sequence parallelism for full cuda graph without specifying compile sizes #21031 with changes in [torch.compile] CUDAGraph Inductor partition integration #24281 and -O. use_inductor_graph_partition=True to check that Inductor partitioning full compilation works with sequence parallelism.
  3. Check end-to-end performance of sequence parallelism alone as well as async TP on a dense unquantized and dense quantized model (both Hopper and Blackwell). Make sure to use full cudagraphs where available.
  4. Merge the PR guarding on torch 2.9 and enable seq par and async tp by default to get performance on day 0 of torch 2.9 release.

Additionally, the padding requirement should be re-evaluated: we should benchmark to see the performance reduction in padding num_tokens and compare it to just manually padding with -num_tokens % tp_size around the sequence parallel section, or by doing uneven work across TP ranks by manipulating the sizes returned by reduce_scatter (likely too complicated)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

In progress

Relationships

None yet

Development

No branches or pull requests

Issue actions