Skip to content

Conversation

@fmassa
Copy link
Member

@fmassa fmassa commented Aug 8, 2025

This PR makes bucket sizes for all-gather and reduce-scatter to be of the same size for 1d FSDP.

@fmassa fmassa requested a review from wconstab August 8, 2025 09:15
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 8, 2025
(global_batch_size, job_config.training.seq_len),
device=torch.device("cuda"),
),
return (
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the unrelated lint changes, my editor decided to annoy me here

assert parallel_dims.pp_enabled is False, "PP not supported yet"

torch._inductor.config.bucket_all_gathers_fx_bucket_size_determinator = (
lambda bucket_idx: 500 / parallel_dims.tp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to consolidate at some point, the other inductor configs that control which passes and pass modes live in titan today, are cli driven, and could change even which bucketing pass is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops just realized this is totchtitan..

@fmassa fmassa merged commit 4712163 into autoparallel Aug 8, 2025
2 checks passed
@fmassa fmassa deleted the fmassa/fix_bucket_size branch August 8, 2025 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants