Skip to content

Conversation

@danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Nov 3, 2025

Stacked PRs:


[mxfp8 moe training] compute prefix sum of group sizes inside kernel intead of precomputing

Context

Currently when converting scales to blocked swizzled format, we precompute the new start index of each padded group. However, this creates a dependency on torch.compile, which we rely on codgen fast triton kernels for these prefix sums, otherwise we inject slow eager mode ops into the hot path of the quantized grouped gemms, resulting in net slowdown in eager.

Given some users don't want to use torch.compile, and given sometimes there are bugs blocking the use of torch.compile, we should support eager mode execution with good perf.

Changes

  • Do what we do in the mxfp8 scaled grouped gemm kernels in fbgemm, and compute these group offsets inside the kernel themselves. The work is duplicated, but it is such a tiny amount of work that perf impact is negligible (i.e., O(group_size) prefix sum where group_size is the local number of experts after EP is applied, so <= 8).

Testing

  • pytest test/prototype/moe_training/test_kernels.py
  • pytest test/prototype/moe_training/test_scaled_grouped_mm.py -k mxfp8 -s

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3285

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 1cfb0b3 with merge base f856d36 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre added a commit that referenced this pull request Nov 3, 2025
…intead of precomputing

stack-info: PR: #3285, branch: danielvegamyhre/stack/82
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/82 branch from 421191e to 7cd1a74 Compare November 3, 2025 23:15
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 3, 2025
@danielvegamyhre danielvegamyhre added mx topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) moe labels Nov 3, 2025
danielvegamyhre added a commit that referenced this pull request Nov 3, 2025
…intead of precomputing

stack-info: PR: #3285, branch: danielvegamyhre/stack/82
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/82 branch from 7cd1a74 to 1d8aad3 Compare November 3, 2025 23:37
danielvegamyhre added a commit that referenced this pull request Nov 3, 2025
…intead of precomputing

stack-info: PR: #3285, branch: danielvegamyhre/stack/82
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/82 branch from 1d8aad3 to 1e97f00 Compare November 3, 2025 23:59
…intead of precomputing

stack-info: PR: #3285, branch: danielvegamyhre/stack/82
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/82 branch from 1e97f00 to 1cfb0b3 Compare November 4, 2025 00:01
@danielvegamyhre danielvegamyhre merged commit 01374eb into main Nov 4, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. moe mx topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants