[mxfp8 moe training] compute prefix sum of group sizes inside kernel intead of precomputing #3285

danielvegamyhre · 2025-11-03T23:15:12Z

Stacked PRs:

[mxfp8 moe training] update benchmarks and tests; simplify per group blocked swizzle ref function #3286
->[mxfp8 moe training] compute prefix sum of group sizes inside kernel intead of precomputing #3285

[mxfp8 moe training] compute prefix sum of group sizes inside kernel intead of precomputing

Context

Currently when converting scales to blocked swizzled format, we precompute the new start index of each padded group. However, this creates a dependency on torch.compile, which we rely on codgen fast triton kernels for these prefix sums, otherwise we inject slow eager mode ops into the hot path of the quantized grouped gemms, resulting in net slowdown in eager.

Given some users don't want to use torch.compile, and given sometimes there are bugs blocking the use of torch.compile, we should support eager mode execution with good perf.

Changes

Do what we do in the mxfp8 scaled grouped gemm kernels in fbgemm, and compute these group offsets inside the kernel themselves. The work is duplicated, but it is such a tiny amount of work that perf impact is negligible (i.e., O(group_size) prefix sum where group_size is the local number of experts after EP is applied, so <= 8).

Testing

pytest test/prototype/moe_training/test_kernels.py
pytest test/prototype/moe_training/test_scaled_grouped_mm.py -k mxfp8 -s

pytorch-bot · 2025-11-03T23:15:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3285

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm failures during provisioning step due to network issues

✅ No Failures

As of commit 1cfb0b3 with merge base f856d36 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…intead of precomputing stack-info: PR: #3285, branch: danielvegamyhre/stack/82

danielvegamyhre added a commit that referenced this pull request Nov 3, 2025

[mxfp8 moe training] compute prefix sum of group sizes inside kernel …

7cd1a74

…intead of precomputing stack-info: PR: #3285, branch: danielvegamyhre/stack/82

danielvegamyhre force-pushed the danielvegamyhre/stack/82 branch from 421191e to 7cd1a74 Compare November 3, 2025 23:15

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 3, 2025

danielvegamyhre added mx topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) moe labels Nov 3, 2025

danielvegamyhre requested review from drisspg and vkuzo November 3, 2025 23:21

danielvegamyhre added a commit that referenced this pull request Nov 3, 2025

[mxfp8 moe training] compute prefix sum of group sizes inside kernel …

1d8aad3

…intead of precomputing stack-info: PR: #3285, branch: danielvegamyhre/stack/82

danielvegamyhre force-pushed the danielvegamyhre/stack/82 branch from 7cd1a74 to 1d8aad3 Compare November 3, 2025 23:37

drisspg approved these changes Nov 3, 2025

View reviewed changes

danielvegamyhre added a commit that referenced this pull request Nov 3, 2025

[mxfp8 moe training] compute prefix sum of group sizes inside kernel …

1e97f00

…intead of precomputing stack-info: PR: #3285, branch: danielvegamyhre/stack/82

danielvegamyhre force-pushed the danielvegamyhre/stack/82 branch from 1d8aad3 to 1e97f00 Compare November 3, 2025 23:59

[mxfp8 moe training] compute prefix sum of group sizes inside kernel …

1cfb0b3

…intead of precomputing stack-info: PR: #3285, branch: danielvegamyhre/stack/82

danielvegamyhre force-pushed the danielvegamyhre/stack/82 branch from 1e97f00 to 1cfb0b3 Compare November 4, 2025 00:01

danielvegamyhre mentioned this pull request Nov 4, 2025

[mxfp8 moe training] update benchmarks and tests; simplify per group blocked swizzle ref function #3286

Open

danielvegamyhre merged commit 01374eb into main Nov 4, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mxfp8 moe training] compute prefix sum of group sizes inside kernel intead of precomputing #3285

[mxfp8 moe training] compute prefix sum of group sizes inside kernel intead of precomputing #3285

Uh oh!

danielvegamyhre commented Nov 3, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[mxfp8 moe training] compute prefix sum of group sizes inside kernel intead of precomputing #3285

[mxfp8 moe training] compute prefix sum of group sizes inside kernel intead of precomputing #3285

Uh oh!

Conversation

danielvegamyhre commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!