You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
add option to save profiling traces in inference roofline script
Summary:
convenient to analyze differences between roofline and observed
tl;dr; of findings:
mxfp8
1. need to pre-swizzle weights
2. torch.compile gives us two kernels, will repurpose the manual
training kernel for this, will need to add pre-swizzling. Longer
term, can see if fbgemm_gpu one is faster.
mxfp4
1. need a faster gemm (can use fbgemm_gpu)
2. need a fused activation quant kernel (can use fbgemm_gpu)
nvfp4
1. need to speed up existing triton activation quant kernel, currently
it doesn't autotune anything so probably some easy wins here. Longer
term can also benchmark vs fbgemm_gpu
Test Plan:
```bash
CUDA_VISIBLE_DEVICES=5 python benchmarks/float8/float8_inference_roofline.py ~/local/tmp/20251016_inference_nvfp4.csv --recipe_name nvfp4 --save_profile_traces True
```
Reviewers:
Subscribers:
Tasks:
Tags:
ghstack-source-id: c6e2f95
ghstack-comment-id: 3413384438
Pull-Request: #3196
0 commit comments