Commit b8e48cc
committed
add option to save profiling traces in inference roofline script
Summary:
convenient to analyze differences between roofline and observed
tl;dr; of findings:
mxfp8
1. need to pre-swizzle weights
2. torch.compile gives us two kernels, will repurpose the manual
training kernel for this, will need to add pre-swizzling. Longer
term, can see if fbgemm_gpu one is faster.
mxfp4
1. need a faster gemm (can use fbgemm_gpu)
2. need a fused activation quant kernel (can use fbgemm_gpu)
nvfp4
1. need to speed up existing triton activation quant kernel, currently
it doesn't autotune anything so probably some easy wins here. Longer
term can also benchmark vs fbgemm_gpu
Test Plan:
```bash
CUDA_VISIBLE_DEVICES=5 python benchmarks/float8/float8_inference_roofline.py ~/local/tmp/20251016_inference_nvfp4.csv --recipe_name nvfp4 --save_profile_traces True
```
Reviewers:
Subscribers:
Tasks:
Tags:
ghstack-source-id: c6e2f95
ghstack-comment-id: 3413384438
Pull-Request: #31961 parent d1a7fbc commit b8e48cc
1 file changed
+19
-3
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
63 | | - | |
| 63 | + | |
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
| |||
72 | 72 | | |
73 | 73 | | |
74 | 74 | | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
75 | 81 | | |
76 | 82 | | |
77 | 83 | | |
| |||
161 | 167 | | |
162 | 168 | | |
163 | 169 | | |
| 170 | + | |
164 | 171 | | |
165 | 172 | | |
166 | 173 | | |
167 | 174 | | |
168 | 175 | | |
169 | 176 | | |
170 | 177 | | |
| 178 | + | |
171 | 179 | | |
172 | 180 | | |
173 | 181 | | |
| |||
289 | 297 | | |
290 | 298 | | |
291 | 299 | | |
292 | | - | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
293 | 305 | | |
294 | 306 | | |
295 | 307 | | |
| |||
325 | 337 | | |
326 | 338 | | |
327 | 339 | | |
328 | | - | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
329 | 345 | | |
330 | 346 | | |
331 | 347 | | |
| |||
0 commit comments