[Autotuner] Use cudagraph for time measurement on Nvidia hardware #1089

yf225 · 2025-11-06T00:26:53Z

This should make timing on Nvidia hardware more accurate.

njriasan

LGTM! Thanks!

Chillee · 2025-11-06T04:15:50Z

doesn't this make autotuning time much slower?

yf225 · 2025-11-06T06:05:40Z

doesn't this make autotuning time much slower?

yeah I think with original n_retries = 10 in the PR, the autotuning time was not good. I've just updated the PR to effectively set n_retries = 1 (since we are already doing cuda_graph_total_time / n_repeat to get average per-iteration runtime), and it now brings the autotuning time on par with the original time.

jansel

How are you testing that this is giving more stable measurements?

Can you share some data?

jansel · 2025-11-07T02:00:58Z

helion/autotuner/base_search.py

+            else:
+                res = do_bench(
+                    **kwargs,
+                    return_mode="median",


Does the above also use median? Median tends to remove outliers.

jansel · 2025-11-07T02:02:29Z

helion/_testing.py

        if not self._in_ref_eager_mode:
            return

+        import pytest


jansel · 2025-11-07T02:05:03Z

Do the numbers from this match those from the interleaved_bench we are using for more precise measurements? If not it could cause some issues with the rebench logic.

This benchmarking function isn't the main one we use, we use it to get a rough initial time then do interleaved benchmarking on the best configs.

yf225 · 2025-11-07T05:58:30Z

This benchmarking function isn't the main one we use, we use it to get a rough initial time then do interleaved benchmarking on the best configs.

ah I see, then I think the current interleaved_bench approach is still more robust. The most precise approach would be interleaved + cudagraph, but cudagraph requires running a few iterations within the same graph (N kernels * K iters/kernel, and each timing region (graph) contains K iterations) to amortize the overhead, while currently interleaved_bench does per-iteration CUDA event timing (N kernels * K iters/kernel, each timing region only contains 1 iteration).

We could split the K further into K1 * K2, and cudagraph over K2 iterations as one timing region. But I am slightly unsure if this is worth doing given the higher complexity.

yf225 requested review from jansel and oulgen November 6, 2025 00:26

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 6, 2025

yf225 force-pushed the autotuner_cudagraph branch from 2f48f68 to ec44631 Compare November 6, 2025 00:42

njriasan approved these changes Nov 6, 2025

View reviewed changes

yf225 marked this pull request as draft November 6, 2025 00:58

yf225 force-pushed the autotuner_cudagraph branch 3 times, most recently from 13e5356 to ac88d91 Compare November 6, 2025 03:56

yf225 force-pushed the autotuner_cudagraph branch 3 times, most recently from e612b6f to 9b3e31d Compare November 6, 2025 05:59

yf225 marked this pull request as ready for review November 6, 2025 06:13

yf225 force-pushed the autotuner_cudagraph branch 3 times, most recently from 873f9d6 to 5001a53 Compare November 6, 2025 06:21

jansel requested changes Nov 7, 2025

View reviewed changes

[Autotuner] Use cudagraph for time measurement on Nvidia hardware

51ca2fe

yf225 force-pushed the autotuner_cudagraph branch from 9449ed8 to 51ca2fe Compare November 7, 2025 05:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Autotuner] Use cudagraph for time measurement on Nvidia hardware #1089

[Autotuner] Use cudagraph for time measurement on Nvidia hardware #1089

yf225 commented Nov 6, 2025 •

edited

Loading

Uh oh!

njriasan left a comment

Uh oh!

Chillee commented Nov 6, 2025

Uh oh!

yf225 commented Nov 6, 2025

Uh oh!

jansel left a comment

Uh oh!

jansel Nov 7, 2025

Uh oh!

jansel Nov 7, 2025

Uh oh!

jansel commented Nov 7, 2025

Uh oh!

yf225 commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Autotuner] Use cudagraph for time measurement on Nvidia hardware #1089

Are you sure you want to change the base?

[Autotuner] Use cudagraph for time measurement on Nvidia hardware #1089

Conversation

yf225 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njriasan left a comment

Choose a reason for hiding this comment

Uh oh!

Chillee commented Nov 6, 2025

Uh oh!

yf225 commented Nov 6, 2025

Uh oh!

jansel left a comment

Choose a reason for hiding this comment

Uh oh!

jansel Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

jansel Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

jansel commented Nov 7, 2025

Uh oh!

yf225 commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yf225 commented Nov 6, 2025 •

edited

Loading