Skip to content

Conversation

@yf225
Copy link
Contributor

@yf225 yf225 commented Nov 6, 2025

This should make timing on Nvidia hardware more accurate.

@yf225 yf225 requested review from jansel and oulgen November 6, 2025 00:26
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 6, 2025
@yf225 yf225 force-pushed the autotuner_cudagraph branch from 2f48f68 to ec44631 Compare November 6, 2025 00:42
Copy link

@njriasan njriasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@yf225 yf225 marked this pull request as draft November 6, 2025 00:58
@yf225 yf225 force-pushed the autotuner_cudagraph branch 3 times, most recently from 13e5356 to ac88d91 Compare November 6, 2025 03:56
@Chillee
Copy link

Chillee commented Nov 6, 2025

doesn't this make autotuning time much slower?

@yf225 yf225 force-pushed the autotuner_cudagraph branch 3 times, most recently from e612b6f to 9b3e31d Compare November 6, 2025 05:59
@yf225
Copy link
Contributor Author

yf225 commented Nov 6, 2025

doesn't this make autotuning time much slower?

yeah I think with original n_retries = 10 in the PR, the autotuning time was not good. I've just updated the PR to effectively set n_retries = 1 (since we are already doing cuda_graph_total_time / n_repeat to get average per-iteration runtime), and it now brings the autotuning time on par with the original time.

@yf225 yf225 marked this pull request as ready for review November 6, 2025 06:13
@yf225 yf225 force-pushed the autotuner_cudagraph branch 3 times, most recently from 873f9d6 to 5001a53 Compare November 6, 2025 06:21
Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you testing that this is giving more stable measurements?

Can you share some data?

else:
res = do_bench(
**kwargs,
return_mode="median",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the above also use median? Median tends to remove outliers.

if not self._in_ref_eager_mode:
return

import pytest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

@jansel
Copy link
Contributor

jansel commented Nov 7, 2025

Do the numbers from this match those from the interleaved_bench we are using for more precise measurements? If not it could cause some issues with the rebench logic.

This benchmarking function isn't the main one we use, we use it to get a rough initial time then do interleaved benchmarking on the best configs.

@yf225 yf225 force-pushed the autotuner_cudagraph branch from 9449ed8 to 51ca2fe Compare November 7, 2025 05:36
@yf225
Copy link
Contributor Author

yf225 commented Nov 7, 2025

This benchmarking function isn't the main one we use, we use it to get a rough initial time then do interleaved benchmarking on the best configs.

ah I see, then I think the current interleaved_bench approach is still more robust. The most precise approach would be interleaved + cudagraph, but cudagraph requires running a few iterations within the same graph (N kernels * K iters/kernel, and each timing region (graph) contains K iterations) to amortize the overhead, while currently interleaved_bench does per-iteration CUDA event timing (N kernels * K iters/kernel, each timing region only contains 1 iteration).

We could split the K further into K1 * K2, and cudagraph over K2 iterations as one timing region. But I am slightly unsure if this is worth doing given the higher complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants