Skip to content

Use a higher tier-up threshold for JIT code #126795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brandtbucher opened this issue Nov 13, 2024 · 9 comments
Closed

Use a higher tier-up threshold for JIT code #126795

brandtbucher opened this issue Nov 13, 2024 · 9 comments
Assignees
Labels
3.14 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-JIT

Comments

@brandtbucher
Copy link
Member

brandtbucher commented Nov 13, 2024

Our current tier-up threshold is 16, which was chosen a while ago because:

  • in theory, it gives some of our 16-bit branch counters time to stabilize
  • it seemed to work fine in practice

It turns out that we're leaving significant performance and memory improvements on the table by not using higher thresholds. Here are the results of some experiments I ran:

warmup speedup memory traces created traces executed uops executed
64 +0.3% -1.2% -8.0% -0.1% +0.2%
256 +1.0% -2.6% -22.0% -0.7% -1.3%
1024 +1.2% -3.2% -38.6% -3.0% -1.5%
2048 +1.1% -3.3% -44.9% -12.4% -3.8%
4096 +2.1% -3.6% -52.2% -11.2% -3.1%
8192* +2.0% -3.4% -59.2% -12.8% -3.1%
16384* +2.0% -3.6% -65.2% -14.5% -4.7%
32768* +1.8% -3.8% -73.1% -18.3% -7.1%
65536* +1.4% -3.9% -79.7% -21.9% -9.2%

* For warmups above 4096, exponential backoff is disabled.

Based on these numbers, I think 4096 as a new threshold makes sense (2% faster and 3% less memory without significant hits to the amount of work we actually do in JIT code). I'll open a PR.

My next steps will be conducting similar experiments with higher side-exit warmup values, and then lastly with different JIT_CLEANUP_THRESHOLD values.

Linked PRs

@brandtbucher brandtbucher added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.14 bugs and security fixes topic-JIT labels Nov 13, 2024
@brandtbucher brandtbucher self-assigned this Nov 13, 2024
@terryjreedy
Copy link
Member

Is 'speedup' comparing to result for 16?

I have no idea how likely significant correlations between parameters are for this problem, but in general, when doing multidimensional optimization 1 dimension at a time, I would recheck after doing all dimensions that earlier settings are still optimal.

@brandtbucher
Copy link
Member Author

Is 'speedup' comparing to result for 16?

Yes! Sorry if that wasn't clear.

I have no idea how likely significant correlations between parameters are for this problem, but in general, when doing multidimensional optimization 1 dimension at a time, I would recheck after doing all dimensions that earlier settings are still optimal.

Yeah, that's a good idea. I don't know if I'll do another full sweep, but spot-checking the "neighbors" of the current value over time seems useful.

@alonme
Copy link
Contributor

alonme commented Nov 15, 2024

@brandtbucher I assume that powers of 2 are used because of the exponential backoff?
If so, and given that disabling the exponential backoff doesn't seem to have a large negative effect i think that it might be worth searching the space between 2048 and 8192 for the maximum point?

@brandtbucher
Copy link
Member Author

I assume that powers of 2 are used because of the exponential backoff?

It's nice, but they aren't needed (our exponential backoff works fine with non-power-of-two initial values). I mainly chose powers of two because it's a pretty efficient way to search a half-open range of possible values. ;)

If so, and given that disabling the exponential backoff doesn't seem to have a large negative effect i think that it might be worth searching the space between 2048 and 8192 for the maximum point?

I'd like to avoid overfitting to the benchmarks. There's not going to be some "best" number, just a range of values that work well in practice. Being in the right order of magnitude is probably good enough, especially since good warmup values are very sensitive to different workloads and platforms (and as Terry mentioned, we'll probably want to continue tweaking the values over time).

@brandtbucher
Copy link
Member Author

(Plus each benchmarking run takes several hours, and there's a clear plateau near the current chosen value.)

@brandtbucher
Copy link
Member Author

brandtbucher commented Nov 22, 2024

The results of similar experiments with the threshold for warming up side-exits (currently set at 64):

warmup speedup memory traces created traces executed uops executed
256 -0.3% -0.1% -26.6% -0.3% -0.1%
1024 +0.1% -0.4% -50.6% -2.2% -1.7%
2048 -0.0% -0.5% -58.9% -1.7% -0.6%
4096 +0.4% -0.5% -63.6% +3.2% -0.3%
8192* +0.2% -0.6% -70.6% -2.0% -2.0%
16384* +0.1% -0.7% -75.3% -4.9% -3.6%
65536* -0.0% -0.8% -80.3% -12.2% -7.1%

The results are less dramatic here, but it does seem like switching to 4096 here too would result in small performance improvements and memory savings, with no real hit to uops executed.

Note that these new measurements were taken after the other threshold change to 4096 landed, so they accurately depict the improvements we'd see with the new values.

@brandtbucher
Copy link
Member Author

Last one to tweak, the "cold executor" invalidation threshold (currently set at 100000):

threshold speedup memory traces created traces executed uops executed
16384 -0.6% -0.3% +28.6% -29.0% -23.7%
32768 -0.2% +0.0% +20.6% -4.3% -1.5%
65536 -0.5% +0.1% +7.4% -2.4% -0.7%
131072 -0.2% +0.5% -5.9% +0.5% +0.4%
262144 -0.6% +0.4% -24.0% +2.9% +1.2%
524288 -0.3% +0.5% -36.4% +5.8% +2.2%
1048576 -0.3% -0.1% -49.0% +10.9% +4.3%

This seems like it's in a good place, though we might consider higher values in the future.

@alonme
Copy link
Contributor

alonme commented Dec 12, 2024

@brandtbucher for which platform are these results?
Asking as we had big differences between the platforms for the other params,
What is your opinion for having different thresholds for different platforms?

@brandtbucher
Copy link
Member Author

@brandtbucher for which platform are these results?

x86_64-unknown-linux-gnu

What is your opinion for having different thresholds for different platforms?

Maybe in the future, but right now things are changing frequently enough that it's probably fine to stick with a simpler set of "ballpark" numbers for now, then do fine-tuning per-platform later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.14 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-JIT
Projects
None yet
Development

No branches or pull requests

3 participants