-
Notifications
You must be signed in to change notification settings - Fork 18k
doc: figure out, document amd64 minimum requirements #19593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
(Related to #18616) |
I think I jumped the gun with the popcount CL, I didn't realize it was not in the same class as BSF/BSR. I think the right requirement is we assume SSE2 but nothing beyond that. |
You're right. Looks like it was introduced along with SSE 4.2 in the Core i7 965. @randall77, any thoughts on using the CPUID flag to decide whether or not to use the instruction? Do you think it would be worthwhile to stash this in a global during runtime init and switch on it? Or dynamically generate a code page at runtime init? |
We do stash CPUID results in globals at runtime init, for e.g. deciding whether we have AES or AVX2 instructions. These are all used at a fairly high-level to decide between assembly algorithms with 100+ instruction bodies. |
I did a simple but, I think, representative micro-benchmark of different ways to have a fallback and it seems like the global check is quite cheap:
I skimmed over the generated code and it seems reasonable. It's possible we could speed up indirect and codegen by making assumptions about what registers they clobber rather than treating them as typical Go functions that follow the Go calling convention. Presumably the fact that the global check benchmarks as faster than no check at all is some microarchitectural quirk, but, nevertheless. |
I'm not sure that benchmarks tell the whole story, in particular I'm afraid that we are running in: |
What is "nondestructive FP math"? Is that the same as 3-operand instructions? I think we'll get to a GOAMD64 at some point, but at this point the case for it isn't compelling. POPCNT and 3-operand floating point just isn't important enough to justify the split IMO. |
Interesting. That might explain why BenchmarkGlobal is faster than BenchmarkUnguarded. BenchmarkGlobal happens to generate a MOV to the target register of the POPCNT, which should break the dependency chain. Amusingly, if you lift the load of havePopcount out of the loop so the compiler keeps it in a register, then BenchmarkGlobal loses this clobber and drops to the same 0.9 ns/op. Still, the benchmarks are showing 2 to 3 cycles per loop iteration. That's not far from optimal whether or not we do the global load and I suspect a little more cleverness from the compiler could drop the global check even lower. |
Yes, nondestructive means 3 op (source is not destroyed) |
Another instruction set of interest is FMA. |
Decided stock amd64 is the minimum requirement. I've updated the wiki. |
In https://golang.org/cl/38320, @khr adds POPCOUNT intrinsics for amd64.
I realize that our https://golang.org/wiki/MinimumRequirements doesn't say anything about GOARCH=amd64. (only 386)
Which version of amd64 do we assume? Decide and document.
/cc @randall77 @ianlancetaylor @griesemer @rsc @mdempsky @josharian @minux @cherrymui @aclements
The text was updated successfully, but these errors were encountered: