Skip to content

cmd/compile: no automatic use of fused multiply-add on amd64 even with GOAMD64=v3 #71204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dominikh opened this issue Jan 9, 2025 · 8 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@dominikh
Copy link
Member

dominikh commented Jan 9, 2025

Go version

go version go1.23.4 linux/amd64

Output of go env in your module/workspace:

-

What did you do?

Compile the following program with GOARCH=amd64 GOAMD64=v3 go build -gcflags=-S

package pkg

import "math"

func fooImplicit(x, y, z float64) float64 {
	return x*y + z
}

func fooExplicit(x, y, z float64) float64 {
	return math.FMA(x, y, z)
}

What did you see happen?

command-line-arguments.fooImplicit STEXT nosplit size=9 args=0x18 locals=0x0 funcid=0x0 align=0x0
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:5)	TEXT	command-line-arguments.fooImplicit(SB), NOSPLIT|NOFRAME|ABIInternal, $0-24
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:5)	FUNCDATA	$0, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:5)	FUNCDATA	$1, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:5)	FUNCDATA	$5, command-line-arguments.fooImplicit.arginfo1(SB)
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:5)	FUNCDATA	$6, command-line-arguments.fooImplicit.argliveinfo(SB)
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:5)	PCDATA	$3, $1
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:6)	MULSD	X1, X0
	0x0004 00004 (/home/dominikh/prj/src/example.com/bar.go:6)	ADDSD	X2, X0
	0x0008 00008 (/home/dominikh/prj/src/example.com/bar.go:6)	RET
	0x0000 f2 0f 59 c1 f2 0f 58 c2 c3                       ..Y...X..
command-line-arguments.fooExplicit STEXT nosplit size=9 args=0x18 locals=0x0 funcid=0x0 align=0x0
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:9)	TEXT	command-line-arguments.fooExplicit(SB), NOSPLIT|NOFRAME|ABIInternal, $0-24
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:9)	FUNCDATA	$0, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:9)	FUNCDATA	$1, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:9)	FUNCDATA	$5, command-line-arguments.fooExplicit.arginfo1(SB)
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:9)	FUNCDATA	$6, command-line-arguments.fooExplicit.argliveinfo(SB)
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:9)	PCDATA	$3, $1
	0x0000 00000 (/home/dominikh/prj/src/example.com/bar.go:10)	VFMADD231SD	X1, X0, X2
	0x0005 00005 (/home/dominikh/prj/src/example.com/bar.go:10)	MOVUPS	X2, X0
	0x0008 00008 (/home/dominikh/prj/src/example.com/bar.go:10)	RET
	0x0000 c4 e2 f9 b9 d1 0f 10 c2 c3                       .........

What did you expect to see?

I expected fooImplicit and fooExplicit to generate identical code when setting GOAMD64=v3.

On arm64, the compiler detects the x*y + z pattern and automatically uses FMA. On amd64, math.FMA uses runtime feature detection unless the GOAMD64 environment variable is set to v3 or higher, in which case calls to math.FMA compile directly to VFMADD231SD. However, x*y + z isn't detected, regardless of the value of GOAMD64.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 9, 2025
@randall77 randall77 added this to the Unplanned milestone Jan 9, 2025
@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 10, 2025
@prattmic
Copy link
Member

cc @golang/compiler

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/646335 mentions this issue: cmd/compile: lower x*z + y to FMA on AMD64 >= 3

@jake-ciolek
Copy link
Member

Hi @dominikh

I did the initial implementation in CL 646335 and it shows some nice gains on GOAMD=v3:

name                    old time/op  new time/op  delta
Acos-16                 4.58ns ± 0%  3.36ns ± 0%  -26.68%  (p=0.008 n=5+5)
Acosh-16                8.04ns ± 1%  6.44ns ± 0%  -19.95%  (p=0.008 n=5+5)
Asin-16                 4.28ns ± 0%  3.32ns ± 0%  -22.24%  (p=0.008 n=5+5)
Asinh-16                9.92ns ± 0%  8.62ns ± 0%  -13.13%  (p=0.008 n=5+5)
Atan-16                 2.31ns ± 0%  1.84ns ± 0%  -20.02%  (p=0.008 n=5+5)
Atanh-16                7.79ns ± 0%  7.03ns ± 0%   -9.67%  (p=0.008 n=5+5)
Atan2-16                3.93ns ± 0%  3.52ns ± 0%  -10.35%  (p=0.000 n=5+4)
Cbrt-16                 4.62ns ± 0%  4.41ns ± 0%   -4.57%  (p=0.016
Erf-16                  2.71ns ± 0%  2.25ns ± 0%  -16.69%  (p=0.008 n=5+5)
Erfc-16                 3.06ns ± 0%  2.67ns ± 0%  -13.00%  (p=0.016 n=5+4)
Erfinv-16               3.88ns ± 0%  2.84ns ± 3%  -26.83%  (p=0.008 n=5+5)
Erfcinv-16              4.08ns ± 0%  3.01ns ± 1%  -26.27%  (p=0.008 n=5+5)
Exp-16                  3.29ns ± 0%  3.37ns ± 2%   +2.64%  (p=0.016 n=4+5)
ExpGo-16                8.44ns ± 0%  7.48ns ± 1%  -11.37%  (p=0.008 n=5+5)
Expm1-16                4.46ns ± 0%  3.69ns ± 2%  -17.26%  (p=0.016 n=4+5)
Exp2-16                 8.20ns ± 0%  7.39ns ± 2%   -9.94%  (p=0.008 n=5+5)
Exp2Go-16               8.26ns ± 0%  7.23ns ± 0%  -12.49%  (p=0.016 n=4+5)
Abs-16                  0.26ns ± 3%  0.22ns ± 1%  -16.34%  (p=0.008 n=5+5)
Log1p-16                5.00ns ± 0%  3.99ns ± 0%  -20.14%  (p=0.008 n=5+5)

It fails on this particular test case. The results from FMA are slightly different than MULS(S|D)/ADDS(S|D). From my reading, FMA is supposed to be more precise. Do we make another set of tests for when GOAMD64 >= v3?

Could you chime in here @randall77 ? Thanks!

--- FAIL: TestCode/FP (0.00s)
    ssa_test.go:174: Failed:
        multiplyAdd: float32(0.6046603 * 0.9405091) + 0.6645601, expected 1.2332485, got 1.2332486
        multiplyAdd: 0.6645601 += float32(0.6046603 * 0.9405091), expected 1.2332485, got 1.2332486
        multiplyAdd: float32(0.67908466 * 0.21855305) + 0.20318687, expected 0.3516029, got 0.35160288
        multiplyAdd: 0.20318687 += float32(0.67908466 * 0.21855305), expected 0.3516029, got 0.35160288
        multiplyAdd: float32(0.29311424 * 0.29708257) + 0.752573, expected 0.8396522, got 0.8396521
        multiplyAdd: 0.752573 += float32(0.29311424 * 0.29708257), expected 0.8396522, got 0.8396521
        multiplyAdd: float32(0.5305857 * 0.2535405) + 0.282081, expected 0.41660595, got 0.41660598
        multiplyAdd: 0.282081 += float32(0.5305857 * 0.2535405), expected 0.41660595, got 0.41660598
        multiplyAdd: float32(0.29711226 * 0.89436173) + 0.097454615, expected 0.36318043, got 0.36318046
        multiplyAdd: 0.097454615 += float32(0.29711226 * 0.89436173), expected 0.36318043, got 0.36318046
        multiplyAdd: float32(0.6810783 * 0.24151509) + 0.31152245, expected 0.47601312, got 0.47601315
        multiplyAdd: 0.31152245 += float32(0.6810783 * 0.24151509), expected 0.47601312, got 0.47601315
        multiplyAdd: float32(0.73023146 * 0.18292491) + 0.4283571, expected 0.5619346, got 0.56193465
        multiplyAdd: 0.4283571 += float32(0.73023146 * 0.18292491), expected 0.5619346, got 0.56193465
        multiplyAdd: float32(0.89634174 * 0.32208398) + 0.7211478, expected 1.009845, got 1.0098451
        multiplyAdd: 0.7211478 += float32(0.89634174 * 0.32208398), expected 1.009845, got 1.0098451
        multiplyAdd: float32(0.6280982 * 0.12675293) + 0.2813303, expected 0.36094356, got 0.3609436
        multiplyAdd: 0.2813303 += float32(0.6280982 * 0.12675293), expected 0.36094356, got 0.3609436
        multiplyAdd: float32(0.29400632 * 0.75316125) + 0.15096405, expected 0.3723982, got 0.37239823
        multiplyAdd: 0.15096405 += float32(0.29400632 * 0.75316125), expected 0.3723982, got 0.37239823
        multiplyAdd: float64(0.4688898449024232 * 0.28303415118044517) + 0.29310185733681576, expected 0.42581369658590373, got 0.4258136965859037
        multiplyAdd: 0.29310185733681576 += float64(0.4688898449024232 * 0.28303415118044517), expected 0.42581369658590373, got 0.4258136965859037
        multiplyAdd: float64(0.7886049150193449 * 0.3618054804803169) + 0.8805431227416171, expected 1.1658647029293308, got 1.1658647029293305
        multiplyAdd: 0.8805431227416171 += float64(0.7886049150193449 * 0.3618054804803169), expected 1.1658647029293308, got 1.1658647029293305
        multiplyAdd: float64(0.7302314772948083 * 0.18292491645390843) + 0.4283570818068078, expected 0.5619346137829748, got 0.5619346137829747
        multiplyAdd: 0.4283570818068078 += float64(0.7302314772948083 * 0.18292491645390843), expected 0.5619346137829748, got 0.5619346137829747
        multiplyAdd: float64(0.6908388315056789 * 0.7109071952999951) + 0.5637795958152644, expected 1.0549018919252924, got 1.0549018919252926
        multiplyAdd: 0.5637795958152644 += float64(0.6908388315056789 * 0.7109071952999951), expected 1.0549018919252924, got 1.0549018919252926
        multiplyAdd: float64(0.4584424785756506 * 0.6001655953233308) + 0.02626515060968944, expected 0.3014065536855481, got 0.30140655368554814
        multiplyAdd: 0.02626515060968944 += float64(0.4584424785756506 * 0.6001655953233308), expected 0.3014065536855481, got 0.30140655368554814
        multiplyAdd: float64(0.539210105890946 * 0.9756748149873165) + 0.7507630564795985, expected 1.2768567767840384, got 1.2768567767840386
        multiplyAdd: 0.7507630564795985 += float64(0.539210105890946 * 0.9756748149873165), expected 1.2768567767840384, got 1.2768567767840386
        multiplyAdd: float64(0.7830349733960021 * 0.3932509992288867) + 0.1304138461737918, expected 0.4383431318929343, got 0.43834313189293433
        multiplyAdd: 0.1304138461737918 += float64(0.7830349733960021 * 0.3932509992288867), expected 0.4383431318929343, got 0.43834313189293433
        multiplyAdd: float64(0.6841751300974551 * 0.6530402051353608) + 0.524499759549865, expected 0.9712936268572192, got 0.9712936268572193
        multiplyAdd: 0.524499759549865 += float64(0.6841751300974551 * 0.6530402051353608), expected 0.9712936268572192, got 0.9712936268572193
        multiplyAdd: float64(0.3691117091643448 * 0.826454125634742) + 0.34768170859156955, expected 0.6527356034505334, got 0.6527356034505333
        multiplyAdd: 0.34768170859156955 += float64(0.3691117091643448 * 0.826454125634742), expected 0.6527356034505334, got 0.6527356034505333
        multiplyAdd: float64(0.16867966833433606 * 0.33136826030698385) + 0.8279280961505588, expected 0.8838231843956668, got 0.8838231843956669
        multiplyAdd: 0.8279280961505588 += float64(0.16867966833433606 * 0.33136826030698385), expected 0.8838231843956668, got 0.8838231843956669

@jake-ciolek
Copy link
Member

Weirdly, it didn't fail on gerrit (besides on the recently buggy arm64 runners).

@randall77
Copy link
Contributor

That test should not fail. It uses the float64(x*y)+z format to force intermediate rounding. I think what is happening is that the float64->float64 conversion is being thrown away.

On arm64, we don't throw it away.

On amd64, we do.

We do need to run those tests in AMD64=v1 and AMD64=v3 modes now. I think we already do, we have a goamd64v3 builder on build.golang.org. Maybe that config is not in the default trybot set.

@jake-ciolek
Copy link
Member

Thanks.

So, if I understand correctly, the fix is to implement zerowidth LoweredRound(32|64)F Ops on AMD64 and use them when lowering from Round(32|64)F?

@randall77
Copy link
Contributor

Yes, something like that. I'm not sure where the LoweredRound ops end up, but following arm64 and doing the same thing should work.

@dmitshur dmitshur modified the milestones: Unplanned, Go1.25 Feb 13, 2025
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests

7 participants