Skip to content

cmd/compile: play better with perf #73753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
randall77 opened this issue May 16, 2025 · 4 comments
Open

cmd/compile: play better with perf #73753

randall77 opened this issue May 16, 2025 · 4 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Implementation Issues describing a semantics-preserving change to the Go implementation.
Milestone

Comments

@randall77
Copy link
Contributor

randall77 commented May 16, 2025

perf is a sampling-based analysis tool on Linux. It's kind of a swiss-army knife tool, but the basic usage just samples PCs periodically and reports CPU usage by function.

For this issue, I'm interested in how perf gets call stacks, which is the -g option to perf record. Currently the default for perf is to do --call-graph=fp, which means use frame pointers to unwind stacks.

Example program:

package main

import (
	"os"
	"runtime/pprof"
)

type T struct{ a, b, c, d, e, f, g, h int }

//go:noinline
func leaf() {
	a = b
}

var a, b T

type U struct{ a, b, c, d, e, f, g, h, i, j, k, l int }

//go:noinline
func duff() {
	c = d
}

var c, d U

//go:noinline
func work() {
	for i := 0; i < 1000000000; i++ {
		leaf()
		duff()
	}
}

//go:noinline
func main() {
	if len(os.Args) >= 2 {
		f, _ := os.Create(os.Args[1])
		defer f.Close()
		pprof.StartCPUProfile(f)
	}
	work()
	if len(os.Args) >= 2 {
		pprof.StopCPUProfile()
	}
}

Example usage:

> go build example.go
> ./example cpu.prof       // use Go's pprof
> perf record -g ./example // use perf
> perf report -g

Go's pprof seems to always get call stacks perfectly correct.
perf, on the other hand, has some issues. Because perf uses frame pointers, it can sometimes get stack backtraces wrong. In particular, currently it has the following problems:

  1. On amd64, if a sample point is in (some parts of) the prolog or epilog, it incorrectly skips the parent frame. It appears as if the grandparent directly called the sampled function.
  2. On amd64, if the sample point is in a frameless leaf function, the same thing happens.

Both of these problems relate to the fact that perf uses frame pointers to unwind the stack. Because the frame pointer has not been set up in both of the above situations, perf unwinds incorrectly. To get the parent frame, it does pc = *(fp+8); fp = *fp. When fp is from the parent frame, a pc from the parent frame itself is never found, after the current sample point the next pc is from the grandparent.

It seems that this is not a problem on arm64. Not sure how exactly, but it does not suffer from this problem. TODO: how about other architectures? Is this related to link-register vs stack push of the return address?

We have a hack to solve this problem (CL 7728) when the callee is runtime.duffzero or runtime.duffcopy. The caller sets up a dummy frame pointer before calling either of those functions. When perf samples inside those two functions, it correctly finds the parent frame. This hack was added because in perf profiles we see a fair amount of these two functions, and it helps to see the immediate caller (these functions are called from lots of places, unlike a typical frameless leaf function). But for all the other cases in 1 and 2, we are out of luck.

The runtime.duffzero/runtime.duffcopy hack was also ported to arm64, but probably that was not needed. It is also causing problems, see #73748. Probably we should remove it, although I don't yet understand how perf solves this problem on arm64.

So, with all that said, how might we proceed here?

  1. perf is not important. Remove the hack above, and just live with the fact that perf backtraces might be missing the parent. Not the end of the world.
  2. perf is really important. We should add frame pointer setup and teardown to frameless leaf functions.
  3. Do nothing. The duff functions are the only frameless leaf functions that get proper parents.
  4. Convince perf to do stack walking without using frame pointers. Modern perf has some other ways of finding stacks, including --call-graph=lbr (last branch record) and --call-graph=dwarf (using dwarf info in a .eh_frame section).

Only 4 would in principle handle the prolog/epilog problem. Just adding frame pointers everywhere would not.

As mentioned above, maybe this only matters for amd64?

@randall77 randall77 self-assigned this May 16, 2025
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 16, 2025
@randall77 randall77 added this to the Go1.26 milestone May 16, 2025
@randall77
Copy link
Contributor Author

@golang/compiler @golang/runtime

@randall77
Copy link
Contributor Author

Why arm64 works: it parses the function prolog (&epilog?) instructions and decides at what point a new entry in the frame pointer linked list has been set up. If it hasn't done that yet, it knows it is a leaf function and X30 has an address in the parent frame.

At least, if I'm reading the code right. Gory details at https://github.com/libunwind/libunwind/blob/master/src/aarch64/Gstep.c
See in particular the comment above get_frame_state.

@gabyhelp gabyhelp added the Implementation Issues describing a semantics-preserving change to the Go implementation. label May 16, 2025
@randall77
Copy link
Contributor Author

Ok, so arm64 actually has the opposite problem. The detector in libunwind for the "made a frame record" case doesn't trigger for the code our compiler generates. It is looking for

stp x29, x30, [sp+N]!

but we generate two separate store instructions to do this:

str x30, [sp+N]!
str x29, [sp-8]

Since the detector never sees a frame record setup, it treats every function as a leaf. So when the frame record is actually set up, we get a duplicate frame during backtracing. Since everything is thought to be a leaf, after the sampled pc we use the contents of x30, and then walk the frame pointer list. But when the function has indeed set up its frame record, x30 can be junk. Typically it will either be the function's return address, in which case the parent gets reported twice, or a return address into the sampled function (from the last call that the sampled function made), in which case the sampled function will be reported twice. Atypically, x30 will be completely junk. Not sure what happens then.

I was wondering why I was seeing duplicate entries in tracebacks using perf. Now I know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Implementation Issues describing a semantics-preserving change to the Go implementation.
Projects
None yet
Development

No branches or pull requests

3 participants