cmd/compile: play better with perf #73753
Labels
compiler/runtime
Issues related to the Go compiler and/or runtime.
Implementation
Issues describing a semantics-preserving change to the Go implementation.
Milestone
perf
is a sampling-based analysis tool on Linux. It's kind of a swiss-army knife tool, but the basic usage just samples PCs periodically and reports CPU usage by function.For this issue, I'm interested in how perf gets call stacks, which is the
-g
option toperf record
. Currently the default for perf is to do--call-graph=fp
, which means use frame pointers to unwind stacks.Example program:
Example usage:
Go's
pprof
seems to always get call stacks perfectly correct.perf
, on the other hand, has some issues. Becauseperf
uses frame pointers, it can sometimes get stack backtraces wrong. In particular, currently it has the following problems:Both of these problems relate to the fact that
perf
uses frame pointers to unwind the stack. Because the frame pointer has not been set up in both of the above situations, perf unwinds incorrectly. To get the parent frame, it doespc = *(fp+8); fp = *fp
. Whenfp
is from the parent frame, a pc from the parent frame itself is never found, after the current sample point the next pc is from the grandparent.It seems that this is not a problem on
arm64
. Not sure how exactly, but it does not suffer from this problem. TODO: how about other architectures? Is this related to link-register vs stack push of the return address?We have a hack to solve this problem (CL 7728) when the callee is
runtime.duffzero
orruntime.duffcopy
. The caller sets up a dummy frame pointer before calling either of those functions. Whenperf
samples inside those two functions, it correctly finds the parent frame. This hack was added because inperf
profiles we see a fair amount of these two functions, and it helps to see the immediate caller (these functions are called from lots of places, unlike a typical frameless leaf function). But for all the other cases in 1 and 2, we are out of luck.The
runtime.duffzero
/runtime.duffcopy
hack was also ported toarm64
, but probably that was not needed. It is also causing problems, see #73748. Probably we should remove it, although I don't yet understand howperf
solves this problem onarm64
.So, with all that said, how might we proceed here?
perf
is not important. Remove the hack above, and just live with the fact thatperf
backtraces might be missing the parent. Not the end of the world.perf
is really important. We should add frame pointer setup and teardown to frameless leaf functions.perf
to do stack walking without using frame pointers. Modernperf
has some other ways of finding stacks, including--call-graph=lbr
(last branch record) and--call-graph=dwarf
(using dwarf info in a.eh_frame
section).Only 4 would in principle handle the prolog/epilog problem. Just adding frame pointers everywhere would not.
As mentioned above, maybe this only matters for
amd64
?The text was updated successfully, but these errors were encountered: