-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
Reproduced on linux/arm, go1.10.2, while investigating #24260
The relevant code snippets are as follows:
func (p *cpuProfile) add(gp *g, stk []uintptr) {
// Simple cas-lock to coordinate with setcpuprofilerate.
for !atomic.Cas(&prof.signalLock, 0, 1) {
[[ this loops forever ]]
osyield()
}
...
and
func setcpuprofilerate(hz int32) {
...
// Stop profiler on this thread so that it is safe to lock prof.
// if a profiling signal came in while we had prof locked,
// it would deadlock.
setThreadCPUProfiler(0)
for !atomic.Cas(&prof.signalLock, 0, 1) {
osyield()
}
...
The trouble is that setThreadCPUProfiler(0)
doesn't actually do what the comment above it says it is supposed to do. On Linux, SIGPROF is delivered by arming a timer with setitimer(2)
, and setThreadCPUProfiler
disarms the timer by calling setitimer(ITIMER_PROF, &it, NULL)
with an empty struct itimerval
. setitimer(2)
delivers signals to the process, not the thread. Disarming a timer on thread 0 doesn't mean that thread 1 hasn't already queued a SIGPROF that may or may not be delivered to thread 0. See #14434 (Also see the BUGS section of the man page where it points out that signal generation and signal delivery are distinct events. We might conclude that it's possible to receive at least one SIGPROF delivery after the timer has been disarmed; I'm currently spelunking in the linux source tree to see if that's actually true.)
A confounding issue is that atomic.Cas
on arm (up through go1.10.2, but not at tip) calls a kernel helper that leads the kernel to call sched_yield
, so the loop in question actually yields two time slices and not one. I don't know if this is necessarily related to the issue, but it may exacerbate it, since it means that the loop could actually sleep longer in each iteration than than the SIGPROF timer interval.
I'm trying to reproduce this at tip. I'm also going to capture the output of perf record
so that there's a little more evidence that this is actually what's happening. Frustratingly, the fastest repro I have still takes 20+ minutes to hit, so debugging this issue is slow going.