Skip to content

runtime/pprof: livelock in cpuProfile.add #25785

@philhofer

Description

@philhofer

Reproduced on linux/arm, go1.10.2, while investigating #24260

The relevant code snippets are as follows:

func (p *cpuProfile) add(gp *g, stk []uintptr) {
	// Simple cas-lock to coordinate with setcpuprofilerate.
	for !atomic.Cas(&prof.signalLock, 0, 1) {
		[[ this loops forever ]]
		osyield()
	}
...

and

func setcpuprofilerate(hz int32) {

	...

	// Stop profiler on this thread so that it is safe to lock prof.
	// if a profiling signal came in while we had prof locked,
	// it would deadlock.
	setThreadCPUProfiler(0)

	for !atomic.Cas(&prof.signalLock, 0, 1) {
		osyield()
	}

...

The trouble is that setThreadCPUProfiler(0) doesn't actually do what the comment above it says it is supposed to do. On Linux, SIGPROF is delivered by arming a timer with setitimer(2), and setThreadCPUProfiler disarms the timer by calling setitimer(ITIMER_PROF, &it, NULL) with an empty struct itimerval. setitimer(2) delivers signals to the process, not the thread. Disarming a timer on thread 0 doesn't mean that thread 1 hasn't already queued a SIGPROF that may or may not be delivered to thread 0. See #14434 (Also see the BUGS section of the man page where it points out that signal generation and signal delivery are distinct events. We might conclude that it's possible to receive at least one SIGPROF delivery after the timer has been disarmed; I'm currently spelunking in the linux source tree to see if that's actually true.)

A confounding issue is that atomic.Cas on arm (up through go1.10.2, but not at tip) calls a kernel helper that leads the kernel to call sched_yield, so the loop in question actually yields two time slices and not one. I don't know if this is necessarily related to the issue, but it may exacerbate it, since it means that the loop could actually sleep longer in each iteration than than the SIGPROF timer interval.

I'm trying to reproduce this at tip. I'm also going to capture the output of perf record so that there's a little more evidence that this is actually what's happening. Frustratingly, the fastest repro I have still takes 20+ minutes to hit, so debugging this issue is slow going.

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsFixThe path to resolution is known, but the work has not been done.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions