runtime: redesign GC synchronization #12041

aclements · 2015-08-05T19:19:10Z

The barrier mechanism for transitioning from concurrent mark to mark termination has grown to be incredibly complex and fragile. It evolved incrementally from the getfull barrier used by the 1.4 STW GC, but while this barrier was acceptable for STW GC (and for STW mark termination in concurrent GC), its use during concurrent marking has well exceeded its original design criteria.

There are several difficulties:

The barrier needs to exit when there's no more work to be found, but unlike in STW GC, in concurrent GC this is a transient condition since write barriers can produce more work at any time during concurrent execution. As a result, one worker can think that GC is done, signal completion, and exit; then more work is created by write barriers; then another worker can wake up from sleep and continue working. This draws out transitions between phases.
Unlike in STW GC, participants come and go. There are a fixed number of dedicated mark workers, but fractional and idle mark workers and mutator assists are transient and may or may not be present when we run out of work. This, combined with 1 and the current timed polling loop in getfull make mark completion susceptible to live-lock: it's possible for the program to be in some transient worker each time one of the dedicated workers wakes up to check if it can exit the getfull barrier. It also complicates reasoning about the system because dedicated workers participate directly in the getfull barrier, but transient workers must instead use trygetfull instead because they have exit conditions that aren't captured by getfull (e.g., fractional workers exit when preempted). The complexity of implementing these exit conditions contributed to runtime: concurrent GC leads to 5x memory consumption #11677.
The GC needs to know when it's almost out of work during concurrent marking so it can enable more aggressive (but expensive) marking, rescan roots, and flush all work caches. Currently we define "almost out of work" as "out of work in the global queues, but there may per-P cached work". Determining this and synchronizing the workers forced us to layer other synchronization mechanisms on top of the getfull barrier (work.bgMark1 and work.bgMark2) that complicate the coordinator and the worker scheduler (e.g., issue runtime: sporadic huge GC pause #11694).
As a result of the additional barriers introduced to solve 3, there is a sleep/wakeup race in the coordinator that is difficult to solve cleanly. When transitioning from mark 1 to mark 2, we're producing new work via scanning while simultaneously consuming work in mutator assists. This means that when we actually enter mark 2, the workers may already have consumed all of the new work that was just created; no new workers will be started unless more work is created by write barriers (which usually happens, but won't necessarily) so there will be nothing to signal completion. We fix this by checking for this condition when we consider scheduling a worker, but this can cause a delay up to the scheduler quantum.
There are timed polling loops in multiple places. The loop in getfull delays completion of the mark phase and, while the sleep in this loop doesn't consume CPU, it also doesn't let the mutator make use of the CPU while it's waiting. This delay also contributed to runtime: concurrent GC leads to 5x memory consumption #11677. Likewise, there's a loop in assists that can delay execution of a single mutator.
The background mark worker goroutines are effectively higher priority than the coordinator goroutine. If a worker is the first to signal completion, it will put the coordinator on the worker's P's run queue, but if the scheduler decides it needs to continue scheduling the worker (which is possible because of 1), it will delay execution of the coordinator.
When transitioning from mark 1 to mark 2, the strict barrier means the coordinator has to do a non-trivial amount of work during which background workers can do very little and assists may simply spin. It would be much better if this were handled by a global state machine that could distribute this work just like we distribute the mark work.

For all of these reasons, I think we need to redesign the GC barrier for 1.6.

This isn't a concrete proposal, but there are some properties I think the synchronization should have:

Replace polling loops with sleep and wakeup. Ideally, find a way to unify the various wakeup conditions different parts of the system have. Dedicated workers need to block until there's a work buffer available or the phase terminates. Fractional and idle workers need to block until there's a work buffer available, the preemption flag is set, or the phase terminates. Assists need to block until there's a work buffer available, credit to be stolen, or the phase terminates. Also, ideally, the coordinator itself would symmetrically participate in the barrier and block until the phase terminates (if there is a coordinator; see runtime: replace GC coordinator with state machine #11970).
Rather than focusing on the number of goroutines doing work, focus on how many work buffers are "checked out". GC cannot enter mark termination if there is any work on the global queues or any work buffers checked out. Implementation-wise, this may be very similar to what we do with nproc/nwait right now, except that write barriers don't interact with nproc/nwait because they're not "doing work".
Make completion monotonic: when a phase is completed, all workers should stop, even if more work shows up.
Distribute work currently done on the coordinator, such as rescanning, to the workers and assists.

Most likely the solution to this is deeply tied to the solution to #11970.

@RLH @rsc

gopherbot · 2015-10-15T16:02:03Z

CL https://golang.org/cl/15890 mentions this issue.

gopherbot · 2015-10-26T22:02:15Z

CL https://golang.org/cl/16297 mentions this issue.

GC assists must block until the assist can be satisfied (either through stealing credit or doing work) or the GC cycle ends. Currently, this is implemented as a retry loop with a 100 µs delay. This obviously isn't ideal, as it wastes CPU and delays mutator execution. It also has the somewhat peculiar downside that sleeping a G requires allocation, and this requires working around recursive allocation. Replace this timed delay with a proper scheduling queue. When an assist can't be satisfied immediately, it adds the allocating G to a queue and parks it. Any time background scan credit is flushed, it consults this queue, directly satisfies the debt of queued assists, and wakes up satisfied assists before flushing any remaining credit to the background credit pool. No effect on the go1 benchmarks. Slightly speeds up the garbage benchmark. name old time/op new time/op delta XBenchGarbage-12 5.81ms ± 1% 5.72ms ± 4% -1.65% (p=0.011 n=20+20) Updates #12041. Change-Id: I8ee3b6274dd097b12b10a8030796a958a4b0e7b7 Reviewed-on: https://go-review.googlesource.com/15890 Reviewed-by: Rick Hudson <[email protected]> Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>

Currently dedicated mark workers participate in the getfull barrier during concurrent mark. However, the getfull barrier wasn't designed for concurrent work and this causes no end of headaches. In the concurrent setting, participants come and go. This makes mark completion susceptible to live-lock: since dedicated workers are only periodically polling for completion, it's possible for the program to be in some transient worker each time one of the dedicated workers wakes up to check if it can exit the getfull barrier. It also complicates reasoning about the system because dedicated workers participate directly in the getfull barrier, but transient workers must instead use trygetfull because they have exit conditions that aren't captured by getfull (e.g., fractional workers exit when preempted). The complexity of implementing these exit conditions contributed to #11677. Furthermore, the getfull barrier is inefficient because we could be running user code instead of spinning on a P. In effect, we're dedicating 25% of the CPU to marking even if that means we have to spin to make that 25%. It also causes issues on Windows because we can't actually sleep for 100µs (#8687). Fix this by making dedicated workers no longer participate in the getfull barrier. Instead, dedicated workers simply return to the scheduler when they fail to get more work, regardless of what others workers are doing, and the scheduler only starts new dedicated workers if there's work available. Everything that needs to be handled by this barrier is already handled by detection of mark completion. This makes the system much more symmetric because all workers and assists now use trygetfull during concurrent mark. It also loosens the 25% CPU target so that we can give some of that 25% back to user code if there isn't enough work to keep the mark worker busy. And it eliminates the problematic 100µs sleep on Windows during concurrent mark (though not during mark termination). The downside of this is that if we hit a bottleneck in the heap graph that then expands back out, the system may shut down dedicated workers and take a while to start them back up. We'll address this in the next commit. Updates #12041 and #8687. No effect on the go1 benchmarks. This slows down the garbage benchmark by 9%, but we'll more than make it up in the next commit. name old time/op new time/op delta XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20) Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b Reviewed-on: https://go-review.googlesource.com/16297 Reviewed-by: Rick Hudson <[email protected]>

rsc · 2015-11-24T16:42:26Z

Is this done?

RLH · 2015-11-24T17:09:00Z

I believe so, Austin did the work, I just reviewed it.

On Tue, Nov 24, 2015 at 11:42 AM, Russ Cox [email protected] wrote:

Is this done?

—
Reply to this email directly or view it on GitHub
#12041 (comment).

aclements · 2015-11-24T17:17:52Z

There is still a timed polling loop in getfull (point 5), but it only happens during STW mark termination and doesn't seem to be causing problems at this point (it may extend mark termination time, but I don't have concrete evidence for that). I believe every other point is resolved, so I'll close this issue.

aclements added this to the Go1.6Early milestone Aug 5, 2015

aclements mentioned this issue Sep 18, 2015

runtime: windows system clock resolution issue #8687

Closed

aclements closed this as completed Nov 24, 2015

golang locked and limited conversation to collaborators Nov 27, 2016

gopherbot added the FrozenDueToAge label Nov 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: redesign GC synchronization #12041

runtime: redesign GC synchronization #12041

aclements commented Aug 5, 2015

gopherbot commented Oct 15, 2015

gopherbot commented Oct 26, 2015

rsc commented Nov 24, 2015

RLH commented Nov 24, 2015

aclements commented Nov 24, 2015

runtime: redesign GC synchronization #12041

runtime: redesign GC synchronization #12041

Comments

aclements commented Aug 5, 2015

gopherbot commented Oct 15, 2015

gopherbot commented Oct 26, 2015

rsc commented Nov 24, 2015

RLH commented Nov 24, 2015

aclements commented Nov 24, 2015