Skip to content

runtime: redesign GC synchronization #12041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aclements opened this issue Aug 5, 2015 · 5 comments
Closed

runtime: redesign GC synchronization #12041

aclements opened this issue Aug 5, 2015 · 5 comments

Comments

@aclements
Copy link
Member

The barrier mechanism for transitioning from concurrent mark to mark termination has grown to be incredibly complex and fragile. It evolved incrementally from the getfull barrier used by the 1.4 STW GC, but while this barrier was acceptable for STW GC (and for STW mark termination in concurrent GC), its use during concurrent marking has well exceeded its original design criteria.

There are several difficulties:

  1. The barrier needs to exit when there's no more work to be found, but unlike in STW GC, in concurrent GC this is a transient condition since write barriers can produce more work at any time during concurrent execution. As a result, one worker can think that GC is done, signal completion, and exit; then more work is created by write barriers; then another worker can wake up from sleep and continue working. This draws out transitions between phases.
  2. Unlike in STW GC, participants come and go. There are a fixed number of dedicated mark workers, but fractional and idle mark workers and mutator assists are transient and may or may not be present when we run out of work. This, combined with 1 and the current timed polling loop in getfull make mark completion susceptible to live-lock: it's possible for the program to be in some transient worker each time one of the dedicated workers wakes up to check if it can exit the getfull barrier. It also complicates reasoning about the system because dedicated workers participate directly in the getfull barrier, but transient workers must instead use trygetfull instead because they have exit conditions that aren't captured by getfull (e.g., fractional workers exit when preempted). The complexity of implementing these exit conditions contributed to runtime: concurrent GC leads to 5x memory consumption #11677.
  3. The GC needs to know when it's almost out of work during concurrent marking so it can enable more aggressive (but expensive) marking, rescan roots, and flush all work caches. Currently we define "almost out of work" as "out of work in the global queues, but there may per-P cached work". Determining this and synchronizing the workers forced us to layer other synchronization mechanisms on top of the getfull barrier (work.bgMark1 and work.bgMark2) that complicate the coordinator and the worker scheduler (e.g., issue runtime: sporadic huge GC pause #11694).
  4. As a result of the additional barriers introduced to solve 3, there is a sleep/wakeup race in the coordinator that is difficult to solve cleanly. When transitioning from mark 1 to mark 2, we're producing new work via scanning while simultaneously consuming work in mutator assists. This means that when we actually enter mark 2, the workers may already have consumed all of the new work that was just created; no new workers will be started unless more work is created by write barriers (which usually happens, but won't necessarily) so there will be nothing to signal completion. We fix this by checking for this condition when we consider scheduling a worker, but this can cause a delay up to the scheduler quantum.
  5. There are timed polling loops in multiple places. The loop in getfull delays completion of the mark phase and, while the sleep in this loop doesn't consume CPU, it also doesn't let the mutator make use of the CPU while it's waiting. This delay also contributed to runtime: concurrent GC leads to 5x memory consumption #11677. Likewise, there's a loop in assists that can delay execution of a single mutator.
  6. The background mark worker goroutines are effectively higher priority than the coordinator goroutine. If a worker is the first to signal completion, it will put the coordinator on the worker's P's run queue, but if the scheduler decides it needs to continue scheduling the worker (which is possible because of 1), it will delay execution of the coordinator.
  7. When transitioning from mark 1 to mark 2, the strict barrier means the coordinator has to do a non-trivial amount of work during which background workers can do very little and assists may simply spin. It would be much better if this were handled by a global state machine that could distribute this work just like we distribute the mark work.

For all of these reasons, I think we need to redesign the GC barrier for 1.6.

This isn't a concrete proposal, but there are some properties I think the synchronization should have:

  1. Replace polling loops with sleep and wakeup. Ideally, find a way to unify the various wakeup conditions different parts of the system have. Dedicated workers need to block until there's a work buffer available or the phase terminates. Fractional and idle workers need to block until there's a work buffer available, the preemption flag is set, or the phase terminates. Assists need to block until there's a work buffer available, credit to be stolen, or the phase terminates. Also, ideally, the coordinator itself would symmetrically participate in the barrier and block until the phase terminates (if there is a coordinator; see runtime: replace GC coordinator with state machine #11970).
  2. Rather than focusing on the number of goroutines doing work, focus on how many work buffers are "checked out". GC cannot enter mark termination if there is any work on the global queues or any work buffers checked out. Implementation-wise, this may be very similar to what we do with nproc/nwait right now, except that write barriers don't interact with nproc/nwait because they're not "doing work".
  3. Make completion monotonic: when a phase is completed, all workers should stop, even if more work shows up.
  4. Distribute work currently done on the coordinator, such as rescanning, to the workers and assists.

Most likely the solution to this is deeply tied to the solution to #11970.

@RLH @rsc

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/15890 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/16297 mentions this issue.

aclements added a commit that referenced this issue Nov 3, 2015
GC assists must block until the assist can be satisfied (either
through stealing credit or doing work) or the GC cycle ends.
Currently, this is implemented as a retry loop with a 100 µs delay.
This obviously isn't ideal, as it wastes CPU and delays mutator
execution. It also has the somewhat peculiar downside that sleeping a
G requires allocation, and this requires working around recursive
allocation.

Replace this timed delay with a proper scheduling queue. When an
assist can't be satisfied immediately, it adds the allocating G to a
queue and parks it. Any time background scan credit is flushed, it
consults this queue, directly satisfies the debt of queued assists,
and wakes up satisfied assists before flushing any remaining credit to
the background credit pool.

No effect on the go1 benchmarks. Slightly speeds up the garbage
benchmark.

name              old time/op  new time/op  delta
XBenchGarbage-12  5.81ms ± 1%  5.72ms ± 4%  -1.65%  (p=0.011 n=20+20)

Updates #12041.

Change-Id: I8ee3b6274dd097b12b10a8030796a958a4b0e7b7
Reviewed-on: https://go-review.googlesource.com/15890
Reviewed-by: Rick Hudson <[email protected]>
Run-TryBot: Austin Clements <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
aclements added a commit that referenced this issue Nov 4, 2015
Currently dedicated mark workers participate in the getfull barrier
during concurrent mark. However, the getfull barrier wasn't designed
for concurrent work and this causes no end of headaches.

In the concurrent setting, participants come and go. This makes mark
completion susceptible to live-lock: since dedicated workers are only
periodically polling for completion, it's possible for the program to
be in some transient worker each time one of the dedicated workers
wakes up to check if it can exit the getfull barrier. It also
complicates reasoning about the system because dedicated workers
participate directly in the getfull barrier, but transient workers
must instead use trygetfull because they have exit conditions that
aren't captured by getfull (e.g., fractional workers exit when
preempted). The complexity of implementing these exit conditions
contributed to #11677. Furthermore, the getfull barrier is inefficient
because we could be running user code instead of spinning on a P. In
effect, we're dedicating 25% of the CPU to marking even if that means
we have to spin to make that 25%. It also causes issues on Windows
because we can't actually sleep for 100µs (#8687).

Fix this by making dedicated workers no longer participate in the
getfull barrier. Instead, dedicated workers simply return to the
scheduler when they fail to get more work, regardless of what others
workers are doing, and the scheduler only starts new dedicated workers
if there's work available. Everything that needs to be handled by this
barrier is already handled by detection of mark completion.

This makes the system much more symmetric because all workers and
assists now use trygetfull during concurrent mark. It also loosens the
25% CPU target so that we can give some of that 25% back to user code
if there isn't enough work to keep the mark worker busy. And it
eliminates the problematic 100µs sleep on Windows during concurrent
mark (though not during mark termination).

The downside of this is that if we hit a bottleneck in the heap graph
that then expands back out, the system may shut down dedicated workers
and take a while to start them back up. We'll address this in the next
commit.

Updates #12041 and #8687.

No effect on the go1 benchmarks. This slows down the garbage benchmark
by 9%, but we'll more than make it up in the next commit.

name              old time/op  new time/op  delta
XBenchGarbage-12  5.80ms ± 2%  6.32ms ± 4%  +9.03%  (p=0.000 n=20+20)

Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b
Reviewed-on: https://go-review.googlesource.com/16297
Reviewed-by: Rick Hudson <[email protected]>
@rsc
Copy link
Contributor

rsc commented Nov 24, 2015

Is this done?

@RLH
Copy link
Contributor

RLH commented Nov 24, 2015

I believe so, Austin did the work, I just reviewed it.

On Tue, Nov 24, 2015 at 11:42 AM, Russ Cox [email protected] wrote:

Is this done?


Reply to this email directly or view it on GitHub
#12041 (comment).

@aclements
Copy link
Member Author

There is still a timed polling loop in getfull (point 5), but it only happens during STW mark termination and doesn't seem to be causing problems at this point (it may extend mark termination time, but I don't have concrete evidence for that). I believe every other point is resolved, so I'll close this issue.

@golang golang locked and limited conversation to collaborators Nov 27, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants