Skip to content

Commit 03eb948

Browse files
committed
runtime: separate soft and hard heap limits
Currently, GC pacing is based on a single hard heap limit computed based on GOGC. In order to achieve this hard limit, assist pacing makes the conservative assumption that the entire heap is live. However, in the steady state (with GOGC=100), only half of the heap is live. As a result, the garbage collector works twice as hard as necessary and finishes half way between the trigger and the goal. Since this is a stable state for the trigger controller, this repeats from cycle to cycle. Matters are even worse if GOGC is higher. For example, if GOGC=200, only a third of the heap is live in steady state, so the GC will work three times harder than necessary and finish only a third of the way between the trigger and the goal. Since this causes the garbage collector to consume ~50% of the available CPU during marking instead of the intended 25%, about 25% of the CPU goes to mutator assists. This high mutator assist cost causes high mutator latency variability. This commit improves the situation by separating the heap goal into two goals: a soft goal and a hard goal. The soft goal is set based on GOGC, just like the current goal is, and the hard goal is set at a 10% larger heap than the soft goal. Prior to the soft goal, assist pacing assumes the heap is in steady state (e.g., only half of it is live). Between the soft goal and the hard goal, assist pacing switches to the current conservative assumption that the entire heap is live. In benchmarks, this nearly eliminates mutator assists. However, since background marking is fixed at 25% CPU, this causes the trigger controller to saturate, which leads to somewhat higher variability in heap size. The next commit will address this. The lower CPU usage of course leads to longer mark cycles, though really it means the mark cycles are as long as they should have been in the first place. This does, however, lead to two potential down-sides compared to the current pacing policy: 1. the total overhead of the write barrier is higher because it's enabled more of the time and 2. the heap size may be larger because there's more floating garbage. We addressed 1 by significantly improving the performance of the write barrier in the preceding commits. 2 can be demonstrated in intense GC benchmarks, but doesn't seem to be a problem in any real applications. Updates #14951. Updates #14812 (fixes?). Fixes #18534. This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little overall throughput effect on the go1 benchmarks: name old time/op new time/op delta BinaryTree17-12 2.41s ± 0% 2.40s ± 0% -0.22% (p=0.007 n=20+18) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.07% (p=0.003 n=17+18) FmtFprintfEmpty-12 41.7ns ± 3% 42.2ns ± 0% +1.17% (p=0.002 n=20+15) FmtFprintfString-12 66.5ns ± 0% 67.9ns ± 2% +2.16% (p=0.000 n=16+20) FmtFprintfInt-12 77.6ns ± 2% 75.6ns ± 3% -2.55% (p=0.000 n=19+19) FmtFprintfIntInt-12 124ns ± 1% 123ns ± 1% -0.98% (p=0.000 n=18+17) FmtFprintfPrefixedInt-12 151ns ± 1% 148ns ± 1% -1.75% (p=0.000 n=19+20) FmtFprintfFloat-12 210ns ± 1% 212ns ± 0% +0.75% (p=0.000 n=19+16) FmtManyArgs-12 501ns ± 1% 499ns ± 1% -0.30% (p=0.041 n=17+19) GobDecode-12 6.50ms ± 1% 6.49ms ± 1% ~ (p=0.234 n=19+19) GobEncode-12 5.43ms ± 0% 5.47ms ± 0% +0.75% (p=0.000 n=20+19) Gzip-12 216ms ± 1% 220ms ± 1% +1.71% (p=0.000 n=19+20) Gunzip-12 38.6ms ± 0% 38.8ms ± 0% +0.66% (p=0.000 n=18+19) HTTPClientServer-12 78.1µs ± 1% 78.5µs ± 1% +0.49% (p=0.035 n=20+20) JSONEncode-12 12.1ms ± 0% 12.2ms ± 0% +1.05% (p=0.000 n=18+17) JSONDecode-12 53.0ms ± 0% 52.3ms ± 0% -1.27% (p=0.000 n=19+19) Mandelbrot200-12 3.74ms ± 0% 3.69ms ± 0% -1.17% (p=0.000 n=18+19) GoParse-12 3.17ms ± 1% 3.17ms ± 1% ~ (p=0.569 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 73.7ns ± 0% +0.76% (p=0.000 n=18+17) RegexpMatchEasy0_1K-12 239ns ± 0% 238ns ± 0% -0.27% (p=0.000 n=13+17) RegexpMatchEasy1_32-12 69.0ns ± 2% 69.1ns ± 1% ~ (p=0.404 n=19+19) RegexpMatchEasy1_1K-12 367ns ± 1% 365ns ± 1% -0.60% (p=0.000 n=19+19) RegexpMatchMedium_32-12 105ns ± 1% 104ns ± 1% -1.24% (p=0.000 n=19+16) RegexpMatchMedium_1K-12 34.1µs ± 2% 33.6µs ± 3% -1.60% (p=0.000 n=20+20) RegexpMatchHard_32-12 1.62µs ± 1% 1.67µs ± 1% +2.75% (p=0.000 n=18+18) RegexpMatchHard_1K-12 48.8µs ± 1% 50.3µs ± 2% +3.07% (p=0.000 n=20+19) Revcomp-12 386ms ± 0% 384ms ± 0% -0.57% (p=0.000 n=20+19) Template-12 59.9ms ± 1% 61.1ms ± 1% +2.01% (p=0.000 n=20+19) TimeParse-12 301ns ± 2% 307ns ± 0% +2.11% (p=0.000 n=20+19) TimeFormat-12 323ns ± 0% 323ns ± 0% ~ (all samples are equal) [Geo mean] 47.0µs 47.1µs +0.23% https://perf.golang.org/search?q=upload:20171030.1 Likewise, the throughput effect on the x/benchmarks is minimal (and reasonably positive on the garbage benchmark with a large heap): name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.29ms ± 3% -4.57% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.24ms ± 2% +0.59% (p=0.016 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.326 n=20+19) JSON-12 11.1ms ± 1% 11.3ms ± 2% +2.15% (p=0.000 n=16+17) It does increase the heap size of the garbage benchmarks, but seems to have relatively little impact on more realistic programs. Also, we'll gain some of this back with the next commit. name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.21G ± 1% 1.88G ± 2% +55.59% (p=0.000 n=19+20) Garbage/benchmem-MB=64-12 168M ± 3% 248M ± 8% +48.08% (p=0.000 n=18+20) HTTP-12 45.6M ± 9% 47.0M ±27% ~ (p=0.925 n=20+20) JSON-12 193M ±11% 206M ±11% +7.06% (p=0.001 n=20+20) https://perf.golang.org/search?q=upload:20171030.2 Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736 Reviewed-on: https://go-review.googlesource.com/59970 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Rick Hudson <[email protected]>
1 parent bc723cf commit 03eb948

File tree

1 file changed

+48
-22
lines changed

1 file changed

+48
-22
lines changed

src/runtime/mgc.go

Lines changed: 48 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -500,47 +500,73 @@ func (c *gcControllerState) startCycle() {
500500
// is when assists are enabled and the necessary statistics are
501501
// available).
502502
func (c *gcControllerState) revise() {
503-
// Compute the expected scan work remaining.
503+
gcpercent := gcpercent
504+
if gcpercent < 0 {
505+
// If GC is disabled but we're running a forced GC,
506+
// act like GOGC is huge for the below calculations.
507+
gcpercent = 100000
508+
}
509+
live := atomic.Load64(&memstats.heap_live)
510+
511+
var heapGoal, scanWorkExpected int64
512+
if live <= memstats.next_gc {
513+
// We're under the soft goal. Pace GC to complete at
514+
// next_gc assuming the heap is in steady-state.
515+
heapGoal = int64(memstats.next_gc)
516+
517+
// Compute the expected scan work remaining.
518+
//
519+
// This is estimated based on the expected
520+
// steady-state scannable heap. For example, with
521+
// GOGC=100, only half of the scannable heap is
522+
// expected to be live, so that's what we target.
523+
//
524+
// (This is a float calculation to avoid overflowing on
525+
// 100*heap_scan.)
526+
scanWorkExpected = int64(float64(memstats.heap_scan) * 100 / float64(100+gcpercent))
527+
} else {
528+
// We're past the soft goal. Pace GC so that in the
529+
// worst case it will complete by the hard goal.
530+
const maxOvershoot = 1.1
531+
heapGoal = int64(float64(memstats.next_gc) * maxOvershoot)
532+
533+
// Compute the upper bound on the scan work remaining.
534+
scanWorkExpected = int64(memstats.heap_scan)
535+
}
536+
537+
// Compute the remaining scan work estimate.
504538
//
505539
// Note that we currently count allocations during GC as both
506540
// scannable heap (heap_scan) and scan work completed
507-
// (scanWork), so this difference won't be changed by
508-
// allocations during GC.
509-
//
510-
// This particular estimate is a strict upper bound on the
511-
// possible remaining scan work for the current heap.
512-
// You might consider dividing this by 2 (or by
513-
// (100+GOGC)/100) to counter this over-estimation, but
514-
// benchmarks show that this has almost no effect on mean
515-
// mutator utilization, heap size, or assist time and it
516-
// introduces the danger of under-estimating and letting the
517-
// mutator outpace the garbage collector.
518-
scanWorkExpected := int64(memstats.heap_scan) - c.scanWork
519-
if scanWorkExpected < 1000 {
541+
// (scanWork), so allocation will change this difference will
542+
// slowly in the soft regime and not at all in the hard
543+
// regime.
544+
scanWorkRemaining := scanWorkExpected - c.scanWork
545+
if scanWorkRemaining < 1000 {
520546
// We set a somewhat arbitrary lower bound on
521547
// remaining scan work since if we aim a little high,
522548
// we can miss by a little.
523549
//
524550
// We *do* need to enforce that this is at least 1,
525551
// since marking is racy and double-scanning objects
526-
// may legitimately make the expected scan work
527-
// negative.
528-
scanWorkExpected = 1000
552+
// may legitimately make the remaining scan work
553+
// negative, even in the hard goal regime.
554+
scanWorkRemaining = 1000
529555
}
530556

531557
// Compute the heap distance remaining.
532-
heapDistance := int64(memstats.next_gc) - int64(atomic.Load64(&memstats.heap_live))
533-
if heapDistance <= 0 {
558+
heapRemaining := heapGoal - int64(live)
559+
if heapRemaining <= 0 {
534560
// This shouldn't happen, but if it does, avoid
535561
// dividing by zero or setting the assist negative.
536-
heapDistance = 1
562+
heapRemaining = 1
537563
}
538564

539565
// Compute the mutator assist ratio so by the time the mutator
540566
// allocates the remaining heap bytes up to next_gc, it will
541567
// have done (or stolen) the remaining amount of scan work.
542-
c.assistWorkPerByte = float64(scanWorkExpected) / float64(heapDistance)
543-
c.assistBytesPerWork = float64(heapDistance) / float64(scanWorkExpected)
568+
c.assistWorkPerByte = float64(scanWorkRemaining) / float64(heapRemaining)
569+
c.assistBytesPerWork = float64(heapRemaining) / float64(scanWorkRemaining)
544570
}
545571

546572
// endCycle computes the trigger ratio for the next cycle.

0 commit comments

Comments
 (0)