-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: GC causes latency spikes with single processor #18534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@RLH @aclements @dvyukov Is this on your radar for go1.9? Have you had a chance to try the example code yet? I tried the patches that Austin posted for #14812, but those did not resolve the issue in the example that I posted. I also tried playing around with more scheduler options in the runtime, but I'm not familiar with the GC and scheduler and none of the basic "reschedule now" changes that I made seemed to help. |
Your I think you should be able to get closer by setting |
On Linux, the default NumCPU and GOMAXPROCS is determined with
sched_getaffinity so I think the command does emulate a one-cpu system.
|
I am also able to reproduce this issue on darwin amd64 with go1.8 and GOMAXPROCS=1.
The scheduler/GC seems to need those extra cores to keep latency down. |
Some more data points:
( My understanding is that setting GOMAXPROCS will force the runtime to create more threads. Even when those threads are running on the same core and have to be scheduled by the OS (linux in this case), that seems to lead to lower latency than just letting the Go runtime schedule a single thread. Here are some measurements without taskset, but with different values for GOMAXPROCS:
It's interesting that there is a consistent change in the 99% latency between GOMAXPROCS=4 and =5. I suspect that is an artifact of some decision within the runtime about the number of idle GC workers, but have not looked at a trace. A similar looking change can also be seen in the To me, these samples point to a deficiency in the Go runtime's handling of scheduling/preemption during GC. That issue is probably hidden on systems with more than 4 cores, which many server systems have these days. |
The delays do not correlate with the GCs. I lowered the frequency of the
GCs by setting GOGC=4000 thus trading RAM for fewer GCs. I also printed
when the Tick was delayed so we could correlate with the gctrace time
indicating when the GC ran. The resulting output is below and shows that
the GC is not running when the tick delays are detected. Furthermore the
gctrace seems to indicate that STW portions of the GC are well below 1
millisecond.
My guess is that you have 11 goroutines running and occasionally the
scheduler doesn't get around to the goroutine that services the tick.
This is on tip.
rlh@rlh:~/work/code/bugs/latencyBug/src$ GOGC=4000 GODEBUG=gctrace=1
GOMAXPROCS=1 ./latencyBug
gc 1 @2.147s 0%: 0.011+77+0.009 ms clock, 0.011+0.46/20/21+0.009 ms cpu,
160->171->50 MB, 3498 MB goal, 1 P
Higher latency @8.096342913s than expected at iteration 79. Tick was
delayed 34.711117ms
gc 2 @17.000s 0%: 0.047+121+0.010 ms clock, 0.047+5.6/32/28+0.010 ms cpu,
1078->1084->45 MB, 2068 MB goal, 1 P
gc 3 @37.513s 0%: 0.011+86+0.004 ms clock, 0.011+18/25/12+0.004 ms cpu,
1410->1417->42 MB, 1858 MB goal, 1 P
Higher latency @45.28716699s than expected at iteration 451. Tick was
delayed 25.805553ms
gc 4 @56.186s 0%: 0.015+91+0.010 ms clock, 0.015+43/22/0+0.010 ms cpu,
1534->1540->46 MB, 1752 MB goal, 1 P
Higher latency @57.60534838s than expected at iteration 574. Tick was
delayed 23.092542ms
Higher latency @1m1.582175184s than expected at iteration 614. Tick was
delayed 20.844583ms
Higher latency @1m2.407811598s than expected at iteration 622. Tick was
delayed 21.701094ms
gc 5 @78.309s 0%: 0.008+67+0.007 ms clock, 0.008+33/20/2.2+0.007 ms cpu,
1773->1776->44 MB, 1899 MB goal, 1 P
0.0% -34.534406ms
1.0% -17.807413ms
5.0% -11.569303ms
50.0% -1.282µs
95.0% 11.480867ms
99.0% 17.951876ms
100.0% 34.711117ms
…On Fri, May 19, 2017 at 1:48 AM, tarm ***@***.***> wrote:
Some more data points:
$ taskset -c 0 ./issue18534
0.0% -70.533562ms
1.0% -52.267319ms
5.0% -35.006899ms
50.0% -2.676µs
95.0% 36.313112ms
99.0% 57.591539ms
100.0% 86.679896ms
GOMAXPROCS=2 taskset -c 0 ./issue18534
0.0% -61.117609ms
1.0% -31.714017ms
5.0% -21.586722ms
50.0% 949ns
95.0% 20.895062ms
99.0% 33.908325ms
100.0% 62.686614ms
$ GOMAXPROCS=4 taskset -c 0 ./issue18534
0.0% -33.988405ms
1.0% -26.603924ms
5.0% -11.102534ms
50.0% 1.049µs
95.0% 10.910585ms
99.0% 23.701329ms
100.0% 34.930726ms
$ GOMAXPROCS=5 taskset -c 0 ./issue18534
0.0% -41.173521ms
1.0% -15.659458ms
5.0% -7.67135ms
50.0% 70ns
95.0% 7.62678ms
99.0% 15.607855ms
100.0% 41.169155ms
$ GOMAXPROCS=8 taskset -c 0 ./issue18534
0.0% -17.46882ms
1.0% -10.194657ms
5.0% -5.408951ms
50.0% -315ns
95.0% 5.410479ms
99.0% 10.139459ms
100.0% 16.585539ms
$ GOMAXPROCS=16 taskset -c 0 ./issue18534
0.0% -21.709796ms
1.0% -8.175178ms
5.0% -4.094512ms
50.0% -173ns
95.0% 4.09989ms
99.0% 8.111605ms
100.0% 21.653956ms
(top indicates that taskset is constraining the process to 1 core as
expected.)
My understanding is that setting GOMAXPROCS will force the runtime to
create more threads. Even when those threads are running on the same core
and have to be scheduled by the OS (linux in this case), that seems to lead
to lower latency than just letting the Go runtime schedule a single thread.
Here are some measurements without taskset, but with different values for
GOMAXPROCS:
$ ./issue18534 # 40 logical cores, GOMAXPROCS set by default
0.0% -8.913432ms
1.0% -103.472µs
5.0% -51.566µs
50.0% -56ns
95.0% 50.329µs
99.0% 107.945µs
100.0% 8.917999ms
$ GOMAXPROCS=4 ./issue18534
0.0% -16.824243ms
1.0% -2.714634ms
5.0% -80.119µs
50.0% -358ns
95.0% 82.118µs
99.0% 2.521557ms
100.0% 16.79671ms
$ GOMAXPROCS=5 ./issue18534
0.0% -11.703387ms
1.0% -205.105µs
5.0% -72.006µs
50.0% -1.152µs
95.0% 72.07µs
99.0% 181.71µs
100.0% 11.744029ms
$ GOMAXPROCS=4 ./issue18534
0.0% -13.447356ms
1.0% -5.392172ms
5.0% -89.006µs
50.0% -401ns
95.0% 90.569µs
99.0% 5.325731ms
100.0% 13.443492ms
$ GOMAXPROCS=5 ./issue18534
0.0% -10.005695ms
1.0% -205.612µs
5.0% -68.098µs
50.0% 241ns
95.0% 64.149µs
99.0% 186.699µs
100.0% 10.021666ms
$ GOMAXPROCS=8 ./issue18534
0.0% -180.467µs
1.0% -106.693µs
5.0% -56.36µs
50.0% -228ns
95.0% 56.368µs
99.0% 91.327µs
100.0% 213.66µs
It's interesting that there is a consistent change in the 99% latency
between GOMAXPROCS=4 and =5. I suspect that is an artifact of some decision
within the runtime about the number of idle GC workers, but have not looked
at a trace. A similar looking change can also be seen in the taskset
samples when going from GOMAXPROCS=4->5.
To me, these samples point to a deficiency in the Go runtime's handling of
scheduling/preemption during GC. That issue is probably hidden on systems
with more than 4 cores, which many server systems have these days.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18534 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA7WnweLAPHJQai8FiqCIFwIX5ICu5aIks5r7S1BgaJpZM4Lcgtv>
.
|
I re-ran your reproducer with current master, which annotates mark assists in the execution trace, and added a This does appear to be caused by GC and scheduling, though several things have to go wrong simultaneously. Here's the new trace I'm analyzing with tip. The worst delay is a 57ms delay that gets reported at 1,571.288 ms into the trace. First, a fractional mark worker starts at 1,473.412 ms. Since there's just one P, this necessarily delays all other work. This is supposed to run for at most 10ms, but for reasons I don't fully understand runs for 20ms. This delays the timer goroutine, so six wakeups all happen at 1,493.612 ms when the fractional worker finishes. It so happens these are all childLoops that happened to randomly pick short sleeps. These childLoops each run for ~20ms, partly because they're doing a lot of assist work (#14951) and partly because they take a long time to run even when not assisting. Again, I'm not sure why this is 20ms instead of 10ms. Together, these consume the next 60ms of CPU time. The mainLoop wake up is supposed to happen fairly early in this 60ms window. Because these are saturating the CPU, again, the timer goroutine doesn't get to run, so another three timer wakeups pile up until 1,557.327 ms. One of these (it happens to be the last one serviced), is mainLoop. I need to dig in to why all of these preemptions are happening at 20ms. My guess is that sysmon has fallen back to a 10ms tick and has to observe them in the same goroutine for two ticks in a row before preempting, which seems a bit bogus. I suspect fixing that and fixing the over-assist problem (#14951) would fix this issue. |
Change https://golang.org/cl/59970 mentions this issue: |
Change https://golang.org/cl/59971 mentions this issue: |
@aclements Thanks for your work on this! I cherry-picked your 2 patches onto tip and ran our actual application (not just the test reproduction case in this ticket): Go v1.9.0:
Go tip (d77d4f5) + patches applied
The 99.9% latencies moved from about 10ms -> 6ms, but the worst case latencies (out of 10k samples) went from ~120ms -> ~15ms. That is huge improvement in the worst case mutator latencies and would be enough to resolve this issue for us. The PRs are still marked as DO NOT SUBMIT until benchmarking is complete. I am eager to see them merged, so let me know if there are is more I can help with to move them along. (My real application does not create that much garbage, and so is not useful for benchmarking odd GC assist interactions.) |
@tarm, that's great! Could you also check what effect the patches have on heap size and throughput? You're running with GOMAXPROCS=1, right? There's still an issue with fractional workers running for a long time without preemption, which I suspect is responsible for the 15ms worst-case you saw. I just filed an issue focused on that problem (#21698). |
For the recent benchmarking, I have been running with This application is a tick based application instead of a run-to-completion type application so throughput is more difficult to measure. Over about 12 hours of running, the accumulated CPU time (as reported through The application avoids allocation and is not a heavy user of RAM, so is also not a great stressor of the heap size and we have not specifically instrumented for heap usage. That said, the memory usage reported through |
Change https://golang.org/cl/60790 mentions this issue: |
Change https://golang.org/cl/68573 mentions this issue: |
Currently fractional workers run until preempted by the scheduler, which means they typically run for 20ms. During this time, all other goroutines on that P are blocked, which can introduce significant latency variance. This modifies fractional workers to self-preempt shortly after achieving the fractional utilization goal. In practice this means they preempt much sooner, and the scale of their preemption is on the order of how often the user goroutine block (so, if the application is compute-bound, the fractional workers will also run for long times, but if the application blocks frequently, the fractional workers will also preempt quickly). Fixes #21698. Updates #18534. Change-Id: I03a5ab195dae93154a46c32083c4bb52415d2017 Reviewed-on: https://go-review.googlesource.com/68573 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Rick Hudson <[email protected]>
Currently, both the background mark worker and the goal GC CPU are both fixed at 25%. The trigger controller's goal is to achieve the goal CPU usage, and with the previous commit it can actually achieve this. But this means there are *no* assists, which sounds ideal but actually causes problems for the trigger controller. Since the controller can't lower CPU usage below the background mark worker CPU, it saturates at the CPU goal and no longer gets feedback, which translates into higher variability in heap growth. This commit fixes this by allowing assists 5% CPU beyond the 25% fixed background mark. This avoids saturating the trigger controller, since it can now get feedback from both sides of the CPU goal. This leads to low variability in both CPU usage and heap growth, at the cost of reintroducing a low rate of mark assists. We also experimented with 20% background plus 5% assist, but 25%+5% clearly performed better in benchmarks. Updates #14951. Updates #14812. Updates #18534. Combined with the previous CL, this significantly improves tail mutator utilization in the x/bechmarks garbage benchmark. On a sample trace, it increased the 99.9%ile mutator utilization at 10ms from 26% to 59%, and at 5ms from 17% to 52%. It reduced the 99.9%ile zero utilization window from 2ms to 700µs. It also helps the mean mutator utilization: it increased the 10s mutator utilization from 83% to 94%. The minimum mutator utilization is also somewhat improved, though there is still some unknown artifact that causes a miniscule fraction of mutator assists to take 5--10ms (in fact, there was exactly one 10ms mutator assist in my sample trace). This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little effect on the go1 benchmarks (and the slight overall improvement makes up for the slight overall slowdown from the previous commit): name old time/op new time/op delta BinaryTree17-12 2.40s ± 0% 2.41s ± 1% +0.26% (p=0.010 n=18+19) Fannkuch11-12 2.95s ± 0% 2.93s ± 0% -0.62% (p=0.000 n=18+15) FmtFprintfEmpty-12 42.2ns ± 0% 42.3ns ± 1% +0.37% (p=0.001 n=15+14) FmtFprintfString-12 67.9ns ± 2% 67.2ns ± 3% -1.03% (p=0.002 n=20+18) FmtFprintfInt-12 75.6ns ± 3% 76.8ns ± 2% +1.59% (p=0.000 n=19+17) FmtFprintfIntInt-12 123ns ± 1% 124ns ± 1% +0.77% (p=0.000 n=17+14) FmtFprintfPrefixedInt-12 148ns ± 1% 150ns ± 1% +1.28% (p=0.000 n=20+20) FmtFprintfFloat-12 212ns ± 0% 211ns ± 1% -0.67% (p=0.000 n=16+17) FmtManyArgs-12 499ns ± 1% 500ns ± 0% +0.23% (p=0.004 n=19+16) GobDecode-12 6.49ms ± 1% 6.51ms ± 1% +0.32% (p=0.008 n=19+19) GobEncode-12 5.47ms ± 0% 5.43ms ± 1% -0.68% (p=0.000 n=19+20) Gzip-12 220ms ± 1% 216ms ± 1% -1.66% (p=0.000 n=20+19) Gunzip-12 38.8ms ± 0% 38.5ms ± 0% -0.80% (p=0.000 n=19+20) HTTPClientServer-12 78.5µs ± 1% 78.1µs ± 1% -0.53% (p=0.008 n=20+19) JSONEncode-12 12.2ms ± 0% 11.9ms ± 0% -2.38% (p=0.000 n=17+19) JSONDecode-12 52.3ms ± 0% 53.3ms ± 0% +1.84% (p=0.000 n=19+20) Mandelbrot200-12 3.69ms ± 0% 3.69ms ± 0% -0.19% (p=0.000 n=19+19) GoParse-12 3.17ms ± 1% 3.19ms ± 1% +0.61% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 73.7ns ± 0% 73.2ns ± 1% -0.66% (p=0.000 n=17+20) RegexpMatchEasy0_1K-12 238ns ± 0% 239ns ± 0% +0.32% (p=0.000 n=17+16) RegexpMatchEasy1_32-12 69.1ns ± 1% 69.2ns ± 1% ~ (p=0.669 n=19+13) RegexpMatchEasy1_1K-12 365ns ± 1% 367ns ± 1% +0.49% (p=0.000 n=19+19) RegexpMatchMedium_32-12 104ns ± 1% 105ns ± 1% +1.33% (p=0.000 n=16+20) RegexpMatchMedium_1K-12 33.6µs ± 3% 34.1µs ± 4% +1.67% (p=0.001 n=20+20) RegexpMatchHard_32-12 1.67µs ± 1% 1.62µs ± 1% -2.78% (p=0.000 n=18+17) RegexpMatchHard_1K-12 50.3µs ± 2% 48.7µs ± 1% -3.09% (p=0.000 n=19+18) Revcomp-12 384ms ± 0% 386ms ± 0% +0.59% (p=0.000 n=19+19) Template-12 61.1ms ± 1% 60.5ms ± 1% -1.02% (p=0.000 n=19+20) TimeParse-12 307ns ± 0% 303ns ± 1% -1.23% (p=0.000 n=19+15) TimeFormat-12 323ns ± 0% 323ns ± 0% -0.12% (p=0.011 n=15+20) [Geo mean] 47.1µs 47.0µs -0.20% https://perf.golang.org/search?q=upload:20171030.4 It slightly improve the performance the x/benchmarks: name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.29ms ± 3% 2.22ms ± 2% -2.97% (p=0.000 n=18+18) Garbage/benchmem-MB=64-12 2.24ms ± 2% 2.21ms ± 2% -1.64% (p=0.000 n=18+18) HTTP-12 12.6µs ± 1% 12.6µs ± 1% ~ (p=0.690 n=19+17) JSON-12 11.3ms ± 2% 11.3ms ± 1% ~ (p=0.163 n=17+18) and fixes some of the heap size bloat caused by the previous commit: name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.88G ± 2% 1.77G ± 2% -5.52% (p=0.000 n=20+18) Garbage/benchmem-MB=64-12 248M ± 8% 226M ± 5% -8.93% (p=0.000 n=20+20) HTTP-12 47.0M ±27% 47.2M ±12% ~ (p=0.512 n=20+20) JSON-12 206M ±11% 206M ±10% ~ (p=0.841 n=20+20) https://perf.golang.org/search?q=upload:20171030.5 Combined with the change to add a soft goal in the previous commit, the achieves a decent performance improvement on the garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.22ms ± 2% -7.40% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.21ms ± 2% -1.06% (p=0.000 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.330 n=20+17) JSON-12 11.1ms ± 1% 11.3ms ± 1% +1.87% (p=0.000 n=16+18) https://perf.golang.org/search?q=upload:20171030.6 Change-Id: If04ddb57e1e58ef2fb9eec54c290eb4ae4bea121 Reviewed-on: https://go-review.googlesource.com/59971 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Rick Hudson <[email protected]>
(This may be the same as #14812, but the discussion for that issue has veered towards sync.Pool and there are many processors available in the traces people have posted to that issue.)
What version of Go are you using (
go version
)?go1.7.4 Testing with the latest master branch does not make a difference for my real application.
What operating system and processor architecture are you using (
go env
)?Linux on an x64 single-core 1.4GHz Atom processor
What did you do?
It does not run on the playground (takes too long and cannot write trace file), but here's a minimal test case that reproduces the issue
https://play.golang.org/p/aJmmumzKpD
(original example that was slightly more complicated: https://play.golang.org/p/_tQTfBR33g)
Specifically, that reproduces the approximate amount of memory, GC frequency, GC irregularity that I see in our application. Concretely, this reproduction test case has GCs that collect 50-100MB but also sometimes cause application delays with long tails on the order of 100ms.
To mimic the single core processor from my laptop, I was running the test with CPU affinity like this: "GODEBUG=gcpacertrace=1,gctrace=1 taskset -c 0 ./main"
What did you expect to see?
The main goroutine is driven off of a 100ms tick. Ideally that tick would arrive every 100ms and then the main goroutine would run immediately after the tick then and be done.
What did you see instead?
Sometimes the tick gets delayed and sometimes the main goroutine gets delayed even after the tick fires. Measuring the deviation away from the expected 100ms interval, I see a latency distribution like this:
The mean is as expected, but the 75ms deviation is much higher than we want to see. (The negative values are the ticker catching up after previous delays).
The gctrace output before the 75ms delay looks like this:
My lay-person reading of the trace is that the "child" garbage-generating goroutines are asked to assist with the GC, but then they end up blocking execution of both the runtime.timerproc goroutine and the main goroutine.
This seems like a scheduler issue and goroutines that generate minimal garbage should be prioritized over goroutines assisting with the GC.
gctrace=1,gcpacertrace=1 output
runtime trace The 74ms delay occurs around 41 seconds into the trace. The tick should have fired at 41003ms but did not fire until 41079ms and the G1 goroutine did not run until 41140ms. The complete trace is split up into multiple chunks, but this GC can be found 12.6s into its chunk.
The text was updated successfully, but these errors were encountered: