-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: assisting mark and sweep make GC uncontrollable in the critical path #21107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've updated the title to "runtime: assisting mark and sweep make GC uncontrollable in the critical path", anyone, please correct it if needed. I'll also mark this for Go1.10 so that we can take a look at it when the tree opens. |
@zhaozhiqianghw Can you provide a runnable test case that shows the problem? Right now I don't see a clear bug here. If you just want to discuss the GC, that should be done on the golang-dev mailing list, not in the issue tracker. Thanks. |
@ianlancetaylor I think this binarytree example(http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=go&id=4) can show the case clearly. Although the STW1 and STW2 are really short. But the sweep stage and assist stage occupy the whole Ps. These really have the same influence as the STW. But these are hidden from the staticist. So we can still say the GC of Go have a long STW, but not in the marking stage, actually in sweeping stage also some time in marking stage. We just open the trace in this example, then we can see the case. I run this example with parameter 8 and 4 cores, find in some time four core do GC sweep for 1ms. It's mean the "STW" is 1ms at least. I use go1.9rc1 to show the GC ASSIST. |
@zhaozhiqianghw, can you describe a specific problem you're seeing with binarytree? (Bear in mind that microbenchmarks are important only as far as they model the behavior of things people actually care about.) This may be a dup of #14951, but I can't really say without more detail. |
What I want to show is above. After GC marking, almost all P "stoped" to do assisting sweep. This is really STW for the mutator. For a program caring about latencies, this sweeping "STW" will introduce about 1ms latencies to critial mutator. And we really come across this situation, I just use a common benchmark to show it. So am I clear? I think this's a problem. I want to know your opinion. @aclements |
What are you seeing with 1.9 rc1 ? There was a corner case where sweeping
could take O(n) time in 1.8.x but in 1.9 it now takes O(logN). The corner
case involves heaps with lots of objects larger than 1MB and the heap size
50GB. Is that the case here?
Fixed with this CL
https://go-review.googlesource.com/c/38732
…On Mon, Jul 31, 2017 at 11:45 AM, 赵志强 ***@***.***> wrote:
[image: default]
<https://user-images.githubusercontent.com/25518559/28785759-e07e9706-7649-11e7-9623-b8cf6377fedc.PNG>
What I want to show is above. After GC marking, almost all P "stoped" to
do assisting sweep. This is really STW for the mutator. For a program
caring about latencies, this sweeping "STW" will introduce about 1ms
latencies to critial mutator.
So am I clear? I think this's a problem. I want to know your opinion.
@aclements <https://github.com/aclements>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#21107 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA7Wn6Te6IdgMQZDWoItOAvlumIDkEEtks5sTfaZgaJpZM4Oe5lf>
.
|
Actually I can see this in both Go 1.8.x and 1.9 rc1. My case is not the same as the case you mentioned @RLH . I display the gctrace below. As I analyized the code, after GC marking, the mutators are invoked into the sweeping, because lots of same kind of spans need to be swept.
|
Thanks for running this on 1.9 rc. Can you confirm that this is running on a 8 HW thread machine, probably a 4 core AMD64 box? Is anything else running on the machine? I'm trying to rule out the possibility that the OS is not making available GOMAXPROCS worth of HW threads to the Go application. Continuing on that note, is the code that demonstrates this issue the same as mentioned above at http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=go&id=4 |
@zhaozhiqianghw, how are you running binarytree? When I run it without arguments, it finishes in 0.02 seconds and does no GCs. If I run it with the argument "21" suggested by the shootout, the heap is significantly larger than the 4MB shown in your gctrace and retains significantly more than the 0MB shown in your gctrace. |
Let's make the details more clear. @RLH to reduce the infulence of OS, I remove the
@aclements The reason is I post the final stage gctrace. I reget the gctrace from beginning, I think it's the same as yours. From the result above, we still can find the problem. And the sweep "STW" is more longer sometime (2.3ms), so I think this can reduce the time slice of OS @RLH . From the image, we also can find, this problem does not occur always. But we can always find it in somewhere. @aclements I pretty sure you can reproduce it, no matter with 18 or 21. Beside, the current credit system in GC, can also introduce uncontrolablt latency. We came across this in Go1.8.0. The reason I run this benchmark is that Go1.9 make the GC ASSIST visible. From the result above, during GC, aside with dedicated marking P, assist marking will introduce uncontrolable latency to critial mutator. So a real-time program, I think left 25% CPU time for GC is really acceptable, but inserting uncontrolable GC assist marking and assist sweeping will introduce big effect. |
Thanks for the additional details @zhaozhiqianghw. I can now confirm that I can reproduce similar behavior.
I did the same and found that the original over-subscription had a significant effect on the trace because the OS was much more likely to context-switch at an unfortunate time. But this is just a tracing artifact, not a real problem, so we should focus on the version without over-subscription.
I still don't see any issues in the GC trace you posted. Am I missing something?
I'm not sure what you mean by "this can reduce the time slice of OS". However, you're right that this simultaneous sweep is an issue. When I reproduce this locally, I can see from the event details that these are all very large sweeps that reclaim very little. This is issue #18155. Thanks for finding another reproducer. :) That said, this only happens because this benchmark has an extremely specific allocation pattern that I think is unlikely to happen outside a micro-benchmark. This long sweep is because the entire
The marking you've highlighted is 0.23ms, which is well within the realm of noise from other parts of the system, like the OS. I do think we should reduce this (#14951), but it's not at all obvious to me that 0.23ms is a problem. |
I think the stage of assist sweeping is quite strong. And about your question.
Yes, I observed this in my application and found a long sweeping in mutator. But it's hard to show my application. And I found @rhysh also report this problem. So I use this micro-benchmark to show it to you. Do you have any plan to try to reduce the influence of this simultaneous sweeping and long sweeping in mutator? |
@aclements Let me try to clarify; this is the problem I mentioned to you in my July 13 email. If you look at @zhaozhiqianghw's MARK screenshot carefully, even though each time the mutator only does 230us worth of marking, the same mutator has to do this repeatedly, for a large number of times, because the it performs a large number of allocations. The effect of this is that application requests see a 50x latency increase when they happen to run within a GC cycle. From the application's perspective, it might as well have been a 12ms STW pause. We account for the dedicated GC goroutine(s) by using more cores than are necessary for the application workload, but the mutators still must do excessive marking during a GC cycle. Outside of a GC cycle, the request latencies are much more reasonable, except for the simultaneous sweeping just after a GC cycle. Again, when this happens, there is an effective pause in the application, and the latencies of all requests during that time suffer as a result. Our application requires very consistent latency, and we would very much appreciate any suggestion from you on reducing or eliminating such wild latency fluctuations. |
I've reported similar behavior, but I haven't seen it in a production application. (I filed #18155 for someone's blog post about a micro-benchmark.) It would certainly be good to handle worst-case scenarios like this one—demonstrating that it happens in a production application would likely increase the priority of a fix. Are the execution traces and gctrace lines in this issue all for the binarytree benchmark, or are some from the (likely closed-source) application where you first encountered this issue? The execution traces in this issue show two different GC-related contributions to application latency: some in the mark phase, some in the sweep phase. I'm going to lay out my current understanding of the problems so we're all on the same page. (I don't know that it'll be the right page on the first try—@aclements and others will hopefully correct it if I'm mistaken.) First, the mark phase. The way mark assists are currently tuned means that the application code (mutator) ends up doing around the same amount of work as the dedicated GC worker goroutines. You can see this in the gctrace lines. The "gc 22" line ends with "193->223->100 MB, 274 MB goal" and the "gc 23" line ends with "151->175->100 MB, 200 MB goal". At the end of gc 22, there's 100 MB of live heap memory. The default setting of GOGC means that the memory usage is allowed to double before the end of the next GC. This becomes the "200 MB goal" for gc 23. At the end of gc 22, the GC "pacer" sets the goal of 200 MB, and based on the allocation rate of the program and a bunch of other inputs I don't yet understand, calculates that it will need to begin the 23rd GC cycle when the heap size is 151 MB. However, the live heap size at the end of that cycle remains 100 MB. Since the mark process only involves inspecting live memory, it happened twice as fast as the worst case. If all of the heap were still reachable, the mark phase would have taken longer, so the heap would have grown all the way to 200 MB instead of just to 175 MB. That's exactly the kind of equilibrium the current pacer algorithm is designed to find. One thing that means is that the pacer is finishing the GC cycle early. The other, shown in the "0.12+168/139/0+0.21 ms cpu" portion of the gctrace lines, is that the background workers are only doing around half of the work (139+0 ms on that line), with the other half coming from mutator assists (168 ms on that line). Issue #14951 gets pretty deep pretty fast, but is there to track progress on (among other things) improving the pacer algorithm so it doesn't slow down applications too much via assists, when the background workers could do the work instead. Ideally, I think, mark assists would only activate when a program's behavior changed, and would rarely occur for applications that were maintaining a steady heap size and allocation rate/pattern. I don't know of a general workaround for this source of mutator latency, beyond "don't make large allocations in the program's critical path". Next, delays during the sweep phase. After the mark phase completes, the runtime prefers to hand out memory allocations from newly-reclaimed memory rather than getting new regions from the OS. Although it's clear from looking at a piece of memory whether it's available for reuse (the mark bit will not be set), it's not clear where these pieces of memory might be found. The work to find them is called "sweep". In order to allocate memory right after the mark phase finishes, a goroutine needs two things. First, if not much sweep work has been done, the goroutine may need to do some sweep work to build up credit to offset the allocation it's about to make. Second, there has to be a piece of memory available to hold the allocation. From the discussion in #18155, it sounds like this isn't a problem with the credit so much as actually finding the memory to use for the allocation. Simple "microbenchmarks" like those discussed in this issue and in #18155 can flood large sections of memory with identically-sized allocations that all remain live, making it hard to find a free piece of memory to use for new allocations. I haven't seen this problem in production applications, and it seems to require a memory allocation pattern so consistent and so simple that I don't (yet) know how it could happen outside of a microbenchmark. It would be very interesting to learn of other sources of this issue, if they exist. Without seeing it in a production application, I don't know how to offer a workaround for that case. |
Change https://golang.org/cl/60790 mentions this issue: |
@rhysh, thanks for the detailed analysis, as always. :) I verified that the mark phase issues are now fixed on master (to be released in Go 1.10). The sweep phase issues aren't, but as @rhysh said, we've only ever observed these in microbenchmarks with incredibly simple allocation and retention patterns. If people are observing this in real applications, that would be very useful to know. At any rate, this part is a dup of #18155. Thanks for all the traces and analyses! |
@aclements Could you please show the commits about the fixed mark phase? I only find this commit f10d99f . I really care about this topic. |
@zhaozhiqianghw: there were two major fixes, plus one dependency:
|
Change https://golang.org/cl/138076 mentions this issue: |
This adds an endpoint to the trace tool that plots the minimum mutator utilization curve using information on mark assists and GC pauses from the trace. This commit implements a fairly straightforward O(nm) algorithm for computing the MMU (and tests against an even more direct but slower algorithm). Future commits will extend and optimize this algorithm. This should be useful for debugging and understanding mutator utilization issues like #14951, #14812, #18155. #18534, #21107, particularly once follow-up CLs add trace cross-referencing. Change-Id: Ic2866869e7da1e6c56ba3e809abbcb2eb9c4923a Reviewed-on: https://go-review.googlesource.com/c/60790 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Hyang-Ah Hana Kim <[email protected]>
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?Go 1.8
@rhysh gave a talk(https://github.com/gophercon/2017-talks/blob/master/RhysHiltner-AnIntroductionToGoToolTrace/Rhys-Hiltner-go-tool-trace-GopherCon-2017.pdf) in 2017 gopherCon about using trace to show the influence of assisting marking and sweeping. These two stages will cause a long stop in any mutator goroutine. Actually, our team also find this bad effect. Although these methods are used to prevent the rate of mutator and prevent the heap increasing too fast. But these strategies really make the GC uncontrolable in critial path.
So does Go have any plan to improve this? @aclements
\cc @rhysh @aclements
The text was updated successfully, but these errors were encountered: