-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
I've been seeing the sync tests timeout quite a lot recently (on my 8 and 18 core linux/s390x VMs) and the cause appears to be TestWaitGroupMisuse2. I've also recreated the problem, albeit with more difficulty, on my 2-core darwin/amd64 laptop (note that by default this test doesn't run unless you have 5+ cores).
CL 36841 (83f95b8) added some new code that uses atomic operations to try and sync up the goroutines in order to make the test more reliable (to fix #11443). Basically all 3 goroutines atomically increment the uint32 here
and then poll it until it has the value 3. Unfortunately there is nothing to stop one of the goroutines being preempted in preparation for a GC 'stop the world' phase while the others are left spinning on here
. This is a deadlock because here
will never be incremented by the preempted goroutine and the spinning goroutines aren't preemptible.
The root cause of the problem is essentially #10958 and this issue can be fixed with either GOEXPERIMENT=preemptibleloops or GOGC=off. Although even with these fixes, at least on linux/s390x, the test still often takes a long time (up to 75s) and sometimes burns through all 1e6 iterations without triggering a panic. This probably explains why I see this a lot, since the longer the test runs the more likely a badly timed GC is.
Is it worth modifying this test or should we just wait until #10958 is fixed?
Side note: while recreating this on my 2-core darwin/amd64 laptop I noticed that repeatedly running TestWaitGroupMisuse2 leaks goroutines. I think this is because it doesn't necessarily wake up the goroutine that calls Wait. Not sure if we should care about that?