-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: GC corrupting live objects in cmd/link #11617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here's a failure on linux/amd64, suggesting it's not OpenBSD-specific: http://build.golang.org/log/735c2d636a8cc1f5303994c593c7b5d18b789912
Here's a failure on openbsd/386 for a non-go.string symbol: http://build.golang.org/log/f57efb3a7916e4742eaac8a7288e95f65db4b83c
Here's another failure on openbsd/386 that doesn't fit the 173/200 pattern: http://build.golang.org/log/5baea92cb737e12a38f631a199b657c182def33b
|
CC @rsc @RLH @aclements This is an intermittent problem that is still happening periodically on the build dashboard. The values that the linker is reporting are completely bogus. I can not think of any plausible way this could occur other than a GC problem in the compiler or linker. Or, I suppose, a file system problem. @mdempsky: are you able to repeat this at all outside of the build dashboard? I have not been able to, but I also don't have an OpenBSD system. |
@ianlancetaylor I spent a while trying to repro it on my openbsd/amd64 machine, but haven't succeeded yet. I was going to try an openbsd/386 VM next since it seems to repro there more often with the trybots. |
BTW, I noticed this code in cmd/link/internal/ld/objfile.go:
Is it reasonable to change this code to error if the encoded int values are out of range for an int32/uint8 rather than silently truncate them? That might help narrow down if the corruption is occurring in the compiler or linker. |
Sounds like a good idea to me. |
CL https://golang.org/cl/12050 mentions this issue. |
Instead of silently truncating integers to their expected range, check that they're within range and emit errors if not. Intended to help narrow down the cause of issue #11617. Change-Id: Ia7b577270f8438ca7479262702371e26277f1ea7 Reviewed-on: https://go-review.googlesource.com/12050 Reviewed-by: Ian Lance Taylor <[email protected]>
The same "invalid relocation" error occurred a few more times on build.golang.org over the weekend: openbsd/386 http://build.golang.org/log/e6b76c598ae8232a031d8494e724fd395ba784ef
openbsd/386 http://build.golang.org/log/8a864cdb337aad199245615a5e9e74c62f3e6424
openbsd/386 http://build.golang.org/log/f079859ada15356800d9dcba0b1b838901dbd6ed
openbsd/amd64 http://build.golang.org/log/6d93047a2121ed1c02ca23338a71a566cd48c196
No rdint panics, so it's seemingly not file corruption/desync at least. It also occurs to me now that symbols like "type..alg.runtime.bgMarkSignal" are only defined in runtime.a, which is only compiled once and reused for all builds/tests. If the corruption was in the compiler, every link should fail. So it seems like something must be going wrong in cmd/link. |
Ignoring the small ones, I think these are pointers gone rogue. They're all multiples of 16. Both 386 and amd64 have distinct but different patterns: 386 is 0x18xxxxxx and amd64 is 0x20xxxxxx (on both openbsd and linux). Since Off is an int32, this would be only the bottom 32 bits of a 64 bit pointer and Siz will be the next 8 bits, which gives us 0xc820xxxxxx, which definitely looks like a 64-bit Go heap pointer. |
@aclements Is it still useful for me to work on repro'ing this on openbsd/386? |
It would be very helpful to have a reproducible (or reasonably reproducible) case. Most convenient would be on linux, but any repro is good. I have no idea what's going on; I was merely adding a fact to our debugging database. :) |
Managed two repros so far with "./run func7.go":
Seems exceedingly rare unfortunately. Any ideas for improving the repro rate or useful debugging to apply once corruption is detected are welcome. |
I started sprinkling sanity checks for the relocation values throughout cmd/link, and I noticed that the value of Done (the next byte after Siz) is getting set to 222. Together, 173 and 222 would be 0xdead in little endian, so I'm suspecting this is the 0xdeaddeaddeaddead in mgcsweep.go? To further support that theory, the handful of cases I've witnessed so far have also always been the first Reloc in the s.R slice. |
As an experiment, I changed 0xdeaddeaddeaddead to 0xdeafdeafdeafdeaf and the 173 value changed to 175. So it's definitely that. |
Ah, there you go. So we're prematurely collecting the slice backing store and overwriting the first word of the first Reloc with a free list pointer and the second word with 0xdeaddeaddeaddead. I don't think there are any channels in the linker (and grep agrees), so this isn't #11643. Can you run your repro with GODEBUG=gccheckmark=1? That may give us a better idea of what's going on. |
I suspect this is also the cause of the markflood faults that we see from time to time. These happen in a loop over s.R that attempts to dereference the s.R[i].Sym pointer, so if s.R's backing array is being reclaimed, this could be overwritten with a non-pointer. Here are some recent examples: 2015-07-05T03:52:56-cc3f031/openbsd-386-gce56 |
I'll try with GODEBUG=gccheckmark=1 today. For what it's worth, my current test is that I added this function:
and then I instrumented Ldmain like:
and so on. I got maybe a dozen crashes running this overnight on openbsd/386. One was "sanity check failed after loadlib allsym", but the rest were all "sanity check failed after loadlib allsym after GC". (Before I added the runtime.GC() call and second pass, the failures were more sporadically distributed.) I also ran a similar test overnight on linux/amd64 (tweaking the sanity check heuristics accordingly), but didn't get any failures. |
Thanks!
Interesting. Since runtime.GC is stop-the-world, this suggests that it's not a concurrency issue or a missed write barrier, but we're simply tracing the Allsym list wrong. That could indicate a problem with the heap bitmap. If that's the case, gccheckmark=1 won't panic with a check mark failure, and we'll continue to see your "sanity check failed". BTW, I've been able to reproduce this twice in a cluster of 5 openbsd-386 VMs over the past few hours. I've tried it with and without gccheckmark=1 and have only been able to reproduce it without gccheckmark=1, so it could be that check mark somehow inhibits this failure, or it could be random chance since it takes so long to reproduce. |
I just saw a repro with GODEBUG=gccheckmark=1, but I didn't see any additional runtime/GC debugging output. Just my own instrumentation prints. |
Okay, this definitely sounds like a bitmap problem, then. I'm working on some debugging code, but in the mean time, could you try setting doubleCheck in mbitmap.go to true? |
Can do. Should I continue with GODEBUG=gccheckmark=1 or not? |
For now, yes, I'd run with GODEBUG=gccheckmark=1 just to get more samples of failures with that enabled. If you get a few more failures where the runtime doesn't panic, you can stop using checkmark since it slows things down. |
Just saw another "after loadlib allsym after GC" failure with GODEBUG=gccheckmark=1 and doubleCheck=true, but still no extra runtime output. |
@mdempsky, do you have any more examples of your sanity check failure message? In the one you posted, the LSym object is very aligned (so aligned that the corrupted bitmap byte is also very aligned) and I'd like to know if that's just coincidence or if it's likely to be relevant. I haven't figured out much more since last night, unfortunately. I dug around in the neighboring memory for clues. The preceding span (which neighbors the corrupted byte in the bitmap) is also full of LSyms. Based on their location in Ctxt.Allsym, they were allocated long before the LSym with the corrupted bitmap (over 10,000 LSyms were allocated in-between), so it wasn't the initialization of that span that corrupted the bitmap byte, or the unrolling of the bitmap for those LSyms. Also, since the LSym is 176 bytes, the preceding span has 96 bytes of padding at the end, which leaves quite a bit of space between the bitmap for the preceding LSym and the corrupted bitmap byte. The LSym with the corrupted bitmap byte was one of the more recently allocated LSyms, with only 2,027 more recently allocated LSyms (out of 17,335 on Ctxt.Allsym). It's also the only LSym with a corrupted bitmap. In the core dump you sent me, the majority of the LSyms are not yet marked (it's in the middle of a GC, so that's not entirely surprising). |
Ah, sorry, I stopped running the tests after I got that coredump. I'll resume running to try to get you some more. |
Just got a crash of my own on an openbsd-386 VM.
The LSym is still the first object on a span, but that's as aligned as it gets, and the bitmap doesn't have any stronger alignment either. It's still a fairly recently allocated LSym, but the preceding span is something else entirely (it's a 16K span of 3072 byte slots, again with some padding at the end.) |
Does that crash show the same bitmap corruption as the first one? |
Yes, exactly the same bitmap corruption (sorry, should have said). Also, as in the first one, this is the only LSym that's corrupted. |
Still no progress on a solution. By @mdempsky's request, my notes on my hypotheses and what tests I've tried are below and the code for these tests is at https://go-review.googlesource.com/12277. I'll be away until Monday, so I probably won't make much more progress on this before then. Overwritten byte contains the mark bit. Are we screwing up a mark bit write?
|
@mdempsky, if you're still plugging away at this (I can definitely understand if you're not :), I just uploaded https://go-review.googlesource.com/12249, which records an in-memory log of all intentional bitmap writes that may be to the overwritten byte. It would be great to get a core dump with that CL. I'm going to try, but I'm on a plane and it's hard enough to juggle a bunch of OpenBSD VMs with a wired connection to the corp network, much less a flaky connection from 33,576 feet. :) |
As another data point, I've been running
on my linux-amd64 workstation for the past few days. Out of ~1000 runs, I saw 8 failures. 2 were the markflood crash, which I believe is this bug. 1 was a runtime test timeout (maybe bad). 4 were tests that can fail with some probability (fine). 1 was #11640, but I was running off a commit before my possible fix for that went in. |
If one wanted to increase the number of first bytes of a spans where the On Thu, Jul 16, 2015 at 11:44 AM, Austin Clements [email protected]
|
@RLH, what do you mean by "reduce the size of a span"? Do you mean decrease the 8K page size? |
Yes, make it 1k. On Thursday, July 16, 2015, Austin Clements [email protected]
|
Now we're getting somewhere! Using @mdempsky's link5 core dump with PS3 of my CL, I think I've found the interleaving the leads to this. I don't know why this happens yet.
My guess about the global interleaving (I should probably add timestamps to the log...):
At this point we're toast. Note that the above four events probably happened in quick succession, given their event indexes, and appear to be interleaved.
|
I'm almost out of battery and distance to my flight's destination. Here's my current brain dump:
Guess: initSpan called from mSpan_Sweep for large span
I don't know what this is, but I think it's key. How are we doing two inits of the same span at the same time?
Guess: We grabbed this span because we needed to allocate an LSym; now we're allocating it.
Guess: heapBitsSweepSpan in mSpan_Sweep finishing A possible next step would be to call runtime.Callers (or something) from logWrite and dump the whole darn call stack into the log record so we don't have to guess. |
I'll try to get another coredump with timestamps and caller PCs. |
A couple of peanut gallery questions I've been wondering about: do we have
any idea why this happens more often on openbsd? The only thing I know
that's odd about it in this context is that there is no platform support
for thread local storage but I can't see how that matters with an
internally linked program link cmd/link.
The other q is, does this happen with GOMAXPROCS=1?
|
P0 calls mSpan_Sweep on a large span with an unmarked object on it. mSpan_Sweep calls heapBitsSweepSpan, which invokes the callback for the one object on the span. Because this is a large span, the callback calls initSpan, which clears the bitmap ( P1 grabs this span from the heap to use for 176 byte objects and allocates an LSym in the first slot ( P0 returns from the callback to heapBitsSweepSpan. It clears the byte containing the mark, even though this span has been taken over by P0 at this point. I can think of a few solutions.
I like solution 1 best. |
@mdempsky, mind testing https://go-review.googlesource.com/12320? |
@aclements Testing. Also, an hour is too short to say anything definitive, but after adding cputicks() and Callers() instrumentation to writeLog(), I didn't get any more repros. I'm worried the probe effect (especially from Callers) is too much, so I've reverted those changes for now. I.e., I'm just testing 11249 PS3 + 12320 PS1 for now to see if I get any more crashes. |
CL https://golang.org/cl/12320 mentions this issue. |
On Jul 16, 2015 1:08 PM, "Michael Hudson-Doyle" [email protected]
It's a good question. My guess is that it has to do with thread scheduling, I bet we could trigger this much more (and on other platforms) by adding a
I don't know. It would be good to confirm it doesn't happen with |
Three hours and no failures of any sort so far. Not even the occasional segfaults or runaway CPU usage issues I was sporadically seeing while trying to reproduce the "invalid relocation" errors. I'll continue running overnight (and then through the weekend too), but so far it seems promising. |
Another 16 hours of testing, and still no failures. At this point, I think it's reasonable to assume @aclements's patch fixed the issue. I'll keep the tests going over the weekend, but unless I see any new failures, I won't spam with further updates. |
Great! Thanks for all for the help and for sticking it through! |
CL https://golang.org/cl/12451 mentions this issue. |
This is a cleanup following cc8f544, which was a minimal change to fix issue golang#11617. This consolidates the two places in mSpan_Sweep that update sweepgen. Previously this was necessary because sweepgen must be updated before freeing the span, but we freed large spans early. Now we free large spans later, so there's no need to duplicate the sweepgen update. This also means large spans can take advantage of the sweepgen sanity checking performed for other spans. Change-Id: I23b79dbd9ec81d08575cd307cdc0fa6b20831768 Reviewed-on: https://go-review.googlesource.com/12451 Reviewed-by: Rick Hudson <[email protected]>
http://build.golang.org/log/dbd04ef919b4bba31c79a2882fda1ba41008cace (386)
http://build.golang.org/log/0b184b4b886b6bf86634e81ca10fadbc6fd5d0f4 (386)
http://build.golang.org/log/a9e3dca50638f63391106fa63af858506eba811a (386)
http://build.golang.org/log/7c9e9ef6f2e9f8b568fc5efaa40368f480ce6818 (amd64)
http://build.golang.org/log/3f96c7856ce7548618d41419892da7fa7d8ab1e5 (amd64)
See also issue #10567 where TestCgoExternalThread was failing with "invalid relocation" errors (among others).
Somewhat small sample set, but notably the invalid offset seems to always be 173 on openbsd/386 and 200 on openbsd/amd64.
The text was updated successfully, but these errors were encountered: