Skip to content

runtime: GC corrupting live objects in cmd/link #11617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mdempsky opened this issue Jul 6, 2015 · 62 comments
Closed

runtime: GC corrupting live objects in cmd/link #11617

mdempsky opened this issue Jul 6, 2015 · 62 comments
Milestone

Comments

@mdempsky
Copy link
Contributor

mdempsky commented Jul 6, 2015

http://build.golang.org/log/dbd04ef919b4bba31c79a2882fda1ba41008cace (386)

# go run run.go -- fixedbugs/bug484.go
exit status 2
# command-line-arguments
go.string."npcdata": go.string."npcdata": invalid relocation 414675216+173 not in [0,16)

http://build.golang.org/log/0b184b4b886b6bf86634e81ca10fadbc6fd5d0f4 (386)

# go run run.go -- nosplit.go
incorrect output
BUG
rejected incorrectly:
    main 128

    linker output:
    # _/tmp/go-test-nosplit187529840
    go.string.hdr."eodSent": go.string.hdr."eodSent": invalid relocation 414647280+173 not in [0,8)

http://build.golang.org/log/a9e3dca50638f63391106fa63af858506eba811a (386)

# go run run.go -- fixedbugs/bug290.go
exit status 2
# command-line-arguments
go.string."runtime.tforkt": go.string."runtime.tforkt": invalid relocation 410364976+173 not in [0,23)

http://build.golang.org/log/7c9e9ef6f2e9f8b568fc5efaa40368f480ce6818 (amd64)

# go run run.go -- fixedbugs/issue6055.go
exit status 2
# command-line-arguments
go.string."bad frame layout": go.string."bad frame layout": invalid relocation 544120224+200 not in [0,33)

http://build.golang.org/log/3f96c7856ce7548618d41419892da7fa7d8ab1e5 (amd64)

# go run run.go -- fixedbugs/bug242.go
exit status 2
# command-line-arguments
go.string."runtime.hchan": go.string."runtime.hchan": invalid relocation 544858352+200 not in [0,30)

See also issue #10567 where TestCgoExternalThread was failing with "invalid relocation" errors (among others).

Somewhat small sample set, but notably the invalid offset seems to always be 173 on openbsd/386 and 200 on openbsd/amd64.

@mdempsky
Copy link
Contributor Author

mdempsky commented Jul 7, 2015

Here's a failure on linux/amd64, suggesting it's not OpenBSD-specific: http://build.golang.org/log/735c2d636a8cc1f5303994c593c7b5d18b789912

# go run run.go -- varinit.go
exit status 2
# command-line-arguments
go.string."NextGC": go.string."NextGC": invalid relocation 540185808+200 not in [0,23)

Here's a failure on openbsd/386 for a non-go.string symbol: http://build.golang.org/log/f57efb3a7916e4742eaac8a7288e95f65db4b83c

# go run run.go -- fixedbugs/bug082.go
exit status 2
# command-line-arguments
type..eqfunc.runtime.modulehash: type..eqfunc.runtime.modulehash: invalid relocation 415126928+173 not in [0,4)

Here's another failure on openbsd/386 that doesn't fit the 173/200 pattern: http://build.golang.org/log/5baea92cb737e12a38f631a199b657c182def33b

# testmain
type.[]*uint8: type.[]*uint8: invalid relocation 43+4 not in [0,44)
type.[]*uint8: type.[]*uint8: invalid relocation 59+4 not in [0,44)
type.[]*uint8: type.[]*uint8: invalid relocation 76+4 not in [0,44)
type.[]*uint8: type.[]*uint8: invalid relocation 93+4 not in [0,44)
type.[]*uint8: type.[]*uint8: invalid relocation 120+4 not in [0,44)
FAIL    encoding/base64 [build failed]

@mdempsky mdempsky changed the title cmd/link: "invalid relocation" failures on openbsd cmd/link: "invalid relocation" failures Jul 7, 2015
@ianlancetaylor ianlancetaylor added this to the Go1.5 milestone Jul 10, 2015
@ianlancetaylor
Copy link
Contributor

CC @rsc @RLH @aclements

This is an intermittent problem that is still happening periodically on the build dashboard. The values that the linker is reporting are completely bogus. I can not think of any plausible way this could occur other than a GC problem in the compiler or linker. Or, I suppose, a file system problem.

@mdempsky: are you able to repeat this at all outside of the build dashboard? I have not been able to, but I also don't have an OpenBSD system.

@mdempsky
Copy link
Contributor Author

@ianlancetaylor I spent a while trying to repro it on my openbsd/amd64 machine, but haven't succeeded yet. I was going to try an openbsd/386 VM next since it seems to repro there more often with the trybots.

@mdempsky
Copy link
Contributor Author

BTW, I noticed this code in cmd/link/internal/ld/objfile.go:

                    r.Off = int32(rdint(f))
                    r.Siz = uint8(rdint(f))
                    r.Type = int32(rdint(f))
                    r.Add = rdint(f)

Is it reasonable to change this code to error if the encoded int values are out of range for an int32/uint8 rather than silently truncate them? That might help narrow down if the corruption is occurring in the compiler or linker.

@ianlancetaylor
Copy link
Contributor

Sounds like a good idea to me.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/12050 mentions this issue.

mdempsky added a commit that referenced this issue Jul 10, 2015
Instead of silently truncating integers to their expected range, check
that they're within range and emit errors if not.  Intended to help
narrow down the cause of issue #11617.

Change-Id: Ia7b577270f8438ca7479262702371e26277f1ea7
Reviewed-on: https://go-review.googlesource.com/12050
Reviewed-by: Ian Lance Taylor <[email protected]>
@mdempsky
Copy link
Contributor Author

The same "invalid relocation" error occurred a few more times on build.golang.org over the weekend:

openbsd/386 http://build.golang.org/log/e6b76c598ae8232a031d8494e724fd395ba784ef

go.string.hdr."func(*runtime.wbufptr) *runtime.workbuf": go.string.hdr."func(*runtime.wbufptr) *runtime.workbuf": invalid relocation 415022496+173 not in [0,8)

openbsd/386 http://build.golang.org/log/8a864cdb337aad199245615a5e9e74c62f3e6424

go.string.hdr."logput": go.string.hdr."logput": invalid relocation 414345424+173 not in [0,8)

openbsd/386 http://build.golang.org/log/f079859ada15356800d9dcba0b1b838901dbd6ed

type..alg.runtime.bgMarkSignal: type..alg.runtime.bgMarkSignal: invalid relocation 413976688+173 not in [0,8)

openbsd/amd64 http://build.golang.org/log/6d93047a2121ed1c02ca23338a71a566cd48c196

type..alg24: type..alg24: invalid relocation 48+0 not in [0,16)

No rdint panics, so it's seemingly not file corruption/desync at least.

It also occurs to me now that symbols like "type..alg.runtime.bgMarkSignal" are only defined in runtime.a, which is only compiled once and reused for all builds/tests. If the corruption was in the compiler, every link should fail. So it seems like something must be going wrong in cmd/link.

@aclements
Copy link
Member

Ignoring the small ones, I think these are pointers gone rogue.

They're all multiples of 16. Both 386 and amd64 have distinct but different patterns: 386 is 0x18xxxxxx and amd64 is 0x20xxxxxx (on both openbsd and linux). Since Off is an int32, this would be only the bottom 32 bits of a 64 bit pointer and Siz will be the next 8 bits, which gives us 0xc820xxxxxx, which definitely looks like a 64-bit Go heap pointer.

@mdempsky
Copy link
Contributor Author

@aclements Is it still useful for me to work on repro'ing this on openbsd/386?

@aclements
Copy link
Member

It would be very helpful to have a reproducible (or reasonably reproducible) case. Most convenient would be on linux, but any repro is good. I have no idea what's going on; I was merely adding a fact to our debugging database. :)

@mdempsky
Copy link
Contributor Author

Managed two repros so far with "./run func7.go":

go.string.hdr."align": go.string.hdr."align": invalid relocation 414699840+173 not in [0,8)

go.string.hdr."struct { F uintptr; gp *runtime.g; p uintptr }": unknown reloc -1961808867
2015/07/13 14:06:06 unexpected relocation variant

Seems exceedingly rare unfortunately. Any ideas for improving the repro rate or useful debugging to apply once corruption is detected are welcome.

@mdempsky
Copy link
Contributor Author

I started sprinkling sanity checks for the relocation values throughout cmd/link, and I noticed that the value of Done (the next byte after Siz) is getting set to 222. Together, 173 and 222 would be 0xdead in little endian, so I'm suspecting this is the 0xdeaddeaddeaddead in mgcsweep.go?

To further support that theory, the handful of cases I've witnessed so far have also always been the first Reloc in the s.R slice.

@mdempsky mdempsky changed the title cmd/link: "invalid relocation" failures runtime: GC corrupting live objects in cmd/link Jul 14, 2015
@mdempsky
Copy link
Contributor Author

As an experiment, I changed 0xdeaddeaddeaddead to 0xdeafdeafdeafdeaf and the 173 value changed to 175. So it's definitely that.

@aclements
Copy link
Member

Ah, there you go. So we're prematurely collecting the slice backing store and overwriting the first word of the first Reloc with a free list pointer and the second word with 0xdeaddeaddeaddead.

I don't think there are any channels in the linker (and grep agrees), so this isn't #11643.

Can you run your repro with GODEBUG=gccheckmark=1? That may give us a better idea of what's going on.

@aclements
Copy link
Member

I suspect this is also the cause of the markflood faults that we see from time to time. These happen in a loop over s.R that attempts to dereference the s.R[i].Sym pointer, so if s.R's backing array is being reclaimed, this could be overwritten with a non-pointer. Here are some recent examples:

2015-07-05T03:52:56-cc3f031/openbsd-386-gce56
2015-07-05T03:52:56-cc3f031/openbsd-amd64-gce56
2015-07-09T04:09:00-91976aa/openbsd-amd64-gce56
2015-07-11T14:36:33-2ae7737/openbsd-amd64-gce56
2015-07-11T19:38:57-7a982e3/openbsd-amd64-gce56
2015-07-12T04:42:40-783297a/openbsd-amd64-gce56
2015-07-13T19:42:35-0bcdffe/openbsd-amd64-gce56
2015-07-13T22:36:24-b06961b/openbsd-amd64-gce56
2015-07-14T04:42:12-3290e9c/linux-amd64
2015-07-14T05:33:14-a74d030/openbsd-amd64-gce56

@mdempsky
Copy link
Contributor Author

I'll try with GODEBUG=gccheckmark=1 today.

For what it's worth, my current test is that I added this function:

var sanitycheckfailed = false
func sanitycheck(when string) {
    if sanitycheckfailed { return }
    check := func(s *LSym, where string) {
        for i := range s.R {
            if s.R[i].Siz == 173 || s.R[i].Done > 1 {
                fmt.Println("sanity check failed", when, where)
                fmt.Println(s, s.R, i)
                debug.PrintStack()
                sanitycheckfailed = true
            }
        }
    }
    for s := Ctxt.Allsym; s != nil; s = s.Allsym {
        check(s, "allsym")
    }
    runtime.GC()
    for s := Ctxt.Allsym; s != nil; s = s.Allsym {
        check(s, "allsym after GC")
    }
}

and then I instrumented Ldmain like:

loadlib()
sanitycheck("after loadlib")
[...]
checkgo()
sanitycheck("after checkgo")
checkstrdata()
sanitycheck("after checkstrdata")

and so on.

I got maybe a dozen crashes running this overnight on openbsd/386. One was "sanity check failed after loadlib allsym", but the rest were all "sanity check failed after loadlib allsym after GC". (Before I added the runtime.GC() call and second pass, the failures were more sporadically distributed.)

I also ran a similar test overnight on linux/amd64 (tweaking the sanity check heuristics accordingly), but didn't get any failures.

@aclements
Copy link
Member

I'll try with GODEBUG=gccheckmark=1 today.

Thanks!

One was "sanity check failed after loadlib allsym", but the rest were all "sanity check failed after loadlib allsym after GC".

Interesting. Since runtime.GC is stop-the-world, this suggests that it's not a concurrency issue or a missed write barrier, but we're simply tracing the Allsym list wrong. That could indicate a problem with the heap bitmap. If that's the case, gccheckmark=1 won't panic with a check mark failure, and we'll continue to see your "sanity check failed".

BTW, I've been able to reproduce this twice in a cluster of 5 openbsd-386 VMs over the past few hours. I've tried it with and without gccheckmark=1 and have only been able to reproduce it without gccheckmark=1, so it could be that check mark somehow inhibits this failure, or it could be random chance since it takes so long to reproduce.

@mdempsky
Copy link
Contributor Author

I just saw a repro with GODEBUG=gccheckmark=1, but I didn't see any additional runtime/GC debugging output. Just my own instrumentation prints.

@aclements
Copy link
Member

I just saw a repro with GODEBUG=gccheckmark=1, but I didn't see any additional runtime/GC debugging output. Just my own instrumentation prints.

Okay, this definitely sounds like a bitmap problem, then. I'm working on some debugging code, but in the mean time, could you try setting doubleCheck in mbitmap.go to true?

@mdempsky
Copy link
Contributor Author

Can do. Should I continue with GODEBUG=gccheckmark=1 or not?

@aclements
Copy link
Member

For now, yes, I'd run with GODEBUG=gccheckmark=1 just to get more samples of failures with that enabled. If you get a few more failures where the runtime doesn't panic, you can stop using checkmark since it slows things down.

@mdempsky
Copy link
Contributor Author

Just saw another "after loadlib allsym after GC" failure with GODEBUG=gccheckmark=1 and doubleCheck=true, but still no extra runtime output.

@aclements
Copy link
Member

@mdempsky, do you have any more examples of your sanity check failure message? In the one you posted, the LSym object is very aligned (so aligned that the corrupted bitmap byte is also very aligned) and I'd like to know if that's just coincidence or if it's likely to be relevant.

I haven't figured out much more since last night, unfortunately. I dug around in the neighboring memory for clues. The preceding span (which neighbors the corrupted byte in the bitmap) is also full of LSyms. Based on their location in Ctxt.Allsym, they were allocated long before the LSym with the corrupted bitmap (over 10,000 LSyms were allocated in-between), so it wasn't the initialization of that span that corrupted the bitmap byte, or the unrolling of the bitmap for those LSyms. Also, since the LSym is 176 bytes, the preceding span has 96 bytes of padding at the end, which leaves quite a bit of space between the bitmap for the preceding LSym and the corrupted bitmap byte.

The LSym with the corrupted bitmap byte was one of the more recently allocated LSyms, with only 2,027 more recently allocated LSyms (out of 17,335 on Ctxt.Allsym). It's also the only LSym with a corrupted bitmap. In the core dump you sent me, the majority of the LSyms are not yet marked (it's in the middle of a GC, so that's not entirely surprising).

@mdempsky
Copy link
Contributor Author

Ah, sorry, I stopped running the tests after I got that coredump. I'll resume running to try to get you some more.

@aclements
Copy link
Member

In the one you posted, the LSym object is very aligned (so aligned that the corrupted bitmap byte is also very aligned) and I'd like to know if that's just coincidence or if it's likely to be relevant.

Just got a crash of my own on an openbsd-386 VM.

sanity check failed after loadlib allsym after GC
go.typelink.[16]uint32  [16]uint32 [{409197984 173 222 1 0 0 0 type.[16]uint32 <nil>}] 0
0x18786000 0x187860a0 0x1863d5f0

The LSym is still the first object on a span, but that's as aligned as it gets, and the bitmap doesn't have any stronger alignment either.

It's still a fairly recently allocated LSym, but the preceding span is something else entirely (it's a 16K span of 3072 byte slots, again with some padding at the end.)

@mdempsky
Copy link
Contributor Author

Does that crash show the same bitmap corruption as the first one?

@aclements
Copy link
Member

Does that crash show the same bitmap corruption as the first one?

Yes, exactly the same bitmap corruption (sorry, should have said). Also, as in the first one, this is the only LSym that's corrupted.

@aclements
Copy link
Member

Still no progress on a solution. By @mdempsky's request, my notes on my hypotheses and what tests I've tried are below and the code for these tests is at https://go-review.googlesource.com/12277. I'll be away until Monday, so I probably won't make much more progress on this before then.

Overwritten byte contains the mark bit. Are we screwing up a mark bit write?

  • What manipulates that byte after allocation?
    • heapBits.setMarked
      • NO: Atomic or (checked assembly); can't clear bits
    • heapBits.setMarkedNonAtomic
      • Only used by sweeping, which SHOULD own the span and never be concurrent with GC.
      • TEST: Try to reproduce with setMarked instead of setMarkedNonAtomic.
        • NO: Still occurred (sample 3).
    • heapBits.setCheckmarked
      • NO: Checkmarking is not enabled, so we don't reach this code
    • heapBits.initCheckmarkSpan
      • NO: Checkmarking is not enabled, so we don't reach this code
    • heapBits.clearCheckmarkSpan
      • NO: Checkmarking is not enabled, so we don't reach this code
  • What else manipulates mark bits?
    • heapBits.initSpan
      • NO: The preceding span was initialized long before the span with the bad bitmap.
        • 0x18770000 is the 2028'th LSym in Allsym. The preceding span is LSyms except for , which are Allsyms 13989--14033, so they were allocated long before 0x18770000 (since the list goes from newest to oldest). Further, the span of 0x18770000 consists entirely of LSyms, and 0x18770000 would have been the first one allocated from that span (causing the initSpan).
          • 0x1876f290 from the previous span is the only object in that span not on Allsym. It's definitely not an LSym. No pointers.
      • NO: the memclr seems to be right (as does the memclr 386 implementation).
    • heapBitsSetType
      • TEST: Check last byte written.
        • NO: Reproduced twice without this test failing.
    • heapBitsSetTypeGCProg
    • heapBitsSweepSpan
      • Sweep of neighboring span?
        • Unlikely, but not impossible that a sweep of a neighboring span overflowed. I wouldn't expect it to hit that exact byte of the bitmap, since that's not where the step would take it if it went one over.
      • Sweep of this span without proper ownership?
        • heapBitsSweepSpan does do non-atomic writes to this byte.
        • Could happen if object is allocated (bitmap written), but not marked, sweep runs, sees it's not marked, writes a zero to this byte, and then it gets marked.
        • Could happen if sweep reads the mark byte before allocation writes the bitmap, then allocation writes the bitmap, then sweep writes a zero byte back out.
        • Can there be multiple sweepers?
          • No. Every mSpan_Sweep is protected by a cas(&s.sweepgen), so only one sweeper is sweeping a span at a time.
        • Can there be a sweeper and an allocator?
          • TEST: Check sweepgen before and after heapBitsSetType.
            • NO: Reproduced 10 times (!) without this failure. Did this make it easier to trigger?
        • Can there be a sweeper and a GC?
          • TEST: Check sweepgen in greyobject and gcmarknewobject_m.
            • NO: Reproduced twice without this failure.
      • TEST: Check that bitp doesn't change across sweep callback.
        • NO (probably): Reproduced 10 times (!) without this test failing. Did this make it easier to trigger?
      • TEST: Look for the LSym bitmap pattern and panic before writing.
        • ???
  • TEST: Skip allocating the first object in 176 byte spans, put a test pattern there, and check the test pattern when sweeping. (Tests for a stray write, which seems very unlikely, but who knows at this point.)
    • ???

@aclements
Copy link
Member

@mdempsky, if you're still plugging away at this (I can definitely understand if you're not :), I just uploaded https://go-review.googlesource.com/12249, which records an in-memory log of all intentional bitmap writes that may be to the overwritten byte. It would be great to get a core dump with that CL. I'm going to try, but I'm on a plane and it's hard enough to juggle a bunch of OpenBSD VMs with a wired connection to the corp network, much less a flaky connection from 33,576 feet. :)

@aclements
Copy link
Member

As another data point, I've been running

GODEBUG=gccheckmark=1 GOTRACEBACK=crash stress -p 1 ./run.bash --no-rebuild

on my linux-amd64 workstation for the past few days. Out of ~1000 runs, I saw 8 failures. 2 were the markflood crash, which I believe is this bug. 1 was a runtime test timeout (maybe bad). 4 were tests that can fail with some probability (fine). 1 was #11640, but I was running off a commit before my possible fix for that went in.

@RLH
Copy link
Contributor

RLH commented Jul 16, 2015

If one wanted to increase the number of first bytes of a spans where the
corruption happens one could reduce the size of a span this increasing the
number of spans and the chances of hitting the corruption.

On Thu, Jul 16, 2015 at 11:44 AM, Austin Clements [email protected]
wrote:

As another data point, I've been running

GODEBUG=gccheckmark=1 GOTRACEBACK=crash stress -p 1 ./run.bash --no-rebuild

on my linux-amd64 workstation for the past few days. Out of ~1000 runs, I
saw 8 failures. 2 were the markflood crash, which I believe is this bug. 1
was a runtime test timeout (maybe bad). 4 were tests that can fail with
some probability (fine). 1 was #11640
#11640, but I was running off a
commit before my possible fix for that went in.


Reply to this email directly or view it on GitHub
#11617 (comment).

@aclements
Copy link
Member

@RLH, what do you mean by "reduce the size of a span"? Do you mean decrease the 8K page size?

@RLH
Copy link
Contributor

RLH commented Jul 16, 2015

Yes, make it 1k.

On Thursday, July 16, 2015, Austin Clements [email protected]
wrote:

@RLH https://github.com/RLH, what do you mean by "reduce the size of a
span"? Do you mean decrease the 8K page size?


Reply to this email directly or view it on GitHub
#11617 (comment).

@aclements
Copy link
Member

Now we're getting somewhere! Using @mdempsky's link5 core dump with PS3 of my CL, I think I've found the interleaving the leads to this. I don't know why this happens yet.

(gdb) p 'runtime.allp'[0].bitmapLogPos%2048
$4 = 443
(gdb) p 'runtime.allp'[1].bitmapLogPos%2048
$5 = 315

(gdb) afind 'runtime.allp'[0].bitmapLog, !(x['addr']==0x184ef7ff)
[1310]={addr = 407828479, entry = 0x500, pre = 196 '\304', post = 0 '\000'}
entry=initSpan, pre=0b11000100 [presumably reusing a span]
[1311]={addr = 407828479, entry = 0x615, pre = 197 '\305', post = 0 '\000'}
entry=heapBitsSweepSpan, pre=0b11000101
[92]={addr = 407828479, entry = 0x297, pre = 0 '\000', post = 16 '\020'}
entry=setMarked, post=0b0001000 (marked, no pointers)

(gdb) afind 'runtime.allp'[1].bitmapLog, !(x['addr']==0x184ef7ff)
[370]={addr = 407828479, entry = 0x500, pre = 0 '\000', post = 0 '\000'}
entry=initSpan
[371]={addr = 407828479, entry = 0x954, pre = 0 '\000', post = 197 '\305'}
entry=heapBitsSetType, pre=0b11000101 [correct LSym bitmap, not marked]
[1529]={addr = 407828479, entry = 0x615, pre = 16 '\020', post = 0 '\000'}
entry=heapBitsSweepSpan
[314]={addr = 407828479, entry = 0x297, pre = 0 '\000', post = 16 '\020'}
entry=setMarked

My guess about the global interleaving (I should probably add timestamps to the log...):

P0 [1310]={addr = 407828479, entry = 0x500, pre = 196 '\304', post = 0 '\000'}
entry=initSpan, pre=0b11000100 [presumably reusing a span]

P1 [370]={addr = 407828479, entry = 0x500, pre = 0 '\000', post = 0 '\000'}
entry=initSpan

P1 [371]={addr = 407828479, entry = 0x954, pre = 0 '\000', post = 197 '\305'}
entry=heapBitsSetType, pre=0b11000101 [correct LSym bitmap, not marked]

P0 [1311]={addr = 407828479, entry = 0x615, pre = 197 '\305', post = 0 '\000'}
entry=heapBitsSweepSpan, pre=0b11000101

At this point we're toast. Note that the above four events probably happened in quick succession, given their event indexes, and appear to be interleaved.

P0 [92]={addr = 407828479, entry = 0x297, pre = 0 '\000', post = 16 '\020'}
entry=setMarked, post=0b0001000 (marked, no pointers)

P1 [1529]={addr = 407828479, entry = 0x615, pre = 16 '\020', post = 0 '\000'}
entry=heapBitsSweepSpan

P1 [314]={addr = 407828479, entry = 0x297, pre = 0 '\000', post = 16 '\020'}
entry=setMarked [post agrees with current value of byte, and 314 was *just* logged]

@aclements
Copy link
Member

I'm almost out of battery and distance to my flight's destination. Here's my current brain dump:

P0 [1310]={addr = 407828479, entry = 0x500, pre = 196 '\304', post = 0 '\000'}
entry=initSpan, pre=0b11000100 [presumably reusing a span]

Guess: initSpan called from mSpan_Sweep for large span

P1 [370]={addr = 407828479, entry = 0x500, pre = 0 '\000', post = 0 '\000'}
entry=initSpan

I don't know what this is, but I think it's key. How are we doing two inits of the same span at the same time?

P1 [371]={addr = 407828479, entry = 0x954, pre = 0 '\000', post = 197 '\305'}
entry=heapBitsSetType, pre=0b11000101 [correct LSym bitmap, not marked]

Guess: We grabbed this span because we needed to allocate an LSym; now we're allocating it.

P0 [1311]={addr = 407828479, entry = 0x615, pre = 197 '\305', post = 0 '\000'}
entry=heapBitsSweepSpan, pre=0b11000101

Guess: heapBitsSweepSpan in mSpan_Sweep finishing

A possible next step would be to call runtime.Callers (or something) from logWrite and dump the whole darn call stack into the log record so we don't have to guess.

@mdempsky
Copy link
Contributor Author

I'll try to get another coredump with timestamps and caller PCs.

@mwhudson
Copy link
Contributor

mwhudson commented Jul 16, 2015 via email

@aclements
Copy link
Member

P0 calls mSpan_Sweep on a large span with an unmarked object on it. mSpan_Sweep calls heapBitsSweepSpan, which invokes the callback for the one object on the span. Because this is a large span, the callback calls initSpan, which clears the bitmap (P0 [1310] above). The callback then calls mHeap_Free. This makes the span available for allocation, but this is too early.

P1 grabs this span from the heap to use for 176 byte objects and allocates an LSym in the first slot (P1 [371]).

P0 returns from the callback to heapBitsSweepSpan. It clears the byte containing the mark, even though this span has been taken over by P0 at this point.

I can think of a few solutions.

  1. mSpan_Sweep could delay the mHeap_Free until after the callback and heapBitsSweepSpan returns.

  2. The mSpan_Sweep callback could return a boolean telling heapBitsSweepSpan that the span has been taken care of and it shouldn't touch the bitmap or do anything else with that span.

I like solution 1 best.

@aclements
Copy link
Member

@mdempsky
Copy link
Contributor Author

@aclements Testing.

Also, an hour is too short to say anything definitive, but after adding cputicks() and Callers() instrumentation to writeLog(), I didn't get any more repros. I'm worried the probe effect (especially from Callers) is too much, so I've reverted those changes for now. I.e., I'm just testing 11249 PS3 + 12320 PS1 for now to see if I get any more crashes.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/12320 mentions this issue.

@aclements
Copy link
Member

On Jul 16, 2015 1:08 PM, "Michael Hudson-Doyle" [email protected]
wrote:

A couple of peanut gallery questions I've been wondering about: do we have
any idea why this happens more often on openbsd? The only thing I know
that's odd about it in this context is that there is no platform support
for thread local storage but I can't see how that matters with an
internally linked program link cmd/link.

It's a good question. My guess is that it has to do with thread scheduling,
but that's just a guess.

I bet we could trigger this much more (and on other platforms) by adding a
usleep(10) or something just after the mHeap_Free.

The other q is, does this happen with GOMAXPROCS=1?

I don't know. It would be good to confirm it doesn't happen with
GOMAXPROCS=1, though it's hard enough to reproduce that if it didn't happen
you wouldn't really be sure it wasn't going to happen.

@mdempsky
Copy link
Contributor Author

Three hours and no failures of any sort so far. Not even the occasional segfaults or runaway CPU usage issues I was sporadically seeing while trying to reproduce the "invalid relocation" errors.

I'll continue running overnight (and then through the weekend too), but so far it seems promising.

@mdempsky
Copy link
Contributor Author

Another 16 hours of testing, and still no failures. At this point, I think it's reasonable to assume @aclements's patch fixed the issue.

I'll keep the tests going over the weekend, but unless I see any new failures, I won't spam with further updates.

@aclements
Copy link
Member

Great!

Thanks for all for the help and for sticking it through!

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/12451 mentions this issue.

mk0x9 pushed a commit to mk0x9/go that referenced this issue Sep 14, 2015
This is a cleanup following cc8f544, which was a minimal change to fix
issue golang#11617. This consolidates the two places in mSpan_Sweep that
update sweepgen. Previously this was necessary because sweepgen must
be updated before freeing the span, but we freed large spans early.
Now we free large spans later, so there's no need to duplicate the
sweepgen update. This also means large spans can take advantage of the
sweepgen sanity checking performed for other spans.

Change-Id: I23b79dbd9ec81d08575cd307cdc0fa6b20831768
Reviewed-on: https://go-review.googlesource.com/12451
Reviewed-by: Rick Hudson <[email protected]>
@golang golang locked and limited conversation to collaborators Jul 20, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants