runtime: GC corrupting live objects in cmd/link #11617

mdempsky · 2015-07-06T23:50:13Z

http://build.golang.org/log/dbd04ef919b4bba31c79a2882fda1ba41008cace (386)

# go run run.go -- fixedbugs/bug484.go
exit status 2
# command-line-arguments
go.string."npcdata": go.string."npcdata": invalid relocation 414675216+173 not in [0,16)

http://build.golang.org/log/0b184b4b886b6bf86634e81ca10fadbc6fd5d0f4 (386)

# go run run.go -- nosplit.go
incorrect output
BUG
rejected incorrectly:
    main 128

    linker output:
    # _/tmp/go-test-nosplit187529840
    go.string.hdr."eodSent": go.string.hdr."eodSent": invalid relocation 414647280+173 not in [0,8)

http://build.golang.org/log/a9e3dca50638f63391106fa63af858506eba811a (386)

# go run run.go -- fixedbugs/bug290.go
exit status 2
# command-line-arguments
go.string."runtime.tforkt": go.string."runtime.tforkt": invalid relocation 410364976+173 not in [0,23)

http://build.golang.org/log/7c9e9ef6f2e9f8b568fc5efaa40368f480ce6818 (amd64)

# go run run.go -- fixedbugs/issue6055.go
exit status 2
# command-line-arguments
go.string."bad frame layout": go.string."bad frame layout": invalid relocation 544120224+200 not in [0,33)

http://build.golang.org/log/3f96c7856ce7548618d41419892da7fa7d8ab1e5 (amd64)

# go run run.go -- fixedbugs/bug242.go
exit status 2
# command-line-arguments
go.string."runtime.hchan": go.string."runtime.hchan": invalid relocation 544858352+200 not in [0,30)

See also issue #10567 where TestCgoExternalThread was failing with "invalid relocation" errors (among others).

Somewhat small sample set, but notably the invalid offset seems to always be 173 on openbsd/386 and 200 on openbsd/amd64.

The text was updated successfully, but these errors were encountered:

mdempsky · 2015-07-07T01:39:50Z

Here's a failure on linux/amd64, suggesting it's not OpenBSD-specific: http://build.golang.org/log/735c2d636a8cc1f5303994c593c7b5d18b789912

# go run run.go -- varinit.go
exit status 2
# command-line-arguments
go.string."NextGC": go.string."NextGC": invalid relocation 540185808+200 not in [0,23)

Here's a failure on openbsd/386 for a non-go.string symbol: http://build.golang.org/log/f57efb3a7916e4742eaac8a7288e95f65db4b83c

# go run run.go -- fixedbugs/bug082.go
exit status 2
# command-line-arguments
type..eqfunc.runtime.modulehash: type..eqfunc.runtime.modulehash: invalid relocation 415126928+173 not in [0,4)

Here's another failure on openbsd/386 that doesn't fit the 173/200 pattern: http://build.golang.org/log/5baea92cb737e12a38f631a199b657c182def33b

# testmain
type.[]*uint8: type.[]*uint8: invalid relocation 43+4 not in [0,44)
type.[]*uint8: type.[]*uint8: invalid relocation 59+4 not in [0,44)
type.[]*uint8: type.[]*uint8: invalid relocation 76+4 not in [0,44)
type.[]*uint8: type.[]*uint8: invalid relocation 93+4 not in [0,44)
type.[]*uint8: type.[]*uint8: invalid relocation 120+4 not in [0,44)
FAIL    encoding/base64 [build failed]

ianlancetaylor · 2015-07-10T13:56:55Z

CC @rsc @RLH @aclements

This is an intermittent problem that is still happening periodically on the build dashboard. The values that the linker is reporting are completely bogus. I can not think of any plausible way this could occur other than a GC problem in the compiler or linker. Or, I suppose, a file system problem.

@mdempsky: are you able to repeat this at all outside of the build dashboard? I have not been able to, but I also don't have an OpenBSD system.

mdempsky · 2015-07-10T16:59:42Z

@ianlancetaylor I spent a while trying to repro it on my openbsd/amd64 machine, but haven't succeeded yet. I was going to try an openbsd/386 VM next since it seems to repro there more often with the trybots.

mdempsky · 2015-07-10T19:02:17Z

BTW, I noticed this code in cmd/link/internal/ld/objfile.go:

                    r.Off = int32(rdint(f))
                    r.Siz = uint8(rdint(f))
                    r.Type = int32(rdint(f))
                    r.Add = rdint(f)

Is it reasonable to change this code to error if the encoded int values are out of range for an int32/uint8 rather than silently truncate them? That might help narrow down if the corruption is occurring in the compiler or linker.

ianlancetaylor · 2015-07-10T21:28:00Z

Sounds like a good idea to me.

gopherbot · 2015-07-10T22:00:08Z

CL https://golang.org/cl/12050 mentions this issue.

Instead of silently truncating integers to their expected range, check that they're within range and emit errors if not. Intended to help narrow down the cause of issue #11617. Change-Id: Ia7b577270f8438ca7479262702371e26277f1ea7 Reviewed-on: https://go-review.googlesource.com/12050 Reviewed-by: Ian Lance Taylor <[email protected]>

mdempsky · 2015-07-13T17:42:32Z

The same "invalid relocation" error occurred a few more times on build.golang.org over the weekend:

openbsd/386 http://build.golang.org/log/e6b76c598ae8232a031d8494e724fd395ba784ef

go.string.hdr."func(*runtime.wbufptr) *runtime.workbuf": go.string.hdr."func(*runtime.wbufptr) *runtime.workbuf": invalid relocation 415022496+173 not in [0,8)

openbsd/386 http://build.golang.org/log/8a864cdb337aad199245615a5e9e74c62f3e6424

go.string.hdr."logput": go.string.hdr."logput": invalid relocation 414345424+173 not in [0,8)

openbsd/386 http://build.golang.org/log/f079859ada15356800d9dcba0b1b838901dbd6ed

type..alg.runtime.bgMarkSignal: type..alg.runtime.bgMarkSignal: invalid relocation 413976688+173 not in [0,8)

openbsd/amd64 http://build.golang.org/log/6d93047a2121ed1c02ca23338a71a566cd48c196

type..alg24: type..alg24: invalid relocation 48+0 not in [0,16)

No rdint panics, so it's seemingly not file corruption/desync at least.

It also occurs to me now that symbols like "type..alg.runtime.bgMarkSignal" are only defined in runtime.a, which is only compiled once and reused for all builds/tests. If the corruption was in the compiler, every link should fail. So it seems like something must be going wrong in cmd/link.

aclements · 2015-07-13T20:08:51Z

Ignoring the small ones, I think these are pointers gone rogue.

They're all multiples of 16. Both 386 and amd64 have distinct but different patterns: 386 is 0x18xxxxxx and amd64 is 0x20xxxxxx (on both openbsd and linux). Since Off is an int32, this would be only the bottom 32 bits of a 64 bit pointer and Siz will be the next 8 bits, which gives us 0xc820xxxxxx, which definitely looks like a 64-bit Go heap pointer.

mdempsky · 2015-07-13T20:23:57Z

@aclements Is it still useful for me to work on repro'ing this on openbsd/386?

aclements · 2015-07-13T20:28:03Z

It would be very helpful to have a reproducible (or reasonably reproducible) case. Most convenient would be on linux, but any repro is good. I have no idea what's going on; I was merely adding a fact to our debugging database. :)

mdempsky · 2015-07-13T21:09:32Z

Managed two repros so far with "./run func7.go":

go.string.hdr."align": go.string.hdr."align": invalid relocation 414699840+173 not in [0,8)

go.string.hdr."struct { F uintptr; gp *runtime.g; p uintptr }": unknown reloc -1961808867
2015/07/13 14:06:06 unexpected relocation variant

Seems exceedingly rare unfortunately. Any ideas for improving the repro rate or useful debugging to apply once corruption is detected are welcome.

mdempsky · 2015-07-14T02:09:46Z

I started sprinkling sanity checks for the relocation values throughout cmd/link, and I noticed that the value of Done (the next byte after Siz) is getting set to 222. Together, 173 and 222 would be 0xdead in little endian, so I'm suspecting this is the 0xdeaddeaddeaddead in mgcsweep.go?

To further support that theory, the handful of cases I've witnessed so far have also always been the first Reloc in the s.R slice.

mdempsky · 2015-07-14T03:48:58Z

As an experiment, I changed 0xdeaddeaddeaddead to 0xdeafdeafdeafdeaf and the 173 value changed to 175. So it's definitely that.

aclements · 2015-07-14T04:03:46Z

Ah, there you go. So we're prematurely collecting the slice backing store and overwriting the first word of the first Reloc with a free list pointer and the second word with 0xdeaddeaddeaddead.

I don't think there are any channels in the linker (and grep agrees), so this isn't #11643.

Can you run your repro with GODEBUG=gccheckmark=1? That may give us a better idea of what's going on.

aclements · 2015-07-14T14:37:29Z

I suspect this is also the cause of the markflood faults that we see from time to time. These happen in a loop over s.R that attempts to dereference the s.R[i].Sym pointer, so if s.R's backing array is being reclaimed, this could be overwritten with a non-pointer. Here are some recent examples:

2015-07-05T03:52:56-cc3f031/openbsd-386-gce56
2015-07-05T03:52:56-cc3f031/openbsd-amd64-gce56
2015-07-09T04:09:00-91976aa/openbsd-amd64-gce56
2015-07-11T14:36:33-2ae7737/openbsd-amd64-gce56
2015-07-11T19:38:57-7a982e3/openbsd-amd64-gce56
2015-07-12T04:42:40-783297a/openbsd-amd64-gce56
2015-07-13T19:42:35-0bcdffe/openbsd-amd64-gce56
2015-07-13T22:36:24-b06961b/openbsd-amd64-gce56
2015-07-14T04:42:12-3290e9c/linux-amd64
2015-07-14T05:33:14-a74d030/openbsd-amd64-gce56

mdempsky · 2015-07-14T17:41:36Z

I'll try with GODEBUG=gccheckmark=1 today.

For what it's worth, my current test is that I added this function:

var sanitycheckfailed = false
func sanitycheck(when string) {
    if sanitycheckfailed { return }
    check := func(s *LSym, where string) {
        for i := range s.R {
            if s.R[i].Siz == 173 || s.R[i].Done > 1 {
                fmt.Println("sanity check failed", when, where)
                fmt.Println(s, s.R, i)
                debug.PrintStack()
                sanitycheckfailed = true
            }
        }
    }
    for s := Ctxt.Allsym; s != nil; s = s.Allsym {
        check(s, "allsym")
    }
    runtime.GC()
    for s := Ctxt.Allsym; s != nil; s = s.Allsym {
        check(s, "allsym after GC")
    }
}

and then I instrumented Ldmain like:

loadlib()
sanitycheck("after loadlib")
[...]
checkgo()
sanitycheck("after checkgo")
checkstrdata()
sanitycheck("after checkstrdata")

and so on.

I got maybe a dozen crashes running this overnight on openbsd/386. One was "sanity check failed after loadlib allsym", but the rest were all "sanity check failed after loadlib allsym after GC". (Before I added the runtime.GC() call and second pass, the failures were more sporadically distributed.)

I also ran a similar test overnight on linux/amd64 (tweaking the sanity check heuristics accordingly), but didn't get any failures.

aclements · 2015-07-14T17:53:16Z

I'll try with GODEBUG=gccheckmark=1 today.

Thanks!

One was "sanity check failed after loadlib allsym", but the rest were all "sanity check failed after loadlib allsym after GC".

Interesting. Since runtime.GC is stop-the-world, this suggests that it's not a concurrency issue or a missed write barrier, but we're simply tracing the Allsym list wrong. That could indicate a problem with the heap bitmap. If that's the case, gccheckmark=1 won't panic with a check mark failure, and we'll continue to see your "sanity check failed".

BTW, I've been able to reproduce this twice in a cluster of 5 openbsd-386 VMs over the past few hours. I've tried it with and without gccheckmark=1 and have only been able to reproduce it without gccheckmark=1, so it could be that check mark somehow inhibits this failure, or it could be random chance since it takes so long to reproduce.

mdempsky · 2015-07-14T19:37:26Z

I just saw a repro with GODEBUG=gccheckmark=1, but I didn't see any additional runtime/GC debugging output. Just my own instrumentation prints.

aclements · 2015-07-14T19:55:52Z

I just saw a repro with GODEBUG=gccheckmark=1, but I didn't see any additional runtime/GC debugging output. Just my own instrumentation prints.

Okay, this definitely sounds like a bitmap problem, then. I'm working on some debugging code, but in the mean time, could you try setting doubleCheck in mbitmap.go to true?

mdempsky · 2015-07-14T19:57:26Z

Can do. Should I continue with GODEBUG=gccheckmark=1 or not?

aclements · 2015-07-14T19:59:43Z

For now, yes, I'd run with GODEBUG=gccheckmark=1 just to get more samples of failures with that enabled. If you get a few more failures where the runtime doesn't panic, you can stop using checkmark since it slows things down.

mdempsky · 2015-07-14T20:30:15Z

Just saw another "after loadlib allsym after GC" failure with GODEBUG=gccheckmark=1 and doubleCheck=true, but still no extra runtime output.

aclements · 2015-07-15T16:35:17Z

@mdempsky, do you have any more examples of your sanity check failure message? In the one you posted, the LSym object is very aligned (so aligned that the corrupted bitmap byte is also very aligned) and I'd like to know if that's just coincidence or if it's likely to be relevant.

I haven't figured out much more since last night, unfortunately. I dug around in the neighboring memory for clues. The preceding span (which neighbors the corrupted byte in the bitmap) is also full of LSyms. Based on their location in Ctxt.Allsym, they were allocated long before the LSym with the corrupted bitmap (over 10,000 LSyms were allocated in-between), so it wasn't the initialization of that span that corrupted the bitmap byte, or the unrolling of the bitmap for those LSyms. Also, since the LSym is 176 bytes, the preceding span has 96 bytes of padding at the end, which leaves quite a bit of space between the bitmap for the preceding LSym and the corrupted bitmap byte.

The LSym with the corrupted bitmap byte was one of the more recently allocated LSyms, with only 2,027 more recently allocated LSyms (out of 17,335 on Ctxt.Allsym). It's also the only LSym with a corrupted bitmap. In the core dump you sent me, the majority of the LSyms are not yet marked (it's in the middle of a GC, so that's not entirely surprising).

mdempsky · 2015-07-15T16:43:03Z

Ah, sorry, I stopped running the tests after I got that coredump. I'll resume running to try to get you some more.

aclements · 2015-07-15T19:01:00Z

In the one you posted, the LSym object is very aligned (so aligned that the corrupted bitmap byte is also very aligned) and I'd like to know if that's just coincidence or if it's likely to be relevant.

Just got a crash of my own on an openbsd-386 VM.

sanity check failed after loadlib allsym after GC
go.typelink.[16]uint32  [16]uint32 [{409197984 173 222 1 0 0 0 type.[16]uint32 <nil>}] 0
0x18786000 0x187860a0 0x1863d5f0

The LSym is still the first object on a span, but that's as aligned as it gets, and the bitmap doesn't have any stronger alignment either.

It's still a fairly recently allocated LSym, but the preceding span is something else entirely (it's a 16K span of 3072 byte slots, again with some padding at the end.)

mdempsky · 2015-07-15T19:16:43Z

Does that crash show the same bitmap corruption as the first one?

aclements · 2015-07-15T19:18:48Z

Does that crash show the same bitmap corruption as the first one?

Yes, exactly the same bitmap corruption (sorry, should have said). Also, as in the first one, this is the only LSym that's corrupted.

aclements · 2015-07-15T22:47:19Z

Still no progress on a solution. By @mdempsky's request, my notes on my hypotheses and what tests I've tried are below and the code for these tests is at https://go-review.googlesource.com/12277. I'll be away until Monday, so I probably won't make much more progress on this before then.

Overwritten byte contains the mark bit. Are we screwing up a mark bit write?

What manipulates that byte after allocation?
- heapBits.setMarked
  - NO: Atomic or (checked assembly); can't clear bits
- heapBits.setMarkedNonAtomic
  - Only used by sweeping, which SHOULD own the span and never be concurrent with GC.
  - TEST: Try to reproduce with setMarked instead of setMarkedNonAtomic.
    - NO: Still occurred (sample 3).
- heapBits.setCheckmarked
  - NO: Checkmarking is not enabled, so we don't reach this code
- heapBits.initCheckmarkSpan
  - NO: Checkmarking is not enabled, so we don't reach this code
- heapBits.clearCheckmarkSpan
  - NO: Checkmarking is not enabled, so we don't reach this code
What else manipulates mark bits?
- heapBits.initSpan
  - NO: The preceding span was initialized long before the span with the bad bitmap.
    - 0x18770000 is the 2028'th LSym in Allsym. The preceding span is LSyms except for , which are Allsyms 13989--14033, so they were allocated long before 0x18770000 (since the list goes from newest to oldest). Further, the span of 0x18770000 consists entirely of LSyms, and 0x18770000 would have been the first one allocated from that span (causing the initSpan).
      - 0x1876f290 from the previous span is the only object in that span not on Allsym. It's definitely not an LSym. No pointers.
  - NO: the memclr seems to be right (as does the memclr 386 implementation).
- heapBitsSetType
  - TEST: Check last byte written.
    - NO: Reproduced twice without this test failing.
- heapBitsSetTypeGCProg
- heapBitsSweepSpan
  - Sweep of neighboring span?
    - Unlikely, but not impossible that a sweep of a neighboring span overflowed. I wouldn't expect it to hit that exact byte of the bitmap, since that's not where the step would take it if it went one over.
  - Sweep of this span without proper ownership?
    - heapBitsSweepSpan does do non-atomic writes to this byte.
    - Could happen if object is allocated (bitmap written), but not marked, sweep runs, sees it's not marked, writes a zero to this byte, and then it gets marked.
    - Could happen if sweep reads the mark byte before allocation writes the bitmap, then allocation writes the bitmap, then sweep writes a zero byte back out.
    - Can there be multiple sweepers?
      - No. Every mSpan_Sweep is protected by a cas(&s.sweepgen), so only one sweeper is sweeping a span at a time.
    - Can there be a sweeper and an allocator?
      - TEST: Check sweepgen before and after heapBitsSetType.
        
        NO: Reproduced 10 times (!) without this failure. Did this make it easier to trigger?
    - Can there be a sweeper and a GC?
      - TEST: Check sweepgen in greyobject and gcmarknewobject_m.
        
        NO: Reproduced twice without this failure.
  - TEST: Check that bitp doesn't change across sweep callback.
    - NO (probably): Reproduced 10 times (!) without this test failing. Did this make it easier to trigger?
  - TEST: Look for the LSym bitmap pattern and panic before writing.
    - ???
TEST: Skip allocating the first object in 176 byte spans, put a test pattern there, and check the test pattern when sweeping. (Tests for a stray write, which seems very unlikely, but who knows at this point.)
- ???

aclements · 2015-07-16T15:19:50Z

@mdempsky, if you're still plugging away at this (I can definitely understand if you're not :), I just uploaded https://go-review.googlesource.com/12249, which records an in-memory log of all intentional bitmap writes that may be to the overwritten byte. It would be great to get a core dump with that CL. I'm going to try, but I'm on a plane and it's hard enough to juggle a bunch of OpenBSD VMs with a wired connection to the corp network, much less a flaky connection from 33,576 feet. :)

aclements · 2015-07-16T15:44:19Z

As another data point, I've been running

GODEBUG=gccheckmark=1 GOTRACEBACK=crash stress -p 1 ./run.bash --no-rebuild

on my linux-amd64 workstation for the past few days. Out of ~1000 runs, I saw 8 failures. 2 were the markflood crash, which I believe is this bug. 1 was a runtime test timeout (maybe bad). 4 were tests that can fail with some probability (fine). 1 was #11640, but I was running off a commit before my possible fix for that went in.

RLH · 2015-07-16T16:27:57Z

If one wanted to increase the number of first bytes of a spans where the
corruption happens one could reduce the size of a span this increasing the
number of spans and the chances of hitting the corruption.

On Thu, Jul 16, 2015 at 11:44 AM, Austin Clements [email protected]
wrote:

As another data point, I've been running

GODEBUG=gccheckmark=1 GOTRACEBACK=crash stress -p 1 ./run.bash --no-rebuild

on my linux-amd64 workstation for the past few days. Out of ~1000 runs, I
saw 8 failures. 2 were the markflood crash, which I believe is this bug. 1
was a runtime test timeout (maybe bad). 4 were tests that can fail with
some probability (fine). 1 was #11640
#11640, but I was running off a
commit before my possible fix for that went in.

—
Reply to this email directly or view it on GitHub
#11617 (comment).

aclements · 2015-07-16T16:43:50Z

@RLH, what do you mean by "reduce the size of a span"? Do you mean decrease the 8K page size?

RLH · 2015-07-16T17:11:53Z

Yes, make it 1k.

On Thursday, July 16, 2015, Austin Clements [email protected]
wrote:

@RLH https://github.com/RLH, what do you mean by "reduce the size of a
span"? Do you mean decrease the 8K page size?

—
Reply to this email directly or view it on GitHub
#11617 (comment).

aclements · 2015-07-16T19:01:02Z

Now we're getting somewhere! Using @mdempsky's link5 core dump with PS3 of my CL, I think I've found the interleaving the leads to this. I don't know why this happens yet.

(gdb) p 'runtime.allp'[0].bitmapLogPos%2048
$4 = 443
(gdb) p 'runtime.allp'[1].bitmapLogPos%2048
$5 = 315

(gdb) afind 'runtime.allp'[0].bitmapLog, !(x['addr']==0x184ef7ff)
[1310]={addr = 407828479, entry = 0x500, pre = 196 '\304', post = 0 '\000'}
entry=initSpan, pre=0b11000100 [presumably reusing a span]
[1311]={addr = 407828479, entry = 0x615, pre = 197 '\305', post = 0 '\000'}
entry=heapBitsSweepSpan, pre=0b11000101
[92]={addr = 407828479, entry = 0x297, pre = 0 '\000', post = 16 '\020'}
entry=setMarked, post=0b0001000 (marked, no pointers)

(gdb) afind 'runtime.allp'[1].bitmapLog, !(x['addr']==0x184ef7ff)
[370]={addr = 407828479, entry = 0x500, pre = 0 '\000', post = 0 '\000'}
entry=initSpan
[371]={addr = 407828479, entry = 0x954, pre = 0 '\000', post = 197 '\305'}
entry=heapBitsSetType, pre=0b11000101 [correct LSym bitmap, not marked]
[1529]={addr = 407828479, entry = 0x615, pre = 16 '\020', post = 0 '\000'}
entry=heapBitsSweepSpan
[314]={addr = 407828479, entry = 0x297, pre = 0 '\000', post = 16 '\020'}
entry=setMarked

My guess about the global interleaving (I should probably add timestamps to the log...):

P0 [1310]={addr = 407828479, entry = 0x500, pre = 196 '\304', post = 0 '\000'}
entry=initSpan, pre=0b11000100 [presumably reusing a span]

P1 [370]={addr = 407828479, entry = 0x500, pre = 0 '\000', post = 0 '\000'}
entry=initSpan

P1 [371]={addr = 407828479, entry = 0x954, pre = 0 '\000', post = 197 '\305'}
entry=heapBitsSetType, pre=0b11000101 [correct LSym bitmap, not marked]

P0 [1311]={addr = 407828479, entry = 0x615, pre = 197 '\305', post = 0 '\000'}
entry=heapBitsSweepSpan, pre=0b11000101

At this point we're toast. Note that the above four events probably happened in quick succession, given their event indexes, and appear to be interleaved.

P0 [92]={addr = 407828479, entry = 0x297, pre = 0 '\000', post = 16 '\020'}
entry=setMarked, post=0b0001000 (marked, no pointers)

P1 [1529]={addr = 407828479, entry = 0x615, pre = 16 '\020', post = 0 '\000'}
entry=heapBitsSweepSpan

P1 [314]={addr = 407828479, entry = 0x297, pre = 0 '\000', post = 16 '\020'}
entry=setMarked [post agrees with current value of byte, and 314 was *just* logged]

aclements · 2015-07-16T19:17:03Z

I'm almost out of battery and distance to my flight's destination. Here's my current brain dump:

P0 [1310]={addr = 407828479, entry = 0x500, pre = 196 '\304', post = 0 '\000'}
entry=initSpan, pre=0b11000100 [presumably reusing a span]

Guess: initSpan called from mSpan_Sweep for large span

P1 [370]={addr = 407828479, entry = 0x500, pre = 0 '\000', post = 0 '\000'}
entry=initSpan

I don't know what this is, but I think it's key. How are we doing two inits of the same span at the same time?

P1 [371]={addr = 407828479, entry = 0x954, pre = 0 '\000', post = 197 '\305'}
entry=heapBitsSetType, pre=0b11000101 [correct LSym bitmap, not marked]

Guess: We grabbed this span because we needed to allocate an LSym; now we're allocating it.

P0 [1311]={addr = 407828479, entry = 0x615, pre = 197 '\305', post = 0 '\000'}
entry=heapBitsSweepSpan, pre=0b11000101

Guess: heapBitsSweepSpan in mSpan_Sweep finishing

A possible next step would be to call runtime.Callers (or something) from logWrite and dump the whole darn call stack into the log record so we don't have to guess.

mdempsky · 2015-07-16T19:18:36Z

I'll try to get another coredump with timestamps and caller PCs.

mwhudson · 2015-07-16T20:07:25Z

A couple of peanut gallery questions I've been wondering about: do we have any idea why this happens more often on openbsd? The only thing I know that's odd about it in this context is that there is no platform support for thread local storage but I can't see how that matters with an internally linked program link cmd/link. The other q is, does this happen with GOMAXPROCS=1?

aclements · 2015-07-16T20:21:42Z

P0 calls mSpan_Sweep on a large span with an unmarked object on it. mSpan_Sweep calls heapBitsSweepSpan, which invokes the callback for the one object on the span. Because this is a large span, the callback calls initSpan, which clears the bitmap (P0 [1310] above). The callback then calls mHeap_Free. This makes the span available for allocation, but this is too early.

P1 grabs this span from the heap to use for 176 byte objects and allocates an LSym in the first slot (P1 [371]).

P0 returns from the callback to heapBitsSweepSpan. It clears the byte containing the mark, even though this span has been taken over by P0 at this point.

I can think of a few solutions.

mSpan_Sweep could delay the mHeap_Free until after the callback and heapBitsSweepSpan returns.
The mSpan_Sweep callback could return a boolean telling heapBitsSweepSpan that the span has been taken care of and it shouldn't touch the bitmap or do anything else with that span.

I like solution 1 best.

aclements · 2015-07-16T20:40:01Z

@mdempsky, mind testing https://go-review.googlesource.com/12320?

mdempsky · 2015-07-16T20:46:08Z

@aclements Testing.

Also, an hour is too short to say anything definitive, but after adding cputicks() and Callers() instrumentation to writeLog(), I didn't get any more repros. I'm worried the probe effect (especially from Callers) is too much, so I've reverted those changes for now. I.e., I'm just testing 11249 PS3 + 12320 PS1 for now to see if I get any more crashes.

gopherbot · 2015-07-16T21:00:14Z

CL https://golang.org/cl/12320 mentions this issue.

aclements · 2015-07-16T21:01:53Z

On Jul 16, 2015 1:08 PM, "Michael Hudson-Doyle" [email protected]
wrote:

A couple of peanut gallery questions I've been wondering about: do we have
any idea why this happens more often on openbsd? The only thing I know
that's odd about it in this context is that there is no platform support
for thread local storage but I can't see how that matters with an
internally linked program link cmd/link.

It's a good question. My guess is that it has to do with thread scheduling,
but that's just a guess.

I bet we could trigger this much more (and on other platforms) by adding a
usleep(10) or something just after the mHeap_Free.

The other q is, does this happen with GOMAXPROCS=1?

I don't know. It would be good to confirm it doesn't happen with
GOMAXPROCS=1, though it's hard enough to reproduce that if it didn't happen
you wouldn't really be sure it wasn't going to happen.

mdempsky · 2015-07-16T23:47:18Z

Three hours and no failures of any sort so far. Not even the occasional segfaults or runaway CPU usage issues I was sporadically seeing while trying to reproduce the "invalid relocation" errors.

I'll continue running overnight (and then through the weekend too), but so far it seems promising.

mdempsky · 2015-07-17T16:51:40Z

Another 16 hours of testing, and still no failures. At this point, I think it's reasonable to assume @aclements's patch fixed the issue.

I'll keep the tests going over the weekend, but unless I see any new failures, I won't spam with further updates.

aclements · 2015-07-17T16:55:03Z

Great!

Thanks for all for the help and for sticking it through!

gopherbot · 2015-07-21T16:00:58Z

CL https://golang.org/cl/12451 mentions this issue.

This is a cleanup following cc8f544, which was a minimal change to fix issue golang#11617. This consolidates the two places in mSpan_Sweep that update sweepgen. Previously this was necessary because sweepgen must be updated before freeing the span, but we freed large spans early. Now we free large spans later, so there's no need to duplicate the sweepgen update. This also means large spans can take advantage of the sweepgen sanity checking performed for other spans. Change-Id: I23b79dbd9ec81d08575cd307cdc0fa6b20831768 Reviewed-on: https://go-review.googlesource.com/12451 Reviewed-by: Rick Hudson <[email protected]>

mdempsky added the OS-OpenBSD label Jul 6, 2015

mdempsky mentioned this issue Jul 7, 2015

x/build: support searching across build logs #11618

Open

mdempsky changed the title ~~cmd/link: "invalid relocation" failures on openbsd~~ cmd/link: "invalid relocation" failures Jul 7, 2015

mdempsky removed the OS-OpenBSD label Jul 7, 2015

ianlancetaylor added this to the Go1.5 milestone Jul 10, 2015

aclements mentioned this issue Jul 13, 2015

runtime: panics during gc #11640

Closed

mdempsky changed the title ~~cmd/link: "invalid relocation" failures~~ runtime: GC corrupting live objects in cmd/link Jul 14, 2015

aclements closed this as completed in cc8f544 Jul 17, 2015

ianlancetaylor mentioned this issue Jul 19, 2015

runtime: unexpected fault address 0xdfe1bd #11792

Closed

mdempsky mentioned this issue Jul 21, 2015

runtime: openbsd is spinning #11191

Closed

golang locked and limited conversation to collaborators Jul 20, 2016

gopherbot added the FrozenDueToAge label Jul 20, 2016

runtime: GC corrupting live objects in cmd/link #11617

runtime: GC corrupting live objects in cmd/link #11617

Comments

mdempsky commented Jul 6, 2015

mdempsky commented Jul 7, 2015

ianlancetaylor commented Jul 10, 2015

mdempsky commented Jul 10, 2015

mdempsky commented Jul 10, 2015

ianlancetaylor commented Jul 10, 2015

gopherbot commented Jul 10, 2015

mdempsky commented Jul 13, 2015

aclements commented Jul 13, 2015

mdempsky commented Jul 13, 2015

aclements commented Jul 13, 2015

mdempsky commented Jul 13, 2015

mdempsky commented Jul 14, 2015

mdempsky commented Jul 14, 2015

aclements commented Jul 14, 2015

aclements commented Jul 14, 2015

mdempsky commented Jul 14, 2015

aclements commented Jul 14, 2015

mdempsky commented Jul 14, 2015

aclements commented Jul 14, 2015

mdempsky commented Jul 14, 2015

aclements commented Jul 14, 2015

mdempsky commented Jul 14, 2015

aclements commented Jul 15, 2015

mdempsky commented Jul 15, 2015

aclements commented Jul 15, 2015

mdempsky commented Jul 15, 2015

aclements commented Jul 15, 2015

aclements commented Jul 15, 2015

aclements commented Jul 16, 2015

aclements commented Jul 16, 2015

RLH commented Jul 16, 2015

aclements commented Jul 16, 2015

RLH commented Jul 16, 2015

aclements commented Jul 16, 2015

aclements commented Jul 16, 2015

mdempsky commented Jul 16, 2015

mwhudson commented Jul 16, 2015 via email

aclements commented Jul 16, 2015

aclements commented Jul 16, 2015

mdempsky commented Jul 16, 2015

gopherbot commented Jul 16, 2015

aclements commented Jul 16, 2015

mdempsky commented Jul 16, 2015

mdempsky commented Jul 17, 2015

aclements commented Jul 17, 2015

gopherbot commented Jul 21, 2015