-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: SIGILL: illegal instruction from runtime.deductSweepCredit #29862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Which code ? How do we run it ? Could you show us a small self-contained code which shows the issue ? That will help us debug the issue further. Have you tried with 1.12beta2 to see if the problem exists there too ? Thanks. |
Right now there nothing in this crash stack trace that gives me hints on creating a self contained snippet that makes this reproducible. Project is compiled as a static go binary running inside a pod in Kubernetes (environment details specified). Problem happened once and if it starts repeating multiple times, we will experiment with 1.12 and see if this is specific to 1.11.4. I would like to understand why the following go runtime is an illegal instruction (full disassembly provided). |
It maybe possible that there is a memory corruption somewhere. Please also try to run a race enabled binary to ensure there is no race anywhere. Also, is that the entire stack trace ? There should also be some user-level code in it. I will page some folks who can shed some light @ianlancetaylor @randall77 @aclements. |
Full stack trace. https://github.com/golang/go/files/2782346/sigill.log
==
SIGILL: illegal instruction goroutine 0 [idle]: goroutine 444 [running]: goroutine 1 [IO wait, 16 minutes]: goroutine 18 [syscall, 16 minutes]: goroutine 4 [chan receive, 16 minutes]: goroutine 5 [chan receive]: goroutine 6 [sleep]: goroutine 10 [select, 16 minutes]: goroutine 130 [select, 6 minutes]: goroutine 12 [select, 16 minutes]: goroutine 13 [select, 1 minutes]: goroutine 14 [select, 16 minutes]: goroutine 15 [chan receive, 16 minutes]: goroutine 23 [IO wait, 16 minutes]: goroutine 53 [IO wait]: goroutine 35 [select, 16 minutes]: goroutine 43 [IO wait, 16 minutes]: goroutine 44 [select, 16 minutes]: goroutine 46 [select, 1 minutes]: goroutine 180 [select, 16 minutes]: goroutine 24 [sleep]: goroutine 63 [sleep]: goroutine 128 [sleep, 1 minutes]: goroutine 140 [select, 6 minutes]: goroutine 141 [chan receive, 16 minutes]: goroutine 127 [chan receive, 16 minutes]: goroutine 194 [chan receive, 16 minutes]: goroutine 191 [select, 16 minutes]: goroutine 70 [IO wait]: goroutine 121 [chan receive, 6 minutes]: goroutine 134 [chan receive, 16 minutes]: goroutine 441 [chan receive, 15 minutes]: goroutine 193 [sleep]: goroutine 113 [sleep]: goroutine 69 [sleep]: goroutine 133 [chan receive, 1 minutes]: goroutine 64 [IO wait]: goroutine 182 [select, 16 minutes]: goroutine 132 [chan receive, 16 minutes]: goroutine 120 [chan receive, 6 minutes]: goroutine 114 [chan receive, 16 minutes]: goroutine 131 [chan receive, 1 minutes]: goroutine 190 [IO wait, 16 minutes]: goroutine 179 [chan receive, 16 minutes]: goroutine 185 [select, 16 minutes]: goroutine 195 [sleep, 1 minutes]: goroutine 184 [select]: goroutine 181 [select]: goroutine 183 [chan receive, 16 minutes]: goroutine 119 [sleep]: goroutine 177 [chan receive, 1 minutes]: goroutine 178 [chan receive, 16 minutes]: goroutine 305 [IO wait]: goroutine 142 [chan receive, 16 minutes]: goroutine 143 [chan receive, 16 minutes]: goroutine 144 [sleep, 1 minutes]: goroutine 145 [chan receive, 1 minutes]: goroutine 125 [chan receive, 16 minutes]: goroutine 122 [chan receive, 6 minutes]: goroutine 126 [chan receive, 15 minutes]: goroutine 151 [select, 6 minutes]: goroutine 152 [select, 15 minutes]: goroutine 153 [chan receive, 16 minutes]: goroutine 154 [chan receive, 15 minutes]: goroutine 155 [chan receive, 16 minutes]: goroutine 124 [select, 15 minutes]: goroutine 123 [chan receive, 6 minutes]: goroutine 162 [select, 6 minutes]: goroutine 169 [runnable]: goroutine 160 [select]: goroutine 453 [chan receive]: goroutine 457 [chan receive]: goroutine 196 [chan receive, 16 minutes]: goroutine 197 [select, 16 minutes]: goroutine 198 [select]: goroutine 199 [select, 16 minutes]: goroutine 200 [chan receive, 16 minutes]: goroutine 201 [select, 16 minutes]: goroutine 217 [IO wait]: goroutine 2547 [select, 1 minutes]: goroutine 171 [chan receive]: goroutine 498 [select, 15 minutes]: goroutine 288 [IO wait]: goroutine 242 [select, 16 minutes]: goroutine 211 [select, 16 minutes]: goroutine 212 [select]: goroutine 235 [select, 16 minutes]: goroutine 214 [select, 16 minutes]: goroutine 215 [select]: goroutine 218 [select]: goroutine 225 [chan receive, 16 minutes]: goroutine 220 [sleep, 6 minutes]: goroutine 446 [select, 15 minutes]: goroutine 295 [runnable]: goroutine 299 [IO wait]: goroutine 226 [select, 16 minutes]: goroutine 227 [select]: goroutine 228 [select, 16 minutes]: goroutine 229 [chan receive, 16 minutes]: goroutine 230 [select, 16 minutes]: goroutine 234 [IO wait, 16 minutes]: goroutine 296 [runnable]: goroutine 237 [select, 16 minutes]: goroutine 238 [select, 16 minutes]: goroutine 241 [select, 16 minutes]: goroutine 240 [select, 16 minutes]: goroutine 243 [sleep, 1 minutes]: goroutine 415 [IO wait]: goroutine 442 [chan receive]: goroutine 461 [chan receive]: goroutine 459 [select]: goroutine 447 [chan receive, 15 minutes]: goroutine 483 [select, 15 minutes]: goroutine 491 [select]: goroutine 1046 [chan receive, 11 minutes]: goroutine 497 [chan receive]: goroutine 494 [chan receive]: goroutine 493 [chan receive, 15 minutes]: goroutine 486 [select, 15 minutes]: goroutine 485 [chan receive]: goroutine 488 [select]: goroutine 490 [chan receive]: goroutine 492 [select]: goroutine 500 [select, 1 minutes]: goroutine 439 [select, 15 minutes]: goroutine 438 [select]: goroutine 443 [runnable]: goroutine 460 [chan receive, 15 minutes]: goroutine 489 [chan receive, 15 minutes]: goroutine 452 [chan receive, 15 minutes]: goroutine 484 [chan receive, 15 minutes]: goroutine 440 [select, 15 minutes]: goroutine 448 [chan receive]: goroutine 458 [select]: goroutine 456 [chan receive, 15 minutes]: goroutine 455 [select]: goroutine 495 [select, 15 minutes]: goroutine 454 [select]: goroutine 482 [select]: goroutine 481 [select]: goroutine 445 [select, 15 minutes]: goroutine 487 [select, 1 minutes]: goroutine 499 [select, 15 minutes]: goroutine 496 [chan receive, 15 minutes]: goroutine 464 [chan receive, 15 minutes]: goroutine 605 [select, 14 minutes]: goroutine 501 [select]: goroutine 1062 [chan receive, 1 minutes]: goroutine 465 [chan receive]: goroutine 416 [IO wait]: goroutine 462 [select]: goroutine 301 [IO wait]: goroutine 1814 [chan receive, 6 minutes]: goroutine 425 [select]: goroutine 466 [select, 15 minutes]: goroutine 629 [select]: goroutine 463 [select]: goroutine 393 [IO wait]: goroutine 300 [IO wait]: goroutine 2540 [chan receive, 1 minutes]: goroutine 2539 [chan receive, 1 minutes]: goroutine 2546 [IO wait, 1 minutes]: goroutine 1053 [chan receive, 1 minutes]: goroutine 1049 [chan receive, 1 minutes]: goroutine 1078 [chan receive, 6 minutes]: goroutine 1054 [chan receive, 11 minutes]: goroutine 1080 [select, 1 minutes]: goroutine 1048 [select, 1 minutes]: goroutine 628 [select, 14 minutes]: goroutine 1081 [chan receive, 1 minutes]: goroutine 1813 [chan receive, 6 minutes]: goroutine 606 [select]: goroutine 1045 [chan receive, 1 minutes]: goroutine 2486 [chan receive, 1 minutes]: goroutine 1056 [select, 6 minutes]: goroutine 1073 [chan receive, 6 minutes]: goroutine 1061 [chan receive, 6 minutes]: goroutine 1806 [chan receive, 6 minutes]: rax 0x2000 |
I have to agree that a |
Text/code section can not be corrupted by user code, right? Are you suggesting that there must be hardware bit flip type corruption which caused SIGILL ? |
Yes, I am suggesting a hardware problem. Of course if it happens on a different machine then my suggestion is almost certainly wrong. But a single occurrence of |
Thanks. I've asked the hosting provider if they provide any guarantees on hardware corruption (using ECC or other mechanisms). |
Hosting provider is yet to get back to us on the memory corruption. But this time a different program running in the same environment (different H/W node and VM) has crashed with SIGSEGV around the same location. If this is same as the previous problem, then it is not pointing towards hardware corruption. Stack trace: 2019-01-23T09:04:07.309761319Z fatal error: unexpected signal during runtime execution Assembly at PC: (gdb) disassemble 0x420c60 Full stack trace attached below. |
SIGILL at the same location on a different Node+VM happened again. 2019-01-23T12:16:06.202968107Z SIGILL: illegal instruction |
Thanks, that's interesting. I have no idea what could cause this. The instruction looks fine, and presumably the same instruction is executed many times in different functions, so why a Is there anything in the kernel logs? The sigcode value of In the stack trace You said it was a static Go binary. Is there any C code in the program, or is it pure Go? Approximately how often does this happen? That is, you've seen it twice or maybe three times; how many times is your program running in that time period? I gather your program is running inside a VM? Do you have any information about that? |
Thanks for looking into this.
|
3 times in 2 days, but approximately how many instances of the program were run during those 2 days? |
CC @aclements |
|
2019-01-25T09:20:06.904573082Z fatal error: unexpected signal during runtime execution |
No significant update from hosting provider yet. Several crashes have happened in the past few days. One being "kubectl" command (which is not built or modified by us). Latest instance is failure of "kubectl" command (trace below). 2019-01-28T02:48:05.901242067Z SIGILL: illegal instruction |
Occurrence of these issues stopped after change of hardware node on which these crashes were happening. Hosting provider has not provided any technical details around what might have caused the issue (ticket still pending with them). This golang git issue can be closed. /close |
Thanks for following up. Sorry about your hardware. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes, go 1.11.4. But happened only once.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Nothing specific was being done. This code has been running unmodified for few days without any issue.
What did you expect to see?
No panic
What did you see instead?
SIGILL: illegal instruction
PC=0x420c60 m=8 sigcode=2
goroutine 0 [idle]:
runtime.deductSweepCredit(0x2000, 0x0)
/usr/local/go1.11.4/go/src/runtime/mgcsweep.go:390
runtime.(*mcentral).cacheSpan(0x1385618, 0xc0005cfca8)
/usr/local/go1.11.4/go/src/runtime/mcentral.go:43 +0x60
runtime.(*mcache).refill(0x7f95da4e3000, 0xc0005cfc11)
/usr/local/go1.11.4/go/src/runtime/mcache.go:122 +0x95
runtime.(*mcache).nextFree.func1()
/usr/local/go1.11.4/go/src/runtime/malloc.go:749 +0x32
runtime.systemstack(0x0)
/usr/local/go1.11.4/go/src/runtime/asm_amd64.s:351 +0x66
runtime.mstart()
/usr/local/go1.11.4/go/src/runtime/proc.go:1229
goroutine 444 [running]:
runtime.systemstack_switch()
/usr/local/go1.11.4/go/src/runtime/asm_amd64.s:311
fp=0xc0005cf9a0 sp=0xc0005cf998 pc=0x4581d0
runtime.(*mcache).nextFree(0x7f95da4e3000, 0x11, 0x1000100,
0xc0005cfa40, 0x40a03d)
/usr/local/go1.11.4/go/src/runtime/malloc.go:748 +0xb6
fp=0xc0005cf9f8 sp=0xc0005cf9a0 pc=0x40b756
runtime.mallocgc(0x70, 0xb96800, 0x1, 0x0)
/usr/local/go1.11.4/go/src/runtime/malloc.go:903 +0x793
fp=0xc0005cfa98 sp=0xc0005cf9f8 pc=0x40c0a3
runtime.makeslice(0xb96800, 0x65, 0x65, 0x0, 0x0, 0xa891ff)
...
rax 0x2000
rbx 0x7f95d827c350
rcx 0x8
rdx 0x440
rdi 0x4553c0
rsi 0x137ddc0
rbp 0xc000399f70
rsp 0xc000399f30
r8 0x1
r9 0x11
r10 0x7f95d827c350
r11 0x7fffffffffffff
r12 0x0
r13 0x49
r14 0x49
r15 0x49
rip 0x420c60
rflags 0x10206
cs 0x33
fs 0x0
gs 0x0
Processor
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
stepping : 0
microcode : 0x1
cpu MHz : 2200.000
cache size : 56320 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb
rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni
pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes
xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch
invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle
avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 4400.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
Runtime
Kubernetes 1.11.6 with Google COS os.
docker version
Client:
Version: 17.03.2-ce
API version: 1.27
Go version: go1.10.3
Git commit: f5ec1e2
Built: Thu Oct 25 10:42:32 2018
OS/Arch: linux/amd64
Server:
Version: 17.03.2-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.10.3
Git commit: f5ec1e2
Built: Thu Oct 25 10:42:32 2018
OS/Arch: linux/amd64
Experimental: false
Assembly a the PC
(gdb) disassemble 0x420c60
Dump of assembler code for function runtime.deductSweepCredit:
0x0000000000420c60 <+0>: mov %fs:0xfffffffffffffff8,%rcx
0x0000000000420c69 <+9>: cmp 0x10(%rcx),%rsp
0x0000000000420c6d <+13>: jbe 0x420dc8 <runtime.deductSweepCredit+360>
0x0000000000420c73 <+19>: sub $0x20,%rsp
0x0000000000420c77 <+23>: mov %rbp,0x18(%rsp)
0x0000000000420c7c <+28>: lea 0x18(%rsp),%rbp
0x0000000000420c81 <+33>: movsd 0xf642cf(%rip),%xmm0 #
0x1384f58 <runtime.mheap_+4280>
0x0000000000420c89 <+41>: xorps %xmm1,%xmm1
0x0000000000420c8c <+44>: ucomisd %xmm1,%xmm0
0x0000000000420c90 <+48>: jne 0x420c98 <runtime.deductSweepCredit+56>
0x0000000000420c92 <+50>: jnp 0x420dbe <runtime.deductSweepCredit+350>
0x0000000000420c98 <+56>: lea 0xf6a751(%rip),%rax #
0x138b3f0 <runtime.trace+16>
0x0000000000420c9f <+63>: cmpb $0x0,(%rax)
0x0000000000420ca2 <+66>: jne 0x420daa <runtime.deductSweepCredit+330>
0x0000000000420ca8 <+72>: mov 0x28(%rsp),%rcx
0x0000000000420cad <+77>: mov 0x30(%rsp),%rdx
0x0000000000420cb2 <+82>: jmp 0x420d1c <runtime.deductSweepCredit+188>
0x0000000000420cb4 <+84>: lea 0xf6a735(%rip),%rax #
0x138b3f0 <runtime.trace+16>
0x0000000000420cbb <+91>: mov 0x28(%rsp),%rcx
0x0000000000420cc0 <+96>: mov 0x30(%rsp),%rdx
0x0000000000420cc5 <+101>: mov 0x8(%rsp),%rbx
0x0000000000420cca <+106>: mov 0x10(%rsp),%rsi
0x0000000000420ccf <+111>: xorps %xmm1,%xmm1
0x0000000000420cd2 <+114>: mov 0xf64267(%rip),%rdi #
0x1384f40 <runtime.mheap_+4256>
0x0000000000420cd9 <+121>: sub %rbx,%rdi
0x0000000000420cdc <+124>: cmp %rdi,%rsi
0x0000000000420cdf <+127>: jle 0x420d72 <runtime.deductSweepCredit+274>
0x0000000000420ce5 <+133>: callq 0x4201c0 <runtime.gosweepone>
0x0000000000420cea <+138>: cmpq $0xffffffffffffffff,(%rsp)
0x0000000000420cef <+143>: je 0x420d67 <runtime.deductSweepCredit+263>
0x0000000000420cf1 <+145>: mov 0xf64250(%rip),%rax #
0x1384f48 <runtime.mheap_+4264>
0x0000000000420cf8 <+152>: mov 0x8(%rsp),%rcx
0x0000000000420cfd <+157>: cmp %rcx,%rax
0x0000000000420d00 <+160>: je 0x420cb4 <runtime.deductSweepCredit+84>
The text was updated successfully, but these errors were encountered: