-
Notifications
You must be signed in to change notification settings - Fork 901
MTT master: Segmentation faults in various Fortran_20.0_32_CentOS6.10 basic tests #7627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jsquyres Boris needs to see some debug info from Absoft - can you put him in contact with somebody there? We are unable to reproduce the problem, even when using the same OS version. |
Hey @cagoelz -- can you help out here? We're getting some errors in the MTT Absoft runs, but no one is able to reproduce them. |
I will take a look and see what I can find out.
…On 4/15/20 1:32 PM, Jeff Squyres wrote:
Hey @cagoelz <https://github.com/cagoelz> -- can you help out here? We're getting some errors in the MTT Absoft runs, but no one is
able to reproduce them.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#7627 (comment)>, or
unsubscribe <https://github.com/notifications/unsubscribe-auth/AEJSNVJORBIHP6OSP2XCJRTRMXVSBANCNFSM4MHXR2EQ>.
|
Here are stack traces for c_hello and c_ring from the core files left behind in
scratches/2020-04-15/installs/ompi-nightly-master--absoft--master-202004150242-708f945/tests/trivial/test_get__trivial
```
$ file core.29236
core.29236: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from './c_hello', real uid: 500, effective uid: 500,
real gid: 500, effective gid: 500, execfn: './c_hello', platform: 'i686'
$ gdb ./c_hello core.29236
(gdb) bt
#0 0x007dc85a in ?? ()
#1 0x0041ecaa in OPAL_MCA_PMIX4X_pmix_pshmem_base_select ()
at base/pshmem_base_select.c:89
#2 0x004c691a in pmix_common_dstor_init (ds_name=0xf98964 "ds12", info=0x0,
ninfo=0, lock_cb=0xf99d00, file_cb=0xf99e60) at dstore_base.c:1590
#3 0x00f972ac in ds12_init (info=0x0, ninfo=0) at gds_ds12_base.c:39
#4 0x00325c05 in OPAL_MCA_PMIX4X_pmix_gds_base_select (info=0x0, ninfo=0)
at base/gds_base_select.c:86
#5 0x003930f6 in OPAL_MCA_PMIX4X_pmix_rte_init (type=1, info=0x0, ninfo=0,
cbfunc=0x3473f7 <pmix_client_notify_recv>) at runtime/pmix_init.c:396
#6 0x0034980d in OPAL_MCA_PMIX4X_PMIx_Init (proc=0xafbc28, info=0x0, ninfo=0)
at client/pmix_client.c:561
#7 0x009e448e in ompi_rte_init (pargc=0xbfc31350, pargv=0xbfc31354)
at runtime/ompi_rte.c:552
#8 0x00ac6f5b in ompi_mpi_init (argc=1, argv=0xbfc31464, requested=0,
provided=0xbfc31370, reinit_ok=false) at runtime/ompi_mpi_init.c:508
#9 0x00a30a80 in PMPI_Init (argc=0xbfc313c0, argv=0xbfc313c4) at pinit.c:67
#10 0x080485ff in main ()
```
```
$ file core.29113
core.29113: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from './c_ring', real uid: 500, effective uid: 500,
real gid: 500, effective gid: 500, execfn: './c_ring', platform: 'i686'
$ gdb ./c_ring core.29113
(gdb) bt
#0 0x0064485a in ?? ()
#1 0x00f6ecaa in OPAL_MCA_PMIX4X_pmix_pshmem_base_select ()
at base/pshmem_base_select.c:89
#2 0x0023891a in pmix_common_dstor_init (ds_name=0x1ca964 "ds12", info=0x0,
ninfo=0, lock_cb=0x1cbd00, file_cb=0x1cbe60) at dstore_base.c:1590
#3 0x001c92ac in ds12_init (info=0x0, ninfo=0) at gds_ds12_base.c:39
#4 0x00e75c05 in OPAL_MCA_PMIX4X_pmix_gds_base_select (info=0x0, ninfo=0)
at base/gds_base_select.c:86
#5 0x00ee30f6 in OPAL_MCA_PMIX4X_pmix_rte_init (type=1, info=0x0, ninfo=0,
cbfunc=0xe973f7 <pmix_client_notify_recv>) at runtime/pmix_init.c:396
#6 0x00e9980d in OPAL_MCA_PMIX4X_PMIx_Init (proc=0x80bc28, info=0x0, ninfo=0)
at client/pmix_client.c:561
#7 0x006f448e in ompi_rte_init (pargc=0xbfb50450, pargv=0xbfb50454)
at runtime/ompi_rte.c:552
#8 0x007d6f5b in ompi_mpi_init (argc=1, argv=0xbfb505a4, requested=0,
provided=0xbfb50470, reinit_ok=false) at runtime/ompi_mpi_init.c:508
#9 0x00740a80 in PMPI_Init (argc=0xbfb50500, argv=0xbfb50504) at pinit.c:67
#10 0x080486d7 in main ()
```
Anything else I can provide?
|
Possibly exceeding the default stack limit? Working in scratches/2020-04-15/installs/ompi-nightly-master--absoft--master-202004150242-708f945/tests/trivial/test_get__trivial
|
Has to be something like stack corruption as it otherwise makes no sense. |
That's an awfully short stack limit - is there a reason for it? I don't believe that is what comes with CentOS by default - is it? |
I did not do anything special when creating our 32 bit test bed, just a stock install with the extra packages for development. I just checked a 64 bit version of CentOS 6.10 that I set up a |
Just to clarify: are all the failures on 32-bit test beds? |
Yes, all of the failures at Absoft are on the 32-bit test bed. |
According to the backtrace it fails at pmix_pshmem.finalize() line 89: 85 if (best_pri < priority) {
86 best_pri = priority;
87 /* give any prior module a chance to finalize */
88 if (NULL != pmix_pshmem.finalize) {
89 pmix_pshmem.finalize();
90 }
91 pmix_pshmem = *nmodule;
92 inserted = true;
93 } In the case that we see, the I cannot able to reproduce at my VirtualBox with the same 32bit OS, even setup |
We are running the 32 bit test bed on VMWare 6.5 but that is probably not significant. The only other thing I can add is thatt anything that changes the memory layout causes the seg fault to go away, Examples: loading c_ring into gdb, running it under valgrind, setting LD_USE_LOAD_BIAS in the environment. |
And I noted that all the problems went away with last night's tarball. So it does seem to be something overwriting the @karasevb Perhaps valgrind could tell us something? I'm not entirely sure how to approach this one. @cagoelz IIRC, you guys have a fortran compiler, but use standard gcc for building the C portions of OMPI? Can you tell us what C compiler and version is being used? |
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23) |
@karasevb let's try to use address sanitizer in llvm. |
What about compiling with Note - the above won't help with an issue with a global variable, but will for locally declared variables. If you have time, it might be worth a shot, nonetheless. |
@cagoelz You mentioned that you are running a VMWare for this? I'm wondering if you could setup an instance with the OMPI hash that fails, and then provide me with a copy of that VMWare image? I have VMWare here as well and can perhaps debug this easier (and without bothering you so much with it). |
I am happy to attempt this but keep in mind that I am far from a VMware expert. I know how to create the machines we need for testing but that's about it. We are using the free ESXi 6.5 with the web interface. Do you want me to just create the smallest possible virtual disk with the required install of CentOS and the OMPI files from 04-15-2020 or is there some magic I don't know about that will make transferring everything to you easier? |
We haven't been able to reproduce the problem even with a VM containing CentOS and that OMPI hash that we construct, so I think what we need is to have you do the VM build just as you normally do, then verify that OMPI fails and send us that complete image. Have to figure out where to post it so we can retrieve it as GitHub won't be able to handle something of that size. I do have a dropbox we could possibly use (depending on the final size), or maybe somebody can suggest an alternative? |
Not seeing this any more - chalking it up to something odd over there. Can revisit it if/when it recurs. |
Uh oh!
There was an error while loading. Please reload this page.
https://mtt.open-mpi.org/index.php?do_redir=3224
c_hello
andc_ring
are both segv'ing:The text was updated successfully, but these errors were encountered: