Skip to content

MTT master: Segmentation faults in various Fortran_20.0_32_CentOS6.10 basic tests #7627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
awlauria opened this issue Apr 14, 2020 · 21 comments

Comments

@awlauria
Copy link
Contributor

awlauria commented Apr 14, 2020

https://mtt.open-mpi.org/index.php?do_redir=3224

c_hello and c_ring are both segv'ing:

[ompi32:22230] *** Process received signal ***
[ompi32:22230] Signal: Segmentation fault (11)
[ompi32:22230] Signal code: Address not mapped (1)
[ompi32:22230] Failing at address: 0x86985a
[ompi32:22230] [ 0] [0xc4440c]
[ompi32:22230] [ 1] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libmca_common_dstore.so.0(pmix_common_dstor_init+0x156)[0x40791a]
[ompi32:22230] [ 2] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/pmix/mca_gds_ds12.so(+0x12ac)[0x4122ac]
[ompi32:22230] [ 3] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libopen-pal.so.0(OPAL_MCA_PMIX4X_pmix_gds_base_select+0x1de)[0x6dac05]
[ompi32:22230] [ 4] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libopen-pal.so.0(OPAL_MCA_PMIX4X_pmix_rte_init+0x141b)[0x7480f6]
[ompi32:22230] [ 5] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libopen-pal.so.0(OPAL_MCA_PMIX4X_PMIx_Init+0x238)[0x6fe80d]
[ompi32:22230] [ 6] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libmpi.so.0(ompi_rte_init+0x123)[0x47f48e]
[ompi32:22230] [ 7] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libmpi.so.0(ompi_mpi_init+0x2c6)[0x561f5b]
[ompi32:22230] [ 8] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libmpi.so.0(MPI_Init+0x89)[0x4cba80]
[ompi32:22230] [ 9] ./c_hello[0x80485ff]
[ompi32:22230] [10] /lib/libc.so.6(__libc_start_main+0xe8)[0x20ed28]
[ompi32:22230] [11] ./c_hello[0x8048551]
[ompi32:22230] *** End of error message ***[ompi32:22229] *** Process received signal ***
[ompi32:22229] Signal: Segmentation fault (11)
[ompi32:22229] Signal code: Address not mapped (1)
[ompi32:22229] Failing at address: 0xa4d85a[ompi32:22229] [ 0] [0xb0340c]
[ompi32:22229] [ 1] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libmca_common_dstore.so.0(pmix_common_dstor_init+0x156)[0xe3091a]
[ompi32:22229] [ 2] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/pmix/mca_gds_ds12.so(+0x12ac)[0xdb52ac]
[ompi32:22229] [ 3] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libopen-pal.so.0(OPAL_MCA_PMIX4X_pmix_gds_base_select+0x1de)[0x8cfc05]
[ompi32:22229] [ 4] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libopen-pal.so.0(OPAL_MCA_PMIX4X_pmix_rte_init+0x141b)[0x93d0f6]
[ompi32:22229] [ 5] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libopen-pal.so.0(OPAL_MCA_PMIX4X_PMIx_Init+0x238)[0x8f380d]
[ompi32:22229] [ 6] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libmpi.so.0(ompi_rte_init+0x123)[0x5bf48e]
[ompi32:22229] [ 7] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libmpi.so.0(ompi_mpi_init+0x2c6)[0x6a1f5b]
[ompi32:22229] [ 8] /home/ompitest/scratches/2020-04-14/installs/tki8/install/lib/libmpi.so.0(MPI_Init+0x89)[0x60ba80]
[ompi32:22229] [ 9] ./c_hello[0x80485ff]
[ompi32:22229] [10] /lib/libc.so.6(__libc_start_main+0xe8)[0xb9cd28]
[ompi32:22229] [11] ./c_hello[0x8048551]
[ompi32:22229] *** End of error message
***--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ompi32 exited on signal 11 (Segmentation
fault).
@rhc54
Copy link
Contributor

rhc54 commented Apr 14, 2020

@jsquyres Boris needs to see some debug info from Absoft - can you put him in contact with somebody there? We are unable to reproduce the problem, even when using the same OS version.

@rhc54
Copy link
Contributor

rhc54 commented Apr 15, 2020

Refs openpmix/openpmix#1720

@jsquyres
Copy link
Member

Hey @cagoelz -- can you help out here? We're getting some errors in the MTT Absoft runs, but no one is able to reproduce them.

@cagoelz
Copy link

cagoelz commented Apr 15, 2020 via email

@cagoelz
Copy link

cagoelz commented Apr 15, 2020 via email

@cagoelz
Copy link

cagoelz commented Apr 15, 2020

Possibly exceeding the default stack limit?

Working in scratches/2020-04-15/installs/ompi-nightly-master--absoft--master-202004150242-708f945/tests/trivial/test_get__trivial

$ source ../../../mpi_installed_vars.sh 
$ ulimit -s
10240
$ mpiexec -n 2 ./c_hello 
[ompi32:29858] *** Process received signal ***
[ompi32:29858] Signal: Segmentation fault (11)
[ompi32:29858] Signal code: Address not mapped (1)
[ompi32:29858] Failing at address: 0xab985a
   (output snipped)

$ ulimit -s unlimited
$ mpiexec -n 2 ./c_hello 
Hello, C world!  I am 0 of 2
Hello, C world!  I am 1 of 2

@rhc54
Copy link
Contributor

rhc54 commented Apr 15, 2020

Has to be something like stack corruption as it otherwise makes no sense.

@rhc54
Copy link
Contributor

rhc54 commented Apr 15, 2020

That's an awfully short stack limit - is there a reason for it? I don't believe that is what comes with CentOS by default - is it?

@cagoelz
Copy link

cagoelz commented Apr 15, 2020

I did not do anything special when creating our 32 bit test bed, just a stock install with the extra packages for development. I just checked a 64 bit version of CentOS 6.10 that I set up a
week or so ago and it also has 10240k as the default stack limit. Out of curiosity, I checked
on 64 bit RHEL 8 system and the default stack limit there is 8192k.

@rhc54
Copy link
Contributor

rhc54 commented Apr 15, 2020

Just to clarify: are all the failures on 32-bit test beds?

@cagoelz
Copy link

cagoelz commented Apr 16, 2020

Yes, all of the failures at Absoft are on the 32-bit test bed.

@karasevb
Copy link
Member

According to the backtrace it fails at pmix_pshmem.finalize() line 89:

85        if (best_pri < priority) {
86            best_pri = priority;
87            /* give any prior module a chance to finalize */
88            if (NULL != pmix_pshmem.finalize) {
89                pmix_pshmem.finalize(); 
90            }
91            pmix_pshmem = *nmodule;
92            inserted = true;
93        }

In the case that we see, the pmix_pshmem.finalize pointer must be NULL. Looks like the finalize pointer is corrupted and assigned to unexpected address.

I cannot able to reproduce at my VirtualBox with the same 32bit OS, even setup ulimit -s 32 it works well.

@cagoelz
Copy link

cagoelz commented Apr 16, 2020

We are running the 32 bit test bed on VMWare 6.5 but that is probably not significant. The only other thing I can add is thatt anything that changes the memory layout causes the seg fault to go away, Examples: loading c_ring into gdb, running it under valgrind, setting LD_USE_LOAD_BIAS in the environment.

@rhc54
Copy link
Contributor

rhc54 commented Apr 16, 2020

And I noted that all the problems went away with last night's tarball. So it does seem to be something overwriting the pmix_pshmem location.

@karasevb Perhaps valgrind could tell us something? I'm not entirely sure how to approach this one.

@cagoelz IIRC, you guys have a fortran compiler, but use standard gcc for building the C portions of OMPI? Can you tell us what C compiler and version is being used?

@cagoelz
Copy link

cagoelz commented Apr 16, 2020

gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23)

@artpol84
Copy link
Contributor

@karasevb let's try to use address sanitizer in llvm.

@awlauria
Copy link
Contributor Author

awlauria commented Apr 16, 2020

What about compiling with -fstack-protector-all?

Note - the above won't help with an issue with a global variable, but will for locally declared variables. If you have time, it might be worth a shot, nonetheless.

@rhc54
Copy link
Contributor

rhc54 commented Apr 16, 2020

@cagoelz You mentioned that you are running a VMWare for this? I'm wondering if you could setup an instance with the OMPI hash that fails, and then provide me with a copy of that VMWare image? I have VMWare here as well and can perhaps debug this easier (and without bothering you so much with it).

@cagoelz
Copy link

cagoelz commented Apr 17, 2020

I am happy to attempt this but keep in mind that I am far from a VMware expert. I know how to create the machines we need for testing but that's about it. We are using the free ESXi 6.5 with the web interface. Do you want me to just create the smallest possible virtual disk with the required install of CentOS and the OMPI files from 04-15-2020 or is there some magic I don't know about that will make transferring everything to you easier?

@rhc54
Copy link
Contributor

rhc54 commented Apr 17, 2020

Do you want me to just create the smallest possible virtual disk with the required install of CentOS and the OMPI files from 04-15-2020

We haven't been able to reproduce the problem even with a VM containing CentOS and that OMPI hash that we construct, so I think what we need is to have you do the VM build just as you normally do, then verify that OMPI fails and send us that complete image. Have to figure out where to post it so we can retrieve it as GitHub won't be able to handle something of that size. I do have a dropbox we could possibly use (depending on the final size), or maybe somebody can suggest an alternative?

@awlauria awlauria added the bug label Apr 17, 2020
@rhc54
Copy link
Contributor

rhc54 commented Apr 24, 2020

Not seeing this any more - chalking it up to something odd over there. Can revisit it if/when it recurs.

@rhc54 rhc54 closed this as completed Apr 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants