Skip to content

Failure using hdf5 with ompio component of v3.x #4745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
smguzik opened this issue Jan 24, 2018 · 5 comments
Closed

Failure using hdf5 with ompio component of v3.x #4745

smguzik opened this issue Jan 24, 2018 · 5 comments

Comments

@smguzik
Copy link

smguzik commented Jan 24, 2018

iotest.cpp.gz
Thank you for taking the time to submit an issue!

Background information

A failure is observed using hdf5 with the ompio component of openmpi v3.x. The example works with '--mca io romio314', with mvapich2, and when collective IO is disabled for HDF5. I am only assuming the bug is with the ompio component.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Latest test with openmpi-v3.0.x-201801180304-86f3448.tar.bz2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Above tarball:
CC="gcc" CXX="g++" F77="gfortran" FC="gfortran"
./configure
--prefix=${DEST}
--enable-mpirun-prefix-by-default
--with-pmi=/usr
--with-pmi-libdir=/usr/lib/x86_64-linux-gnu
--with-verbs=/usr
--with-verbs-libdir=/usr/lib

Please describe the system on which you are running

  • Operating system/version:

cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"

uname -a
Linux 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) x86_64 GNU/Linux

gcc --version
gcc (Debian 6.3.0-18) 6.3.0 20170516

  • Computer hardware:

Intel(R) Xeon(R) CPU E5-2670 v2
NFS file system

  • Network type:

Infiniband/Mellanox ConnectX-3


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

The attached program illustrates a simple hdf5 example with the failure. There should be no output to terminal if working correctly. With, 'mpirun -np 4 ./iotest', I see:
BAD write!
BAD write!

Using the romio314 component or commenting out USECOLLECTIVE in the code works.

Note: hdf5 1.10.1 was built with the following (szip probably makes no difference):
./configure --enable-parallel --enable-shared --prefix=${HOME}/local --enable-build-mode=debug --with-szlib=/usr/local/szip/2.1.1
make
make install

See top of attached code for remaining build instructions that I used.

@edgargabriel
Copy link
Member

thank you for the bug report, I will look into this.

@edgargabriel
Copy link
Member

Brief update: I have good news and bad news. I can in fact reproduce the problem with 3.0.x. The good news is, the problem does not occur on master, the test passes in all possible combinations (with all fcoll modules that we have). The bad news is that I looked over the list of pr's that have been applied to 3.0.x (for ompio), and I do not see anything major missing, so I am a bit lost where the problem comes from. Will continue digging.

@edgargabriel
Copy link
Member

test also passes with 3.1.x branch, so it is purely the 3.0.x branch that seems to be affected.

@edgargabriel
Copy link
Member

found the reason. There is one erroneous line that somehow creeped into fs/ufs/fs_file_open.c Will mark this item as a blocker, since this could affect many jobs.

@edgargabriel
Copy link
Member

edgargabriel commented Jan 24, 2018

The bug is in fact there on master and 3.1.x as well, but it has less impact on master and 3.1.x due to changes in the fcoll selection logic.Note, that this is a regression compared to 3.0.0. Will file pr's in the next couple of minutes.

edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jan 24, 2018
an erroneous return statement has creeped in commit 1885d99
which leads to some processes not resetting stripe_size
and stripe_count correctly. This can lead in 3.0.x to different
fcoll modules being selected. The impact is not that dramatic on
master and 3.1.x, but could lead to problems as well.

Fixes open-mpi#4745

Signed-off-by: Edgar Gabriel <[email protected]>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jan 24, 2018
an erroneous return statement has creeped in commit 1885d99
which leads to some processes not resetting stripe_size
and stripe_count correctly. This can lead in 3.0.x to different
fcoll modules being selected. The impact is not that dramatic on
master and 3.1.x, but could lead to problems as well.

Fixes open-mpi#4745

Signed-off-by: Edgar Gabriel <[email protected]>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jan 24, 2018
an erroneous return statement has creeped in commit 1885d99
which leads to some processes not resetting stripe_size
and stripe_count correctly. This can lead in 3.0.x to different
fcoll modules being selected. The impact is not that dramatic on
master and 3.1.x, but could lead to problems as well.

Fixes open-mpi#4745

This is equivalen to commit 22a2b99
on master. I did not cherry pick, since they are some differences in the
fs/lustre component and I did not want to introduce those
changes in this commit.

Signed-off-by: Edgar Gabriel <[email protected]>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jan 24, 2018
an erroneous return statement has creeped in commit 1885d99
which leads to some processes not resetting stripe_size
and stripe_count correctly. This can lead in 3.0.x to different
fcoll modules being selected. The impact is not that dramatic on
master and 3.1.x, but could lead to problems as well.

Fixes open-mpi#4745

This is equivalen to commit 22a2b99
on master. I did not cherry pick, since they are some differences in the
fs/lustre component and I did not want to introduce those
changes in this commit.

Signed-off-by: Edgar Gabriel <[email protected]>
(cherry picked from commit f31f4b2)
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jan 24, 2018
an erroneous return statement has creeped in commit 1885d99
which leads to some processes not resetting stripe_size
and stripe_count correctly. This can lead in 3.0.x to different
fcoll modules being selected. The impact is not that dramatic on
master and 3.1.x, but could lead to problems as well.

Fixes open-mpi#4745

Signed-off-by: Edgar Gabriel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants