-
Notifications
You must be signed in to change notification settings - Fork 900
Failure using hdf5 with ompio component of v3.x #4745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
thank you for the bug report, I will look into this. |
Brief update: I have good news and bad news. I can in fact reproduce the problem with 3.0.x. The good news is, the problem does not occur on master, the test passes in all possible combinations (with all fcoll modules that we have). The bad news is that I looked over the list of pr's that have been applied to 3.0.x (for ompio), and I do not see anything major missing, so I am a bit lost where the problem comes from. Will continue digging. |
test also passes with 3.1.x branch, so it is purely the 3.0.x branch that seems to be affected. |
found the reason. There is one erroneous line that somehow creeped into fs/ufs/fs_file_open.c Will mark this item as a blocker, since this could affect many jobs. |
The bug is in fact there on master and 3.1.x as well, but it has less impact on master and 3.1.x due to changes in the fcoll selection logic.Note, that this is a regression compared to 3.0.0. Will file pr's in the next couple of minutes. |
an erroneous return statement has creeped in commit 1885d99 which leads to some processes not resetting stripe_size and stripe_count correctly. This can lead in 3.0.x to different fcoll modules being selected. The impact is not that dramatic on master and 3.1.x, but could lead to problems as well. Fixes open-mpi#4745 Signed-off-by: Edgar Gabriel <[email protected]>
an erroneous return statement has creeped in commit 1885d99 which leads to some processes not resetting stripe_size and stripe_count correctly. This can lead in 3.0.x to different fcoll modules being selected. The impact is not that dramatic on master and 3.1.x, but could lead to problems as well. Fixes open-mpi#4745 Signed-off-by: Edgar Gabriel <[email protected]>
an erroneous return statement has creeped in commit 1885d99 which leads to some processes not resetting stripe_size and stripe_count correctly. This can lead in 3.0.x to different fcoll modules being selected. The impact is not that dramatic on master and 3.1.x, but could lead to problems as well. Fixes open-mpi#4745 This is equivalen to commit 22a2b99 on master. I did not cherry pick, since they are some differences in the fs/lustre component and I did not want to introduce those changes in this commit. Signed-off-by: Edgar Gabriel <[email protected]>
an erroneous return statement has creeped in commit 1885d99 which leads to some processes not resetting stripe_size and stripe_count correctly. This can lead in 3.0.x to different fcoll modules being selected. The impact is not that dramatic on master and 3.1.x, but could lead to problems as well. Fixes open-mpi#4745 This is equivalen to commit 22a2b99 on master. I did not cherry pick, since they are some differences in the fs/lustre component and I did not want to introduce those changes in this commit. Signed-off-by: Edgar Gabriel <[email protected]> (cherry picked from commit f31f4b2)
an erroneous return statement has creeped in commit 1885d99 which leads to some processes not resetting stripe_size and stripe_count correctly. This can lead in 3.0.x to different fcoll modules being selected. The impact is not that dramatic on master and 3.1.x, but could lead to problems as well. Fixes open-mpi#4745 Signed-off-by: Edgar Gabriel <[email protected]>
iotest.cpp.gz
Thank you for taking the time to submit an issue!
Background information
A failure is observed using hdf5 with the ompio component of openmpi v3.x. The example works with '--mca io romio314', with mvapich2, and when collective IO is disabled for HDF5. I am only assuming the bug is with the ompio component.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Latest test with openmpi-v3.0.x-201801180304-86f3448.tar.bz2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Above tarball:
CC="gcc" CXX="g++" F77="gfortran" FC="gfortran"
./configure
--prefix=${DEST}
--enable-mpirun-prefix-by-default
--with-pmi=/usr
--with-pmi-libdir=/usr/lib/x86_64-linux-gnu
--with-verbs=/usr
--with-verbs-libdir=/usr/lib
Please describe the system on which you are running
cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
uname -a
Linux 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) x86_64 GNU/Linux
gcc --version
gcc (Debian 6.3.0-18) 6.3.0 20170516
Intel(R) Xeon(R) CPU E5-2670 v2
NFS file system
Infiniband/Mellanox ConnectX-3
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
The attached program illustrates a simple hdf5 example with the failure. There should be no output to terminal if working correctly. With, 'mpirun -np 4 ./iotest', I see:
BAD write!
BAD write!
Using the romio314 component or commenting out USECOLLECTIVE in the code works.
Note: hdf5 1.10.1 was built with the following (szip probably makes no difference):
./configure --enable-parallel --enable-shared --prefix=${HOME}/local --enable-build-mode=debug --with-szlib=/usr/local/szip/2.1.1
make
make install
See top of attached code for remaining build instructions that I used.
The text was updated successfully, but these errors were encountered: