-
Notifications
You must be signed in to change notification settings - Fork 937
Description
Thank you for taking the time to submit an issue!
Background information
There is a possible close/read race condition in the close_open_file_descriptors() function of all odls modules. The trigger is unknown, and the condition is not easily reproducible. It is possible that the underlying trigger is a bug in libc or the underlying open/read/close system calls in the kernel, but there is a safe work around.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Looking at the code, this effects all active branches, and possibly stale branches, of openmpi.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
OpenMPI was installed from source when this bug was detected.
Please describe the system on which you are running
- Operating system/version: CentOS 7
- Computer hardware: x86_64 arch
- Network type: OpenIB
Details of the problem
close_open_file_descriptors() seems to go through the open fs in /proc/self/fd/ and closes them all. However, one of the last fd's it closes is the fd on the DIR structure returned by opendir(). In most instances, this works fine. Under certain, currently unknown, circumstances (possibly kernel or libc related) a segmentation fault happens in the readdir() function on the DIR that is closed by while loop.
The proposal is to skip the fd of the open DIR structure. I actually have a working patch for the odls_default_module and plan to port the patch to the other odls modules. I do not, however, have a way to test the patch with the Cray Alps launcher. I also do not have a good way to trigger this possible race condition, but the proposed patch will avoid it.
I am planning to issue pull requests for all of the active branches as well as master per the guidelines.