Skip to content

OSC shared memory fence segfault #5262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
PeterGottesman opened this issue Jun 12, 2018 · 6 comments · Fixed by #5341
Closed

OSC shared memory fence segfault #5262

PeterGottesman opened this issue Jun 12, 2018 · 6 comments · Fixed by #5341

Comments

@PeterGottesman
Copy link
Contributor

PeterGottesman commented Jun 12, 2018

When running the onesided/c_strided_getacc_indexed_shared test, an open syscall is failing when initializing shared memory with MPI_Win_allocate_shared. Execution is continuing until the uninitialized shared memory is accessed by a call to MPI_Win_fence, at which point the process segfaults.

The segfault specifically occurs at line 103 of osc_sm_active_target.c (because module->global_state is NULL).
The failing open syscall occurs during at line 495 of shmem_mmap_module.c.

The stdout/stderr can be seen here:
https://mtt.open-mpi.org/index.php?do_redir=2633

@hjelmn I believe this is up your alley

@hjelmn
Copy link
Member

hjelmn commented Jun 12, 2018

That looks like a new failure mode. Will take a look tomorrow.

@jsquyres
Copy link
Member

@hjelmn FWIW, the file that fails to open in shmem_mmap_module.c:495 is /dev/shm/vader_segment.savbu-usnic-a.58f00001.1.

@PeterGottesman
Copy link
Contributor Author

@hjelmn This is still appearing on MTT. Any luck tracking down the issue?

@hjelmn
Copy link
Member

hjelmn commented Jun 25, 2018

Not yet. I haven't been able to recreate the issue. Plan to try again this week.

@jsquyres
Copy link
Member

Per 2018-06-26 webex, please re-create with a debug build and run with osc_base_verbose 100 and send output to @hjelmn.

Note: this is happening across the board on master, v3.0, v3.1. 😢

hjelmn added a commit to hjelmn/ompi that referenced this issue Jun 26, 2018
This commit fixes a typo where a bcast is used instead of the intended
collective (barrier).

References open-mpi#5262

Signed-off-by: Nathan Hjelm <[email protected]>
@jsquyres
Copy link
Member

Per off-issue discussion, @hjelmn found the issue before we sent him the verbose output. See #5341.

hjelmn added a commit that referenced this issue Jun 26, 2018
This commit fixes a typo where a bcast is used instead of the intended
collective (barrier).

References #5262

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jun 26, 2018
This commit fixes a typo where a bcast is used instead of the intended
collective (barrier).

References open-mpi#5262

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 4c23068)
Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jun 26, 2018
This commit fixes a typo where a bcast is used instead of the intended
collective (barrier).

References open-mpi#5262

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 4c23068)
Signed-off-by: Nathan Hjelm <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants