Skip to content

'nonblocking3' BVT test fails #2151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tjcw opened this issue Oct 3, 2016 · 13 comments
Closed

'nonblocking3' BVT test fails #2151

tjcw opened this issue Oct 3, 2016 · 13 comments

Comments

@tjcw
Copy link
Contributor

tjcw commented Oct 3, 2016

This is the second of the MPICH BVT tests which fails with today's 'git clone' of OMPI.

It fails with the following message

--------------------------------------------------------------------------
[1,0]<stderr>:[oc0436844531:11983] *** Process received signal ***
[1,0]<stderr>:[oc0436844531:11983] Signal: Segmentation fault (11)
[1,0]<stderr>:[oc0436844531:11983] Signal code:  (128)
[1,0]<stderr>:[oc0436844531:11983] Failing at address: (nil)
[1,1]<stderr>:[oc0436844531:11984] *** Process received signal ***
[1,1]<stderr>:[oc0436844531:11984] Signal: Segmentation fault (11)
[1,1]<stderr>:[oc0436844531:11984] Signal code:  (128)
[1,1]<stderr>:[oc0436844531:11984] Failing at address: (nil)
[1,0]<stderr>:[oc0436844531:11983] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7f55dc43d100]
[1,0]<stderr>:[oc0436844531:11983] [ 1] /usr/local/lib/libmpi.so.0(PMPI_Ialltoallw+0x13e)[0x7f55dc6aadce]
[1,1]<stderr>:[oc0436844531:11984] [ 0] [1,0]<stderr>:[oc0436844531:11983] [ 2] nonblocking3[0x40383d]
[1,0]<stderr>:[oc0436844531:11983] [ 3] nonblocking3[0x403ab5]
[1,0]<stderr>:[oc0436844531:11983] [ 4] [1,1]<stderr>:/lib64/libpthread.so.0(+0xf100)[0x7f57c14f1100]
[1,1]<stderr>:[oc0436844531:11984] [ 1] [1,0]<stderr>:/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f55dc08db15]
[1,0]<stderr>:[oc0436844531:11983] [ 5] nonblocking3[0x402bf9]
[1,0]<stderr>:[oc0436844531:11983] *** End of error message ***
[1,1]<stderr>:/usr/local/lib/libmpi.so.0(PMPI_Ialltoallw+0x13e)[0x7f57c175edce]
[1,1]<stderr>:[oc0436844531:11984] [ 2] nonblocking3[0x40383d]
[1,1]<stderr>:[oc0436844531:11984] [ 3] nonblocking3[0x403ab5]
[1,1]<stderr>:[oc0436844531:11984] [ 4] [1,1]<stderr>:/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f57c1141b15]
[1,1]<stderr>:[oc0436844531:11984] [ 5] nonblocking3[0x402bf9]
[1,1]<stderr>:[oc0436844531:11984] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node oc0436844531 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[tjcw@oc0436844531 coll]$ 
@tjcw
Copy link
Contributor Author

tjcw commented Oct 3, 2016

The 'gist' for this problem is here <script src="https://gist.github.com/tjcw/530ba5bb593a0f480b08de89d0497dc6.js"></script>

@tjcw
Copy link
Contributor Author

tjcw commented Oct 3, 2016

nonblocking3.zip

@tjcw
Copy link
Contributor Author

tjcw commented Oct 3, 2016

Trying again with the 'gist' for this problem

https://gist.github.com/tjcw/530ba5bb593a0f480b08de89d0497dc6

@ggouaillardet
Copy link
Contributor

can you please double check your source file ? or post a link to mpich git
i do not see any MPI_Ialltoallw in the gist (broken link) nor mpich nonblocking.c nor nonblicking3.c

@tjcw
Copy link
Contributor Author

tjcw commented Oct 3, 2016

I think I have fixed the 'gist' link now.

@ggouaillardet
Copy link
Contributor

thanks, i will have a look at it
btw, how many MPI tasks are you running ?
can you also please post yoyr configure command line ?

@tjcw
Copy link
Contributor Author

tjcw commented Oct 3, 2016

This was with 2 MPI tasks. I configured with './configure' .

@jsquyres
Copy link
Member

jsquyres commented Oct 3, 2016

Is this a duplicate issue of #2150?

@ggouaillardet
Copy link
Contributor

@jsquyres these are two distinct issues, i can reproduce both and am now on it

@ggouaillardet
Copy link
Contributor

@tjcw the issue here is we do not support (yet) MPI_Op_free until the non blocking request has completed

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Oct 4, 2016
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Fixes open-mpi#2151
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Oct 27, 2016
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Fixes open-mpi#2151

Signed-off-by: Gilles Gouaillardet <[email protected]>
@jjhursey jjhursey added the bug label Nov 18, 2016
@jjhursey jjhursey added this to the v2.1.0 milestone Jan 10, 2017
@jjhursey
Copy link
Member

@ggouaillardet @bosilca I wanted to check in on this ticket to see if there has been progress. And where I can be of help.

I see two PRs that are related:

@bosilca
Copy link
Member

bosilca commented Jan 12, 2017

@jjhursey the 2 PR you mentions are entirely different. The first one (PR #2154) addresses specifically the refcount for op and datatypes for non-blocking collectives by adding a completion callback. The latter (PR #2393) prevents any refcount update for all send/recv operation for predefined data. I see them as complementary, solving different parts of a larger problem.

@ggouaillardet
Copy link
Contributor

@jjhursey as pointed by @bosilca #2154 addresses only the non blocking collectives but at the MPI level.
(and btw, it also prevents updating refcounts of predefined datatypes/ops)

@bosilca had a negative comment about the implementation itself and i acknowledge that, though i did not had much time to revamp it.

currently, refcounts are updated at the pml level, and per our SC discussion, we should do that at the MPI level, regardless communications are point to point vs collective or blocking vs non blocking.
that requires we extend the current ompi_request_t structure (e.g. add two pointers to non predefined datatype, one for op and the last one for communicator), that will greatly simplify #2154 (e.g. no more need to insert a callback)...
except when handling the infamous MPI_{A,Ia}lltoallw and MPI_{N,In}eighbor_alltoallw since we potentially have 2*comm_size)

makes sense ?

@hppritcha hppritcha modified the milestones: v2.0.3, v2.1.0 Jan 24, 2017
jjhursey pushed a commit to jjhursey/ompi that referenced this issue Mar 6, 2017
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Fixes open-mpi#2151

Signed-off-by: Gilles Gouaillardet <[email protected]>
@hppritcha hppritcha modified the milestones: v2.0.3, v2.0.4 Jun 1, 2017
@hppritcha hppritcha modified the milestones: v3.0.0, v2.0.4 Jul 12, 2017
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Sep 1, 2017
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Thanks Thomas Ponweiser for reporting this

Fixes open-mpi#2151
Fixes open-mpi#1304

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Sep 1, 2017
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Thanks Thomas Ponweiser for reporting this

Fixes open-mpi#2151
Fixes open-mpi#1304

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Sep 1, 2017
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Thanks Thomas Ponweiser for reporting this

Fixes open-mpi#2151
Fixes open-mpi#1304

Signed-off-by: Gilles Gouaillardet <[email protected]>
@bwbarrett bwbarrett modified the milestones: v3.0.1, v3.0.0 Sep 12, 2017
@bwbarrett bwbarrett removed this from the v3.0.1 milestone Mar 1, 2018
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Apr 9, 2019
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Thanks Thomas Ponweiser for reporting this

Fixes open-mpi#2151
Fixes open-mpi#1304

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Apr 9, 2019
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Thanks Thomas Ponweiser for reporting this

Fixes open-mpi#2151
Fixes open-mpi#1304

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Jul 4, 2019
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Thanks Thomas Ponweiser for reporting this

Fixes open-mpi#2151
Fixes open-mpi#1304

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Jul 8, 2019
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Thanks Thomas Ponweiser for reporting this

Fixes open-mpi#2151
Fixes open-mpi#1304

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Jul 12, 2019
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Thanks Thomas Ponweiser for reporting this

Fixes open-mpi#2151
Fixes open-mpi#1304

Signed-off-by: Gilles Gouaillardet <[email protected]>

(cherry picked from commit open-mpi/ompi@0fe756d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants