-
Notifications
You must be signed in to change notification settings - Fork 900
loop_spawn IBM test hanging #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Imported from trac issue 2224. Created by jsquyres on 2010-02-03T15:43:20, last modified: 2013-10-30T19:07:44 |
Trac comment by jsquyres on 2010-02-03 16:37:42: I updated the parenthetical about openib/v1.4 to be a bit more clear. Additionally, this appears to be an openib BTL issue somehow. If I run with --mca btl tcp,self, everything works fine (!). If I run with --mca btl openib,self, I get the hanging behavior. I found another detail: during a hang, at least one parent is still stuck in the MPI_Comm_disconnect in the prior iteration (i.e., all other MPI parent procs have moved on to the next MPI_Comm_spawn, but this one is still stuck at the MPI_Comm_disconnect at the end of the prior iteration). That "stuck" parent is stuck in a condition wait in ompi_request_default_wait_all() -- it doesn't think that any of its requests have completed. I have no idea why this would happen 200+ iterations into the run, nor why it seems to be related to the openib BTL... |
Trac comment by jsquyres on 2010-10-26 17:32:20: Just a ping that this is still happening on the SVN trunk HEAD as of r23957. It is definitely related to the openib BTL. FWIW, it now hangs for me on the very first spawn in loop_spawn -- one parent process gets stuck somewhere in MPI_COMM_DISCONNECT (all the other parents complete it). Re-assigning to Vasily, since Pasha left Mellanox. |
Trac comment by jsquyres on 2011-01-11 07:23:33: Just an update... This is still happening on the trunk at r24214; I ran across it yesterday. It appears to be some kind of progression race condition in the COMM_DISCONNECT code, specifically the ORTE DPM ompi_dpm_base_disconnect_init() and ompi_dpm_base_disconnect_waitall() functions. When I ran into it yesterday, one parent process was stuck in ompi_dpm_base_disconnect_waitall() while all the others had moved on to the modex / grpcomm. I replicated the problem with running the following on a single node with 2 active IB HCA ports. The node has 4 cores; this was launching 4 processes each iteration (3 parents + 1 spawned child). I added some additional printf's into the IBM loop_spawn test; you can clearly see that one of the parent processes is not completing MPI_COMM_DISCONNECT in the very first iteration:
Here's the BT from 2 (out of 3) of the parent processes:
and here's the BT from the process who was stuck in COMM_DISCONNECT, waiting for its MPI requests to complete:
(I updated the CC list, too) |
Trac comment by jsquyres on 2011-01-11 07:29:31: (adding George, Edgar, and Ralph because this is now a PML / DPM / ORTE question) One interesting thing that I just noticed is that if I change the MCA_PML_BASE_SEND_SYNCHRONOUS to to STANDARD in the DPM base function, it works:
|
Trac comment by bosilca on 2011-01-12 13:45:03: So we do have a fully-fledged barrier in the MPI_Comm_disconnect? Terrific! I might have an explanation on why going from SYNCHRONOUS to STANDARD might remove the deadlock. Imagine that no process has yet exchanged messages with rank 0. As there is a full barrier (not an optimized one) on disconnect, and this barrier force contacting the processes in the rank order ... at the beginning of the barrier every process start bashing the poor rank 0 with connection requests. This might overload the number of allowed accept and lead to drop some of the connection requests. As a result, we might end-up deadlocking. We had a similar problem a while back in the TCP BTL, but it was fixed. You might have the same issue on the openib BTL, but at this point this is pure conjecture. |
Trac comment by jsquyres on 2011-04-26 09:58:47: Re-assigning to new Mellanox contact... |
Trac comment by hjelmn on 2013-10-30 19:07:44: I want to get this bug (an another on associated with loop_spawn) fixed for the 1.7.x/1.8.x series. Quick question. Why is the isend in disconnect synchronous anyway? The irecv will end up synchronizing the two peers so it seems unnecessary. I can confirm that loop_spawn does indeed hang in 1.7.x and trunk when isend is synchronous but does not if the isend is standard. |
…-warning-fix-v1.8 Remove unused variable.
Fixed long ago |
On the trunk and v1.5 branches (r22536), the IBM test loop_spawn is hanging. The exact iteration on which it hangs is nondeterministic; it hangs for me somewhere around iteration 200.
I'm running on 2 Linux 4-core nodes thusly:
Note that this does ''not'' happen on the v1.4 branch; the test seems to work fine there. This suggests that something has changed on the trunk/v1.5 that caused the problem.
'''SIDENOTE:''' It is worth noting that if using the openib BTL with this test on the v1.4 branch, the test fails much later (i.e., around iteration 1300 for me) because of what looks like a problem in the openib BTL; see https://svn.open-mpi.org/trac/ompi/ticket/1928.
I was unable to determine ''why'' it was hanging. The BT from 2 of the 3 parent children appears to be nearly the same:
Here's a bt from one of the two children:
So they all appear to be in a modex. Beyond that, I am unfamiliar with this portion of the code base...
The text was updated successfully, but these errors were encountered: