Skip to content

Fix debugger attach and cospawn of debugger daemons for the STAT debugger #2425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 2, 2016
Merged

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Nov 16, 2016

Add ability to test the support minus the actual debugger.

Fixes #2411

Signed-off-by: Ralph Castain [email protected]

@gpaulsen @dsolt

gpaulsen
gpaulsen previously approved these changes Nov 16, 2016
Copy link
Member

@gpaulsen gpaulsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - code inspection. Thanks for jumping on this.

@jjhursey
Copy link
Member

I think this looks fine. I have a question about the test attach variable usage.

I see these two environment variables: ORTE_TEST_DEBUGGER_ATTACH and ORTE_TEST_DEBUGGER_SLEEP
Plus this MCA variable: orte_debugger_test_attach

In orterun the environment variables are used, but everywhere else the MCA variable is used. How are these variables used to activate the testing mechanism? Does the user have to set both the MCA parameter and the environment variables?

I'm thinking about code maintenance as well. Would it be possible to consolidate these into just MCA parameters? Say, if the user specifies the environment variables then that flips the MCA parameter(s) [plural making one for attach and one for the sleep]. Then we just use the MCA variables though out the code.

@jjhursey
Copy link
Member

Does this need to be pulled into v2.x as well? And master would need the new test attach ability ported back, right?

@hppritcha
Copy link
Member

@jsquyres I think this is ready to go.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 18, 2016

@jjhursey These changes do indeed need to be ported to master and v2.x - I'll do that today, now that I'm back at a real keyboard again. The port isn't a clean "merge" due to the code path differences.

The debugger test code is a bit of a mishmash as it was done at different times. We started with MCA params, but then a concern was raised that a user might actually try to use them. So we switched to "invisible" environmental variables that have an ORTE prefix.

Due to the number of code path options, the number of these "knobs" is sadly more than I'd like. We need to be able to separately test the following options:

  • mpirun started under a debugger, ensuring that the MPI process release gets properly sent and detected
  • debugger attaching to a running job, ensuring the FIFO correctly "fires" and the resulting actions occur
  • cospawn of a debugger daemon under both of the above scenarios

I'd welcome it if you'd like to take a look at the code and clean it up. This all evolved over a long period, and not from any high-level design.

@gpaulsen
Copy link
Member

I just heard from the lab that they tried this fix with Open MPI, but that it didn't resolve their issue with STAT. We should figure out what's going on before merging this in.

@gpaulsen gpaulsen dismissed their stale review November 18, 2016 21:36

Lab tested, and this fix didn't resolve for them.

@jjhursey
Copy link
Member

I'll see if I can get stat setup on one of our test machines on Monday so I can help debug what might be missing here.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 18, 2016

If they could provide more info than "didn't work", it would help 😄

@gpaulsen
Copy link
Member

Seriously. We're working on getting details. Sorry for delay.

@hppritcha
Copy link
Member

hppritcha commented Nov 22, 2016

Which version of STAT is being used for this testing?

@hppritcha
Copy link
Member

removing blocker tag. we can fix this in a later 2.0.x release

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 24, 2016

I received some updated input from LLNL:

Here are two main failure modes:

For launch mode (mode where orterun is under the control of
LaunchMON debug engine from the beginning), orterun seems to
fail to unlock the main target MPI application from MPI_Init(). To
be clear, orterun does co-spawn the tool deamon in this mode, but
seems to fail afterwards: When LaunchMON tells it to continue
from MPIR_Breakpoint(), it appears orterun (and the MPI application)
doesn’t respond for a while and then prints out the following error
message in what appears to me like an infinite loop:

ORTE_ERROR_LOG: Not found in file ../../../../../orte/mca/routed/radix/routed_radix.c at line 375
ORTE_ERROR_LOG: Unreachable in file ../../../../../orte/mca/oob/ud/oob_ud_component.c at line 589
ORTE_ERROR_LOG: Not found in file ../../../../../orte/mca/routed/radix/routed_radix.c at line 375
ORTE_ERROR_LOG: Not found in file ../../../../../orte/mca/routed/radix/routed_radix.c at line 375

For attach mode (mode in which the MPI application started and LaunchMON
later tries to attach a debugger daemon to the MPI processes), orterun
doesn’t seem to co-spawn the daemon at all. Orterun does print out the
following message, but  as far as I can tell, the be_kicker daemon has never
been spawned and handshakes with the tool front end. 

<Nov 23 11:34:59> APP (INFO): stall for 3 secs
[rzmanta10:47587] [[20639,0],0] Attaching debugger <CUT>/tests/be_kicker
[rzmanta10:47587] [[20639,0],0] Releasing job data for [INVALID]
<Nov 23 11:35:02> APP (INFO): stall for 3 secs

As before, I launched orterun with -d -mca debugger mpirx and it seems
orterun creates the FIFO okay. If a different option set needs to be passed
with newer versions of OpenMPI to effect this co-spawn service in attach
mode, please let me know. Since this is STAT’s main mode, the tool cannot
be validated fully with this failure mode. 

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 28, 2016

I have updated this PR with a set of fixes that resolves the first reported problem and the issue of launching the daemons upon attach:

  • Limit the number of times we retry sending of a message to avoid an infinite loop

  • Don't execute the "init_debugger_after_spawn" state for debugger jobs

  • Add a new test program "attach" that takes the debugger attach fifo as its argument, and then simulates attach by writing a byte down the fifo

I cannot test with STAT directly - so I'll have to hand this off to @jjhursey to see if he can take it further.

@@ -2712,7 +2715,8 @@ static void open_fifo (void)
return;
}

opal_output_verbose(2, orte_debug_output,
// opal_output_verbose(2, orte_debug_output,
opal_output(0,
"%s Monitoring debugger attach fifo %s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be turned back into an opal_output_verbose?

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 28, 2016

Good point - just updated it so that we always output the fifo path if we are testing attach, but otherwise use output_verbose. Thx!

@gpaulsen
Copy link
Member

Is LLNL able to retest with these additional commits?

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2016

Once Dong returns next week (12/5) from vacation - maybe earlier, but doubtful

@lee218llnl
Copy link

In Dong's absence, I was able to do some testing here at LLNL. I built the branch from this PR and got a copy of Dong's latest additions to LaunchMON, which haven't been committed/pushed to github yet. The good news is that with this combination I was able to both attach to a running job and launch a job under STAT!

I will note, however, that in both cases I get some error messages when detaching STAT from mpirun:

[rzmanta15:55045] [[2779,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 375
[rzmanta15:55045] [[2779,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 375
[rzmanta15:55045] [[2779,0],0] ORTE_ERROR_LOG: Unreachable in file oob_ud_component.c at line 589
[rzmanta15:55045] [[2779,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 375
[rzmanta15:55045] [[2779,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 375
[rzmanta15:55045] [[2779,0],0] ORTE_ERROR_LOG: Unreachable in file oob_ud_component.c at line 589
[rzmanta15:55045] [[2779,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 375
[rzmanta15:55045] [[2779,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 375
[rzmanta15:55045] [[2779,0],0] ORTE_ERROR_LOG: Unreachable in file oob_ud_component.c at line 589

I'm not sure if these are errors in how I configured/built OpenMPI or what, but figured it worth pointing out.

Thanks for the fixes!

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2016

Hmmm...well, that is certainly good news! If I update this patch to print out a little more info on those error outputs, would you be able to run it again?

@lee218llnl
Copy link

Yes, I can do additional testing when you push your changes.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2016

@lee218llnl I've added a print statement that will help debug that remaining error output on detach. Could you please give this a spin and post back the output?

@lee218llnl
Copy link

[rzmanta17:131453] [[26848,0],0] ATTEMPTING TO SEND TO [[26848,2],0]
[rzmanta17:131453] [[26848,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta17:131453] [[26848,0],0] ATTEMPTING TO SEND TO [[26848,2],0]
[rzmanta17:131453] [[26848,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta17:131453] [[26848,0],0] ORTE_ERROR_LOG: Unreachable in file oob_ud_component.c at line 589
[rzmanta17:131453] [[26848,0],0] ATTEMPTING TO SEND TO [[26848,2],0]
[rzmanta17:131453] [[26848,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta17:131453] [[26848,0],0] ATTEMPTING TO SEND TO [[26848,2],0]
[rzmanta17:131453] [[26848,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta17:131453] [[26848,0],0] ORTE_ERROR_LOG: Unreachable in file oob_ud_component.c at line 589
[rzmanta17:131453] [[26848,0],0] ATTEMPTING TO SEND TO [[26848,2],0]
[rzmanta17:131453] [[26848,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta17:131453] [[26848,0],0] ATTEMPTING TO SEND TO [[26848,2],0]
[rzmanta17:131453] [[26848,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta17:131453] [[26848,0],0] ORTE_ERROR_LOG: Unreachable in file oob_ud_component.c at line 589

@lee218llnl
Copy link

[rzmanta15:108267] [[31540,0],0] ATTEMPTING TO SEND TO [[31540,2],0]
[rzmanta15:108267] [[31540,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta15:108267] [[31540,0],0] ATTEMPTING TO SEND TO [[31540,2],0]
[rzmanta15:108267] [[31540,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta15:108267] [[31540,0],0] ORTE_ERROR_LOG: Unreachable in file oob_ud_component.c at line 589
[rzmanta15:108267] [[31540,0],0] ATTEMPTING TO SEND TO [[31540,2],0]
[rzmanta15:108267] [[31540,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta15:108267] [[31540,0],0] ATTEMPTING TO SEND TO [[31540,2],0]
[rzmanta15:108267] [[31540,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta15:108267] [[31540,0],0] ORTE_ERROR_LOG: Unreachable in file oob_ud_component.c at line 589
[rzmanta15:108267] [[31540,0],0] ATTEMPTING TO SEND TO [[31540,2],0]
[rzmanta15:108267] [[31540,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta15:108267] [[31540,0],0] ATTEMPTING TO SEND TO [[31540,2],0]
[rzmanta15:108267] [[31540,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta15:108267] [[31540,0],0] ORTE_ERROR_LOG: Unreachable in file oob_ud_component.c at line 589
[rzmanta15:108267] [[31540,0],0] CANNOT SEND TO [[31540,2],0]: TAG 37

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 30, 2016

@lee218llnl Sorry to be such a pain - appreciate your help in tracking this down!

@lee218llnl
Copy link

No problem, glad to be of assistance. Here's the latest:

[rzmanta17:160800] [[7613,0],0] INIT AFTER SPAWN FOR [7613,1]
[rzmanta17:160800] [[7613,0],0] INIT AFTER SPAWN FOR [7613,2]
[rzmanta17:160800] [[7613,0],0] SENDING DEBUGGER RELEASE orterun.c:2553
[rzmanta17:160800] [[7613,0],0] ATTEMPTING TO SEND TO [[7613,2],0]
[rzmanta17:160800] [[7613,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta17:160800] [[7613,0],0] ATTEMPTING TO SEND TO [[7613,2],0]
[rzmanta17:160800] [[7613,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta17:160800] [[7613,0],0] ORTE_ERROR_LOG: Unreachable in file oob_ud_component.c at line 589
[rzmanta17:160800] [[7613,0],0] ATTEMPTING TO SEND TO [[7613,2],0]
[rzmanta17:160800] [[7613,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta17:160800] [[7613,0],0] ATTEMPTING TO SEND TO [[7613,2],0]
[rzmanta17:160800] [[7613,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta17:160800] [[7613,0],0] ORTE_ERROR_LOG: Unreachable in file oob_ud_component.c at line 589
[rzmanta17:160800] [[7613,0],0] ATTEMPTING TO SEND TO [[7613,2],0]
[rzmanta17:160800] [[7613,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta17:160800] [[7613,0],0] ATTEMPTING TO SEND TO [[7613,2],0]
[rzmanta17:160800] [[7613,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 378
[rzmanta17:160800] [[7613,0],0] ORTE_ERROR_LOG: Unreachable in file oob_ud_component.c at line 589
[rzmanta17:160800] [[7613,0],0] CANNOT SEND TO [[7613,2],0]: TAG 37

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 30, 2016

Hmmm...I'm beginning to see the problem. The job you are "debugging" - is it an MPI job? It is being released?

@lee218llnl
Copy link

Yes this is an MPI job. This happens after I detach from the mpirun process.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 30, 2016

Do all your daemons "die" prior to you detaching from mpirun?

@lee218llnl
Copy link

There may be a race there, as my tool frontend tells the daemons to go ahead and exit and then the frontend will detach without waiting for a response. Before the exit is issued, though, the daemons have already detached from the application.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 30, 2016

Okay, I think this last commit will fix the problem - can you please let me know, and I'll remove the debug output?

@jjhursey
Copy link
Member

bot:ibm:retest

@lee218llnl
Copy link

Looks like I'm seg faulting:

[rzmanta30:75070] [[4295,0],0] INIT AFTER SPAWN FOR [4295,2]
[rzmanta30:75070] *** Process received signal ***
[rzmanta30:75070] Signal: Segmentation fault (11)
[rzmanta30:75070] Signal code: Address not mapped (1)
[rzmanta30:75070] Failing at address: 0xaffffffff
[rzmanta30:75070] [ 0] [0x100000050478]
[rzmanta30:75070] [ 1] /nfs/tmp2//lee218/prefix/ompi.rzmanta/lib/libopen-pal.so.20(opal_free+0x34)[0x100000230b30]
[rzmanta30:75070] [ 2] /nfs/tmp2/lee218/prefix/ompi.rzmanta/bin/mpirun[0x1000af1c]
[rzmanta30:75070] [ 3] /nfs/tmp2//lee218/prefix/ompi.rzmanta/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0xcd0)[0x10000024bbc0]
[rzmanta30:75070] [ 4] /nfs/tmp2/lee218/prefix/ompi.rzmanta/bin/mpirun[0x10005c90]
[rzmanta30:75070] [ 5] /nfs/tmp2/lee218/prefix/ompi.rzmanta/bin/mpirun[0x100034c0]
[rzmanta30:75070] [ 6] /lib64/power8/libc.so.6(+0x24580)[0x100000524580]
[rzmanta30:75070] [ 7] /lib64/power8/libc.so.6(__libc_start_main+0xc4)[0x100000524774]
[rzmanta30:75070] *** End of error message ***
[2]+ Segmentation fault (core dumped) /nfs/tmp2/lee218/prefix/ompi.rzmanta/bin/mpirun mpi_ringtopo.rzmanta2 180

Here's a peek at the core file:

bash-4.2$ gdb /nfs/tmp2/lee218/prefix/ompi.rzmanta/bin/mpirun rzmanta30-mpirun-75070.core
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "ppc64le-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /nfs/tmp2/lee218/prefix/ompi.rzmanta/bin/orterun...done.

warning: core file may not match specified executable file.
[New LWP 75070]
[New LWP 75071]
[New LWP 75073]
[New LWP 75074]
[New LWP 75072]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/power8/libthread_db.so.1".
Core was generated by `/nfs/tmp2/lee218/prefix/ompi.rzmanta/bin/mpirun mpi_ringtopo.rzmanta2 180 '.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000010003890 in opal_pointer_array_get_item (table=0x10047360,
element_index=0) at ../../../opal/class/opal_pointer_array.h:131
131 p = table->addr[element_index];
Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.ppc64le libgcc-4.8.5-4.el7.ppc64le libibumad-1.3.10.2.MLNX20150406.966500d-0.1.33100.ppc64le libibverbs-1.1.8mlnx1-OFED.3.3.0.0.9.33100.ppc64le libmlx4-1.0.6mlnx1-OFED.3.3.0.0.7.33100.ppc64le libmlx5-1.0.2mlnx1-OFED.3.3.0.0.9.33100.ppc64le libnl-1.1.4-3.el7.ppc64le libpciaccess-0.13.4-2.el7.ppc64le munge-libs-0.5.12-1.SAG.ppc64le numactl-libs-2.0.9-5.el7_1.ppc64le opensm-libs-4.7.0.MLNX20160523.25f7c7a-0.1.33100.ppc64le
(gdb) where
#0 0x0000000010003890 in opal_pointer_array_get_item (table=0x10047360,
element_index=0) at ../../../opal/class/opal_pointer_array.h:131
#1 0x000000001000af94 in orte_debugger_init_after_spawn (fd=-1, event=4,
cbdata=0x103efe70) at orterun.c:2550
#2 0x000010000024bbc0 in event_process_active_single_queue (
activeq=0x1008fe80, base=0x1008f900) at event.c:1370
#3 event_process_active (base=) at event.c:1440
#4 opal_libevent2022_event_base_loop (base=0x1008f900, flags=)
at event.c:1644
#5 0x0000000010005c90 in orterun (argc=3, argv=0x3fffffffde78)
at orterun.c:1071
#6 0x00000000100034c0 in main (argc=3, argv=0x3fffffffde78) at main.c:13

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 30, 2016

Ouch - embarrassing typo. Sorry about that, should now be fixed

@lee218llnl
Copy link

The fix looks good. I am able to attach and detach, and can see the MPI job continue after detaching. Thanks!

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 30, 2016

Excellent - thanks so much for your patience! I will cleanup the patch

@gpaulsen
Copy link
Member

Yes, thank you both!

…gger. Add ability to test the support minus the actual debugger.

Fixes #2411

Continue cleanup of STAT debugger attach:

* Limit the number of times we retry sending of a message to avoid an infinite loop

* Don't execute the "init_debugger_after_spawn" state for debugger jobs

* Add a new test program "attach" that takes the debugger attach fifo as its argument, and then simulates attach by writing a byte down the fifo

Output the attach fifo info if we are testing attach so we know where to attach to - otherwise, use the output_verbose

Always send "debugger release" to the job actually being debugged, not the debugger itself

Signed-off-by: Ralph Castain <[email protected]>

Remove debug

Signed-off-by: Ralph Castain <[email protected]>
@rhc54
Copy link
Contributor Author

rhc54 commented Nov 30, 2016

@lee218llnl This should be the final version of the patch, with all debug removed. Could you please run a smoke-test on it to ensure nothing got broken?

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 30, 2016

@jjhursey Assuming the smoke-test passes, could you please review this for release?

@lee218llnl
Copy link

I ran my tests and it appears to work OK still. Thanks a bunch!

Copy link
Member

@jjhursey jjhursey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me. Thanks @rhc54!

The only thing I might suggest we add (either as a comment to the PR or in the code) is how to use the attach test to simulate the way STAT is expecting to attach to the running job. If you have a quick example you can copy/paste into a comment that would help.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 1, 2016

Testing the "attach" capability is fairly straightforward:

  1. set a couple of environmental variables:
export OMPI_MCA_orte_debugger_test_daemon=hostname
export OMPI_MCA_orte_debugger_test_attach=1
  1. execute mpirun -npernode 2 ./hello from the orte/test/mpi directory. This will initiate an MPI program, and (because of the envars) tell it to block waiting for debugger release. mpirun itself will automatically print out the attach FIFO it is listening on for attach requests.

  2. in another window, run the attach program in orte/test/mpi:

./attach <FIFO>

where <FIFO> is the string output by mpirun. This will write a flag down the FIFO pipe, causing mpirun to think a debugger has attached. It will then launch the executable specified in the OMPI_MCA_orte_debugger_test_daemon envar, and once that executable is running, it will send the "debugger release" message to the MPI application.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 1, 2016

@hppritcha @jsquyres This is ready to go!

@jsquyres
Copy link
Member

jsquyres commented Dec 1, 2016

Travis seems to have been running extraordinarily slow the past few days. 😦

@jsquyres jsquyres merged commit ce73959 into open-mpi:v2.0.x Dec 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants