-
Notifications
You must be signed in to change notification settings - Fork 897
Fix debugger attach and cospawn of debugger daemons for the STAT debugger #2425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - code inspection. Thanks for jumping on this.
I think this looks fine. I have a question about the test attach variable usage. I see these two environment variables: In I'm thinking about code maintenance as well. Would it be possible to consolidate these into just MCA parameters? Say, if the user specifies the environment variables then that flips the MCA parameter(s) [plural making one for attach and one for the sleep]. Then we just use the MCA variables though out the code. |
Does this need to be pulled into |
@jsquyres I think this is ready to go. |
@jjhursey These changes do indeed need to be ported to master and v2.x - I'll do that today, now that I'm back at a real keyboard again. The port isn't a clean "merge" due to the code path differences. The debugger test code is a bit of a mishmash as it was done at different times. We started with MCA params, but then a concern was raised that a user might actually try to use them. So we switched to "invisible" environmental variables that have an ORTE prefix. Due to the number of code path options, the number of these "knobs" is sadly more than I'd like. We need to be able to separately test the following options:
I'd welcome it if you'd like to take a look at the code and clean it up. This all evolved over a long period, and not from any high-level design. |
I just heard from the lab that they tried this fix with Open MPI, but that it didn't resolve their issue with STAT. We should figure out what's going on before merging this in. |
Lab tested, and this fix didn't resolve for them.
I'll see if I can get stat setup on one of our test machines on Monday so I can help debug what might be missing here. |
If they could provide more info than "didn't work", it would help 😄 |
Seriously. We're working on getting details. Sorry for delay. |
Which version of STAT is being used for this testing? |
removing blocker tag. we can fix this in a later 2.0.x release |
I received some updated input from LLNL:
|
I have updated this PR with a set of fixes that resolves the first reported problem and the issue of launching the daemons upon attach:
I cannot test with STAT directly - so I'll have to hand this off to @jjhursey to see if he can take it further. |
@@ -2712,7 +2715,8 @@ static void open_fifo (void) | |||
return; | |||
} | |||
|
|||
opal_output_verbose(2, orte_debug_output, | |||
// opal_output_verbose(2, orte_debug_output, | |||
opal_output(0, | |||
"%s Monitoring debugger attach fifo %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be turned back into an opal_output_verbose
?
Good point - just updated it so that we always output the fifo path if we are testing attach, but otherwise use output_verbose. Thx! |
Is LLNL able to retest with these additional commits? |
Once Dong returns next week (12/5) from vacation - maybe earlier, but doubtful |
In Dong's absence, I was able to do some testing here at LLNL. I built the branch from this PR and got a copy of Dong's latest additions to LaunchMON, which haven't been committed/pushed to github yet. The good news is that with this combination I was able to both attach to a running job and launch a job under STAT! I will note, however, that in both cases I get some error messages when detaching STAT from mpirun: [rzmanta15:55045] [[2779,0],0] ORTE_ERROR_LOG: Not found in file routed_radix.c at line 375 I'm not sure if these are errors in how I configured/built OpenMPI or what, but figured it worth pointing out. Thanks for the fixes! |
Hmmm...well, that is certainly good news! If I update this patch to print out a little more info on those error outputs, would you be able to run it again? |
Yes, I can do additional testing when you push your changes. |
@lee218llnl I've added a print statement that will help debug that remaining error output on detach. Could you please give this a spin and post back the output? |
[rzmanta17:131453] [[26848,0],0] ATTEMPTING TO SEND TO [[26848,2],0] |
[rzmanta15:108267] [[31540,0],0] ATTEMPTING TO SEND TO [[31540,2],0] |
@lee218llnl Sorry to be such a pain - appreciate your help in tracking this down! |
No problem, glad to be of assistance. Here's the latest: [rzmanta17:160800] [[7613,0],0] INIT AFTER SPAWN FOR [7613,1] |
Hmmm...I'm beginning to see the problem. The job you are "debugging" - is it an MPI job? It is being released? |
Yes this is an MPI job. This happens after I detach from the mpirun process. |
Do all your daemons "die" prior to you detaching from mpirun? |
There may be a race there, as my tool frontend tells the daemons to go ahead and exit and then the frontend will detach without waiting for a response. Before the exit is issued, though, the daemons have already detached from the application. |
Okay, I think this last commit will fix the problem - can you please let me know, and I'll remove the debug output? |
bot:ibm:retest |
Looks like I'm seg faulting: [rzmanta30:75070] [[4295,0],0] INIT AFTER SPAWN FOR [4295,2] Here's a peek at the core file: bash-4.2$ gdb /nfs/tmp2/lee218/prefix/ompi.rzmanta/bin/mpirun rzmanta30-mpirun-75070.core warning: core file may not match specified executable file. |
Ouch - embarrassing typo. Sorry about that, should now be fixed |
The fix looks good. I am able to attach and detach, and can see the MPI job continue after detaching. Thanks! |
Excellent - thanks so much for your patience! I will cleanup the patch |
Yes, thank you both! |
…gger. Add ability to test the support minus the actual debugger. Fixes #2411 Continue cleanup of STAT debugger attach: * Limit the number of times we retry sending of a message to avoid an infinite loop * Don't execute the "init_debugger_after_spawn" state for debugger jobs * Add a new test program "attach" that takes the debugger attach fifo as its argument, and then simulates attach by writing a byte down the fifo Output the attach fifo info if we are testing attach so we know where to attach to - otherwise, use the output_verbose Always send "debugger release" to the job actually being debugged, not the debugger itself Signed-off-by: Ralph Castain <[email protected]> Remove debug Signed-off-by: Ralph Castain <[email protected]>
@lee218llnl This should be the final version of the patch, with all debug removed. Could you please run a smoke-test on it to ensure nothing got broken? |
@jjhursey Assuming the smoke-test passes, could you please review this for release? |
I ran my tests and it appears to work OK still. Thanks a bunch! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good to me. Thanks @rhc54!
The only thing I might suggest we add (either as a comment to the PR or in the code) is how to use the attach
test to simulate the way STAT is expecting to attach to the running job. If you have a quick example you can copy/paste into a comment that would help.
Testing the "attach" capability is fairly straightforward:
where <FIFO> is the string output by mpirun. This will write a flag down the FIFO pipe, causing mpirun to think a debugger has attached. It will then launch the executable specified in the |
@hppritcha @jsquyres This is ready to go! |
Travis seems to have been running extraordinarily slow the past few days. 😦 |
Add ability to test the support minus the actual debugger.
Fixes #2411
Signed-off-by: Ralph Castain [email protected]
@gpaulsen @dsolt