- 
                Notifications
    
You must be signed in to change notification settings  - Fork 929
 
WeeklyTelcon_20220913
        Geoffrey Paulsen edited this page Oct 4, 2022 
        ·
        1 revision
      
    - Dialup Info: (Do not post to public mailing list or public wiki)
 
- Austen Lauria (IBM)
 - Brendan Cunningham (Cornelis Networks)
 - Brian Barrett (AWS)
 - Christoph Niethammer (HLRS)
 - David Bernhold (ORNL)
 - Edgar Gabriel (UoH)
 - Geoffrey Paulsen (IBM)
 - George Bosilca (UTK)
 - Howard Pritchard (LANL)
 - Joseph Schuchart
 - Josh Hursey (IBM)
 - Matthew Dosanjh (Sandia)
 - Todd Kordenbrock (Sandia)
 - Tommy Janjusic (nVidia)
 - William Zhang (AWS)
 
- Akshay Venkatesh (NVIDIA)
 - Artem Polyakov (nVidia)
 - Aurelien Bouteiller (UTK)
 - Brandon Yates (Intel)
 - Charles Shereda (LLNL)
 - Erik Zeiske
 - Harumi Kuno (HPE)
 - Hessam Mirsadeghi (UCX/nVidia)
 - Jan (Sandia -ULT support in Open MPI)
 - Jeff Squyres (Cisco)
 - Jingyin Tang
 - Josh Fisher (Cornelis Networks)
 - Marisa Roman (Cornelius)
 - Mark Allen (IBM)
 - Matias Cabral (Intel)
 - Michael Heinz (Cornelis Networks)
 - Nathan Hjelm (Google)
 - Noah Evans (Sandia)
 - Raghu Raja (AWS)
 - Ralph Castain (Intel)
 - Sam Gutierrez (LLNL)10513
 - Scott Breyer (Sandia?)
 - Shintaro iwasaki
 - Thomas Naughton (ORNL)
 - Xin Zhao (nVidia)
 
- Multiple weeks on CVE from nvidia.
 - v4.1.5
- Schedule: targeting ~6 mon (Nov?)
 - No driver on schedule yet.
 
 - Potential CVE from 4 years ago issue in libevent.. but might not need to do anything.
- Updated one company reported scanner didn't report anything.
 - Waiting on confirmation that patches to remove dead was enough.
 
 - 10779 OPAL "core" library for internal usage
- Approach to seperate out pieces of OPAL for core and top
 - All internal things, not exposed to user
 - Brian and George worked on it, and then Josh picked it up and PRed 10779
 - Still in Draft because he wants to resolve any high level issues
 - As far as code layout, could move some things around, but if we do this too much, worried about dropping history... *
 
 
- Discuss 
MCAhttps://github.com/open-mpi/ompi/pull/10793#issuecomment-1244714561- 
--mcais how we've set OMPI mca parameters in Open MPI- Could PRRTE just "do the right thing" for 
--mca - Agree 
--mcais Open MPI specific options. - when pprte and pmix split off they prefixed.
 - They don't have ownership over MCA.
 - End of the day our docs can't change bec
 - We'd have hundreds or thousands of
 
 - Could PRRTE just "do the right thing" for 
 
 - 
 - Discuss 
mca_base_env_listhttps://github.com/open-mpi/ompi/pull/10788- Did google around, and this is documented https://oar.imag.fr/wiki:passing_environment_variables_to_openmpi_nodes
- Mentions that 
-xis deprecated? 
 - Mentions that 
 - Easy to fix Mellanox CI, but SHOULD we?
 - Lets remove the test, and add it to an Issue 10698.
 
 - Did google around, and this is documented https://oar.imag.fr/wiki:passing_environment_variables_to_openmpi_nodes
 - Discuss Remaining PRRTE CLI issues (https://github.com/open-mpi/ompi/issues/10698)
- 
-Ndocument an error if they try to error if--map-byconflict. - 
--show-progress- do the little...on terminal to display, now it doesn't do anything.- DOE may set this by default in MCA parameters (makes some users feel happy)
 
 - 
--display-topoGenerally we've tried to be backwards compatible. - 
-vversion - 
-Vverbose - 
-s|--preload-binary<- functionally it works, but with-ngets messed up - rankfile <- NOT deprecating
 - --mca is Open MPI's framework
 - No gprtemca. Created by PRRTE, but do we continue to support --gpmixmca?
 - --test-suicide and others all prrtedameon not exposed to the users.
- passed to prrte launcher
 
 
 - 
 - Posted Issue Open-MPI #10698 with about 13 issue, that will need
 - NEED an mpirun manpage
 - NEED mpirun --help
 - Need all these fixes before PRTE ships v3.0.0
 - Any of these issues complex?
 - Testing mpirun command line options.
 - Supposed to do automatic translations from old command line options to new options.
- Are we planning to get rid of options at some point?
 - Not printing deprecated warning by default.
 - We've made new options (that are the new way), but if we're not encouraging people to go to them, why?
- Can we even map old to new options one-to-one.
 
 - We "own" the szitso component and we could ditch new options, and only use old options if we want.
 - Before we force any change, we should get user's
 - Old ones had auto-completion.
 - If we have old options that are going to new options, weird that we don't print the messages.
 - v5.0 was supposed to be pretty disruptive, but if we go back and make it less disruptive, that's fine, but we are kinda saying that the old options are the way.
 
 - Do we want HW_GUIDED in v5?
- No discussion.
 
 - It's be nice to make a test suite that assumes 2-4 Nodes with 4ppr or so... *
 - Schedule:
- PMIx and PRRTE changes coming at end of August.
- PMIx v3.2 released.
 - Try to have bugfixes PRed end of August, to give time to iterate and merged.
 
 - Still using Critical v5.0.x Issues (https://github.com/open-mpi/ompi/projects/3) yesterday
 
 - PMIx and PRRTE changes coming at end of August.
 - Docs
- 
mpirun --helpis OUT OF DATE.- Have to do this relatively quickly, before PRRTE releases.
 - Austen, Geoff and Tomi will be
 - REASON for this, is because mpirun command line is in PRRTE.
 
 
 - 
 - mpirun manpage needs to be re-written.
- Docs are online and can be updates asyncronously.
 - Jeff posted PR to document runpath vs rpath
- Our configure checks some linker flags, but there might be default in linker or in system that really governs what happens.
 
 
 - Symbol Pollution - Need an issue posted.
- OPAL_DECLSPEC - Do we have docs on this?
- No.  Intent is where do you want a symbol available?
- Outside of your library, then use OPAL_DECLSPEC (like Windows DECLSPEC)
 - I want you to export this symbol.
 
 
 - No.  Intent is where do you want a symbol available?
 - need to clean up as much as possible.
 - Open-MPI community's perspective, our ABI is just MPI_Symbols
 - Still unfortunate. We need to clean up as much as possible.
 
 - OPAL_DECLSPEC - Do we have docs on this?
 
- Case of QThreds, where they need a recursive lock.
- A configury problem was fixed.
 
 
- Trying to flesh out a couple more things
 - Not merged into main or v5 yet.
- still a couple of discussion points.
 
 - No discussion.  Still some changes needed before we can retest/rereview.
- ShowLoad errors came out of this.
 - Intent is to turn this error off by default.
 - In Open MPI v5, we've slurped all mca libraries into libmpi (still can via configure)
 
 - If you build them as a dso (say cuda component)
- dlopen will fail because cuda isn't there.
 - and mca framework will emit a warning on STDERR.
 - Accelerators are expensive, and therefore you might not have them on all nodes.
 
 - BUT customers have hit this ERROR in the field.
 - In this case.
- What if we make this switch not be a boolean (always show warning, or don't show the warning)
 - Jeff posted 10763.
 
 - Two mechanisms... could be accelerators as DSOs.
- Because if you're in libmpi.so, whole job will not run.
 
 - Overall Edgar likes the ideas of the PR.
- How is Open MPI (or PRTE) dealing with slurm?
- Because slurm component is built every time, even if it doesn't find slurm.
 - Slurm Headers/libs are GPL
 - So Open MPI fork/exec srun/
 
 
 - How is Open MPI (or PRTE) dealing with slurm?
 - MCA component can still do a dlopen on required libraries
 - HCOLL component must be dlopening also
 - If we don't get Accelerator Framework in v5, is there any AMD accelerator support?
- Not much... just some specific derivated datatype
 - No Streams, No Abstration, etc.
 - Would be a big gap.
 
 - William will try
 - Edgar also has a follow up commit.
 - Waiting until big commit is merged into main, to not further complicate this commit.
 - Any testing with libfabric and accelerator support?
- Edgar is hoping to test this week.
 - If something is missing, it'd probably be on the libfabric
 
 
- Switching to builtin atomics,
- 10613 - Prefered PR. GCC / Clang should have that.
 - Next step would be to refactor the atomics for post v5.0.
 - Waiting on Brian's review and CI fixes.
 
 - Joseph will post some additional info thing in the ticket
 
- We're probably not getting together in person anytime soon.
- So we'll send around a doodle to have time to talk about our rules.
 - Reflect the way we worked several years ago, but not really right now.
 
 - we're to review the admin steering committee in July (per our rules):
 - we're to review the technical steering committee in July (per our rules):
 - We should also review all the OMPI github, slack, and coverity members during the month of July.
- Jeff will kick that off sometime this week or next week.
 
 - In the call we mentioned this, but no real discussion.
 
- Wiki for face to face: https://github.com/open-mpi/ompi/wiki/Meeting-2022
- Might be better to do a half-day/day-long virtual working session.
- Due to company's travel policies, and convenience.
 - Could do administrative tasks here too.
 
 
 - Might be better to do a half-day/day-long virtual working session.