Skip to content

v3.1.x: Update to PMIx v2.2.3 pre-release candidate #6778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 8, 2019
Merged

v3.1.x: Update to PMIx v2.2.3 pre-release candidate #6778

merged 1 commit into from
Jul 8, 2019

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jun 26, 2019

Will update to official release after passing MTT

Refs #6763

Signed-off-by: Ralph Castain [email protected]

Will update to official release after passing MTT

Signed-off-by: Ralph Castain <[email protected]>
@rhc54 rhc54 added this to the v3.1.5 milestone Jun 26, 2019
@rhc54 rhc54 requested a review from jjhursey June 26, 2019 18:15
@rhc54 rhc54 self-assigned this Jun 26, 2019
@rhc54
Copy link
Contributor Author

rhc54 commented Jun 26, 2019

@artpol84 Is this some hiccup on the Mellanox Jenkins? I'm not quite sure I see how PMIx would have caused this failure, but I'm not ruling it out - just need guidance.

21:50:43 + taskset -c 2,3 timeout -s SIGSEGV 17m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 8 -bind-to none -mca orte_tmpdir_base /tmp/tmp.sBsKcd8v7T --report-state-on-timeout --get-stack-traces --timeout 900 -mca coll '^hcoll' -mca btl_openib_if_include mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -mca btl_openib_allow_ib true -mca pml ob1 -mca btl self,openib taskset -c 2,3 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c
21:50:43 --------------------------------------------------------------------------
21:50:43 WARNING: There is at least non-excluded one OpenFabrics device found,
21:50:43 but there are no active ports detected (or Open MPI was unable to use
21:50:43 them).  This is most certainly not what you wanted.  Check your
21:50:43 cables, subnet manager configuration, etc.  The openib BTL will be
21:50:43 ignored for this job.
21:50:43 
21:50:43   Local host: jenkins03
21:50:43 --------------------------------------------------------------------------
21:50:43 [1561575043.381564] [jenkins03:22333:0]    ucp_context.c:601  UCX  WARN  device 'mlx4_0:1' is not available
21:50:43 [1561575043.381933] [jenkins03:22289:0]    ucp_context.c:601  UCX  WARN  device 'mlx4_0:1' is not available
21:50:43 --------------------------------------------------------------------------
21:50:43 At least one pair of MPI processes are unable to reach each other for
21:50:43 MPI communications.  This means that no Open MPI device has indicated
21:50:43 that it can be used to communicate between these processes.  This is
21:50:43 an error; Open MPI requires that all MPI processes be able to reach
21:50:43 each other.  This error can sometimes be the result of forgetting to
21:50:43 specify the "self" BTL.
21:50:43 
21:50:43   Process 1 ([[27929,1],4]) is on host: jenkins03
21:50:43   Process 2 ([[27929,1],0]) is on host: jenkins03
21:50:43   BTLs attempted: self
21:50:43 
21:50:43 Your MPI job is now going to abort; sorry.
21:50:43 --------------------------------------------------------------------------
21:50:43 [1561575043.382579] [jenkins03:22294:0]    ucp_context.c:601  UCX  WARN  device 'mlx4_0:1' is not available
21:50:43 [1561575043.382703] [jenkins03:22374:0]    ucp_context.c:601  UCX  WARN  device 'mlx4_0:1' is not available
21:50:43 --------------------------------------------------------------------------
21:50:43 MPI_INIT has failed because at least one MPI process is unreachable
21:50:43 from another.  This *usually* means that an underlying communication
21:50:43 plugin -- such as a BTL or an MTL -- has either not loaded or not
21:50:43 allowed itself to be used.  Your MPI job will now abort.
21:50:43 
21:50:43 You may wish to try to narrow down the problem;
21:50:43 
21:50:43  * Check the output of ompi_info to see which BTL/MTL plugins are
21:50:43    available.
21:50:43  * Run your application with MPI_THREAD_SINGLE.
21:50:43  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
21:50:43    if using MTL-based communications) to see exactly which
21:50:43    communication plugins were considered and/or discarded.
21:50:43 --------------------------------------------------------------------------
21:50:43 [jenkins03:22333] *** An error occurred in MPI_Init
21:50:43 [jenkins03:22333] *** reported by process [140735023742977,140733193388036]
21:50:43 [jenkins03:22333] *** on a NULL communicator
21:50:43 [jenkins03:22333] *** Unknown error
21:50:43 [jenkins03:22333] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
21:50:43 [jenkins03:22333] ***    and potentially your MPI job)

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 26, 2019

Looking at the output from Mellanox Jenkins, this is clearly some problem with the openib btl - the exact same program ran successfully right before it with vader instead of openib.

@artpol84
Copy link
Contributor

artpol84 commented Jul 1, 2019

@karasevb please have a look

@jsquyres
Copy link
Member

jsquyres commented Jul 1, 2019

Same Mellanox CI error is occurring on #6779.

@jsquyres
Copy link
Member

jsquyres commented Jul 2, 2019

bot:mellanox:retest

@jsquyres jsquyres changed the title Update to PMIx v2.2.3 pre-release candidate v3.1.x: Update to PMIx v2.2.3 pre-release candidate Jul 2, 2019
@artpol84
Copy link
Contributor

artpol84 commented Jul 2, 2019

@jsquyres I've checked with @karasevb.
This seems like a HW issue on the jenkins node. It is still being fixed, so I expect this retest to fail, but hopefully, we will get it working by tomorrow.

@jsquyres
Copy link
Member

jsquyres commented Jul 2, 2019

Oh, ok. The retest on another PR just succeeded. Shrug. Thanks for the heads up.

@artpol84
Copy link
Contributor

artpol84 commented Jul 2, 2019

Looks like it went fixed recently.
All good

@jsquyres
Copy link
Member

jsquyres commented Jul 8, 2019

Once this PR is merged, we'll watch MTT. If it's all good, there will be another PR to get the final final final PMIx 2.2.3 release.

@jsquyres jsquyres merged commit add0446 into open-mpi:v3.1.x Jul 8, 2019
@rhc54 rhc54 deleted the cmr31/pmix branch November 27, 2019 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants