-
Notifications
You must be signed in to change notification settings - Fork 901
v3.1.x: Update to PMIx v2.2.3 pre-release candidate #6778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Will update to official release after passing MTT Signed-off-by: Ralph Castain <[email protected]>
@artpol84 Is this some hiccup on the Mellanox Jenkins? I'm not quite sure I see how PMIx would have caused this failure, but I'm not ruling it out - just need guidance. 21:50:43 + taskset -c 2,3 timeout -s SIGSEGV 17m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 8 -bind-to none -mca orte_tmpdir_base /tmp/tmp.sBsKcd8v7T --report-state-on-timeout --get-stack-traces --timeout 900 -mca coll '^hcoll' -mca btl_openib_if_include mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -mca btl_openib_allow_ib true -mca pml ob1 -mca btl self,openib taskset -c 2,3 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c
21:50:43 --------------------------------------------------------------------------
21:50:43 WARNING: There is at least non-excluded one OpenFabrics device found,
21:50:43 but there are no active ports detected (or Open MPI was unable to use
21:50:43 them). This is most certainly not what you wanted. Check your
21:50:43 cables, subnet manager configuration, etc. The openib BTL will be
21:50:43 ignored for this job.
21:50:43
21:50:43 Local host: jenkins03
21:50:43 --------------------------------------------------------------------------
21:50:43 [1561575043.381564] [jenkins03:22333:0] ucp_context.c:601 UCX WARN device 'mlx4_0:1' is not available
21:50:43 [1561575043.381933] [jenkins03:22289:0] ucp_context.c:601 UCX WARN device 'mlx4_0:1' is not available
21:50:43 --------------------------------------------------------------------------
21:50:43 At least one pair of MPI processes are unable to reach each other for
21:50:43 MPI communications. This means that no Open MPI device has indicated
21:50:43 that it can be used to communicate between these processes. This is
21:50:43 an error; Open MPI requires that all MPI processes be able to reach
21:50:43 each other. This error can sometimes be the result of forgetting to
21:50:43 specify the "self" BTL.
21:50:43
21:50:43 Process 1 ([[27929,1],4]) is on host: jenkins03
21:50:43 Process 2 ([[27929,1],0]) is on host: jenkins03
21:50:43 BTLs attempted: self
21:50:43
21:50:43 Your MPI job is now going to abort; sorry.
21:50:43 --------------------------------------------------------------------------
21:50:43 [1561575043.382579] [jenkins03:22294:0] ucp_context.c:601 UCX WARN device 'mlx4_0:1' is not available
21:50:43 [1561575043.382703] [jenkins03:22374:0] ucp_context.c:601 UCX WARN device 'mlx4_0:1' is not available
21:50:43 --------------------------------------------------------------------------
21:50:43 MPI_INIT has failed because at least one MPI process is unreachable
21:50:43 from another. This *usually* means that an underlying communication
21:50:43 plugin -- such as a BTL or an MTL -- has either not loaded or not
21:50:43 allowed itself to be used. Your MPI job will now abort.
21:50:43
21:50:43 You may wish to try to narrow down the problem;
21:50:43
21:50:43 * Check the output of ompi_info to see which BTL/MTL plugins are
21:50:43 available.
21:50:43 * Run your application with MPI_THREAD_SINGLE.
21:50:43 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
21:50:43 if using MTL-based communications) to see exactly which
21:50:43 communication plugins were considered and/or discarded.
21:50:43 --------------------------------------------------------------------------
21:50:43 [jenkins03:22333] *** An error occurred in MPI_Init
21:50:43 [jenkins03:22333] *** reported by process [140735023742977,140733193388036]
21:50:43 [jenkins03:22333] *** on a NULL communicator
21:50:43 [jenkins03:22333] *** Unknown error
21:50:43 [jenkins03:22333] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
21:50:43 [jenkins03:22333] *** and potentially your MPI job) |
Looking at the output from Mellanox Jenkins, this is clearly some problem with the openib btl - the exact same program ran successfully right before it with vader instead of openib. |
@karasevb please have a look |
Same Mellanox CI error is occurring on #6779. |
bot:mellanox:retest |
Oh, ok. The retest on another PR just succeeded. Shrug. Thanks for the heads up. |
Looks like it went fixed recently. |
Once this PR is merged, we'll watch MTT. If it's all good, there will be another PR to get the final final final PMIx 2.2.3 release. |
Will update to official release after passing MTT
Refs #6763
Signed-off-by: Ralph Castain [email protected]