From caf78243bf28b459b258d42433801aae422750f3 Mon Sep 17 00:00:00 2001 From: Howard Pritchard Date: Mon, 27 Jan 2025 13:21:36 -0700 Subject: [PATCH] dpm: don't use locality info for multi PMIX namespace environments Some of our collective frameworks are now locality aware and make use of this information to make decisions on how to handle app coll ops. It turns out that in certain multi-namespace situations (jobid in ompi speak), some procs can get locality info about other procs but not in a symmetric fashion using PMIx mechanisms. This can lead to communicators with different locality information on different procs. This can lead to deadlock when using certain collectives. This situation can be seen with the ompi-tests/ibm/dynamic/intercomm_merge.c In this test the following happens: 1. process set A is started with mpirun 2. process set A spawns a set of processes B 3. processes in sets A and B create an intra comm using the intercomm from MPI_Comm_spawn and MPI_Comm_get_parent in the spawners and spawnees respectively 4. process in set A spawns a set of processes C 5. processes in sets A and C create an intra comm using the intercomm from MPI_Comm_spawn and MPI_Comm_get_parent in the spawners and spawnees respectively 6. processes in A and B create new intercomm 7. processes in A and C create new intercomm 8. processes in A, B, anc C create a new intra comm using the intercomms from steps 6 and 7 9. processes in A, B, and C try to do an MPI_Barrier using the intra comm from step 8 It turns out in step 8 the locality info supplied by pmix is asymmetric. Processes in sets B and C aren't able to deterimine locality info from each other (PMIx returns not found when attempts are made to get locality info for the remote processes). This causes issues when the step 9 is executed. Processes in set A are trying to use the tuned collective component for the barrier. Processes in sets B and C are trying to use the HAN collective component for the barrier. In process sets B and C, HAN thinks that the communicator has both local and remote procs so tries to use a hierarchical algorithm. Meanwhile, procs in set A can retrieve locality from all procs in sets B and C and think the collective is occuring on a single node - which in fact it is. This behavior can be observed using prrte master at 8ecee645de and openpmix master at a083d8f9. This patch restricts using locality info for a proc if its in a different PMIx namespace. It also removes some comments which are now no longer accurate. Signed-off-by: Howard Pritchard --- 3rd-party/openpmix | 2 +- 3rd-party/prrte | 2 +- ompi/dpm/dpm.c | 12 ++++++++---- 3 files changed, 10 insertions(+), 6 deletions(-) diff --git a/3rd-party/openpmix b/3rd-party/openpmix index 08e41ed5629..707920c99de 160000 --- a/3rd-party/openpmix +++ b/3rd-party/openpmix @@ -1 +1 @@ -Subproject commit 08e41ed5629b51832f5708181af6d89218c7a74e +Subproject commit 707920c99de946a5c3a1850da457340f38c0caf2 diff --git a/3rd-party/prrte b/3rd-party/prrte index 30cadc6746e..f6f5c181c1d 160000 --- a/3rd-party/prrte +++ b/3rd-party/prrte @@ -1 +1 @@ -Subproject commit 30cadc6746ebddd69ea42ca78b964398f782e4e3 +Subproject commit f6f5c181c1dec317c31f61effd73f960ce2eac25 diff --git a/ompi/dpm/dpm.c b/ompi/dpm/dpm.c index 4b5dbf623e1..5616c91b422 100644 --- a/ompi/dpm/dpm.c +++ b/ompi/dpm/dpm.c @@ -436,14 +436,18 @@ int ompi_dpm_connect_accept(ompi_communicator_t *comm, int root, opal_list_remove_item(&ilist, (opal_list_item_t*)cd); // TODO: do we need to release cd ? OBJ_RELEASE(cd); /* ompi_proc_complete_init_single() initializes and optionally retrieves - * OPAL_PMIX_LOCALITY and OPAL_PMIX_HOSTNAME. since we can live without - * them, we are just fine */ + * OPAL_PMIX_LOCALITY and OPAL_PMIX_HOSTNAME. + */ ompi_proc_complete_init_single(proc); /* if this proc is local, then get its locality */ if (NULL != local_ranks_in_jobid) { - uint16_t u16; + uint16_t u16 = 0; for (prn=0; prn < nprn; prn++) { - if (local_ranks_in_jobid[prn] == proc->super.proc_name.vpid) { + /* + * exclude procs not in our job id (aka pmix namespace) from localization optimizations + */ + if ((local_ranks_in_jobid[prn] == proc->super.proc_name.vpid) + && (OMPI_PROC_MY_NAME->jobid == proc->super.proc_name.jobid)) { /* get their locality string */ val = NULL; OPAL_MODEX_RECV_VALUE_IMMEDIATE(rc, PMIX_LOCALITY_STRING,