Skip to content

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Nov 3, 2024

Repoint submodules

@rhc54 rhc54 marked this pull request as draft November 3, 2024 14:24
@rhc54 rhc54 added test mpi4py-all Run the optional mpi4py CI tests and removed Target: main labels Nov 3, 2024
@rhc54 rhc54 force-pushed the topic/chk branch 2 times, most recently from 380433e to 14750c2 Compare November 12, 2024 20:00
hppritcha added a commit to hppritcha/prrte that referenced this pull request Nov 13, 2024
Till we figure out what got busted in upstream pmix/prrte combo.
See what's happening with

open-mpi/ompi#12906

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/prrte that referenced this pull request Nov 13, 2024
add fetch depth 0

Till we figure out what got busted in upstream pmix/prrte combo.
See what's happening with

open-mpi/ompi#12906

Signed-off-by: Howard Pritchard <[email protected]>
Repoint submodules. Disable han and hcoll components
to avoid bug when testing singleton comm_spawn.

Signed-off-by: Ralph Castain <[email protected]>
Signed-off-by: Ralph Castain <[email protected]>
@rhc54
Copy link
Contributor Author

rhc54 commented Nov 25, 2024

For the life of me, I cannot figure out this Jenkins console. Makes zero sense. Claims it failed but there are "no logs" available as to why? I assume it is yet another startup failure - but how do I re-trigger it?

@hppritcha
Copy link
Member

bot:ompi:retest

Signed-off-by: Ralph Castain <[email protected]>
@jsquyres
Copy link
Member

jsquyres commented Dec 2, 2024

Yo @rhc54 Per our discussion today, here's a C code equivalent of the mpi4py testCreateFromGroups:

#include <stdio.h>
#include <mpi.h>

int main()
{
    int size, rank;
    int color, key=0;
    int local_leader, remote_leader;

    MPI_Init(NULL, NULL);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if (rank < size / 2) {
        color = 0;
        local_leader = 0;
        // EDIT: Per comments later in the thread, even though
        // remote_leader is not zero in this case in the original
        // Python test, it appears to need to be 0 for this C
        // perhaps-not-entirely-correctly-translated-from-Python
        // test...?
        //remote_leader = size / 2;
        remote_leader = 0;
    } else {
        color = 1;
        local_leader = 0;
        remote_leader = 0;
    }

    int tag = 17;
    MPI_Comm intracomm, intercomm;
    MPI_Comm_split(MPI_COMM_WORLD, color, key, &intracomm);
    MPI_Intercomm_create(intracomm, local_leader,
                         MPI_COMM_WORLD, remote_leader, tag, &intercomm);

    MPI_Group lgroup, rgroup;
    MPI_Comm_group(intercomm, &lgroup);
    MPI_Comm_remote_group(intercomm, &rgroup);

    MPI_Info info;
    MPI_Info_create(&info);

    MPI_Comm intercomm2;
    printf("Calling MPI_Intercomm_create_from_groups()\n");
    MPI_Intercomm_create_from_groups(lgroup, local_leader,
                                     rgroup, remote_leader,
                                     "the tag", info,
                                     MPI_ERRORS_ABORT, &intercomm2);

    printf("Done!\n");
    MPI_Finalize();
    return 0;
}

For me, this fails and hangs on my Mac:

$ mpicc mpi4py-test-create-from-groups.c -o a.out && mpirun -np 4 a.out 
Calling ic cfromgroups
Calling ic cfromgroups
Calling ic cfromgroups
Calling ic cfromgroups
[hostname:53112] PRTE ERROR: Not found in file grpcomm_direct_group.c at line 1137
[hostname:53112] PRTE ERROR: Not found in file grpcomm_direct_group.c at line 1090
[hostname:53112] PRTE ERROR: Not found in file grpcomm_direct_group.c at line 124
...hang...

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 2, 2024

Hilarious - I get a completely different failure signature, and it comes from the MPI layer (no error reports from PRRTE or PMIx):

$ mpirun -np 4 ./intercomm_from_group
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
[rhc-node01:76677] ompi_group_dense_lookup: invalid peer index (2)
[rhc-node01:76677] *** Process received signal ***
[rhc-node01:76677] Signal: Segmentation fault (11)
[rhc-node01:76677] Signal code: Address not mapped (1)
[rhc-node01:76677] Failing at address: 0x48
[rhc-node01:76677] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffff884f47a0]
[rhc-node01:76677] [ 1] /opt/hpc/external/ompi/lib/libmpi.so.0(ompi_intercomm_create_from_groups+0x1c0)[0xffff87e66308]
[rhc-node01:76677] [ 2] /opt/hpc/external/ompi/lib/libmpi.so.0(PMPI_Intercomm_create_from_groups+0x1d4)[0xffff87f19274]
[rhc-node01:76677] [ 3] ./intercomm_from_group[0x400b88]
[rhc-node01:76677] [ 4] /lib64/libc.so.6(+0x27300)[0xffff87c69300]
[rhc-node01:76677] [ 5] /lib64/libc.so.6(__libc_start_main+0x98)[0xffff87c693d8]
[rhc-node01:76677] [ 6] ./intercomm_from_group[0x400970]
[rhc-node01:76677] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 76677 on node rhc-node01 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

and looking at it with gdb:

(gdb) where
#0  ompi_intercomm_create_from_groups (local_group=0x2cc5ed80, local_leader=0, remote_group=0x2cc613f0, remote_leader=2, tag=0x400bf8 "the tag", info=0x2cc67ce0,
    errhandler=0x420078 <ompi_mpi_errors_abort>, newintercomm=0xfffff9f19b20) at communicator/comm.c:1779
#1  0x0000ffff87f19274 in PMPI_Intercomm_create_from_groups (local_group=0x2cc5ed80, local_leader=0, remote_group=0x2cc613f0, remote_leader=2, tag=0x400bf8 "the tag",
    info=0x2cc67ce0, errhandler=0x420078 <ompi_mpi_errors_abort>, newintercomm=0xfffff9f19b20) at intercomm_create_from_groups.c:85
#2  0x0000000000400b88 in main ()
(gdb) print leader_procs
$1 = (ompi_proc_t **) 0x2cc6bb00
(gdb) print leader_procs[0]
$2 = (ompi_proc_t *) 0x2cbe39e0
(gdb) print leader_procs[0]->super.proc
There is no member named proc.
(gdb) print leader_procs[0]->super.proc_name
$3 = {jobid = 2092761089, vpid = 0}
(gdb) print leader_procs[1]->super.proc_name
Cannot access memory at address 0x48

indicating that this line:

        leader_procs[1] = ompi_group_get_proc_ptr (remote_group, remote_leader, true);

returned trash. Could be slight differences in PMIx/PRRTE hashes.

@bosilca
Copy link
Member

bosilca commented Dec 2, 2024

The code is incorrect, on a 4 ranks run the remote_leader for the MPI_Intercomm_create_from_groups call cannot be 2 here as there are only two processes on the remote group. Assuming the code wanted to let the last rank in the remote_group be the leader, you need to add

MPI_Group_size(rgroup, &remote_leader);
remote_leader--;

before the call to MPI_Intercomm_create_from_groups.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 2, 2024

Thanks @bosilca - that fixed the segfault. Now it just hangs, but hopefully that's a bug I can do something about.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 2, 2024

Looks like the intercomm_create_from_group failure is caused by the underlying code passing PMIx different group IDs from the participants. Using Jeff's provided example, I'm seeing "the tag-OMPIi-[[19550,1],0]" and "the tag-OMPIi-[[19550,1],2]" - so the two groups don't match and things hang. Haven't dug deeper to see where the mistake was made.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 2, 2024

Of course, I am assuming that there shouldn't be two disjoint PMIx groups being constructed, each with two procs in it - is that assumption correct?

@hppritcha
Copy link
Member

okay this test is not correct.

@hppritcha
Copy link
Member

remote leader in both cases needs to be 0.

@hppritcha
Copy link
Member

hpritchard@er-head:~/ompi-er2/examples> (fix_for_issue10895)!mpicc
mpicc -o test test.c
hpritchard@er-head:~/ompi-er2/examples> (fix_for_issue10895)mpirun -np 4 ./test
Hey the remote group size is 2 but i''m putting in this for remote leader! 2
Hey the remote group size is 2 but i''m putting in this for remote leader! 0
Hey the remote group size is 2 but i''m putting in this for remote leader! 2
Hey the remote group size is 2 but i''m putting in this for remote leader! 0
[er-head.usrc:3071320] calling PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],2] size 2 ninfo 2 cid_base 0
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
[er-head.usrc:3071317] calling PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],0] size 2 ninfo 2 cid_base 0
[er-head.usrc:3071319] calling PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],2] size 2 ninfo 2 cid_base 0
[er-head.usrc:3071318] calling PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],0] size 2 ninfo 2 cid_base 0
[er-head.usrc:3071320] PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],2] size 2 ninfo 2 cid_base 4294967295
[er-head.usrc:3071319] PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],2] size 2 ninfo 2 cid_base 4294967295
[er-head.usrc:3071317] PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],0] size 2 ninfo 2 cid_base 4294967294
[er-head.usrc:3071318] PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],0] size 2 ninfo 2 cid_base 4294967294
[er-head.usrc:3071320] PMIx_Get PMIX_GROUP_LOCAL_CID 6 for cid_base 4294967295
[er-head.usrc:3071318] PMIx_Get PMIX_GROUP_LOCAL_CID 6 for cid_base 4294967294
[er-head.usrc:3071317] PMIx_Get PMIX_GROUP_LOCAL_CID 6 for cid_base 4294967294
[er-head.usrc:3071319] PMIx_Get PMIX_GROUP_LOCAL_CID 6 for cid_base 4294967295
[er-head.usrc:3071317] ompi_group_dense_lookup: invalid peer index (2)
[er-head.usrc:3071319] calling PMIx_Group_construct - tag the tag-OMPIi-LC-[[59914,1],0] size 2 ninfo 2 cid_base 0
[er-head:3071317:0:3071317] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x48)
==== backtrace (tid:3071317) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x000000000006c880 ompi_intercomm_create_from_groups()  /home/hpritchard/ompi-er2/ompi/communicator/comm.c:1779
 2 0x000000000010dfdd PMPI_Intercomm_create_from_groups()  /home/hpritchard/ompi-er2/ompi/mpi/c/intercomm_create_from_groups.c:85
 3 0x0000000000400c5d main()  ???:0
 4 0x000000000003ad85 __libc_start_main()  ???:0
 5 0x0000000000400a3e _start()  ???:0
=================================
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
``1`

@hppritcha
Copy link
Member

your 'c' version of the python code is incorrect.

@jsquyres
Copy link
Member

jsquyres commented Dec 4, 2024

okay this test is not correct.

Ok, perhaps I translated it from python incorrectly. In the original Python test, it's definitely not 0 in both cases. But perhaps I missed some other part of setup...? Shrug.

@hppritcha
Copy link
Member

trying to beef up param checking. i'll probably do that in a separate pr.

hppritcha added a commit to hppritcha/ompi that referenced this pull request Dec 4, 2024
The MPI_Comm_create_from_group and especially the
MPI_Intercomm_create_from_groups functions are recent additions
to the standard (MPI 4.0) and users may get confused easily
trying to use them.

So better parameter checking is needed.

Related to open-mpi#12906 where an incorrect code example showed up.

Signed-off-by: Howard Pritchard <[email protected]>
brennan-carson pushed a commit to uofl-capstone-open-mpi/prrte that referenced this pull request Dec 5, 2024
add fetch depth 0

Till we figure out what got busted in upstream pmix/prrte combo.
See what's happening with

open-mpi/ompi#12906

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this pull request Dec 10, 2024
The MPI_Comm_create_from_group and especially the
MPI_Intercomm_create_from_groups functions are recent additions
to the standard (MPI 4.0) and users may get confused easily
trying to use them.

So better parameter checking is needed.

Related to open-mpi#12906 where an incorrect code example showed up.

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit a0486e0)
hppritcha added a commit to hppritcha/ompi that referenced this pull request Dec 16, 2024
The MPI_Comm_create_from_group and especially the
MPI_Intercomm_create_from_groups functions are recent additions
to the standard (MPI 4.0) and users may get confused easily
trying to use them.

So better parameter checking is needed.

Related to open-mpi#12906 where an incorrect code example showed up.

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit a0486e0)
@rhc54
Copy link
Contributor Author

rhc54 commented Dec 16, 2024

Closing this for now - will reopen when upstream is complete

@rhc54 rhc54 closed this Dec 16, 2024
@rhc54 rhc54 deleted the topic/chk branch December 16, 2024 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mpi4py-all Run the optional mpi4py CI tests Target: main test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants