Closed
Description
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Compiled from source (https://www.open-mpi.org/software/ompi/v4.1/). debug build.
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
N/A
Please describe the system on which you are running
- Operating system/version: Ubuntu 20.04
- Computer hardware: i7-10810U
- Network type: N/A
Details of the problem
I am trying to implement non-blocking communications in a large code. However, the code tends to fail for such cases. I have reproduced the error below. When running on one CPU, the code below works when switch is set to false but fails when switch is set to true. https://stackoverflow.com/questions/66932156/mpi-alltoallw-working-and-mpi-ialltoallw-failing
program main
use mpi
implicit none
logical :: switch
integer, parameter :: maxSize=128
integer scounts(maxSize), sdispls(maxSize)
integer rcounts(maxSize), rdispls(maxSize)
integer :: types(maxSize)
double precision sbuf(maxSize), rbuf(maxSize)
integer comm, size, rank, req
integer ierr
integer ii
call MPI_INIT(ierr)
comm = MPI_COMM_WORLD
call MPI_Comm_size(comm, size, ierr)
call MPI_Comm_rank(comm, rank, ierr)
switch = .true.
! Init
sbuf(:) = rank
scounts(:) = 0
rcounts(:) = 0
sdispls(:) = 0
rdispls(:) = 0
types(:) = MPI_INTEGER
if (switch) then
! Send one time N double precision
scounts(1) = 1
rcounts(1) = 1
sdispls(1) = 0
rdispls(1) = 0
call MPI_Type_create_subarray(1, (/maxSize/), &
(/maxSize/), &
(/0/), &
MPI_ORDER_FORTRAN,MPI_DOUBLE_PRECISION, &
types(1),ierr)
call MPI_Type_commit(types(1),ierr)
else
! Send N times one double precision
do ii = 1, maxSize
scounts(ii) = 1
rcounts(ii) = 1
sdispls(ii) = ii-1
rdispls(ii) = ii-1
types(ii) = MPI_DOUBLE_PRECISION
enddo
endif
call MPI_Ibarrier(comm, req, ierr)
call MPI_Wait(req, MPI_STATUS_IGNORE, ierr)
if (switch) then
call MPI_Ialltoallw(sbuf, scounts, sdispls, types, &
rbuf, rcounts, rdispls, types, &
comm, req, ierr)
call MPI_Wait(req, MPI_STATUS_IGNORE, ierr)
call MPI_TYPE_FREE(types(1), ierr)
else
call MPI_alltoallw(sbuf, scounts, sdispls, types, &
rbuf, rcounts, rdispls, types, &
comm, ierr)
endif
call MPI_Finalize( ierr )
end program main
Running the program on one CPU as follows
$ mpirun -np 1 valgrind --vgdb=yes --vgdb-error=0 ./a.out
Valgrind produces the following error
==249074== Invalid read of size 8
==249074== at 0x4EB0A6D: release_vecs_callback (coll_base_util.c:222)
==249074== by 0x4EB100A: complete_vecs_callback (coll_base_util.c:245)
==249074== by 0x74AD1CC: ompi_request_complete (request.h:441)
==249074== by 0x74AE86D: ompi_coll_libnbc_progress (coll_libnbc_component.c:466)
==249074== by 0x4FC0C39: opal_progress (opal_progress.c:231)
==249074== by 0x4E04795: ompi_request_wait_completion (request.h:415)
==249074== by 0x4E047EB: ompi_request_default_wait (req_wait.c:42)
==249074== by 0x4E80AF7: PMPI_Wait (pwait.c:74)
==249074== by 0x48A30D2: mpi_wait (pwait_f.c:76)
==249074== by 0x10961A: MAIN__ (tmp.f90:61)
==249074== by 0x1096C6: main (tmp.f90:7)
==249074== Address 0x7758830 is 0 bytes inside a block of size 8 free'd
==249074== at 0x483CA3F: free (vg_replace_malloc.c:540)
==249074== by 0x4899CCC: PMPI_IALLTOALLW (pialltoallw_f.c:125)
==249074== by 0x1095FC: MAIN__ (tmp.f90:61)
==249074== by 0x1096C6: main (tmp.f90:7)
==249074== Block was alloc'd at
==249074== at 0x483B7F3: malloc (vg_replace_malloc.c:309)
==249074== by 0x4899B4A: PMPI_IALLTOALLW (pialltoallw_f.c:90)
==249074== by 0x1095FC: MAIN__ (tmp.f90:61)
==249074== by 0x1096C6: main (tmp.f90:7)
gdb produces the following error
Thread 1 received signal SIGTRAP, Trace/breakpoint trap.
0x0000000004eb0a6d in release_vecs_callback (request=0x7758af8) at ../../../../openmpi-4.1.0/ompi/mca/coll/base/coll_base_util.c:222
222 if (NULL != request->data.vecs.stypes[i]) {
(gdb) bt
#0 0x0000000004eb0a6d in release_vecs_callback (request=0x7758af8) at ../../../../openmpi-4.1.0/ompi/mca/coll/base/coll_base_util.c:222
#1 0x0000000004eb100b in complete_vecs_callback (req=0x7758af8) at ../../../../openmpi-4.1.0/ompi/mca/coll/base/coll_base_util.c:245
#2 0x00000000074ad1cd in ompi_request_complete (request=0x7758af8, with_signal=true) at ../../../../../openmpi-4.1.0/ompi/request/request.h:441
#3 0x00000000074ae86e in ompi_coll_libnbc_progress () at ../../../../../openmpi-4.1.0/ompi/mca/coll/libnbc/coll_libnbc_component.c:466
#4 0x0000000004fc0c3a in opal_progress () at ../../openmpi-4.1.0/opal/runtime/opal_progress.c:231
#5 0x0000000004e04796 in ompi_request_wait_completion (req=0x7758af8) at ../../openmpi-4.1.0/ompi/request/request.h:415
#6 0x0000000004e047ec in ompi_request_default_wait (req_ptr=0x1ffeffdbb8, status=0x1ffeffdbc0) at ../../openmpi-4.1.0/ompi/request/req_wait.c:42
#7 0x0000000004e80af8 in PMPI_Wait (request=0x1ffeffdbb8, status=0x1ffeffdbc0) at pwait.c:74
#8 0x00000000048a30d3 in ompi_wait_f (request=0x1ffeffe6cc, status=0x10c0a0 <mpi_fortran_status_ignore_>, ierr=0x1ffeffeee0) at pwait_f.c:76
#9 0x000000000010961b in MAIN__ () at tmp.f90:61