Skip to content

Commit 822f2d4

Browse files
committed
coll/han: alltoallv should fall-back if transform is in-place
mca_coll_han_alltoallv_using_smsc is not compatible with an in-place transform. Update comments and fall-back logic to catch this case. Signed-off-by: Luke Robison <[email protected]>
1 parent e25e897 commit 822f2d4

File tree

1 file changed

+20
-5
lines changed

1 file changed

+20
-5
lines changed

ompi/mca/coll/han/coll_han_alltoallv.c

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -489,14 +489,20 @@ static int decide_to_use_smsc_alg(
489489

490490
/*
491491
Perform an allreduce on all ranks to decide if this algorithm is worth
492-
using. There are three important things:
493-
494-
1. Device buffers. XPMEM doesn't support GPU buffers, so we cannot proceed
492+
using. There are four important things:
493+
494+
1. sbuf == MPI_IN_PLACE. This algorithm does not currently support the
495+
in-place operation. (Future developers may note that the inter-node
496+
communications are ordered such that in-place could be supported, but
497+
additional ordering and/or buffering would be required to ensure local
498+
ranks do not overwrite buffers before sends are posted. However, for now
499+
we just bail out.)
500+
2. Device buffers. XPMEM doesn't support GPU buffers, so we cannot proceed
495501
with this algorithm.
496-
2. Send size per rank. This algorithm can pack small messages together,
502+
3. Send size per rank. This algorithm can pack small messages together,
497503
but this typically isn't helpful for large messages, and XPMEM-mapped
498504
memory cannot be registered with ibv_reg_mr.
499-
3. Contiguous buffers. The exception to #2 above is if we can't post our
505+
4. Contiguous buffers. The exception to #2 above is if we can't post our
500506
sends/recvs in one large block to the NIC. For these non-contiguous
501507
datatypes, our packing algorithm is better because (a) we re-use our
502508
buffers from a free-list which can remain registered to the NIC, and (b)
@@ -509,6 +515,15 @@ static int decide_to_use_smsc_alg(
509515
our execution time, which is <1/10 of the "basic" algorithm.
510516
*/
511517

518+
if (sbuf == MPI_IN_PLACE) {
519+
if (comm_rank == 0) {
520+
opal_output_verbose(10, mca_coll_han_component.han_output, "alltoallv: decide_to_use_smsc_alg: "
521+
"MPI_IN_PLACE operation prevents smsc_alg from being used. "
522+
"Continue with SMSC? ==> no.\n");
523+
}
524+
*use_smsc = 0;
525+
}
526+
512527
/* some magic in the count: if we pick 1, need_buffers() might not be
513528
accurate. We could be precisely correct and compute need_buffers for every
514529
rank's count, but that could be a lot of iteration. Just use 2 and assume

0 commit comments

Comments
 (0)