@@ -489,14 +489,20 @@ static int decide_to_use_smsc_alg(
489
489
490
490
/*
491
491
Perform an allreduce on all ranks to decide if this algorithm is worth
492
- using. There are three important things:
493
-
494
- 1. Device buffers. XPMEM doesn't support GPU buffers, so we cannot proceed
492
+ using. There are four important things:
493
+
494
+ 1. sbuf == MPI_IN_PLACE. This algorithm does not currently support the
495
+ in-place operation. (Future developers may note that the inter-node
496
+ communications are ordered such that in-place could be supported, but
497
+ additional ordering and/or buffering would be required to ensure local
498
+ ranks do not overwrite buffers before sends are posted. However, for now
499
+ we just bail out.)
500
+ 2. Device buffers. XPMEM doesn't support GPU buffers, so we cannot proceed
495
501
with this algorithm.
496
- 2 . Send size per rank. This algorithm can pack small messages together,
502
+ 3 . Send size per rank. This algorithm can pack small messages together,
497
503
but this typically isn't helpful for large messages, and XPMEM-mapped
498
504
memory cannot be registered with ibv_reg_mr.
499
- 3 . Contiguous buffers. The exception to #2 above is if we can't post our
505
+ 4 . Contiguous buffers. The exception to #2 above is if we can't post our
500
506
sends/recvs in one large block to the NIC. For these non-contiguous
501
507
datatypes, our packing algorithm is better because (a) we re-use our
502
508
buffers from a free-list which can remain registered to the NIC, and (b)
@@ -509,6 +515,15 @@ static int decide_to_use_smsc_alg(
509
515
our execution time, which is <1/10 of the "basic" algorithm.
510
516
*/
511
517
518
+ if (sbuf == MPI_IN_PLACE ) {
519
+ if (comm_rank == 0 ) {
520
+ opal_output_verbose (10 , mca_coll_han_component .han_output , "alltoallv: decide_to_use_smsc_alg: "
521
+ "MPI_IN_PLACE operation prevents smsc_alg from being used. "
522
+ "Continue with SMSC? ==> no.\n" );
523
+ }
524
+ * use_smsc = 0 ;
525
+ }
526
+
512
527
/* some magic in the count: if we pick 1, need_buffers() might not be
513
528
accurate. We could be precisely correct and compute need_buffers for every
514
529
rank's count, but that could be a lot of iteration. Just use 2 and assume
0 commit comments