coll/han: alltoallv should fall-back if transform is in-place

lrbison · lrbison · commit 822f2d40bad7 · 2024-09-11T15:13:24.000Z
mca_coll_han_alltoallv_using_smsc is not compatible with an in-place
transform.  Update comments and fall-back logic to catch this case.

Signed-off-by: Luke Robison &lt;lrbison@amazon.com&gt;
diff --git a/ompi/mca/coll/han/coll_han_alltoallv.c b/ompi/mca/coll/han/coll_han_alltoallv.c
@@ -489,14 +489,20 @@ static int decide_to_use_smsc_alg(
 
     /*
     Perform an allreduce on all ranks to decide if this algorithm is worth
-    using. There are three important things:
-
-     1. Device buffers.  XPMEM doesn't support GPU buffers, so we cannot proceed
+    using. There are four important things:
+
+     1. sbuf == MPI_IN_PLACE.  This algorithm does not currently support the
+        in-place operation.  (Future developers may note that the inter-node
+        communications are ordered such that in-place could be supported, but
+        additional ordering and/or buffering would be required to ensure local
+        ranks do not overwrite buffers before sends are posted. However, for now
+        we just bail out.)
+     2. Device buffers.  XPMEM doesn't support GPU buffers, so we cannot proceed
         with this algorithm.
-     2. Send size per rank.  This algorithm can pack small messages together,
+     3. Send size per rank.  This algorithm can pack small messages together,
         but this typically isn't helpful for large messages, and XPMEM-mapped
         memory cannot be registered with ibv_reg_mr.
-     3. Contiguous buffers.  The exception to #2 above is if we can't post our
+     4. Contiguous buffers.  The exception to #2 above is if we can't post our
         sends/recvs in one large block to the NIC.  For these non-contiguous
         datatypes, our packing algorithm is better because (a) we re-use our
         buffers from a free-list which can remain registered to the NIC, and (b)
@@ -509,6 +515,15 @@ static int decide_to_use_smsc_alg(
     our execution time, which is <1/10 of the "basic" algorithm.
     */
 
+   if (sbuf == MPI_IN_PLACE) {
+        if (comm_rank == 0) {
+            opal_output_verbose(10, mca_coll_han_component.han_output, "alltoallv: decide_to_use_smsc_alg: "
+                "MPI_IN_PLACE operation prevents smsc_alg from being used.  "
+                "Continue with SMSC? ==> no.\n");
+        }
+        *use_smsc = 0;
+   }
+
     /* some magic in the count: if we pick 1, need_buffers() might not be
     accurate.  We could be precisely correct and compute need_buffers for every
     rank's count, but that could be a lot of iteration.  Just use 2 and assume