Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

v2.x: ompi/request: correctly handle zero count in ompi_request_default_wai… #1368

Merged
merged 1 commit into from
Sep 12, 2016

Conversation

ggouaillardet
Copy link
Contributor

…t_{all,any,some}

(cherry picked from commit open-mpi/ompi@91e1200)

@ggouaillardet
Copy link
Contributor Author

:bot🏷️bug
:bot🏷️blocker
:bot:milestone:v2.0.2
:bot:assign: @bosilca

this fixes a hang in the collective/alltoallvw_zero from the ibm test suite.
i am not sure why this was not spotted earlier, fwiw, this occurs when running on 2 nodes, 16 mpi tasks and --mca btl_tcp_progress_thread 1 --mca btl vader,tcp,self, though only the number of tasks might be relevant here

@bosilca
Copy link
Member

bosilca commented Sep 5, 2016

I don't think we want to totally prevent the progress engine from triggering. We always forced a call to progress on all wait and test.

@ggouaillardet
Copy link
Contributor Author

i do not think opal_progress() is invoked when count == 0, and or i understand your comment.

int sync_wait_mt(ompi_wait_sync_t *sync)
{
    if(sync->count <= 0)
        return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
    /* ... */
    opal_progress();
    /* ... */
}

and/or

static inline int sync_wait_st (ompi_wait_sync_t *sync)
{
    while (sync->count > 0) {
        opal_progress();
    }
    /* ... */

anyway, on second thought, my patch was an overkill for ompi_request_default_wait_{any,some}, and here are two simpler ones (only one of them is needed) that replaces it

diff --git a/opal/threads/wait_sync.h b/opal/threads/wait_sync.h
index 9ebb4d7..9f28bdf 100644
--- a/opal/threads/wait_sync.h
+++ b/opal/threads/wait_sync.h
@@ -6,6 +6,8 @@
  * Copyright (c) 2016      Los Alamos National Security, LLC. All rights
  *                         reserved.
  * Copyright (c) 2016      Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2016      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -47,7 +49,8 @@ typedef struct ompi_wait_sync_t {
  * the critical path. */
 #define WAIT_SYNC_RELEASE(sync)                       \
     if (opal_using_threads()) {                       \
-        while ((sync)->signaling) {                   \
+        while ((sync)->count > 0 &&                   \
+               (sync)->signaling) {                   \
             continue;                                 \
         }                                             \
         pthread_cond_destroy(&(sync)->condition);     \

or

diff --git a/ompi/request/req_wait.c b/ompi/request/req_wait.c
index 141c101..b87a922 100644
--- a/ompi/request/req_wait.c
+++ b/ompi/request/req_wait.c
@@ -16,6 +16,8 @@
  * Copyright (c) 2016      Los Alamos National Security, LLC. All rights
  *                         reserved.
  * Copyright (c) 2016      Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2016      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -356,7 +358,11 @@ int ompi_request_default_wait_all( size_t count,
             }
         }
     }
-    WAIT_SYNC_RELEASE(&sync);
+    if (OPAL_LIKELY(0 != count)) {
+        WAIT_SYNC_RELEASE(&sync);
+    } else {
+        WAIT_SYNC_RELEASE_NOWAIT(&sync);
+    }
     return mpi_error;
 }

can you please advise ?

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/2151/ for details.

@bosilca
Copy link
Member

bosilca commented Sep 6, 2016

We have explicit calls to opal_progress where needed, so you can ignore my previous comment.

Can you explain why you need to check for count in WAIT_SYNC_RELEASE. To be more precise I don't see the corner case that this patch is trying to solve.

If count become 0 after a wait_sync_update, then signaling is set to false and things should go smoothly. However, if count is 0 from the beginning, as we don't call the wait_sync_update, we might deadlock in SYNC_WAIT. If this is the use case, I think a better fix would be to change
(sync)->signaling = true;
to
(sync)->signaling = (c != 0);
in WAIT_SYNC_INIT().

WAIT_SYNC_INIT(sync,0); WAIT_SYNC_RELEASE(sync);
hanged because sync->signaled was initialised to true, and
there is no reason to invoke WAIT_SYNC_SIGNALED(sync) before
WAIT_SYNC_RELEASE(sync)
this commit initializes sync->signaled to true unless the count is zero.

Thanks George for the review and guidance.

(cherry picked from commit open-mpi/ompi@44a66e2)
@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/2165/ for details.

@ggouaillardet
Copy link
Contributor Author

@bosilca i updated the PR, could you please review it ?

@bosilca
Copy link
Member

bosilca commented Sep 9, 2016

👍

@ggouaillardet
Copy link
Contributor Author

@jsquyres @hppritcha it would be great to have this merged so MTT does not issue "false positives"

there are quite a lot of hangs right now on both master and v2.x, though quite a lot will likely be fixed by open-mpi/ompi-tests@53ad6f22e1f9ceda580a06bf8e00324a5f15c6c0

@jsquyres
Copy link
Member

jsquyres commented Sep 9, 2016

@hppritcha Good to go.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants