-
Notifications
You must be signed in to change notification settings - Fork 68
v2.x: ompi/request: correctly handle zero count in ompi_request_default_wai… #1368
v2.x: ompi/request: correctly handle zero count in ompi_request_default_wai… #1368
Conversation
:bot🏷️bug this fixes a hang in the |
I don't think we want to totally prevent the progress engine from triggering. We always forced a call to progress on all wait and test. |
i do not think int sync_wait_mt(ompi_wait_sync_t *sync)
{
if(sync->count <= 0)
return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
/* ... */
opal_progress();
/* ... */
} and/or static inline int sync_wait_st (ompi_wait_sync_t *sync)
{
while (sync->count > 0) {
opal_progress();
}
/* ... */ anyway, on second thought, my patch was an overkill for diff --git a/opal/threads/wait_sync.h b/opal/threads/wait_sync.h
index 9ebb4d7..9f28bdf 100644
--- a/opal/threads/wait_sync.h
+++ b/opal/threads/wait_sync.h
@@ -6,6 +6,8 @@
* Copyright (c) 2016 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2016 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2016 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -47,7 +49,8 @@ typedef struct ompi_wait_sync_t {
* the critical path. */
#define WAIT_SYNC_RELEASE(sync) \
if (opal_using_threads()) { \
- while ((sync)->signaling) { \
+ while ((sync)->count > 0 && \
+ (sync)->signaling) { \
continue; \
} \
pthread_cond_destroy(&(sync)->condition); \ or diff --git a/ompi/request/req_wait.c b/ompi/request/req_wait.c
index 141c101..b87a922 100644
--- a/ompi/request/req_wait.c
+++ b/ompi/request/req_wait.c
@@ -16,6 +16,8 @@
* Copyright (c) 2016 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2016 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2016 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -356,7 +358,11 @@ int ompi_request_default_wait_all( size_t count,
}
}
}
- WAIT_SYNC_RELEASE(&sync);
+ if (OPAL_LIKELY(0 != count)) {
+ WAIT_SYNC_RELEASE(&sync);
+ } else {
+ WAIT_SYNC_RELEASE_NOWAIT(&sync);
+ }
return mpi_error;
}
can you please advise ? |
Test FAILed. |
We have explicit calls to opal_progress where needed, so you can ignore my previous comment. Can you explain why you need to check for count in WAIT_SYNC_RELEASE. To be more precise I don't see the corner case that this patch is trying to solve. If count become 0 after a wait_sync_update, then signaling is set to false and things should go smoothly. However, if count is 0 from the beginning, as we don't call the wait_sync_update, we might deadlock in SYNC_WAIT. If this is the use case, I think a better fix would be to change |
WAIT_SYNC_INIT(sync,0); WAIT_SYNC_RELEASE(sync); hanged because sync->signaled was initialised to true, and there is no reason to invoke WAIT_SYNC_SIGNALED(sync) before WAIT_SYNC_RELEASE(sync) this commit initializes sync->signaled to true unless the count is zero. Thanks George for the review and guidance. (cherry picked from commit open-mpi/ompi@44a66e2)
377955a
to
f807f1e
Compare
Test PASSed. |
@bosilca i updated the PR, could you please review it ? |
👍 |
@jsquyres @hppritcha it would be great to have this merged so MTT does not issue "false positives" there are quite a lot of hangs right now on both master and v2.x, though quite a lot will likely be fixed by open-mpi/ompi-tests@53ad6f22e1f9ceda580a06bf8e00324a5f15c6c0 |
@hppritcha Good to go. |
…t_{all,any,some}
(cherry picked from commit open-mpi/ompi@91e1200)