Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 3 additions & 14 deletions ompi/mca/osc/ucx/osc_ucx_passive_target.c
Original file line number Diff line number Diff line change
Expand Up @@ -76,21 +76,10 @@ static inline int start_exclusive(ompi_osc_ucx_module_t *module, int target) {
}

static inline int end_exclusive(ompi_osc_ucx_module_t *module, int target) {
uint64_t result_value = 0;
uint64_t remote_addr = (module->state_addrs)[target] + OSC_UCX_STATE_LOCK_OFFSET;
int ret = OMPI_SUCCESS;

ret = opal_common_ucx_wpmem_fetch(module->state_mem,
UCP_ATOMIC_FETCH_OP_SWAP, TARGET_LOCK_UNLOCKED,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please elaborate a bit about the hang? If we set TARGET_LOCK_UNLOCKED (which is zero), start_shared should succeed, because result_value is supposed to be less than TARGET_LOCK_EXCLUSIVE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume that process 0 holds an exclusive lock and process 1 tries to acquire a shared lock. Process 1 will retrieve the lock value (which is TARGET_LOCK_EXCLUSIVE) and add 1 it, making it TARGET_LOCK_EXCLUSIVE + 1. In the meantime, however, process 0 releases the lock by resetting it to TARGET_LOCK_UNLOCKED (which is 0). Since Process 1 has seen the lock as being taken exclusively, it will subtract 1 again, leading to a value of -1 in the lock variable (the lock is thus out of sync).

The next time Process 1 tries to take a shared lock, it will get the value -1 in a variable of type uint64_t, which is definitely >=TARGET_LOCK_EXCLUSIVE.

I realized, though, the my patch used -((int64_t)TARGET_LOCK_UNLOCKED) instead of -((int64_t)TARGET_LOCK_EXCLUSIVE). That should be fixed now. Sorry if that led to confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now, thanks

target, &result_value, sizeof(result_value),
remote_addr);
if (OMPI_SUCCESS != ret) {
return ret;
}

assert(result_value >= TARGET_LOCK_EXCLUSIVE);

return ret;
return opal_common_ucx_wpmem_post(module->state_mem, UCP_ATOMIC_POST_OP_ADD,
-((int64_t)TARGET_LOCK_EXCLUSIVE), target,
sizeof(uint64_t), remote_addr);
}

int ompi_osc_ucx_lock(int lock_type, int target, int assert, struct ompi_win_t *win) {
Expand Down