Skip to content

UCX osc: mixing exclusive and shared locks leads to lockup #6931

@devreal

Description

@devreal

Looking at the code in osc_ucx_passive_target.c I found that the implementation of end_exclusive is flawed, leading to a lock-up if ranks try to take a shared lock while one rank holds an exclusive lock on the same target. Simply replacing the value in the lock's memory with TARGET_LOCK_UNLOCKED overwrites any changes to that value made by other processes trying to acquire a shared lock, causing the lock to get out of sync. Instead, the value of TARGET_LOCK_EXCLUSIVE should be subtracted from the lock to release it and to not interfere with the attempts of other ranks.

While debugging, I also found that some of the asserts in these code paths are overly strict and trigger even when they should not.

I will post PRs for master, v4.0.x, and v3.1.x soon.

Potentially related: #6549

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions