Skip to content

UCX: Hang combining exclusive/shared window lock #6549

@devreal

Description

@devreal

Running Open MPI 4.0.1 in combination with Open UCX 1.5 I am seeing my application hang while one process attempts to release an exclusive lock while the target attempts to acquire a shared lock. The code below can be used to reproduce the issue (tested on our IB cluster):

#include <stdio.h>
#include <mpi.h>

int main(int argc, char **argv)
{
  MPI_Win win;
  int elem_per_unit = 1;
  int *baseptr;
  int rank, size;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  MPI_Win_allocate(
    elem_per_unit*sizeof(int), 1, MPI_INFO_NULL,
    MPI_COMM_WORLD, &baseptr, &win);

  if (size == 2) {
    // get exclusive lock
    if (rank != 0) {
      int val;
      printf("[%d] Acquiring exclusive lock\n", rank);
      MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
      MPI_Put(&val, 1, MPI_INT, 0, 0, 1, MPI_INT, win);
      MPI_Win_flush(0, win);
    }

    MPI_Barrier(MPI_COMM_WORLD);

    // release exclusive lock
    if (rank != 0) {
      printf("[%d] Releasing exclusive lock\n", rank);
      // Rank 1 hangs here
      MPI_Win_unlock(0, win);
    }
  }

  // Rank 0 hangs here
  printf("[%d] Acquiring shared lock\n", rank);
  MPI_Win_lock_all(0, win);

  MPI_Win_unlock_all(win);
  MPI_Win_free(&win);
  MPI_Finalize();

  return 0;
}

Build with:

$ mpicc mpi_shared_excl_lock.c -o mpi_shared_excl_lock

Run with:

$ mpirun -n 2 -N 1 ./mpi_shared_excl_lock
[1] Acquiring exclusive lock
[1] Releasing exclusive lock
[0] Acquiring shared lock

Interestingly, leaving out the barrier between acquiring and releasing the lock lets the example run successfully. Also, things run fine when using Open IB instead of UCX.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions