Skip to content

Address Registration Error in CUDA Aware MPICH 4.2.2 + UCX 1.17.0 Application #10085

@cl3to

Description

@cl3to

I'm running an application on a cluster that uses CUDA Aware MPICH (v4.2.2) and UCX (v1.17.0). My application consists of two binaries, a server and a client, so I use the MPMD mode of mpirun to execute it: mpirun -np 1 server : -np 1 client. The problem is that when I try to run the application, either intra-node or inter-node, I get the following error and the application hangs:

[1724426336.207066] [c066:48733:0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x55bace2c02a0, length=49792, access=0xf) failed: Bad address
[1724426336.207083] [c066:48733:0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x55bace2c02a0 (host) length 49792 on md[8]=mlx5_1: Input/output error (md supports: host|cuda)

After some research, I found that setting the environment variable UCX_RCACHE_ENABLE=n allows my application to run without errors. However, the application’s runtime performance is not as expected. Profiling the application revealed that most of the time is spent on data transfer between the nodes.

When running the OSU 7.4 benchmark, I observed that the bandwidth between nodes using InfiniBand is approximately 5.75 times slower when I set the variable UCX_RCACHE_ENABLE=n.

export UCX_RCACHE_ENABLE=y
mpirun -ppn 1 -np 2 osu_bw -m 100000000:1000000000 or
mpirun -ppn 1 -np 1 osu_bw -m 100000000:1000000000 : -np 1 osu_bw -m 100000000:1000000000

# OSU MPI Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
100000000           23073.19
200000000           23037.84
400000000           23001.02
800000000           22948.29
export UCX_RCACHE_ENABLE=n
mpirun -ppn 1 -np 2 osu_bw -m 100000000:1000000000 or
mpirun -ppn 1 -np 1 osu_bw -m 100000000:1000000000 : -np 1 osu_bw -m 100000000:1000000000

# OSU MPI Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
100000000            4083.05
200000000            4078.84
400000000            4076.13
800000000            4076.14

Any suggestions on why the application might be failing to register addresses?

Setup and versions

OS version:

  • cat /etc/redhat-release: Red Hat Enterprise Linux Server release 7.9 (Maipo)
  • Kernel uname -r: 3.10.0-1160.49.1.el7.x86_64

RDMA/IB version:

  • rpm -q libibverbs: libibverbs-54mlnx1-1.54310.x86_64
  • rpm -q rdma-core: rdma-core-devel-54mlnx1-1.54310.x86_64

IB HW:

  • Each node has 2 IB NIC.
  • ibstat:
CA 'mlx5_0'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.27.1016
        Hardware version: 0
        Node GUID: 0x0800380300b49dac
        System image GUID: 0x0800380300b49dac
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 115
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x0800380300b49dac
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.27.1016
        Hardware version: 0
        Node GUID: 0x0800380300b49da0
        System image GUID: 0x0800380300b49da0
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 177
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x0800380300b49da0
                Link layer: InfiniBand

CUDA 12.0:

  • Each node has four 32GB V100 GPUs

  • cuda libraries: cuda-toolkit-12-0-12.0.0-1.x86_64

  • cuda drivers: cuda-driver-devel-12-0-12.0.107-1.x86_64

  • lsmod |grep nv_peer_mem:

nv_peer_mem            13369  0 
ib_core               358225  11 rdma_cm,ib_cm,iw_cm,beegfs,nv_peer_mem,ko2iblnd,mlx5_ib,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
nvidia              56056886  55 nv_peer_mem,gdrdrv,nvidia_modeset,nvidia_uvm
  • lsmod|grep gdrdrv:
gdrdrv                 18183  0 
nvidia              56056886  55 nv_peer_mem,gdrdrv,nvidia_modeset,nvidia_uvm

ucx_info -v:

# Library version: 1.17.0
# Library path: /home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/ucx-1.17.0-qq5l5fowibcomrutchar7maekewkiloo/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch '', revision 4ef9a09
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/ucx-1.17.0-qq5l5fowibcomrutchar7maekewkiloo --without-go --disable-doxygen-doc --disable-assertions --enable-compiler-opt=3 --without-java --enable-shared --enable-static --disable-logging --disable-mt --with-openmp --enable-optimizations --disable-params-check --disable-gtest --with-pic --with-cuda=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/cuda-12.4.0-tddfkicmflo4uydz5vvubsl5233hiasi --enable-cma --without-dc --without-dm --with-gdrcopy=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/gdrcopy-2.4.1-i7vxfrthjgn7ojewfj5a4pwsspcsg4te --with-ib-hw-tm --with-knem=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/knem-1.1.4-bhkutyn7invsbjv3e32yg3k5fiusiah6 --without-mlx5-dv --with-rc --with-ud --with-xpmem=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/xpmem-2.6.5-36-oeerzcdtxg5h6qhtv7s2nmmsh5imj4xl --without-fuse3 --without-bfd --with-rdmacm=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/rdma-core-52.0-frbk7sgqzmo2vjgu642ryhq26e3dxma7 --with-verbs=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/rdma-core-52.0-frbk7sgqzmo2vjgu642ryhq26e3dxma7 --with-avx --without-rocm

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions