Skip to content

MPI Spawn jobs doesn't work on multinode LSF cluster #9041

Closed
@Extremys

Description

@Extremys

Background information

Version of used OpenMPI

OpenMPI v4.0.5

OpenMPI installation

Installation from GCC 10.2 version of Easybuild recipe

[lsf-host:31240] mca: base: components_register: registering framework btl components
[lsf-host:31240] mca: base: components_register: found loaded component self
[lsf-host:31240] mca: base: components_register: component self register function successful
[lsf-host:31240] mca: base: components_register: found loaded component tcp
[lsf-host:31240] mca: base: components_register: component tcp register function successful
[lsf-host:31240] mca: base: components_register: found loaded component sm
[lsf-host:31240] mca: base: components_register: found loaded component usnic
[lsf-host:31240] mca: base: components_register: component usnic register function successful
[lsf-host:31240] mca: base: components_register: found loaded component vader
[lsf-host:31240] mca: base: components_register: component vader register function successful
                 Package: Open MPI easybuild@lsf-host Distribution
                Open MPI: 4.0.5
  Open MPI repo revision: v4.0.5
   Open MPI release date: Aug 26, 2020
                Open RTE: 4.0.5
  Open RTE repo revision: v4.0.5
   Open RTE release date: Aug 26, 2020
                    OPAL: 4.0.5
      OPAL repo revision: v4.0.5
       OPAL release date: Aug 26, 2020
                 MPI API: 3.1.0
            Ident string: 4.0.5
                  Prefix: /cm/easybuild/software/OpenMPI/4.0.5-GCC-10.2.0
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: lsf-host
           Configured by: easybuild
           Configured on: Fri Jun  4 17:58:42 CEST 2021
          Configure host: lsf-host
  Configure command line: '--prefix=/cm/easybuild/software/OpenMPI/4.0.5-GCC-10.2.0'
                          '--build=x86_64-pc-linux-gnu'
                          '--host=x86_64-pc-linux-gnu'
                          '--with-lsf=/pss/lsf/9.1/'
                          '--with-lsf-libdir=/pss/lsf/9.1/linux2.6-glibc2.3-x86_64/lib'
                          '--with-pmix=/cm/easybuild/software/PMIx/3.1.5-GCCcore-10.2.0'
                          '--enable-mpirun-prefix-by-default'
                          '--enable-shared'
                          '--with-hwloc=/cm/easybuild/software/hwloc/2.2.0-GCCcore-10.2.0'
                          '--with-libevent=/cm/easybuild/software/libevent/2.1.12-GCCcore-10.2.0'
                          '--with-ucx=/cm/easybuild/software/UCX/1.9.0-GCCcore-10.2.0'
                          '--without-verbs'
                Built by: easybuild
                Built on: Fri Jun  4 18:10:10 CEST 2021
              Built host: lsf-host
              C bindings: yes
            C++ bindings: no
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the gfortran compiler and/or Open
                          MPI, does not support the following: array
                          subsections, direct passthru (where possible) to
                          underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /cm/easybuild/software/GCCcore/10.2.0/bin/gcc
  C compiler family name: GNU
      C compiler version: 10.2.0
            C++ compiler: g++
   C++ compiler absolute: /cm/easybuild/software/GCCcore/10.2.0/bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /cm/easybuild/software/GCCcore/10.2.0/bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
          Fort PROTECTED: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
           C++ profiling: no
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: yes
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
      MPI1 compatibility: no
          MPI extensions: affinity, cuda, pcollreq
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.0.5)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.0.5)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.5)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.5)
                 MCA btl: usnic (MCA v2.1.0, API v3.1.0, Component v4.0.5)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.5)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.0.5)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA event: external (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA hwloc: external (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.0.5)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.0.5)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.0.5)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v4.0.5)
                MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.0.5)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.0.5)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.0.5)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component
                          v4.0.5)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component
                          v4.0.5)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component
                          v4.0.5)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component
                          v4.0.5)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
                          v4.0.5)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA ess: lsf (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.0.5)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.0.5)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA plm: lsf (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA ras: lsf (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.0.5)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.0.5)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.0.5)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.0.5)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.0.5)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.0.5)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.5)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.0.5)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.0.5)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.0.5)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                  MCA fs: lustre (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                  MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA mtl: psm2 (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v4.0.5)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.0.5)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.0.5)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.0.5)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v4.0.5)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
[lsf-host:31240] mca: base: close: unloading component self
[lsf-host:31240] mca: base: close: unloading component tcp
[lsf-host:31240] mca: base: close: unloading component usnic
[lsf-host:31240] mca: base: close: unloading component vader

System description

  • Operating system/version: Redhat Enterprise 7.3
  • Computer hardware: Intel 64bits Broadwell gen
  • Network type: eth
  • iptable rules : empty
  • Job scheduler: LSF 9.1 cluster of 3 Intel nodes

Details of the problem

I try to run a simple MPI spawn program through an LSF cluster, when the scheduler allocs a single node the execution works pretty well but when it's multinode, the MPI processes spawn from seperated hostname can't talk each other so resulting an abort, what I'm doing wrong? Is it an OpenMPI bug? Thank you you for your help!

producer.cpp source:

#include "mpi.h"
int main(int argc, char* argv[])
{
  MPI_Init(&argc, &argv);
  int rank;
  int size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  std::cout << rank << " " << size << '\n';
  const char* command = "/home/user/worker";
  MPI_Comm everyone;
  int nslaves = 2;
  MPI_Comm_spawn(command, MPI_ARGV_NULL, nslaves, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &everyone,
                 MPI_ERRCODES_IGNORE);
  std::cout << "END" << std::endl;
  MPI_Finalize();
}

worker.cpp source:

#include "mpi.h"
#include <iostream>
int main(int argc, char *argv[])
{
  MPI_Init(&argc, &argv);
  MPI_Comm com;
  MPI_Comm_get_parent(&com);
  std::cout << "Hello" << std::endl;
  MPI_Finalize();
}

launching commands:

bash-4.2$ module load OpenMPI/4.0.5-GCC-10.2.0
bash-4.2$ export OMPI_MCA_orte_base_help_aggregate=0
bash-4.2$ export OMPI_MCA_btl_base_verbose=100
bash-4.2$ mpic++ -o producer producer.cpp
bash-4.2$ mpic++ -o worker worker.cpp
bash-4.2$ bsub -n 3 -R "span[ptile=1]" mpirun -np 1 -o output.log ./producer # -R content option force the job to be launch on multinode

output.log content:

Sender: LSF System <[email protected]>
Subject: Job 4088: <mpirun -n 1 /home/user/producer> in cluster <r_cluster> Exited

Job <mpirun -n 1 /home/user/producer> was submitted from host <lsf-host.cm.cluster> by user
 <user> in cluster <r_cluster>.
Job was executed on host(s) <1*lsf-host-001.cm.cluster>, in queue <STANDARD_BATCH>, as user <user> in cluster <r_cluster>.
                            <1*lsf-host-002.cm.cluster>
                            <1*lsf-host.cm.cluster>
</home/user> was used as the home directory.
</home/user> was used as the working directory.
Started at Mon Jun  7 11:42:40 2021
Results reported on Mon Jun  7 11:42:50 2021

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 /home/user/producer
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   0.47 sec.
    Max Memory :                                 53 MB
    Average Memory :                             7.00 MB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Processes :                              3
    Max Threads :                                7
    Run time :                                   9 sec.
    Turnaround time :                            10 sec.

The output (if any) follows:

[lsf-host-001:03013] mca: base: components_register: registering framework btl components
[lsf-host-001:03013] mca: base: components_register: found loaded component self
[lsf-host-001:03013] mca: base: components_register: component self register function successful
[lsf-host-001:03013] mca: base: components_register: found loaded component tcp
[lsf-host-001:03013] mca: base: components_register: component tcp register function successful
[lsf-host-001:03013] mca: base: components_register: found loaded component sm
[lsf-host-001:03013] mca: base: components_register: found loaded component usnic
[lsf-host-001:03013] mca: base: components_register: component usnic register function successful
[lsf-host-001:03013] mca: base: components_register: found loaded component vader
[lsf-host-001:03013] mca: base: components_register: component vader register function successful
[lsf-host-001:03013] mca: base: components_open: opening btl components
[lsf-host-001:03013] mca: base: components_open: found loaded component self
[lsf-host-001:03013] mca: base: components_open: component self open function successful
[lsf-host-001:03013] mca: base: components_open: found loaded component tcp
[lsf-host-001:03013] mca: base: components_open: component tcp open function successful
[lsf-host-001:03013] mca: base: components_open: found loaded component usnic
[lsf-host-001:03013] mca: base: components_open: component usnic open function successful
[lsf-host-001:03013] mca: base: components_open: found loaded component vader
[lsf-host-001:03013] mca: base: components_open: component vader open function successful
[lsf-host-001:03013] select: initializing btl component self
[lsf-host-001:03013] select: init of component self returned success
[lsf-host-001:03013] select: initializing btl component tcp
[lsf-host-001:03013] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[lsf-host-001:03013] btl: tcp: Found match: 127.0.0.1 (lo)
[lsf-host-001:03013] btl:tcp: Attempting to bind to AF_INET port 1024
[lsf-host-001:03013] btl:tcp: Successfully bound to AF_INET port 1024
[lsf-host-001:03013] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[lsf-host-001:03013] btl:tcp: examining interface eth0
[lsf-host-001:03013] btl:tcp: using ipv6 interface eth0
[lsf-host-001:03013] btl:tcp: examining interface eth1
[lsf-host-001:03013] btl:tcp: using ipv6 interface eth1
[lsf-host-001:03013] select: init of component tcp returned success
[lsf-host-001:03013] select: initializing btl component usnic
[lsf-host-001:03013] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data availab
le (-61)
[lsf-host-001:03013] select: init of component usnic returned failure
[lsf-host-001:03013] mca: base: close: component usnic closed
[lsf-host-001:03013] mca: base: close: unloading component usnic
[lsf-host-001:03013] select: initializing btl component vader
[lsf-host-001:03013] select: init of component vader returned failure
[lsf-host-001:03013] mca: base: close: component vader closed
[lsf-host-001:03013] mca: base: close: unloading component vader
0 1
[lsf-host:14462] mca: base: components_register: registering framework btl components
[lsf-host:14462] mca: base: components_register: found loaded component self
[lsf-host:14462] mca: base: components_register: component self register function successful
[lsf-host:14462] mca: base: components_register: found loaded component tcp
[lsf-host:14462] mca: base: components_register: component tcp register function successful
[lsf-host:14462] mca: base: components_register: found loaded component sm
[lsf-host:14462] mca: base: components_register: found loaded component usnic
[lsf-host:14462] mca: base: components_register: component usnic register function successful
[lsf-host:14462] mca: base: components_register: found loaded component vader
[lsf-host:14462] mca: base: components_register: component vader register function successful
[lsf-host:14462] mca: base: components_open: opening btl components
[lsf-host:14462] mca: base: components_open: found loaded component self
[lsf-host:14462] mca: base: components_open: component self open function successful
[lsf-host:14462] mca: base: components_open: found loaded component tcp
[lsf-host:14462] mca: base: components_open: component tcp open function successful
[lsf-host:14462] mca: base: components_open: found loaded component usnic
[lsf-host:14462] mca: base: components_open: component usnic open function successful
[lsf-host:14462] mca: base: components_open: found loaded component vader
[lsf-host:14462] mca: base: components_open: component vader open function successful
[lsf-host:14462] select: initializing btl component self
[lsf-host:14462] select: init of component self returned success
[lsf-host:14462] select: initializing btl component tcp
[lsf-host:14462] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[lsf-host:14462] btl: tcp: Found match: 127.0.0.1 (lo)
[lsf-host:14462] btl:tcp: Attempting to bind to AF_INET port 1024
[lsf-host:14462] btl:tcp: Successfully bound to AF_INET port 1024
[lsf-host:14462] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[lsf-host:14462] btl:tcp: examining interface eth0
[lsf-host:14462] btl:tcp: using ipv6 interface eth0
[lsf-host:14462] btl:tcp: examining interface eth1
[lsf-host:14462] btl:tcp: using ipv6 interface eth1
[lsf-host:14462] select: init of component tcp returned success
[lsf-host:14462] select: initializing btl component usnic
[lsf-host-002:27348] mca: base: components_register: registering framework btl components
[lsf-host-002:27348] mca: base: components_register: found loaded component self
[lsf-host-002:27348] mca: base: components_register: component self register function successful
[lsf-host-002:27348] mca: base: components_register: found loaded component tcp
[lsf-host-002:27348] mca: base: components_register: component tcp register function successful
[lsf-host-002:27348] mca: base: components_register: found loaded component sm
[lsf-host-002:27348] mca: base: components_register: found loaded component usnic
[lsf-host-002:27348] mca: base: components_register: component usnic register function successful
[lsf-host-002:27348] mca: base: components_register: found loaded component vader
[lsf-host-002:27348] mca: base: components_register: component vader register function successful
[lsf-host-002:27348] mca: base: components_open: opening btl components
[lsf-host-002:27348] mca: base: components_open: found loaded component self
[lsf-host-002:27348] mca: base: components_open: component self open function successful
[lsf-host-002:27348] mca: base: components_open: found loaded component tcp
[lsf-host-002:27348] mca: base: components_open: component tcp open function successful
[lsf-host-002:27348] mca: base: components_open: found loaded component usnic
[lsf-host-002:27348] mca: base: components_open: component usnic open function successful
[lsf-host-002:27348] mca: base: components_open: found loaded component vader
[lsf-host-002:27348] mca: base: components_open: component vader open function successful
[lsf-host-002:27348] select: initializing btl component self
[lsf-host-002:27348] select: init of component self returned success
[lsf-host-002:27348] select: initializing btl component tcp
[lsf-host-002:27348] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[lsf-host-002:27348] btl: tcp: Found match: 127.0.0.1 (lo)
[lsf-host-002:27348] btl:tcp: Attempting to bind to AF_INET port 1024
[lsf-host-002:27348] btl:tcp: Successfully bound to AF_INET port 1024
[lsf-host-002:27348] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[lsf-host-002:27348] btl:tcp: examining interface eth0
[lsf-host-002:27348] btl:tcp: using ipv6 interface eth0
[lsf-host-002:27348] btl:tcp: examining interface eth1
[lsf-host-002:27348] btl:tcp: using ipv6 interface eth1
[lsf-host-002:27348] select: init of component tcp returned success
[lsf-host-002:27348] select: initializing btl component usnic
[lsf-host:14462] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data availab
le (-61)
[lsf-host:14462] select: init of component usnic returned failure
[lsf-host:14462] mca: base: close: component usnic closed
[lsf-host:14462] mca: base: close: unloading component usnic
[lsf-host:14462] select: initializing btl component vader
[lsf-host:14462] select: init of component vader returned failure
[lsf-host:14462] mca: base: close: component vader closed
[lsf-host:14462] mca: base: close: unloading component vader
[lsf-host-002:27348] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data availab
le (-61)
[lsf-host-002:27348] select: init of component usnic returned failure
[lsf-host-002:27348] mca: base: close: component usnic closed
[lsf-host-002:27348] mca: base: close: unloading component usnic
[lsf-host-002:27348] select: initializing btl component vader
[lsf-host-002:27348] select: init of component vader returned failure
[lsf-host-002:27348] mca: base: close: component vader closed
[lsf-host-002:27348] mca: base: close: unloading component vader
[lsf-host-001:03013] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
[lsf-host-001:03013] [[59089,1],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[lsf-host-002:27348] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
[lsf-host-002:27348] [[59089,2],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[lsf-host:14462] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
[lsf-host:14462] [[59089,2],1] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[lsf-host:14462] *** An error occurred in MPI_Init
[lsf-host:14462] *** reported by process [3872456706,1]
[lsf-host:14462] *** on a NULL communicator
[lsf-host:14462] *** Unknown error
[lsf-host:14462] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lsf-host:14462] ***    and potentially your MPI job)
[lsf-host-001:03013] *** An error occurred in MPI_Comm_spawn
[lsf-host-001:03013] *** reported by process [3872456705,0]
[lsf-host-001:03013] *** on communicator MPI_COMM_WORLD
[lsf-host-001:03013] *** MPI_ERR_OTHER: known error not in list
[lsf-host-001:03013] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lsf-host-001:03013] ***    and potentially your MPI job)
[lsf-host-001:03008] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[lsf-host-002:27348] *** An error occurred in MPI_Init
[lsf-host-002:27348] *** reported by process [3872456706,0]
[lsf-host-002:27348] *** on a NULL communicator
[lsf-host-002:27348] *** Unknown error
[lsf-host-002:27348] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lsf-host-002:27348] ***    and potentially your MPI job)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions