An internal error has occurred in ORTE #4416

teonnik · 2017-10-28T13:12:18Z

Background information

Version

OpenMPI 3.0.0 with CUDA support. I don't know exactly how OpenMPI was installed, I am not the system administrator, I will let you know as soon as I find out.

System

Linux juron1-adm 3.10.0-514.26.2.el7.ppc64le #1 SMP Mon Jul 10 02:18:17 GMT 2017 ppc64le ppc64le ppc64le GNU/Linux

18 IBM S822LC servers ("Minksky") each with

2 IBM POWER8 processors (up to 4.023 GHz, 2*10 cores, 8 threads/core)
4 NVIDIA Tesla P100 GPUs ("Pascal")
4x16 GByte HBM memory attached to GPU
256 GByte DDR4 memory attached to the POWER8 processors
1.6 GByte NVMe SSD

All nodes are connected to a single Mellanox InfiniBand EDR switch.

Details of the problem

I have a C++14 code using CUDA Thrust (the CUDA part is C++11). When I tried to run on multiple nodes, I received the following error:

--------------------------------------------------------------------------
mpirun: Forwarding signal 12 to job
[juronc06.juron.dns.zone:70129] [[1582,0],0] grpcomm:direct:send_relay proc [[1582,0],1] not running - cannot relay: NOT ALIVE 
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[1582,0],0] FORCE-TERMINATE AT Unreachable:-12 - error ../../../../../orte/mca/grpcomm/direct/grpcomm_direct.c(548)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

I was trying to execute a test from a library I wrote. The code can be found here.

The cluster uses LSF, I ran with the following command:

bsub -J gvec -n 4 -R "span[ptile=1]" -R "rusage[ngpus_shared=1]" -W 00:01 -q normal -e gvec.err -o gvec.out "mpirun /homeb/padc/padc013/asynchronator/juron/test/gvec

Output from LSF:

Sender: LSF System <[email protected]>
Subject: Job 7650: <gvec> in cluster <juron> Exited

Job <gvec> was submitted from host <juron1-adm> by user <padc013> in cluster <juron>.
Job was executed on host(s) <1*juronc06>, in queue <normal>, as user <padc013> in cluster <juron>.
                            <1*juronc04>
                            <1*juronc07>
                            <1*juronc03>
</gpfs/homeb/padc/padc013> was used as the home directory.
</gpfs/work/padc/padc013/alss> was used as the working directory.
Started at Results reported on 
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun /homeb/padc/padc013/asynchronator/juron/test/gvec 
------------------------------------------------------------

TERM_RUNLIMIT: job killed after reaching LSF run time limit.
Exited with exit code 244.

Resource usage summary:

    CPU time :                                   0.47 sec.
    Max Memory :                                 25 MB
    Average Memory :                             25.00 MB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                9
    Run time :                                   62 sec.
    Turnaround time :                            64 sec.

The output (if any) is above this job summary.



PS:

Read file <gvec.err> for stderr output of this job.

teonnik · 2017-10-31T05:36:20Z

The issue does not pertain to OpenMPI, it's rather due to incorrect cluster configuration.

lisalenorelowe · 2019-10-17T22:51:43Z

We are getting this same error message - can you please tell me how you determined what was wrong with the cluster configuration? We are also using LSF.

teonnik closed this as completed Oct 31, 2017

lisalenorelowe mentioned this issue Oct 22, 2019

grpcomm, can't relay #7100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An internal error has occurred in ORTE #4416

An internal error has occurred in ORTE #4416

teonnik commented Oct 28, 2017 •

edited

Loading

teonnik commented Oct 31, 2017

lisalenorelowe commented Oct 17, 2019

An internal error has occurred in ORTE #4416

An internal error has occurred in ORTE #4416

Comments

teonnik commented Oct 28, 2017 • edited Loading

Background information

Version

System

Details of the problem

teonnik commented Oct 31, 2017

lisalenorelowe commented Oct 17, 2019

teonnik commented Oct 28, 2017 •

edited

Loading