Skip to content

An internal error has occurred in ORTE #4416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
teonnik opened this issue Oct 28, 2017 · 2 comments
Closed

An internal error has occurred in ORTE #4416

teonnik opened this issue Oct 28, 2017 · 2 comments

Comments

@teonnik
Copy link

teonnik commented Oct 28, 2017

Background information

Version

OpenMPI 3.0.0 with CUDA support. I don't know exactly how OpenMPI was installed, I am not the system administrator, I will let you know as soon as I find out.

System

Linux juron1-adm 3.10.0-514.26.2.el7.ppc64le #1 SMP Mon Jul 10 02:18:17 GMT 2017 ppc64le ppc64le ppc64le GNU/Linux

18 IBM S822LC servers ("Minksky") each with

  • 2 IBM POWER8 processors (up to 4.023 GHz, 2*10 cores, 8 threads/core)
  • 4 NVIDIA Tesla P100 GPUs ("Pascal")
  • 4x16 GByte HBM memory attached to GPU
  • 256 GByte DDR4 memory attached to the POWER8 processors
  • 1.6 GByte NVMe SSD

All nodes are connected to a single Mellanox InfiniBand EDR switch.


Details of the problem

I have a C++14 code using CUDA Thrust (the CUDA part is C++11). When I tried to run on multiple nodes, I received the following error:

--------------------------------------------------------------------------
mpirun: Forwarding signal 12 to job
[juronc06.juron.dns.zone:70129] [[1582,0],0] grpcomm:direct:send_relay proc [[1582,0],1] not running - cannot relay: NOT ALIVE 
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[1582,0],0] FORCE-TERMINATE AT Unreachable:-12 - error ../../../../../orte/mca/grpcomm/direct/grpcomm_direct.c(548)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

I was trying to execute a test from a library I wrote. The code can be found here.

The cluster uses LSF, I ran with the following command:

bsub -J gvec -n 4 -R "span[ptile=1]" -R "rusage[ngpus_shared=1]" -W 00:01 -q normal -e gvec.err -o gvec.out "mpirun /homeb/padc/padc013/asynchronator/juron/test/gvec

Output from LSF:

Sender: LSF System <[email protected]>
Subject: Job 7650: <gvec> in cluster <juron> Exited

Job <gvec> was submitted from host <juron1-adm> by user <padc013> in cluster <juron>.
Job was executed on host(s) <1*juronc06>, in queue <normal>, as user <padc013> in cluster <juron>.
                            <1*juronc04>
                            <1*juronc07>
                            <1*juronc03>
</gpfs/homeb/padc/padc013> was used as the home directory.
</gpfs/work/padc/padc013/alss> was used as the working directory.
Started at Results reported on 
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun /homeb/padc/padc013/asynchronator/juron/test/gvec 
------------------------------------------------------------

TERM_RUNLIMIT: job killed after reaching LSF run time limit.
Exited with exit code 244.

Resource usage summary:

    CPU time :                                   0.47 sec.
    Max Memory :                                 25 MB
    Average Memory :                             25.00 MB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                9
    Run time :                                   62 sec.
    Turnaround time :                            64 sec.

The output (if any) is above this job summary.



PS:

Read file <gvec.err> for stderr output of this job.
@teonnik
Copy link
Author

teonnik commented Oct 31, 2017

The issue does not pertain to OpenMPI, it's rather due to incorrect cluster configuration.

@teonnik teonnik closed this as completed Oct 31, 2017
@lisalenorelowe
Copy link

We are getting this same error message - can you please tell me how you determined what was wrong with the cluster configuration? We are also using LSF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants