You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenMPI 3.0.0 with CUDA support. I don't know exactly how OpenMPI was installed, I am not the system administrator, I will let you know as soon as I find out.
2 IBM POWER8 processors (up to 4.023 GHz, 2*10 cores, 8 threads/core)
4 NVIDIA Tesla P100 GPUs ("Pascal")
4x16 GByte HBM memory attached to GPU
256 GByte DDR4 memory attached to the POWER8 processors
1.6 GByte NVMe SSD
All nodes are connected to a single Mellanox InfiniBand EDR switch.
Details of the problem
I have a C++14 code using CUDA Thrust (the CUDA part is C++11). When I tried to run on multiple nodes, I received the following error:
--------------------------------------------------------------------------
mpirun: Forwarding signal 12 to job
[juronc06.juron.dns.zone:70129] [[1582,0],0] grpcomm:direct:send_relay proc [[1582,0],1] not running - cannot relay: NOT ALIVE
--------------------------------------------------------------------------
An internal error has occurred in ORTE:
[[1582,0],0] FORCE-TERMINATE AT Unreachable:-12 - error ../../../../../orte/mca/grpcomm/direct/grpcomm_direct.c(548)
This is something that should be reported to the developers.
--------------------------------------------------------------------------
I was trying to execute a test from a library I wrote. The code can be found here.
The cluster uses LSF, I ran with the following command:
Sender: LSF System <[email protected]>
Subject: Job 7650: <gvec>in cluster <juron> Exited
Job <gvec> was submitted from host <juron1-adm> by user <padc013>in cluster <juron>.
Job was executed on host(s) <1*juronc06>, in queue <normal>, as user <padc013>in cluster <juron>.
<1*juronc04><1*juronc07><1*juronc03></gpfs/homeb/padc/padc013> was used as the home directory.
</gpfs/work/padc/padc013/alss> was used as the working directory.
Started at Results reported on
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
mpirun /homeb/padc/padc013/asynchronator/juron/test/gvec
------------------------------------------------------------
TERM_RUNLIMIT: job killed after reaching LSF run time limit.
Exited with exit code 244.
Resource usage summary:
CPU time: 0.47 sec.
Max Memory : 25 MB
Average Memory : 25.00 MB
Total Requested Memory : -
Delta Memory : -
Max Swap : -
Max Processes : 3
Max Threads : 9
Run time: 62 sec.
Turnaround time: 64 sec.
The output (if any) is above this job summary.
PS:
Read file <gvec.err>for stderr output of this job.
The text was updated successfully, but these errors were encountered:
We are getting this same error message - can you please tell me how you determined what was wrong with the cluster configuration? We are also using LSF.
Background information
Version
OpenMPI 3.0.0 with CUDA support. I don't know exactly how OpenMPI was installed, I am not the system administrator, I will let you know as soon as I find out.
System
Linux juron1-adm 3.10.0-514.26.2.el7.ppc64le #1 SMP Mon Jul 10 02:18:17 GMT 2017 ppc64le ppc64le ppc64le GNU/Linux
18 IBM S822LC servers ("Minksky") each with
All nodes are connected to a single Mellanox InfiniBand EDR switch.
Details of the problem
I have a C++14 code using CUDA Thrust (the CUDA part is C++11). When I tried to run on multiple nodes, I received the following error:
-------------------------------------------------------------------------- mpirun: Forwarding signal 12 to job [juronc06.juron.dns.zone:70129] [[1582,0],0] grpcomm:direct:send_relay proc [[1582,0],1] not running - cannot relay: NOT ALIVE -------------------------------------------------------------------------- An internal error has occurred in ORTE: [[1582,0],0] FORCE-TERMINATE AT Unreachable:-12 - error ../../../../../orte/mca/grpcomm/direct/grpcomm_direct.c(548) This is something that should be reported to the developers. --------------------------------------------------------------------------
I was trying to execute a test from a library I wrote. The code can be found here.
The cluster uses LSF, I ran with the following command:
Output from LSF:
The text was updated successfully, but these errors were encountered: