Open
Description
Background information
What version of Open MPI are you using?
v5.0.0rc7
Describe how Open MPI was installed
tarball
Please describe the system on which you are running
- Operating system/version: Linux 4.19.0-18-cloud-amd64 SMP Debian 4.19.208-1 (2021-09-29) x86_64 GNU/Linux
- Network type: TCP/IP
Details of the problem
I am trying to make a distributed system built on OpenMPI continue past a node failure. In order to do this I must detect and handle a node failure.
I am using OpenMPI v5rc7, run with "--with-ft ulfm", and have set "MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN)". It seems the node failure is not returned as an error that can be handled in the code.
Example:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
int comm_size;
MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
int my_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
int window_buffer = 0;
if (my_rank == 1)
{
window_buffer = 12345;
}
MPI_Win window;
MPI_Win_create(&window_buffer, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &window);
MPI_Win_fence(0, window);
int value_fetched;
if(my_rank == 0)
{
// Network fails. Attempt to fetch the value from the MPI process 1 window
system("sudo iptables -A OUTPUT -d 10.166.0.18 -j DROP");
system("sudo iptables -A INPUT -s 10.166.0.18 -j DROP");
int err = MPI_Get(&value_fetched, 1, MPI_INT, 1, 0, 1, MPI_INT, window);
// Handle error
if (err)
{
printf("Received error from MPI_Get: %d\n", err);
}
// reset firewall
system("sudo iptables --flush");
}
MPI_Win_fence(0, window);
MPI_Win_free(&window);
MPI_Finalize();
return EXIT_SUCCESS;
}
$ /home/ompi5rc7/bin/mpic++ example.cpp
$ /home/ompi5rc7/bin/mpirun --with-ft ulfm -n 2 --hostfile ../hosts ./a.out
--------------------------------------------------------------------------
WARNING: The selected 'osc' module 'rdma' is not tested for post-failure
operation, yet you have requested support for fault tolerance.
When using this component, normal failure free operation is expected;
However, failures may cause the application to abort, crash or deadlock.
In this framework, the following components are tested to operate under
failure scenarios: {}
--------------------------------------------------------------------------
1 more process has sent help message help-mpi-ft.txt / module:untested:failundef
1 more process has sent help message help-mpi-ft.txt / module:untested:failundef
< long wait here >
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
node-died
But I couldn't open the help file:
(null). Sorry!
--------------------------------------------------------------------------
I have also tried running with "/home/ompi5rc7/bin/mpirun --with-ft ulfm --mca btl tcp,self -n 2 --hostfile ../hosts ./a.out" but get the same output. I am not using RDMA.
Is it possible to print out the error code after a node failure?