Skip to content

Return error from node failure #10389

Open
@hatmer

Description

@hatmer

Background information

What version of Open MPI are you using?

v5.0.0rc7

Describe how Open MPI was installed

tarball

Please describe the system on which you are running

  • Operating system/version: Linux 4.19.0-18-cloud-amd64 SMP Debian 4.19.208-1 (2021-09-29) x86_64 GNU/Linux
  • Network type: TCP/IP

Details of the problem

I am trying to make a distributed system built on OpenMPI continue past a node failure. In order to do this I must detect and handle a node failure.

I am using OpenMPI v5rc7, run with "--with-ft ulfm", and have set "MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN)". It seems the node failure is not returned as an error that can be handled in the code.

Example:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
 
int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);
    int comm_size;
    MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
    int my_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

    MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
 
    int window_buffer = 0;
    if (my_rank == 1)
    {
        window_buffer = 12345;
    } 

    MPI_Win window;
    MPI_Win_create(&window_buffer, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &window);
    MPI_Win_fence(0, window);
 
    int value_fetched;
    if(my_rank == 0)
    {
        // Network fails. Attempt to fetch the value from the MPI process 1 window
        system("sudo iptables -A OUTPUT -d 10.166.0.18 -j  DROP");
        system("sudo iptables -A INPUT -s 10.166.0.18 -j DROP");
        int err = MPI_Get(&value_fetched, 1, MPI_INT, 1, 0, 1, MPI_INT, window);

        // Handle error
        if (err)
        {
            printf("Received error from MPI_Get: %d\n", err);
        }
        // reset firewall
        system("sudo iptables --flush");
    }
 
    MPI_Win_fence(0, window);
    MPI_Win_free(&window); 
    MPI_Finalize();
    return EXIT_SUCCESS;
}
$ /home/ompi5rc7/bin/mpic++ example.cpp
$ /home/ompi5rc7/bin/mpirun --with-ft ulfm -n 2 --hostfile ../hosts ./a.out
--------------------------------------------------------------------------
WARNING: The selected 'osc' module 'rdma' is not tested for post-failure
operation, yet you have requested support for fault tolerance.
When using this component, normal failure free operation is expected;
However, failures may cause the application to abort, crash or deadlock.

In this framework, the following components are tested to operate under
failure scenarios: {}
--------------------------------------------------------------------------
1 more process has sent help message help-mpi-ft.txt / module:untested:failundef
1 more process has sent help message help-mpi-ft.txt / module:untested:failundef 

< long wait here >

--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    node-died
But I couldn't open the help file:
    (null).  Sorry!
--------------------------------------------------------------------------

I have also tried running with "/home/ompi5rc7/bin/mpirun --with-ft ulfm --mca btl tcp,self -n 2 --hostfile ../hosts ./a.out" but get the same output. I am not using RDMA.

Is it possible to print out the error code after a node failure?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions