Skip to content

Wrong result on non-blocking collectives with user-defined op #1754

Closed
@yukiM-fj

Description

@yukiM-fj

Our team(Fujitsu MPI team) found a problem in libnbc at Open MPI 2.0.0rc1.
Results may be wrong when using non-blocking collectives with user-defined op.
The problem occurrs because of the following two reasons:

  • Non-commutative operations are not considered in libnbc.
  • Results may be overwritten unexpectedly in ompi_3buff_op_user() function.
    • Overwritten when src2 and dst are pointed same buffer in ompi_3buff_op_user()'s arguments.
    • "Chain" algorithm at MPI_Ireduce is this pattern.

I wrote a program to reproduce this problem at gist.

We fixed libnbc by following ways:

  • Changed algorithm selection and algorithm behavior with non-commutative op.
    • when non-commutative op is set, not call iallreduce ring alg and rank 0 is root temporarily at ireduce binomial alg.
  • Added a wrapper when call "NBC_Sched_op" in each algorithm and set parameters in this wrapper to compute correctly.

I wrote two files to show pseudo-code to fix it at another gist.
These files are based on Open MPI v1.8.4. In my gist, "psedo-alg-selection.c" is for the algorithm selection and behavior and "pseudo-wrapper_libnbcop.c" is for the wrapper.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions