Skip to content

Added an initial implementation of partitioned communication. #8044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mdosanjh
Copy link
Contributor

@mdosanjh mdosanjh commented Sep 14, 2020

This is an initial implementation of partitioned communication, adding the part mca component and 'rma module' for the initial back end implementation (this initial implementation uses the MPI RMA interface).

There is one notable caveat to this implementation; that both MPI_Psend_init and MPI_Precv_init are blocking, which is not compliant with the upcoming MPI-4 specification.
.

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@jsquyres
Copy link
Member

ok to test

@jsquyres
Copy link
Member

Also added @mdosanjh to the Open MPI github org.

@hjelmn
Copy link
Member

hjelmn commented Sep 14, 2020

@mdosanjh We should probably work on getting the non-blocking window creation done if it isn't already underway. That would address the caveat.

@mdosanjh
Copy link
Contributor Author

@hjelmn Agreed; I haven't started looking at that yet but it is on my todo list. There's also an MPI_Comm_create_group in the init calls that also prevents a non-blocking implementation (this is to limit the window to the two processes involved and ensure the tags match). I'm wondering if it is feasible to create a non-blocking MPI_Win_create that uses a group to only include a subset of the processes (eliminating the need for MPI_Comm_create_group).

@mdosanjh
Copy link
Contributor Author

@jsquyres I got the following error message:
Started by upstream project "open-mpi.pull_request" build number 6962
originally caused by:
GitHub pull request #8044 of commit e443621, no merge conflicts.
Running as SYSTEM
FATAL: no longer a configured node for EC2 (Open MPI AWS Production) - Ubuntu 16.04 (i-08271dfed355b1be0)
java.lang.IllegalStateException: no longer a configured node for EC2 (Open MPI AWS Production) - Ubuntu 16.04 (i-08271dfed355b1be0)
at hudson.model.AbstractBuild$AbstractBuildExecution.getCurrentNode(AbstractBuild.java:415)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:455)
at hudson.model.Run.execute(Run.java:1880)
at hudson.matrix.MatrixBuild.run(MatrixBuild.java:323)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:428)
Finished: FAILURE

This doesn't appear to be an issue with my code (though I could be wrong) is there a way to re-run the tests?

@jsquyres
Copy link
Member

bot:aws:retest

@jsquyres
Copy link
Member

jsquyres commented Mar 6, 2021

bot:retest

@ibm-ompi
Copy link

ibm-ompi commented Mar 6, 2021

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/bd05eb727e830d4a91645c5f35ee6ff6

@jjhursey
Copy link
Member

jjhursey commented Mar 8, 2021

The PGI failure log is super long, but it's there if you download it.

Here is the error:

NVC++-S-0103-Illegal operand types for comparison operator (osc_rdma_frag.h: 75)
NVC++/power Linux 20.11-0: compilation completed with severe errors
NVC++-S-0103-Illegal operand types for comparison operator (osc_rdma_frag.h: 75)
NVC++/power Linux 20.11-0: compilation completed with severe errors
NVC++-S-0103-Illegal operand types for comparison operator (osc_rdma_frag.h: 75)
NVC++/power Linux 20.11-0: compilation completed with severe errors
NVC++-S-0103-Illegal operand types for comparison operator (osc_rdma_frag.h: 75)
NVC++/power Linux 20.11-0: compilation completed with severe errors
NVC++-S-0103-Illegal operand types for comparison operator (osc_rdma_frag.h: 75)
NVC++/power Linux 20.11-0: compilation completed with severe errors
make[2]: *** [Makefile:1573: osc_rdma_dynamic.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [Makefile:1573: osc_rdma_accumulate.lo] Error 1
make[2]: *** [Makefile:1573: osc_rdma_comm.lo] Error 1
make[2]: *** [Makefile:1573: osc_rdma_passive_target.lo] Error 1
make[2]: *** [Makefile:1573: osc_rdma_active_target.lo] Error 1
make[2]: Leaving directory '/tmp/ompi-src/ompi/mca/osc/rdma'
make[1]: *** [Makefile:2737: all-recursive] Error 1
make[1]: Leaving directory '/tmp/ompi-src/ompi'
make: *** [Makefile:1545: all-recursive] Error 1

The file is super long because of piles and piles these warnings (@awlauria do we have a PGI cleanup PR to address this?):

"../../opal/include/opal/sys/atomic_stdc.h", line 119: warning: argument of
          type "opal_atomic_int32_t *" is incompatible with parameter of type
          "volatile void *"
  OPAL_ATOMIC_STDC_DEFINE_FETCH_OP(add, 32, int32_t, +)

@awlauria
Copy link
Contributor

awlauria commented Mar 9, 2021

#8444 should clean up that PGI warning. I just need to push myself to finish it.

I'll try to rebase this week

@mdosanjh mdosanjh closed this Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants