Skip to content

OSC/UCX: Adding the following optimizations (nonblocking accumulate and reusing resources) #10709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 2, 2022

Conversation

MamziB
Copy link
Contributor

@MamziB MamziB commented Aug 23, 2022

Adding the following optimizations: 1) Reuse the same workers/eps in
single-threaded applications, this is helpful if an application
creates many windows, therefore, we avoid the unnecessary overheads and 2) adding the truly nonblocking
MPI_Accumulate/Get_Accumulate.

Signed-off-by: Mamzi Bayatpour [email protected]
Co-authored-by: Tomislav Janjusic [email protected]
Co-authored-by: Joseph Schuchart [email protected]

@devreal
Copy link
Contributor

devreal commented Aug 24, 2022

Can we split this into two PRs? The two changes seem to be independent.

@janjust
Copy link
Contributor

janjust commented Aug 24, 2022

@devreal The features are independent, but the changes are intertwined, we'd prefer to leave it as is. FWIW, these are significant improvements in NWCHEM performance, now at least on small scale, we seem to outperform IMPI by about 20%, need to test larger scales to see effects.

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a preliminary review. There is good stuff in this PR. Some things aren't clear to me and I'd like to see some documentation on how the default ep mechanism works, as there seems to be some special handling depending on whether multiple-thread support is enabled.

@@ -150,6 +152,17 @@ typedef struct ompi_osc_ucx_lock {
#define OSC_UCX_GET_EP(comm_, rank_) (ompi_comm_peer_lookup(comm_, rank_)->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_UCX])
#define OSC_UCX_GET_DISP(module_, rank_) ((module_->disp_unit < 0) ? module_->disp_units[rank_] : module_->disp_unit)

extern bool mpi_thread_multiple_enabled;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it has the same meaning opal_using_threads(). Is it different?

Copy link
Contributor Author

@MamziB MamziB Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two reasons that for now, we did not set mpi_thread_multiple_enabled the same as opal_using_threads():

  1. We need access to comm world in OSC Finalize, to destroy all the created default EPs. Therefore, we need to fix the following issue before making this change:

#10629

  1. We want to support high multi-threading performance for applications that call init with thread-single, but they still use threads for RMA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to support high multi-threading performance for applications that call init with thread-single, but they still use threads for RMA.

I'm not sure I understand. There shouldn't be multiple threads if MPI was initialized with thread-single. Can you elaborate? Did you mean thread-serialized? opal_using_threads should be false for thread-serialized, so it has the semantics you're looking for, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to know. If this is a requirement, we can rely on opal_using_threads(). I found a workaround for the first issue, and now the osc finalize is fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The symbol is still here. Can it be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check the latest commits and let me know if you have more thoughts about it?
@devreal

MCA_BASE_VAR_SCOPE_GROUP, &enable_nonblocking_accumulate);
(void) mca_base_component_var_register(&mca_osc_ucx_component.super.osc_version, "enable_wpool_thread_multiple",
description_str, MCA_BASE_VAR_TYPE_BOOL, NULL, 0, 0, OPAL_INFO_LVL_5,
MCA_BASE_VAR_SCOPE_GROUP, &mpi_thread_multiple_enabled);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that an MCA parameter? That should be determined from how MPI is initialized (see my other comment about the opal variable)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered, please see above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure why this is an MCA parameter? Why would a user have to control thread-safety from the command line?

Copy link
Contributor Author

@MamziB MamziB Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest commits, we are setting this variable as the same as the opal_thread_multiple().
I would like to keep this as an MCA variable for the following reason: Users will have the option to provide each window with an endpoint, instead of sharing it across all the windows. So far, I have not found such an application that can benefit from it, but it would be harmless to keep this option open. What do you think?

@github-actions
Copy link

github-actions bot commented Oct 6, 2022

Hello! The Git Commit Checker CI bot found a few problems with this PR:

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@janjust
Copy link
Contributor

janjust commented Oct 6, 2022

@MamziB missing signed off in the last commit

@MamziB
Copy link
Contributor Author

MamziB commented Oct 6, 2022

@MamziB missing signed off in the last commit

@janjust I am planning to squash all the new commits, and then add sign-off. Does that sound good to you?

@github-actions
Copy link

github-actions bot commented Oct 7, 2022

Hello! The Git Commit Checker CI bot found a few problems with this PR:

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@MamziB
Copy link
Contributor Author

MamziB commented Oct 7, 2022

@devreal, Thanks for your constructive comments. Can you please let us know if you have more comments?

@github-actions
Copy link

github-actions bot commented Oct 7, 2022

Hello! The Git Commit Checker CI bot found a few problems with this PR:

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@@ -150,6 +152,17 @@ typedef struct ompi_osc_ucx_lock {
#define OSC_UCX_GET_EP(comm_, rank_) (ompi_comm_peer_lookup(comm_, rank_)->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_UCX])
#define OSC_UCX_GET_DISP(module_, rank_) ((module_->disp_unit < 0) ? module_->disp_units[rank_] : module_->disp_unit)

extern bool mpi_thread_multiple_enabled;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to support high multi-threading performance for applications that call init with thread-single, but they still use threads for RMA.

I'm not sure I understand. There shouldn't be multiple threads if MPI was initialized with thread-single. Can you elaborate? Did you mean thread-serialized? opal_using_threads should be false for thread-serialized, so it has the semantics you're looking for, no?

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

e4072fe: fixing the macros

  • check_signed_off: does not contain a valid Signed-off-by line

5c28f90: adding the prefix

  • check_signed_off: does not contain a valid Signed-off-by line

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@MamziB
Copy link
Contributor Author

MamziB commented Oct 13, 2022

hey @devreal please let us know your further comments. Again, thanks for your various help so far.

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MamziB! The issue with the thread-multiple variable still exists (either needs renaming or be replaced by opal_using_threads()) and a few more points I'm not clear about.

@@ -150,6 +152,17 @@ typedef struct ompi_osc_ucx_lock {
#define OSC_UCX_GET_EP(comm_, rank_) (ompi_comm_peer_lookup(comm_, rank_)->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_UCX])
#define OSC_UCX_GET_DISP(module_, rank_) ((module_->disp_unit < 0) ? module_->disp_units[rank_] : module_->disp_unit)

extern bool mpi_thread_multiple_enabled;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The symbol is still here. Can it be removed?

MCA_BASE_VAR_SCOPE_GROUP, &enable_nonblocking_accumulate);
(void) mca_base_component_var_register(&mca_osc_ucx_component.super.osc_version, "enable_wpool_thread_multiple",
description_str, MCA_BASE_VAR_TYPE_BOOL, NULL, 0, 0, OPAL_INFO_LVL_5,
MCA_BASE_VAR_SCOPE_GROUP, &mpi_thread_multiple_enabled);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure why this is an MCA parameter? Why would a user have to control thread-safety from the command line?

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

08a847c: use opal_uses_thread to set the mpi_thread_multipl...

  • check_signed_off: does not contain a valid Signed-off-by line

e4072fe: fixing the macros

  • check_signed_off: does not contain a valid Signed-off-by line

5c28f90: adding the prefix

  • check_signed_off: does not contain a valid Signed-off-by line

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

d91d76a: atomic add for nb outstanding ops and renaming the...

  • check_signed_off: does not contain a valid Signed-off-by line

37163ce: move num_incomplete_req_ops to osc ucx context

  • check_signed_off: does not contain a valid Signed-off-by line

08a847c: use opal_uses_thread to set the mpi_thread_multipl...

  • check_signed_off: does not contain a valid Signed-off-by line

e4072fe: fixing the macros

  • check_signed_off: does not contain a valid Signed-off-by line

5c28f90: adding the prefix

  • check_signed_off: does not contain a valid Signed-off-by line

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

8213d67: make nonblocking acc default

  • check_signed_off: does not contain a valid Signed-off-by line

d91d76a: atomic add for nb outstanding ops and renaming the...

  • check_signed_off: does not contain a valid Signed-off-by line

37163ce: move num_incomplete_req_ops to osc ucx context

  • check_signed_off: does not contain a valid Signed-off-by line

08a847c: use opal_uses_thread to set the mpi_thread_multipl...

  • check_signed_off: does not contain a valid Signed-off-by line

e4072fe: fixing the macros

  • check_signed_off: does not contain a valid Signed-off-by line

5c28f90: adding the prefix

  • check_signed_off: does not contain a valid Signed-off-by line

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

ddf5d70: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

dcb44cb: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

8213d67: make nonblocking acc default

  • check_signed_off: does not contain a valid Signed-off-by line

d91d76a: atomic add for nb outstanding ops and renaming the...

  • check_signed_off: does not contain a valid Signed-off-by line

37163ce: move num_incomplete_req_ops to osc ucx context

  • check_signed_off: does not contain a valid Signed-off-by line

08a847c: use opal_uses_thread to set the mpi_thread_multipl...

  • check_signed_off: does not contain a valid Signed-off-by line

e4072fe: fixing the macros

  • check_signed_off: does not contain a valid Signed-off-by line

5c28f90: adding the prefix

  • check_signed_off: does not contain a valid Signed-off-by line

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

9f46f48: create separate req obj for accumulate

  • check_signed_off: does not contain a valid Signed-off-by line

ddf5d70: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

dcb44cb: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

8213d67: make nonblocking acc default

  • check_signed_off: does not contain a valid Signed-off-by line

d91d76a: atomic add for nb outstanding ops and renaming the...

  • check_signed_off: does not contain a valid Signed-off-by line

37163ce: move num_incomplete_req_ops to osc ucx context

  • check_signed_off: does not contain a valid Signed-off-by line

08a847c: use opal_uses_thread to set the mpi_thread_multipl...

  • check_signed_off: does not contain a valid Signed-off-by line

e4072fe: fixing the macros

  • check_signed_off: does not contain a valid Signed-off-by line

5c28f90: adding the prefix

  • check_signed_off: does not contain a valid Signed-off-by line

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

63e6893: add description for mca vars

  • check_signed_off: does not contain a valid Signed-off-by line

9f46f48: create separate req obj for accumulate

  • check_signed_off: does not contain a valid Signed-off-by line

ddf5d70: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

dcb44cb: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

8213d67: make nonblocking acc default

  • check_signed_off: does not contain a valid Signed-off-by line

d91d76a: atomic add for nb outstanding ops and renaming the...

  • check_signed_off: does not contain a valid Signed-off-by line

37163ce: move num_incomplete_req_ops to osc ucx context

  • check_signed_off: does not contain a valid Signed-off-by line

08a847c: use opal_uses_thread to set the mpi_thread_multipl...

  • check_signed_off: does not contain a valid Signed-off-by line

e4072fe: fixing the macros

  • check_signed_off: does not contain a valid Signed-off-by line

5c28f90: adding the prefix

  • check_signed_off: does not contain a valid Signed-off-by line

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

18ff47b: fix the naming of req comp

  • check_signed_off: does not contain a valid Signed-off-by line

63e6893: add description for mca vars

  • check_signed_off: does not contain a valid Signed-off-by line

9f46f48: create separate req obj for accumulate

  • check_signed_off: does not contain a valid Signed-off-by line

ddf5d70: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

dcb44cb: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

8213d67: make nonblocking acc default

  • check_signed_off: does not contain a valid Signed-off-by line

d91d76a: atomic add for nb outstanding ops and renaming the...

  • check_signed_off: does not contain a valid Signed-off-by line

37163ce: move num_incomplete_req_ops to osc ucx context

  • check_signed_off: does not contain a valid Signed-off-by line

08a847c: use opal_uses_thread to set the mpi_thread_multipl...

  • check_signed_off: does not contain a valid Signed-off-by line

e4072fe: fixing the macros

  • check_signed_off: does not contain a valid Signed-off-by line

5c28f90: adding the prefix

  • check_signed_off: does not contain a valid Signed-off-by line

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

7651aa3: fix thread_enabled name

  • check_signed_off: does not contain a valid Signed-off-by line

18ff47b: fix the naming of req comp

  • check_signed_off: does not contain a valid Signed-off-by line

63e6893: add description for mca vars

  • check_signed_off: does not contain a valid Signed-off-by line

9f46f48: create separate req obj for accumulate

  • check_signed_off: does not contain a valid Signed-off-by line

ddf5d70: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

dcb44cb: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

8213d67: make nonblocking acc default

  • check_signed_off: does not contain a valid Signed-off-by line

d91d76a: atomic add for nb outstanding ops and renaming the...

  • check_signed_off: does not contain a valid Signed-off-by line

37163ce: move num_incomplete_req_ops to osc ucx context

  • check_signed_off: does not contain a valid Signed-off-by line

08a847c: use opal_uses_thread to set the mpi_thread_multipl...

  • check_signed_off: does not contain a valid Signed-off-by line

e4072fe: fixing the macros

  • check_signed_off: does not contain a valid Signed-off-by line

5c28f90: adding the prefix

  • check_signed_off: does not contain a valid Signed-off-by line

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

5444d2e: disable nb acc for dyn windows

  • check_signed_off: does not contain a valid Signed-off-by line

7651aa3: fix thread_enabled name

  • check_signed_off: does not contain a valid Signed-off-by line

18ff47b: fix the naming of req comp

  • check_signed_off: does not contain a valid Signed-off-by line

63e6893: add description for mca vars

  • check_signed_off: does not contain a valid Signed-off-by line

9f46f48: create separate req obj for accumulate

  • check_signed_off: does not contain a valid Signed-off-by line

ddf5d70: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

dcb44cb: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

8213d67: make nonblocking acc default

  • check_signed_off: does not contain a valid Signed-off-by line

d91d76a: atomic add for nb outstanding ops and renaming the...

  • check_signed_off: does not contain a valid Signed-off-by line

37163ce: move num_incomplete_req_ops to osc ucx context

  • check_signed_off: does not contain a valid Signed-off-by line

08a847c: use opal_uses_thread to set the mpi_thread_multipl...

  • check_signed_off: does not contain a valid Signed-off-by line

e4072fe: fixing the macros

  • check_signed_off: does not contain a valid Signed-off-by line

5c28f90: adding the prefix

  • check_signed_off: does not contain a valid Signed-off-by line

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

174b53c: adding ref count for accumulate lock

  • check_signed_off: does not contain a valid Signed-off-by line

5444d2e: disable nb acc for dyn windows

  • check_signed_off: does not contain a valid Signed-off-by line

7651aa3: fix thread_enabled name

  • check_signed_off: does not contain a valid Signed-off-by line

18ff47b: fix the naming of req comp

  • check_signed_off: does not contain a valid Signed-off-by line

63e6893: add description for mca vars

  • check_signed_off: does not contain a valid Signed-off-by line

9f46f48: create separate req obj for accumulate

  • check_signed_off: does not contain a valid Signed-off-by line

ddf5d70: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

dcb44cb: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

8213d67: make nonblocking acc default

  • check_signed_off: does not contain a valid Signed-off-by line

d91d76a: atomic add for nb outstanding ops and renaming the...

  • check_signed_off: does not contain a valid Signed-off-by line

37163ce: move num_incomplete_req_ops to osc ucx context

  • check_signed_off: does not contain a valid Signed-off-by line

08a847c: use opal_uses_thread to set the mpi_thread_multipl...

  • check_signed_off: does not contain a valid Signed-off-by line

e4072fe: fixing the macros

  • check_signed_off: does not contain a valid Signed-off-by line

5c28f90: adding the prefix

  • check_signed_off: does not contain a valid Signed-off-by line

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@MamziB
Copy link
Contributor Author

MamziB commented Oct 20, 2022

@devreal Please kindly let us know if you have more comments. Thanks a lot.

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

00ebe71: Enhance ref counters

  • check_signed_off: does not contain a valid Signed-off-by line

174b53c: adding ref count for accumulate lock

  • check_signed_off: does not contain a valid Signed-off-by line

5444d2e: disable nb acc for dyn windows

  • check_signed_off: does not contain a valid Signed-off-by line

7651aa3: fix thread_enabled name

  • check_signed_off: does not contain a valid Signed-off-by line

18ff47b: fix the naming of req comp

  • check_signed_off: does not contain a valid Signed-off-by line

63e6893: add description for mca vars

  • check_signed_off: does not contain a valid Signed-off-by line

9f46f48: create separate req obj for accumulate

  • check_signed_off: does not contain a valid Signed-off-by line

ddf5d70: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

dcb44cb: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

8213d67: make nonblocking acc default

  • check_signed_off: does not contain a valid Signed-off-by line

d91d76a: atomic add for nb outstanding ops and renaming the...

  • check_signed_off: does not contain a valid Signed-off-by line

37163ce: move num_incomplete_req_ops to osc ucx context

  • check_signed_off: does not contain a valid Signed-off-by line

08a847c: use opal_uses_thread to set the mpi_thread_multipl...

  • check_signed_off: does not contain a valid Signed-off-by line

e4072fe: fixing the macros

  • check_signed_off: does not contain a valid Signed-off-by line

5c28f90: adding the prefix

  • check_signed_off: does not contain a valid Signed-off-by line

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MamziB, looks good to me now 👍

@MamziB
Copy link
Contributor Author

MamziB commented Oct 24, 2022

@devreal I'm going to push the last commit. Please wait for that. I am working on pushing it.

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

a2d2993: Using a separate lock for handling the dynamic win...

  • check_signed_off: does not contain a valid Signed-off-by line

00ebe71: Enhance ref counters

  • check_signed_off: does not contain a valid Signed-off-by line

174b53c: adding ref count for accumulate lock

  • check_signed_off: does not contain a valid Signed-off-by line

5444d2e: disable nb acc for dyn windows

  • check_signed_off: does not contain a valid Signed-off-by line

7651aa3: fix thread_enabled name

  • check_signed_off: does not contain a valid Signed-off-by line

18ff47b: fix the naming of req comp

  • check_signed_off: does not contain a valid Signed-off-by line

63e6893: add description for mca vars

  • check_signed_off: does not contain a valid Signed-off-by line

9f46f48: create separate req obj for accumulate

  • check_signed_off: does not contain a valid Signed-off-by line

ddf5d70: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

dcb44cb: cleanup

  • check_signed_off: does not contain a valid Signed-off-by line

8213d67: make nonblocking acc default

  • check_signed_off: does not contain a valid Signed-off-by line

d91d76a: atomic add for nb outstanding ops and renaming the...

  • check_signed_off: does not contain a valid Signed-off-by line

37163ce: move num_incomplete_req_ops to osc ucx context

  • check_signed_off: does not contain a valid Signed-off-by line

08a847c: use opal_uses_thread to set the mpi_thread_multipl...

  • check_signed_off: does not contain a valid Signed-off-by line

e4072fe: fixing the macros

  • check_signed_off: does not contain a valid Signed-off-by line

5c28f90: adding the prefix

  • check_signed_off: does not contain a valid Signed-off-by line

6db6d31: Enhancing the osc finalize for resource utilizati...

  • check_signed_off: does not contain a valid Signed-off-by line

4afa6c7: code reorganization

  • check_signed_off: does not contain a valid Signed-off-by line

addc6b8: enhance the datatype handling in nb acc

  • check_signed_off: does not contain a valid Signed-off-by line

6b537d1: Fixing some corner cases in nonblocking accumulate...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: ae97b9792e246f4397fc175fde36f41e7c35637f

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@MamziB
Copy link
Contributor Author

MamziB commented Oct 25, 2022

@devreal Please take a look at the last commits. I reverted a couple of commits, as they were not MPI atomicity compliant. Basically, using those two commits, one process can initiate multiple on-the-flight nonblocking accumulate to the same target. This can violate the atomicity of each MPI_Accumulate/Get_accumulate call. Now everything should be good to go.

@MamziB MamziB force-pushed the mamzi/single-thread-enhancements-3 branch from a2d2993 to 5e6ba1a Compare October 26, 2022 18:21
@MamziB
Copy link
Contributor Author

MamziB commented Oct 26, 2022

I squashed the commits, however, since my branch is not based on the latest main branch, some unwanted commits sneaked in. I will rebase my branch on top of the latest main and I will try again. Thanks for your patience.

@MamziB MamziB force-pushed the mamzi/single-thread-enhancements-3 branch from 5e6ba1a to d1b0e37 Compare October 26, 2022 18:53
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing, otherwise it's looking good now 👍

#define OSC_UCX_GET_DISP(module_, rank_) ((module_->disp_unit < 0) ? module_->disp_units[rank_] : module_->disp_unit)

extern bool opal_common_ucx_thread_enabled;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an opal variables and should be defined in an opal header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, Actually, I noticed a warning message regarding this variable. I have fixed this warning in the latest push. Please kindly take look.
By the way, I added the definition to the opal header file, and I got many more warnings like below:

libopen-palmca_common_ucx_noinst_la-common_ucx_wpool.o: 0000000000000001 C opal_common_ucx_thread_enabled
      osc_ucx_component.o: 0000000000000001 C opal_common_ucx_thread_enabled
  osc_ucx_active_target.o: 0000000000000001 C opal_common_ucx_thread_enabled
           osc_ucx_comm.o: 0000000000000001 C opal_common_ucx_thread_enabled
osc_ucx_passive_target.o: 0000000000000001 C opal_common_ucx_thread_enabled
        osc_ucx_request.o: 0000000000000001 C opal_common_

Therefore, instead, I defined it in an opal .c file, and extern'ed it in suitable places. This is because we do not need a copy of the variable on each file that includes the opal header, instead, I defined it in a c file and extern it on other files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you declare the variable in a header file you mark it as extern and define it once in a .c file. The messages suggest that you have definitions of the same variable in multiple files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always forget the details about this warning. Here is a thread about the warning you're seeing: https://users.open-mpi.narkive.com/fTvNIGxP/ompi-make-install-warns-about-common-symbols

Bottom line: global variables (in your case opal_common_ucx_thread_enabled) should always be initialized where they are defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devreal I fixed this issue.
@janjust The branch is ready.

@MamziB MamziB force-pushed the mamzi/single-thread-enhancements-3 branch 4 times, most recently from ee49d3b to 34a98f3 Compare October 27, 2022 18:37
…eps in

single threaded applications, this is helpful if an application
creates many windows, therefore, we avoid the unnecessary overheads  and 2) adding the truely nonblocking
MPI_Accumulate/Get_Accumulate.

Signed-off-by: Mamzi Bayatpour  <[email protected]>
Co-authored-by: Tomislav Janjusic <[email protected]>
Co-authored-by: Joseph Schuchart <[email protected]>>
@MamziB MamziB force-pushed the mamzi/single-thread-enhancements-3 branch from 34a98f3 to 1ea6fb9 Compare October 27, 2022 18:44
@janjust
Copy link
Contributor

janjust commented Oct 31, 2022

bot:retest

@janjust
Copy link
Contributor

janjust commented Oct 31, 2022

bot:retest

@janjust
Copy link
Contributor

janjust commented Nov 1, 2022

bot:ibm:retest

@janjust
Copy link
Contributor

janjust commented Nov 1, 2022

bot:retest

@janjust
Copy link
Contributor

janjust commented Nov 2, 2022

@MamziB please open up v5.0 PR asap

@janjust janjust merged commit 710ff57 into open-mpi:main Nov 2, 2022
@MamziB
Copy link
Contributor Author

MamziB commented Nov 2, 2022

@janjust I opened the v5 here:
#11025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants