-
Notifications
You must be signed in to change notification settings - Fork 901
OSC/UCX: Adding the following optimizations (nonblocking accumulate and reusing resources) #10709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSC/UCX: Adding the following optimizations (nonblocking accumulate and reusing resources) #10709
Conversation
Can we split this into two PRs? The two changes seem to be independent. |
@devreal The features are independent, but the changes are intertwined, we'd prefer to leave it as is. FWIW, these are significant improvements in NWCHEM performance, now at least on small scale, we seem to outperform IMPI by about 20%, need to test larger scales to see effects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a preliminary review. There is good stuff in this PR. Some things aren't clear to me and I'd like to see some documentation on how the default ep mechanism works, as there seems to be some special handling depending on whether multiple-thread support is enabled.
ompi/mca/osc/ucx/osc_ucx.h
Outdated
@@ -150,6 +152,17 @@ typedef struct ompi_osc_ucx_lock { | |||
#define OSC_UCX_GET_EP(comm_, rank_) (ompi_comm_peer_lookup(comm_, rank_)->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_UCX]) | |||
#define OSC_UCX_GET_DISP(module_, rank_) ((module_->disp_unit < 0) ? module_->disp_units[rank_] : module_->disp_unit) | |||
|
|||
extern bool mpi_thread_multiple_enabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it has the same meaning opal_using_threads()
. Is it different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are two reasons that for now, we did not set mpi_thread_multiple_enabled the same as opal_using_threads():
- We need access to comm world in OSC Finalize, to destroy all the created default EPs. Therefore, we need to fix the following issue before making this change:
- We want to support high multi-threading performance for applications that call init with thread-single, but they still use threads for RMA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to support high multi-threading performance for applications that call init with thread-single, but they still use threads for RMA.
I'm not sure I understand. There shouldn't be multiple threads if MPI was initialized with thread-single. Can you elaborate? Did you mean thread-serialized? opal_using_threads
should be false for thread-serialized, so it has the semantics you're looking for, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good to know. If this is a requirement, we can rely on opal_using_threads(). I found a workaround for the first issue, and now the osc finalize is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The symbol is still here. Can it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please check the latest commits and let me know if you have more thoughts about it?
@devreal
ompi/mca/osc/ucx/osc_ucx_component.c
Outdated
MCA_BASE_VAR_SCOPE_GROUP, &enable_nonblocking_accumulate); | ||
(void) mca_base_component_var_register(&mca_osc_ucx_component.super.osc_version, "enable_wpool_thread_multiple", | ||
description_str, MCA_BASE_VAR_TYPE_BOOL, NULL, 0, 0, OPAL_INFO_LVL_5, | ||
MCA_BASE_VAR_SCOPE_GROUP, &mpi_thread_multiple_enabled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that an MCA parameter? That should be determined from how MPI is initialized (see my other comment about the opal variable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered, please see above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure why this is an MCA parameter? Why would a user have to control thread-safety from the command line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest commits, we are setting this variable as the same as the opal_thread_multiple().
I would like to keep this as an MCA variable for the following reason: Users will have the option to provide each window with an endpoint, instead of sharing it across all the windows. So far, I have not found such an application that can benefit from it, but it would be harmless to keep this option open. What do you think?
Hello! The Git Commit Checker CI bot found a few problems with this PR: 6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
@MamziB missing signed off in the last commit |
Hello! The Git Commit Checker CI bot found a few problems with this PR: addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
@devreal, Thanks for your constructive comments. Can you please let us know if you have more comments? |
Hello! The Git Commit Checker CI bot found a few problems with this PR: 4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
ompi/mca/osc/ucx/osc_ucx.h
Outdated
@@ -150,6 +152,17 @@ typedef struct ompi_osc_ucx_lock { | |||
#define OSC_UCX_GET_EP(comm_, rank_) (ompi_comm_peer_lookup(comm_, rank_)->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_UCX]) | |||
#define OSC_UCX_GET_DISP(module_, rank_) ((module_->disp_unit < 0) ? module_->disp_units[rank_] : module_->disp_unit) | |||
|
|||
extern bool mpi_thread_multiple_enabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to support high multi-threading performance for applications that call init with thread-single, but they still use threads for RMA.
I'm not sure I understand. There shouldn't be multiple threads if MPI was initialized with thread-single. Can you elaborate? Did you mean thread-serialized? opal_using_threads
should be false for thread-serialized, so it has the semantics you're looking for, no?
Hello! The Git Commit Checker CI bot found a few problems with this PR: 6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Hello! The Git Commit Checker CI bot found a few problems with this PR: e4072fe: fixing the macros
5c28f90: adding the prefix
6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
hey @devreal please let us know your further comments. Again, thanks for your various help so far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MamziB! The issue with the thread-multiple variable still exists (either needs renaming or be replaced by opal_using_threads()
) and a few more points I'm not clear about.
ompi/mca/osc/ucx/osc_ucx.h
Outdated
@@ -150,6 +152,17 @@ typedef struct ompi_osc_ucx_lock { | |||
#define OSC_UCX_GET_EP(comm_, rank_) (ompi_comm_peer_lookup(comm_, rank_)->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_UCX]) | |||
#define OSC_UCX_GET_DISP(module_, rank_) ((module_->disp_unit < 0) ? module_->disp_units[rank_] : module_->disp_unit) | |||
|
|||
extern bool mpi_thread_multiple_enabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The symbol is still here. Can it be removed?
ompi/mca/osc/ucx/osc_ucx_component.c
Outdated
MCA_BASE_VAR_SCOPE_GROUP, &enable_nonblocking_accumulate); | ||
(void) mca_base_component_var_register(&mca_osc_ucx_component.super.osc_version, "enable_wpool_thread_multiple", | ||
description_str, MCA_BASE_VAR_TYPE_BOOL, NULL, 0, 0, OPAL_INFO_LVL_5, | ||
MCA_BASE_VAR_SCOPE_GROUP, &mpi_thread_multiple_enabled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure why this is an MCA parameter? Why would a user have to control thread-safety from the command line?
Hello! The Git Commit Checker CI bot found a few problems with this PR: 08a847c: use opal_uses_thread to set the mpi_thread_multipl...
e4072fe: fixing the macros
5c28f90: adding the prefix
6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Hello! The Git Commit Checker CI bot found a few problems with this PR: d91d76a: atomic add for nb outstanding ops and renaming the...
37163ce: move num_incomplete_req_ops to osc ucx context
08a847c: use opal_uses_thread to set the mpi_thread_multipl...
e4072fe: fixing the macros
5c28f90: adding the prefix
6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Hello! The Git Commit Checker CI bot found a few problems with this PR: 8213d67: make nonblocking acc default
d91d76a: atomic add for nb outstanding ops and renaming the...
37163ce: move num_incomplete_req_ops to osc ucx context
08a847c: use opal_uses_thread to set the mpi_thread_multipl...
e4072fe: fixing the macros
5c28f90: adding the prefix
6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Hello! The Git Commit Checker CI bot found a few problems with this PR: ddf5d70: cleanup
dcb44cb: cleanup
8213d67: make nonblocking acc default
d91d76a: atomic add for nb outstanding ops and renaming the...
37163ce: move num_incomplete_req_ops to osc ucx context
08a847c: use opal_uses_thread to set the mpi_thread_multipl...
e4072fe: fixing the macros
5c28f90: adding the prefix
6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Hello! The Git Commit Checker CI bot found a few problems with this PR: 9f46f48: create separate req obj for accumulate
ddf5d70: cleanup
dcb44cb: cleanup
8213d67: make nonblocking acc default
d91d76a: atomic add for nb outstanding ops and renaming the...
37163ce: move num_incomplete_req_ops to osc ucx context
08a847c: use opal_uses_thread to set the mpi_thread_multipl...
e4072fe: fixing the macros
5c28f90: adding the prefix
6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Hello! The Git Commit Checker CI bot found a few problems with this PR: 63e6893: add description for mca vars
9f46f48: create separate req obj for accumulate
ddf5d70: cleanup
dcb44cb: cleanup
8213d67: make nonblocking acc default
d91d76a: atomic add for nb outstanding ops and renaming the...
37163ce: move num_incomplete_req_ops to osc ucx context
08a847c: use opal_uses_thread to set the mpi_thread_multipl...
e4072fe: fixing the macros
5c28f90: adding the prefix
6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Hello! The Git Commit Checker CI bot found a few problems with this PR: 18ff47b: fix the naming of req comp
63e6893: add description for mca vars
9f46f48: create separate req obj for accumulate
ddf5d70: cleanup
dcb44cb: cleanup
8213d67: make nonblocking acc default
d91d76a: atomic add for nb outstanding ops and renaming the...
37163ce: move num_incomplete_req_ops to osc ucx context
08a847c: use opal_uses_thread to set the mpi_thread_multipl...
e4072fe: fixing the macros
5c28f90: adding the prefix
6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Hello! The Git Commit Checker CI bot found a few problems with this PR: 7651aa3: fix thread_enabled name
18ff47b: fix the naming of req comp
63e6893: add description for mca vars
9f46f48: create separate req obj for accumulate
ddf5d70: cleanup
dcb44cb: cleanup
8213d67: make nonblocking acc default
d91d76a: atomic add for nb outstanding ops and renaming the...
37163ce: move num_incomplete_req_ops to osc ucx context
08a847c: use opal_uses_thread to set the mpi_thread_multipl...
e4072fe: fixing the macros
5c28f90: adding the prefix
6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Hello! The Git Commit Checker CI bot found a few problems with this PR: 5444d2e: disable nb acc for dyn windows
7651aa3: fix thread_enabled name
18ff47b: fix the naming of req comp
63e6893: add description for mca vars
9f46f48: create separate req obj for accumulate
ddf5d70: cleanup
dcb44cb: cleanup
8213d67: make nonblocking acc default
d91d76a: atomic add for nb outstanding ops and renaming the...
37163ce: move num_incomplete_req_ops to osc ucx context
08a847c: use opal_uses_thread to set the mpi_thread_multipl...
e4072fe: fixing the macros
5c28f90: adding the prefix
6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Hello! The Git Commit Checker CI bot found a few problems with this PR: 174b53c: adding ref count for accumulate lock
5444d2e: disable nb acc for dyn windows
7651aa3: fix thread_enabled name
18ff47b: fix the naming of req comp
63e6893: add description for mca vars
9f46f48: create separate req obj for accumulate
ddf5d70: cleanup
dcb44cb: cleanup
8213d67: make nonblocking acc default
d91d76a: atomic add for nb outstanding ops and renaming the...
37163ce: move num_incomplete_req_ops to osc ucx context
08a847c: use opal_uses_thread to set the mpi_thread_multipl...
e4072fe: fixing the macros
5c28f90: adding the prefix
6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
@devreal Please kindly let us know if you have more comments. Thanks a lot. |
Hello! The Git Commit Checker CI bot found a few problems with this PR: 00ebe71: Enhance ref counters
174b53c: adding ref count for accumulate lock
5444d2e: disable nb acc for dyn windows
7651aa3: fix thread_enabled name
18ff47b: fix the naming of req comp
63e6893: add description for mca vars
9f46f48: create separate req obj for accumulate
ddf5d70: cleanup
dcb44cb: cleanup
8213d67: make nonblocking acc default
d91d76a: atomic add for nb outstanding ops and renaming the...
37163ce: move num_incomplete_req_ops to osc ucx context
08a847c: use opal_uses_thread to set the mpi_thread_multipl...
e4072fe: fixing the macros
5c28f90: adding the prefix
6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MamziB, looks good to me now 👍
@devreal I'm going to push the last commit. Please wait for that. I am working on pushing it. |
Hello! The Git Commit Checker CI bot found a few problems with this PR: a2d2993: Using a separate lock for handling the dynamic win...
00ebe71: Enhance ref counters
174b53c: adding ref count for accumulate lock
5444d2e: disable nb acc for dyn windows
7651aa3: fix thread_enabled name
18ff47b: fix the naming of req comp
63e6893: add description for mca vars
9f46f48: create separate req obj for accumulate
ddf5d70: cleanup
dcb44cb: cleanup
8213d67: make nonblocking acc default
d91d76a: atomic add for nb outstanding ops and renaming the...
37163ce: move num_incomplete_req_ops to osc ucx context
08a847c: use opal_uses_thread to set the mpi_thread_multipl...
e4072fe: fixing the macros
5c28f90: adding the prefix
6db6d31: Enhancing the osc finalize for resource utilizati...
4afa6c7: code reorganization
addc6b8: enhance the datatype handling in nb acc
6b537d1: Fixing some corner cases in nonblocking accumulate...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
@devreal Please take a look at the last commits. I reverted a couple of commits, as they were not MPI atomicity compliant. Basically, using those two commits, one process can initiate multiple on-the-flight nonblocking accumulate to the same target. This can violate the atomicity of each MPI_Accumulate/Get_accumulate call. Now everything should be good to go. |
a2d2993
to
5e6ba1a
Compare
I squashed the commits, however, since my branch is not based on the latest main branch, some unwanted commits sneaked in. I will rebase my branch on top of the latest main and I will try again. Thanks for your patience. |
5e6ba1a
to
d1b0e37
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing, otherwise it's looking good now 👍
ompi/mca/osc/ucx/osc_ucx.h
Outdated
#define OSC_UCX_GET_DISP(module_, rank_) ((module_->disp_unit < 0) ? module_->disp_units[rank_] : module_->disp_unit) | ||
|
||
extern bool opal_common_ucx_thread_enabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an opal variables and should be defined in an opal header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment, Actually, I noticed a warning message regarding this variable. I have fixed this warning in the latest push. Please kindly take look.
By the way, I added the definition to the opal header file, and I got many more warnings like below:
libopen-palmca_common_ucx_noinst_la-common_ucx_wpool.o: 0000000000000001 C opal_common_ucx_thread_enabled
osc_ucx_component.o: 0000000000000001 C opal_common_ucx_thread_enabled
osc_ucx_active_target.o: 0000000000000001 C opal_common_ucx_thread_enabled
osc_ucx_comm.o: 0000000000000001 C opal_common_ucx_thread_enabled
osc_ucx_passive_target.o: 0000000000000001 C opal_common_ucx_thread_enabled
osc_ucx_request.o: 0000000000000001 C opal_common_
Therefore, instead, I defined it in an opal .c file, and extern'ed it in suitable places. This is because we do not need a copy of the variable on each file that includes the opal header, instead, I defined it in a c file and extern it on other files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you declare the variable in a header file you mark it as extern
and define it once in a .c file. The messages suggest that you have definitions of the same variable in multiple files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always forget the details about this warning. Here is a thread about the warning you're seeing: https://users.open-mpi.narkive.com/fTvNIGxP/ompi-make-install-warns-about-common-symbols
Bottom line: global variables (in your case opal_common_ucx_thread_enabled
) should always be initialized where they are defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ee49d3b
to
34a98f3
Compare
…eps in single threaded applications, this is helpful if an application creates many windows, therefore, we avoid the unnecessary overheads and 2) adding the truely nonblocking MPI_Accumulate/Get_Accumulate. Signed-off-by: Mamzi Bayatpour <[email protected]> Co-authored-by: Tomislav Janjusic <[email protected]> Co-authored-by: Joseph Schuchart <[email protected]>>
34a98f3
to
1ea6fb9
Compare
bot:retest |
bot:retest |
bot:ibm:retest |
bot:retest |
@MamziB please open up v5.0 PR asap |
Adding the following optimizations: 1) Reuse the same workers/eps in
single-threaded applications, this is helpful if an application
creates many windows, therefore, we avoid the unnecessary overheads and 2) adding the truly nonblocking
MPI_Accumulate/Get_Accumulate.
Signed-off-by: Mamzi Bayatpour [email protected]
Co-authored-by: Tomislav Janjusic [email protected]
Co-authored-by: Joseph Schuchart [email protected]