-
Notifications
You must be signed in to change notification settings - Fork 936
UCX osc: add support for acc_single_intrinsic #6980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9836a6c to
393e962
Compare
artpol84
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ompi/mca/osc/ucx/osc_ucx_comm.c
Outdated
| ompi_datatype_type_size(origin_dt, &origin_dt_bytes); | ||
| ompi_datatype_type_size(target_dt, &target_dt_bytes); | ||
|
|
||
| if (origin_dt_bytes > sizeof(uint64_t) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also have ompi_datatype_is_predefined(origin_dt) check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect that this is checked upon entry to MPI. Have to check if that happens.
ompi/mca/osc/ucx/osc_ucx_comm.c
Outdated
| ompi_datatype_type_size(origin_dt, &origin_dt_bytes); | ||
| ompi_datatype_type_size(target_dt, &target_dt_bytes); | ||
|
|
||
| if (origin_dt_bytes > sizeof(uint64_t) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also have ompi_datatype_is_predefined(origin_dt) check?
ompi/mca/osc/ucx/osc_ucx_comm.c
Outdated
| return ret; | ||
| } | ||
|
|
||
| ret = opal_common_ucx_wpmem_flush(module->mem, OPAL_COMMON_UCX_SCOPE_EP, target); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also will need to flush it per element.
|
Overall, this looks good to me. Though I'd like to double-check some details. |
0a87dcf to
6e500e8
Compare
|
I addressed the reviewer comments and added two optimizations: the accumulate lock won't be acquired if an exclusive lock is present; and the release of the accumulate lock will be overlapped with any previous operations. |
|
Thanks, I’ll go over later this week |
6e500e8 to
3e43a73
Compare
|
@artpol84 I finally found the time to work on this again. I hopefully have addressed all the points we found during our review in December. Please take a look if you can spare some time :) I would be happy if we could get this into 5.0. @yosefe One point that is not clear to me: There is a specific function in static inline
int opal_common_ucx_atomic_cswap(ucp_ep_h ep, uint64_t compare,
uint64_t value, void *result, size_t op_size,
uint64_t remote_addr, ucp_rkey_h rkey,
ucp_worker_h worker)
{
uint64_t tmp = value;
int ret;
ret = opal_common_ucx_atomic_fetch(ep, UCP_ATOMIC_FETCH_OP_CSWAP, compare, &tmp,
op_size, remote_addr, rkey, worker);
if (OPAL_LIKELY(OPAL_SUCCESS == ret)) {
/* in case if op_size is constant (like sizeof(type)) then this condition
* is evaluated in compile time */
if (op_size == sizeof(uint64_t)) {
*(uint64_t*)result = tmp;
} else {
assert(op_size == sizeof(uint32_t));
*(uint32_t*)result = tmp;
}
}
return ret;
}AFAICS, the assignment from |
|
@devreal it works because the data is taken from first 4B of the target address and also placed in first 4B of tmp. on LE arch, the lower 4B of tmp are overwritten with the previous value in remote_addr, not the upper 4B. |
Ahh, I see. I was not thinking in terms of LE architectures. That would not be portable to BE architectures though, right? Why not copy the In similar terms, it is not entirely clear to me what the format of the 64bit value passed to |
|
@devreal it works on BE as well, in this case the first 4B of tmp will be tye "upper" part. However, it doesn't really matter, we don't to integer case on tmp, we just do pointer cast, which does not care about endianness.
|
|
I updated this PR to include the following:
|
|
@awlauria I think this PR is now in a good state again (after a long pause). It's awaiting a review now. Can we add this to the list of tracked issues for 5.0? |
|
@devreal I will add it to the list. Thanks! However - it looks like there may be a commit or two here missing the signed-off by, can you fix that? |
|
@awlauria Yes, I'm aware of that. I wasn't sure whether some of the commits should be squashed. I will fix that one unsigned today though. |
516a5ba to
25440c4
Compare
|
Fixed two bugs that slipped through during initial testing and squashed down some commits. |
|
@janjust can you please review |
25440c4 to
5d9b3da
Compare
|
I'd like to review one more time. I'll plan to go over it this week. I am sorry for not doing this before. |
| if (ret != OMPI_SUCCESS) { | ||
| return ret; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is 'this'? The whitespace? It's part of a longer section of code that was removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
… by UCX Signed-off-by: Joseph Schuchart <[email protected]>
… free memory if required Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
5d9b3da to
e3b417c
Compare
|
@devreal , Do we have everything in place now? |
artpol84
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much!
This PR adds support for the
acc_single_intrinsicinfo key / mca parameter that allows the user to specify that only a single intrinsic type will be used in accumulate operations to avoid taking the accumulate lock for every operation. It also enables asynchronicity for accumulate operations that do not require emulation through compare-and-swap (currently onlyreplaceandsumandcompare-and-swapitself).Limitations:
128 bit types are not supported and lead to an error if this info key is set (not sure how to handle 128 bit values that with the current UCX API)Operations other thanreplaceandsuminMPI_Accumulaterequire agetfollowed by acasloop until successful replacement. This loop is currently executed directly but could be encapsulated in a UCX completion request, issuing a new cas if the previous one failed. Not sure whether UCX allows new operations to be issued from within a completion callback though.After merging of UCX osc: properly release exclusive lock to avoid lockup #6933 there might be a few more chances to reduce latency, e.g., by fencing the release of the accumulate lock with the completion of the final put (ifacc_single_intrinsicis not enabled).