UCX osc: add support for acc_single_intrinsic #6980

devreal · 2019-09-16T16:42:17Z

This PR adds support for the acc_single_intrinsic info key / mca parameter that allows the user to specify that only a single intrinsic type will be used in accumulate operations to avoid taking the accumulate lock for every operation. It also enables asynchronicity for accumulate operations that do not require emulation through compare-and-swap (currently only replace and sum and compare-and-swap itself).

Limitations:

~~128 bit types are not supported and lead to an error if this info key is set (not sure how to handle 128 bit values that with the current UCX API)~~
Operations other than replace and sum in MPI_Accumulate require a get followed by a cas loop until successful replacement. This loop is currently executed directly but could be encapsulated in a UCX completion request, issuing a new cas if the previous one failed. Not sure whether UCX allows new operations to be issued from within a completion callback though.
After merging of UCX osc: properly release exclusive lock to avoid lockup #6933 there might be a few more chances to reduce latency, e.g., by fencing the release of the accumulate lock with the completion of the final put (if acc_single_intrinsic is not enabled).

artpol84 · 2019-10-17T18:02:26Z

@devreal, sorry for the delay
I'll try to review later this week and will ask @janjust to go over it next week (as h isn't available this week).

artpol84

@devreal Let's discuss offline.
@yosefe please advise on my questions.

artpol84 · 2019-10-30T03:17:08Z

ompi/mca/osc/ucx/osc_ucx_comm.c

+    ompi_datatype_type_size(origin_dt, &origin_dt_bytes);
+    ompi_datatype_type_size(target_dt, &target_dt_bytes);
+
+    if (origin_dt_bytes  > sizeof(uint64_t) ||


Should we also have ompi_datatype_is_predefined(origin_dt) check?

I'd expect that this is checked upon entry to MPI. Have to check if that happens.

ompi/mca/osc/ucx/osc_ucx_comm.c

artpol84 · 2019-10-30T04:12:26Z

ompi/mca/osc/ucx/osc_ucx_comm.c

+    ompi_datatype_type_size(origin_dt, &origin_dt_bytes);
+    ompi_datatype_type_size(target_dt, &target_dt_bytes);
+
+    if (origin_dt_bytes  > sizeof(uint64_t) ||


Should we also have ompi_datatype_is_predefined(origin_dt) check?

ompi/mca/osc/ucx/osc_ucx_comm.c

artpol84 · 2019-10-30T04:24:02Z

ompi/mca/osc/ucx/osc_ucx_comm.c

+            return ret;
+        }
+
+        ret = opal_common_ucx_wpmem_flush(module->mem, OPAL_COMMON_UCX_SCOPE_EP, target);


You also will need to flush it per element.

artpol84 · 2019-10-30T04:25:40Z

Overall, this looks good to me. Though I'd like to double-check some details.

devreal · 2019-11-04T16:29:03Z

I addressed the reviewer comments and added two optimizations: the accumulate lock won't be acquired if an exclusive lock is present; and the release of the accumulate lock will be overlapped with any previous operations.

artpol84 · 2019-11-05T02:37:24Z

Thanks, I’ll go over later this week

ompi/mca/osc/ucx/osc_ucx_comm.c

devreal · 2020-03-30T20:32:35Z

@artpol84 I finally found the time to work on this again. I hopefully have addressed all the points we found during our review in December. Please take a look if you can spare some time :) I would be happy if we could get this into 5.0.

@yosefe One point that is not clear to me:

There is a specific function in opal/mca/common/ucx/common_ucx.h for compare-and-swap (https://github.com/open-mpi/ompi/blob/master/opal/mca/common/ucx/common_ucx.h#L204). I don't really understand why this code is needed (and doubt that it's correct):

static inline
int opal_common_ucx_atomic_cswap(ucp_ep_h ep, uint64_t compare,
                                 uint64_t value, void *result, size_t op_size,
                                 uint64_t remote_addr, ucp_rkey_h rkey,
                                 ucp_worker_h worker)
{
    uint64_t tmp = value;
    int ret;

    ret = opal_common_ucx_atomic_fetch(ep, UCP_ATOMIC_FETCH_OP_CSWAP, compare, &tmp,
                                       op_size, remote_addr, rkey, worker);
    if (OPAL_LIKELY(OPAL_SUCCESS == ret)) {
        /* in case if op_size is constant (like sizeof(type)) then this condition
         * is evaluated in compile time */
        if (op_size == sizeof(uint64_t)) {
            *(uint64_t*)result = tmp;
        } else {
            assert(op_size == sizeof(uint32_t));
            *(uint32_t*)result = tmp;
        }
    }
    return ret;
}

AFAICS, the assignment from tmp after the operation is faulty in the 32 bit case: with op_size == 4 the upper 4B of tmp will be overwritten with the previous value in remote_addr and the lower 4B should be left untouched by UCP_ATOMIC_FETCH_OP_CSWAP. However, the assignment *(uint32_t*)result = tmp; will downcast the value in tmp, cutting off the upper 4B and assigning the lower 4B. Am I missing something here? Do we really need a special function for compare-and-swap or can't we just handle it through opal_common_ucx_atomic_fetch with UCP_ATOMIC_FETCH_OP_CSWAP? See also the comment here #6980 (comment)

yosefe · 2020-03-31T09:19:22Z

@devreal it works because the data is taken from first 4B of the target address and also placed in first 4B of tmp. on LE arch, the lower 4B of tmp are overwritten with the previous value in remote_addr, not the upper 4B.

devreal · 2020-03-31T13:34:40Z

it works because the data is taken from first 4B of the target address and also placed in first 4B of tmp. on LE arch, the lower 4B of tmp are overwritten with the previous value in remote_addr, not the upper 4B.

Ahh, I see. I was not thinking in terms of LE architectures. That would not be portable to BE architectures though, right? Why not copy the value into result and pass the result pointer directly to opal_common_ucx_atomic_fetch? That would make it easier to create a nonblocking version of compare-and-swap.

In similar terms, it is not entirely clear to me what the format of the 64bit value passed to ucp_atomic_fetch_nb is if op_size == 4. Is the 64-bit value cast down to a 32-bit integer? Are either the upper or lower 4B used? I don't see that specified in the UCX standard.

yosefe · 2020-03-31T13:42:30Z

@devreal it works on BE as well, in this case the first 4B of tmp will be tye "upper" part. However, it doesn't really matter, we don't to integer case on tmp, we just do pointer cast, which does not care about endianness.

the input value for UCX atomic is "uint64_t" so it's int-casted to uint32 if needed.
the output value of the atomic is "void*" so it's a direct pointer to 32-bit result

devreal · 2020-04-03T10:45:09Z

I updated this PR to include the following:

a non-blocking variant of opal_common_ucx_wpmem_cmpswp
a safe and portable way to load and store 64-bit integer values to variable-size pointers without relying on memcpy
support for compare-and-swap for types that are not supported by UCX atomics (from what I understand that is 8, 16, and 128-bit types)

devreal · 2020-04-07T15:59:52Z

@awlauria I think this PR is now in a good state again (after a long pause). It's awaiting a review now. Can we add this to the list of tracked issues for 5.0?

awlauria · 2020-04-15T02:37:29Z

@devreal I will add it to the list. Thanks! However - it looks like there may be a commit or two here missing the signed-off by, can you fix that?

devreal · 2020-04-15T05:05:01Z

@awlauria Yes, I'm aware of that. I wasn't sure whether some of the commits should be squashed. I will fix that one unsigned today though.

devreal · 2020-04-15T10:26:28Z

Fixed two bugs that slipped through during initial testing and squashed down some commits.

artpol84 · 2020-04-28T16:54:22Z

@janjust can you please review

jladd-mlnx · 2020-06-17T13:32:20Z

@janjust @artpol84 @devreal what is the status? Seems like it's a critical fix. Can we merge it?

artpol84 · 2020-06-17T14:21:56Z

I'd like to review one more time. I'll plan to go over it this week. I am sorry for not doing this before.

ompi/mca/osc/ucx/osc_ucx_comm.c

artpol84 · 2020-06-18T15:26:34Z

ompi/mca/osc/ucx/osc_ucx_comm.c

            if (ret != OMPI_SUCCESS) {
                return ret;
            }
-


Can we remove this?

What is 'this'? The whitespace? It's part of a longer section of code that was removed.

Signed-off-by: Joseph Schuchart <[email protected]>

… by UCX Signed-off-by: Joseph Schuchart <[email protected]>

… free memory if required Signed-off-by: Joseph Schuchart <[email protected]>

Signed-off-by: Joseph Schuchart <[email protected]>

artpol84 · 2020-06-25T14:29:14Z

@devreal , Do we have everything in place now?

artpol84

Thank you very much!

devreal force-pushed the ucx-acc-singel-intrinsics branch 2 times, most recently from 9836a6c to 393e962 Compare September 16, 2019 18:37

artpol84 requested review from artpol84 and janjust October 9, 2019 10:37

artpol84 added the enhancement label Oct 9, 2019

artpol84 requested changes Oct 30, 2019

View reviewed changes

devreal force-pushed the ucx-acc-singel-intrinsics branch 2 times, most recently from 0a87dcf to 6e500e8 Compare November 4, 2019 16:26

devreal commented Dec 18, 2019

View reviewed changes

ompi/mca/osc/ucx/osc_ucx_comm.c Outdated Show resolved Hide resolved

devreal commented Dec 18, 2019

View reviewed changes

ompi/mca/osc/ucx/osc_ucx_comm.c Outdated Show resolved Hide resolved

devreal commented Dec 18, 2019

View reviewed changes

ompi/mca/osc/ucx/osc_ucx_comm.c Outdated Show resolved Hide resolved

devreal commented Dec 18, 2019

View reviewed changes

ompi/mca/osc/ucx/osc_ucx_comm.c Outdated Show resolved Hide resolved

awlauria added this to the v5.0.0 milestone Mar 27, 2020

awlauria added the Target: v5.0.x label Mar 27, 2020

devreal force-pushed the ucx-acc-singel-intrinsics branch from 6e500e8 to 3e43a73 Compare March 30, 2020 20:32

devreal force-pushed the ucx-acc-singel-intrinsics branch from 516a5ba to 25440c4 Compare April 15, 2020 10:25

devreal force-pushed the ucx-acc-singel-intrinsics branch from 25440c4 to 5d9b3da Compare June 17, 2020 08:53

devreal mentioned this pull request Jun 17, 2020

RMA Functionality #7813

Open

artpol84 reviewed Jun 18, 2020

View reviewed changes

devreal added 18 commits June 23, 2020 12:41

UCX osc: add support for acc_single_intrinsic info key / mca param

73a1834

Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: properly clean up requests in case of errors

d448efd

Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: fix macro parameter name usage in OMPI_OSC_UCX_REQUEST_RETURN

8606a02

Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: re-use value returned by cswap to save additional get

1a3c6bb

Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: allow for overlap with (some) request-based atomic operations

557ae80

Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: allow to asynchronously compare-and-swap

7cfc0e7

Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: correctly handle MPI_NO_OP

d888b4f

Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: simplify output address computation

899f58c

Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: Use accumulate for operations/datatypes that are not covered…

4d7a385

… by UCX Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: fence active operations before releasing accumulate lock and…

471d767

… free memory if required Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: do not acquire accumulate lock if exclusive lock was taken

427d4bd

Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: centralize decision on whether to use AMOs

d8696aa

Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: make MPI_Fetch_and_op non-blocking if possible

5f786bc

Signed-off-by: Joseph Schuchart <[email protected]>

UCX common: add non-blocking compare-and-swap

824afac

Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: safely load/store 64bit integer from variable size pointer

7d5a6e3

Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: fall back to get-compare-put for unsupported datatypes

434c905

Signed-off-by: Joseph Schuchart <[email protected]>

UCX osc: atomic fetch-and-op only on 32 and 64bit values

e215eff

Signed-off-by: Joseph Schuchart <[email protected]>

Add missing copyright header

e3b417c

Signed-off-by: Joseph Schuchart <[email protected]>

devreal force-pushed the ucx-acc-singel-intrinsics branch from 5d9b3da to e3b417c Compare June 23, 2020 10:42

artpol84 approved these changes Jun 25, 2020

View reviewed changes

artpol84 merged commit 907f4e1 into open-mpi:master Jun 25, 2020

This was referenced Jun 25, 2020

OSC UCX: Remove stale free from merge conflict #7875

Merged

Compile: error (undeclared variable) in UCX osc component #7874

Closed

devreal deleted the ucx-acc-singel-intrinsics branch October 3, 2022 15:53

UCX osc: add support for acc_single_intrinsic #6980

UCX osc: add support for acc_single_intrinsic #6980

Uh oh!

Conversation

devreal commented Sep 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artpol84 commented Oct 17, 2019

Uh oh!

artpol84 left a comment

Choose a reason for hiding this comment

Uh oh!

artpol84 Oct 30, 2019

Choose a reason for hiding this comment

Uh oh!

devreal Nov 4, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

artpol84 Oct 30, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

artpol84 Oct 30, 2019

Choose a reason for hiding this comment

Uh oh!

artpol84 commented Oct 30, 2019

Uh oh!

devreal commented Nov 4, 2019

Uh oh!

artpol84 commented Nov 5, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devreal commented Mar 30, 2020 • edited by artemry-nv Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yosefe commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devreal commented Mar 31, 2020

Uh oh!

yosefe commented Mar 31, 2020

Uh oh!

devreal commented Apr 3, 2020

Uh oh!

devreal commented Apr 7, 2020

Uh oh!

awlauria commented Apr 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devreal commented Apr 15, 2020

Uh oh!

devreal commented Apr 15, 2020

Uh oh!

artpol84 commented Apr 28, 2020

Uh oh!

jladd-mlnx commented Jun 17, 2020

Uh oh!

artpol84 commented Jun 17, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

artpol84 Jun 18, 2020

Choose a reason for hiding this comment

Uh oh!

devreal Jun 23, 2020

Choose a reason for hiding this comment

Uh oh!

artpol84 Jun 25, 2020

Choose a reason for hiding this comment

Uh oh!

devreal commented Sep 16, 2019 •

edited

Loading

devreal commented Mar 30, 2020 •

edited by artemry-nv

Loading

yosefe commented Mar 31, 2020 •

edited

Loading

awlauria commented Apr 15, 2020 •

edited

Loading