Skip to content

Conversation

@devreal
Copy link
Contributor

@devreal devreal commented Sep 16, 2019

This PR adds support for the acc_single_intrinsic info key / mca parameter that allows the user to specify that only a single intrinsic type will be used in accumulate operations to avoid taking the accumulate lock for every operation. It also enables asynchronicity for accumulate operations that do not require emulation through compare-and-swap (currently only replace and sum and compare-and-swap itself).

Limitations:

  • 128 bit types are not supported and lead to an error if this info key is set (not sure how to handle 128 bit values that with the current UCX API)
  • Operations other than replace and sum in MPI_Accumulate require a get followed by a cas loop until successful replacement. This loop is currently executed directly but could be encapsulated in a UCX completion request, issuing a new cas if the previous one failed. Not sure whether UCX allows new operations to be issued from within a completion callback though.
  • After merging of UCX osc: properly release exclusive lock to avoid lockup #6933 there might be a few more chances to reduce latency, e.g., by fencing the release of the accumulate lock with the completion of the final put (if acc_single_intrinsic is not enabled).

@devreal devreal force-pushed the ucx-acc-singel-intrinsics branch 2 times, most recently from 9836a6c to 393e962 Compare September 16, 2019 18:37
@artpol84 artpol84 requested review from artpol84 and janjust October 9, 2019 10:37
@artpol84
Copy link
Contributor

@devreal, sorry for the delay
I'll try to review later this week and will ask @janjust to go over it next week (as h isn't available this week).

Copy link
Contributor

@artpol84 artpol84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devreal Let's discuss offline.
@yosefe please advise on my questions.

ompi_datatype_type_size(origin_dt, &origin_dt_bytes);
ompi_datatype_type_size(target_dt, &target_dt_bytes);

if (origin_dt_bytes > sizeof(uint64_t) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have ompi_datatype_is_predefined(origin_dt) check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect that this is checked upon entry to MPI. Have to check if that happens.

ompi_datatype_type_size(origin_dt, &origin_dt_bytes);
ompi_datatype_type_size(target_dt, &target_dt_bytes);

if (origin_dt_bytes > sizeof(uint64_t) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have ompi_datatype_is_predefined(origin_dt) check?

return ret;
}

ret = opal_common_ucx_wpmem_flush(module->mem, OPAL_COMMON_UCX_SCOPE_EP, target);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also will need to flush it per element.

@artpol84
Copy link
Contributor

Overall, this looks good to me. Though I'd like to double-check some details.

@devreal devreal force-pushed the ucx-acc-singel-intrinsics branch 2 times, most recently from 0a87dcf to 6e500e8 Compare November 4, 2019 16:26
@devreal
Copy link
Contributor Author

devreal commented Nov 4, 2019

I addressed the reviewer comments and added two optimizations: the accumulate lock won't be acquired if an exclusive lock is present; and the release of the accumulate lock will be overlapped with any previous operations.

@artpol84
Copy link
Contributor

artpol84 commented Nov 5, 2019

Thanks, I’ll go over later this week

@awlauria awlauria added this to the v5.0.0 milestone Mar 27, 2020
@devreal devreal force-pushed the ucx-acc-singel-intrinsics branch from 6e500e8 to 3e43a73 Compare March 30, 2020 20:32
@devreal
Copy link
Contributor Author

devreal commented Mar 30, 2020

@artpol84 I finally found the time to work on this again. I hopefully have addressed all the points we found during our review in December. Please take a look if you can spare some time :) I would be happy if we could get this into 5.0.

@yosefe One point that is not clear to me:

There is a specific function in opal/mca/common/ucx/common_ucx.h for compare-and-swap (https://github.com/open-mpi/ompi/blob/master/opal/mca/common/ucx/common_ucx.h#L204). I don't really understand why this code is needed (and doubt that it's correct):

static inline
int opal_common_ucx_atomic_cswap(ucp_ep_h ep, uint64_t compare,
                                 uint64_t value, void *result, size_t op_size,
                                 uint64_t remote_addr, ucp_rkey_h rkey,
                                 ucp_worker_h worker)
{
    uint64_t tmp = value;
    int ret;

    ret = opal_common_ucx_atomic_fetch(ep, UCP_ATOMIC_FETCH_OP_CSWAP, compare, &tmp,
                                       op_size, remote_addr, rkey, worker);
    if (OPAL_LIKELY(OPAL_SUCCESS == ret)) {
        /* in case if op_size is constant (like sizeof(type)) then this condition
         * is evaluated in compile time */
        if (op_size == sizeof(uint64_t)) {
            *(uint64_t*)result = tmp;
        } else {
            assert(op_size == sizeof(uint32_t));
            *(uint32_t*)result = tmp;
        }
    }
    return ret;
}

AFAICS, the assignment from tmp after the operation is faulty in the 32 bit case: with op_size == 4 the upper 4B of tmp will be overwritten with the previous value in remote_addr and the lower 4B should be left untouched by UCP_ATOMIC_FETCH_OP_CSWAP. However, the assignment *(uint32_t*)result = tmp; will downcast the value in tmp, cutting off the upper 4B and assigning the lower 4B. Am I missing something here? Do we really need a special function for compare-and-swap or can't we just handle it through opal_common_ucx_atomic_fetch with UCP_ATOMIC_FETCH_OP_CSWAP? See also the comment here #6980 (comment)

@yosefe
Copy link
Contributor

yosefe commented Mar 31, 2020

@devreal it works because the data is taken from first 4B of the target address and also placed in first 4B of tmp. on LE arch, the lower 4B of tmp are overwritten with the previous value in remote_addr, not the upper 4B.

@devreal
Copy link
Contributor Author

devreal commented Mar 31, 2020

it works because the data is taken from first 4B of the target address and also placed in first 4B of tmp. on LE arch, the lower 4B of tmp are overwritten with the previous value in remote_addr, not the upper 4B.

Ahh, I see. I was not thinking in terms of LE architectures. That would not be portable to BE architectures though, right? Why not copy the value into result and pass the result pointer directly to opal_common_ucx_atomic_fetch? That would make it easier to create a nonblocking version of compare-and-swap.

In similar terms, it is not entirely clear to me what the format of the 64bit value passed to ucp_atomic_fetch_nb is if op_size == 4. Is the 64-bit value cast down to a 32-bit integer? Are either the upper or lower 4B used? I don't see that specified in the UCX standard.

@yosefe
Copy link
Contributor

yosefe commented Mar 31, 2020

@devreal it works on BE as well, in this case the first 4B of tmp will be tye "upper" part. However, it doesn't really matter, we don't to integer case on tmp, we just do pointer cast, which does not care about endianness.

  • the input value for UCX atomic is "uint64_t" so it's int-casted to uint32 if needed.
  • the output value of the atomic is "void*" so it's a direct pointer to 32-bit result

@devreal
Copy link
Contributor Author

devreal commented Apr 3, 2020

I updated this PR to include the following:

  1. a non-blocking variant of opal_common_ucx_wpmem_cmpswp
  2. a safe and portable way to load and store 64-bit integer values to variable-size pointers without relying on memcpy
  3. support for compare-and-swap for types that are not supported by UCX atomics (from what I understand that is 8, 16, and 128-bit types)

@devreal
Copy link
Contributor Author

devreal commented Apr 7, 2020

@awlauria I think this PR is now in a good state again (after a long pause). It's awaiting a review now. Can we add this to the list of tracked issues for 5.0?

@awlauria
Copy link
Contributor

awlauria commented Apr 15, 2020

@devreal I will add it to the list. Thanks! However - it looks like there may be a commit or two here missing the signed-off by, can you fix that?

@devreal
Copy link
Contributor Author

devreal commented Apr 15, 2020

@awlauria Yes, I'm aware of that. I wasn't sure whether some of the commits should be squashed. I will fix that one unsigned today though.

@devreal devreal force-pushed the ucx-acc-singel-intrinsics branch from 516a5ba to 25440c4 Compare April 15, 2020 10:25
@devreal
Copy link
Contributor Author

devreal commented Apr 15, 2020

Fixed two bugs that slipped through during initial testing and squashed down some commits.

@artpol84
Copy link
Contributor

@janjust can you please review

@devreal devreal force-pushed the ucx-acc-singel-intrinsics branch from 25440c4 to 5d9b3da Compare June 17, 2020 08:53
@devreal devreal mentioned this pull request Jun 17, 2020
@jladd-mlnx
Copy link
Member

@janjust @artpol84 @devreal what is the status? Seems like it's a critical fix. Can we merge it?

@artpol84
Copy link
Contributor

I'd like to review one more time. I'll plan to go over it this week. I am sorry for not doing this before.

if (ret != OMPI_SUCCESS) {
return ret;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is 'this'? The whitespace? It's part of a longer section of code that was removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

devreal added 18 commits June 23, 2020 12:41
Signed-off-by: Joseph Schuchart <[email protected]>
… free memory if required

Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
@devreal devreal force-pushed the ucx-acc-singel-intrinsics branch from 5d9b3da to e3b417c Compare June 23, 2020 10:42
@artpol84
Copy link
Contributor

@devreal , Do we have everything in place now?

Copy link
Contributor

@artpol84 artpol84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!

@artpol84 artpol84 merged commit 907f4e1 into open-mpi:master Jun 25, 2020
@devreal devreal deleted the ucx-acc-singel-intrinsics branch October 3, 2022 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants