Skip to content

Conversation

@wzamazon
Copy link
Contributor

This PR contains a series of patches that aims to fix

#8983

and

#7830

@wzamazon wzamazon requested review from bwbarrett and hjelmn October 25, 2021 16:56
@wzamazon
Copy link
Contributor Author

tested using ompi-tests/onesided

mca_btl_base_module_t *module);
OPAL_DECLSPEC int mca_btl_base_param_verify(mca_btl_base_module_t *module);

OPAL_DECLSPEC int mca_btl_base_am_atomic_32(int32_t *operand, opal_atomic_int32_t *addr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right way to add atomic support to btl/self. You shouldn't have to do anything to use the am interface with btl/self. The osc rdma component should call mca_btl_base_am_rdma_init() on the selected btl, which will fill in these functions. Now, we can (and probably should) have a discussion with Nathan about why the osc rdma component calls am_rdma_init instead of the btl itself, but either way, I don't think this commit is right.

free(description_str);

ompi_osc_rdma_btl_alternate_names = "sm,tcp";
ompi_osc_rdma_btl_alternate_names = "self,sm,tcp";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the right thing to do. There's no real reason to allow-list alternate BTLs, and it still means that we don't have a good solution for BTLs that aren't in one of the two allowlists. I can understand the first-level allow list (for now), but here we should just handle as many BTLs as it takes to reach everyone in the widow, and any BTL should work (that's the whole point of the AM wrapper).


/* find a btl/endpoint to use for this peer */
int ret = ompi_osc_rdma_peer_btl_endpoint (module, peer_id, &module_btl_index, &endpoint);
if (OPAL_UNLIKELY(OMPI_SUCCESS != ret && !((module->selected_btls[0]->btl_atomic_flags & MCA_BTL_ATOMIC_SUPPORTS_GLOB) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional is almost certainly wrong, but I don't entirely understand Nathan's comments about the local leader optimization being required on uGNI. I'm pretty sure that this is to handle whatever condition causes us to have to worry about the local leader optimization, so we should explicitly get an ack from @hjelmn here and understand what's going on.

/* The BTL has active-message based atomics */
#define MCA_BTL_FLAGS_ATOMIC_AM_FOP 0x400000

/* This flag indicates whether the BTL's RDMA operation supports
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know most of this file isn't, but everything in this file should be in doxygen format (so start with /**)

module->state->num_fenced_peers = 0;
OPAL_THREAD_UNLOCK(&(module->lock));
ret = module->comm->c_coll->coll_barrier(module->comm, module->comm->c_coll->coll_barrier_module);
if (ret) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no implicit cast from int to bool

}

/* for each process in the group increment their number of fenced peers */
for (int i = 0 ; i < num_peers; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be consistent in spacing around ;

/* for each process in the group increment their number of fenced peers */
for (int i = 0 ; i < num_peers; ++i) {
ompi_osc_rdma_peer_t *peer = peers[i];
intptr_t target = (intptr_t) peer->state + offsetof (ompi_osc_rdma_state_t, num_fenced_peers);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no space between offseof and argument.

ret = module->comm->c_coll->coll_barrier(module->comm, module->comm->c_coll->coll_barrier_module);
if (module->btl_support_remote_completion) {
/* if all selected btls support remote completion, then all RMA operations have finished
* on remote side. A barrier is enough to complete the fence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indenting

* if that is the case, this function will not have been called
*/
assert (!ompi_osc_rdma_peer_local_state (peer));
ret = ompi_osc_rdma_lock_btl_op (module, peer, target, MCA_BTL_ATOMIC_ADD, 1, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is sufficient. BTL operations (including RDMA operations) are not implicitly ordered. An ordering flag must be set on every operation that must be ordered in order for this to be guaranteed to work. TCP doesn't take advantage of reordering (and I don't think OFI does either) so you likely won't see this in testing. but the verbs BTL did have an ordering problem and we haven't audited the other BTLs, so we should really follow the spec).

@wzamazon wzamazon force-pushed the oneside_fix_upstream branch from 5daf595 to 70285a8 Compare November 14, 2021 22:19
@ibm-ompi
Copy link

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/37f2b0805c6ce06764bb9c85e9542070

@ibm-ompi
Copy link

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/e62bf6d29d28dbe7485e433e33d4a2ab

@ibm-ompi
Copy link

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/d6cd677f88a124039f111dd524099019

Currently, osc_rdma_query_alternate_btls() select alternate btls
from a pre-defined list of btls osc_rdma_alternate_btl_names.

The patch removed the restriction so any btl can be used as
alternate btl. However, for alternate btls, osc/rdma will disable
their native rdma and atomics implementations and use active message
RDMA/atomics for them. This is because the atomicity across btls
cannot be guranteed.

Signed-off-by: Wei Zhang <[email protected]>
@wzamazon wzamazon force-pushed the oneside_fix_upstream branch from 70285a8 to fcdb885 Compare November 15, 2021 00:45
Currently, ompi_osc_rdma_new_peer() can still proceed if no btl endpoint has
been found a new peer. This caused segfault when a process try
to communicated with the peer.

This patch change it to: if a btl endpoint cannot be found for
a peer, ompi_osc_rdma_new_peer() wil not proceed.

Signed-off-by: Wei Zhang <[email protected]>
This patch introduced a new flag MCA_BTL_FLAGS_RDMA_REMOTE_COMPLETION,
which is used to indicate whether a btl's RDMA operations support
remote completion.

3 btls support this feature: self, ofi and ugni, thus this patch
added the flags to them.

Signed-off-by: Wei Zhang <[email protected]>
this patch introduce 3 types of btl ordering requirements:

    MCA_BTL_IN_ORDER_RDMA_ATOMICS,
    MCA_BTL_IN_ORDER_SEND,
    MCA_BTL_IN_ORDER_ALL.

Signed-off-by: Wei Zhang <[email protected]>
Currently, active message RDMA/atomics always use MCA_BTL_NO_ORDER
on the btl, even when caller specified ordering requirement. This
patch makes active message RDMA/atomics to honor caller's ordering
requirement.

Signed-off-by: Wei Zhang <[email protected]>
This patch fix the condition of using local leader and
cpu atomics.

The local leader optimization is:

  on each node, a process was designated as the local leader,
  who setup shared memory, and other processes on the same
  node would map their states to local leader's shared memory.

  When a process try to update a peer process's state, the
  process will do that through atomic actions on local leader's
  memory. The peer's state is then updated through shard memory.

The cpu atomics optimizaton is:

  for processes on the same node, use cpu atomics to update peer's
  state (instead of using the selected btl).

Both "local leader" and "cpu atomics" are using different
channels to transfer data and to update peer's state.

This kind of optimizations requires selected BTLs to support
remote completion. Otherwise, there may be data corruption,
because peer's state can be updated before the RDMA operation
is completed on the peer.

However, currently, local leader is used unconditionally.

"cpu atomics" is used when btl support mixed usage of cpu
and NIC atomics.

Both are used wrongly and will cause data corruption.

This patch address the issue by only use "local leader" and
"cpu atomics" if all seleted BTLs support remote completion.

When "local leader" is not used, each process need to have its peers'
state, for which this patch introduced a function gather_peer_state().

This patch then sets peer's state pointer using gathered information,
and use the same endpoint to update state and transfer data.

Signed-off-by: Wei Zhang <[email protected]>
Currently, fence is implemented by two steps:

First, waiting for RDMA operiatons to complete locally.
Second, call coll->barrier on the communicator.

This is correct only if selected BTL support remote completion.
Otherwise, it can happen that when coll->barrier() finished, the remote
side of the RDMA operation has not completed yet.

This patch implemented a different barrier mechanism, which is used
when any of the selected BTL does not support remote completion.

In which case, each process will post an atomic operation to every
peer to increase a counter on the peer through the selected BTL endpoint.
Though wait for its own counter to reach number of peers. This ensures
all previous RDMA operations have completed.

Signed-off-by: Wei Zhang <[email protected]>
@awlauria
Copy link
Contributor

bot:aws:retest

java failure.

@wzamazon
Copy link
Contributor Author

Replaced by 3 PRs:

#9694
#9695
#9696

@wzamazon wzamazon closed this Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants