Skip to content

osc/pt2pt: various threading fixes #1339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 2, 2016
Merged

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Feb 2, 2016

This commit fixes several bugs identified by a new multi-threaded RMA
benchmarking suite. The following bugs have been identified and fixed:

  • The code that signaled the actual start of an access epoch changed
    the eager_send_active flag on a synchronization object without
    holding the object's lock. This could cause another thread waiting
    on eager sends to block indefinitely because the entirety of
    ompi_osc_pt2pt_sync_expected could exectute between the check of
    eager_send_active and the conditon wait of
    ompi_osc_pt2pt_sync_wait.
  • The bookkeeping of fragments could get screwed up when performing
    long put/accumulate operations from different threads. This was
    caused by the fragment flush code at the end of both put and
    accumulate. This code was put in place to avoid sending a large
    number of unexpected messages to a peer. To fix the bookkeeping
    issue we now 1) wait for eager sends to be active before stating
    any large isend's, and 2) keep track of the number of large isends
    associated with a fragment. If the number of large isends reaches
    32 the active fragment is flushed.
  • Use atomics to update the large receive/send tag counters. This
    prevents duplicate tags from being used. The tag space has also
    been updated to use the entire 16-bits of the tag space.

These changes should also fix #1299.

Signed-off-by: Nathan Hjelm [email protected]

This commit fixes several bugs identified by a new multi-threaded RMA
benchmarking suite. The following bugs have been identified and fixed:

 - The code that signaled the actual start of an access epoch changed
   the eager_send_active flag on a synchronization object without
   holding the object's lock. This could cause another thread waiting
   on eager sends to block indefinitely because the entirety of
   ompi_osc_pt2pt_sync_expected could exectute between the check of
   eager_send_active and the conditon wait of
   ompi_osc_pt2pt_sync_wait.

 - The bookkeeping of fragments could get screwed up when performing
   long put/accumulate operations from different threads. This was
   caused by the fragment flush code at the end of both put and
   accumulate. This code was put in place to avoid sending a large
   number of unexpected messages to a peer. To fix the bookkeeping
   issue we now 1) wait for eager sends to be active before stating
   any large isend's, and 2) keep track of the number of large isends
   associated with a fragment. If the number of large isends reaches
   32 the active fragment is flushed.

 - Use atomics to update the large receive/send tag counters. This
   prevents duplicate tags from being used. The tag space has also
   been updated to use the entire 16-bits of the tag space.

These changes should also fix open-mpi#1299.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn
Copy link
Member Author

hjelmn commented Feb 2, 2016

@ggouaillardet Can you verify #1299 is fixed by this commit?

@hjelmn
Copy link
Member Author

hjelmn commented Feb 2, 2016

Missed one change in this relevant to #1299. Will push update in a bit.

@hjelmn hjelmn added this to the v2.0.0 milestone Feb 2, 2016
@hjelmn hjelmn self-assigned this Feb 2, 2016
@hjelmn
Copy link
Member Author

hjelmn commented Feb 2, 2016

@jsquyres Mellanox failure shows usnic issue.

@jsquyres
Copy link
Member

jsquyres commented Feb 2, 2016

I see the problem. Fixing...

@jsquyres
Copy link
Member

jsquyres commented Feb 2, 2016

@hjelmn master has been fixed. Rebase and you should be ok. Sorry about that.

@hjelmn
Copy link
Member Author

hjelmn commented Feb 2, 2016

:bot:retest:

hjelmn added a commit that referenced this pull request Feb 2, 2016
osc/pt2pt: various threading fixes
@hjelmn hjelmn merged commit 615b27c into open-mpi:master Feb 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

osc/pt2pt hang in master
2 participants