Skip to content

pml/ob1: match callback will now queue wrong sequence frag and return. #4419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 11, 2017

Conversation

thananon
Copy link
Member

@thananon thananon commented Oct 29, 2017

In multithreaded case, it is expensive to release the lock, call the slow match
and retake the lock again just to queue the frag. This patch will eliminate number of
lock taken by queueing the frag right away and return.

Also, when the message arrive out of order, we queue it up in an ordered list to minimize the searching time.

@bosilca please review.

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to keep the PERUSE API functional please add before the append_frag_to_list

 /**
  * We generate the MSG_ARRIVED event as soon as the PML is aware of a matching
  * fragment arrival. Independing if it is received on the correct order or not.
  * This will allow the tools to figure out if the messages are not received in the
  * correct order (if multiple network interfaces).
  */
PERUSE_TRACE_MSG_EVENT(PERUSE_COMM_MSG_ARRIVED, comm_ptr,
                       hdr->hdr_src, hdr->hdr_tag, PERUSE_RECV);

In multithreaded case, it is expensive to release the lock, call the slow match
and retake the lock again just to queue the frag. This patch will eliminate number of
lock taken by queueing the frag right away and return.

Signed-off-by: Thananon Patinyasakdikul <[email protected]>
@thananon
Copy link
Member Author

adjusted the commit per @bosilca comment.

@thananon
Copy link
Member Author

thananon commented Nov 7, 2017

Added a patch from @bosilca to put incoming out of sequence fragments into an ordered list to minimize the searching time.

@ompiteam-bot
Copy link

All Tests Passed!

Rework the logic to handle the out-of-sequence fragments on the receiver
side. A large number of OOS messages are still arriving even in single
threaded scenarios.

Signed-off-by: George Bosilca <[email protected]>
@bosilca bosilca merged commit 3f43e2c into open-mpi:master Nov 11, 2017
@jsquyres
Copy link
Member

@thananon @bosilca This PR resulted in 2 serious-looking Coverity warnings:

*** CID 1420944:  Program hangs  (LOCK)
/ompi/mca/pml/ob1/pml_ob1_recvfrag.c: 357 in mca_pml_ob1_recv_frag_callback_match()
351                                                  hdr->hdr_common.hdr_type, frag);
352             } else {
353                 OB1_MATCHING_UNLOCK(&comm->matching_lock);
354             }
355         }
356     
>>   CID 1420944:  Program hangs  (LOCK)
>>   Returning without unlocking "comm->matching_lock.m_lock_pthread".
357         return;
358     }
359     
360     
361     void mca_pml_ob1_recv_frag_callback_rndv(mca_btl_base_module_t* btl,
362                                              mca_btl_base_tag_t tag,
*** CID 1420943:  Program hangs  (LOCK)
/ompi/mca/pml/ob1/pml_ob1_recvfrag.c: 788 in mca_pml_ob1_recv_frag_match()
782                                         num_segments, NULL);
783             OB1_MATCHING_UNLOCK(&comm->matching_lock);
784             return OMPI_SUCCESS;
785         }
786     
787         /* mca_pml_ob1_recv_frag_match_proc() will release the lock. */
>>   CID 1420943:  Program hangs  (LOCK)
>>   Returning without unlocking "comm->matching_lock.m_lock_pthread".
788         return mca_pml_ob1_recv_frag_match_proc(btl, comm_ptr, proc, hdr,
789                                                 segments, num_segments,
790                                                 type, NULL);
791     }
792     
793     

@bosilca
Copy link
Member

bosilca commented Nov 11, 2017

Thanks @jsquyres , both can be safely ignored, the design is meant that way. If you look at the comment right next to CID 1420943, there is a comment about this.

@bosilca bosilca deleted the pr_newob1 branch November 11, 2017 13:33
@jsquyres
Copy link
Member

Cool; thanks @bosilca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants