Skip to content

Fix a race condition in pthread call targets not waking up. #12244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

kripken
Copy link
Member

@kripken kripken commented Sep 17, 2020

A thread may happen to have finished handling its events right before we add
another one. If it is idle, it may never handle it. To avoid that, if we wait on a call
then notify it to wake up.

To allow that, track the target thread of each proxied call, so we know who to
notify.

@kripken kripken requested a review from juj September 17, 2020 01:13
kripken added a commit that referenced this pull request Sep 17, 2020
Comment on lines +426 to +427
// which in a race condition may have finished handling its event queue
// just after we added our event. (We could also notify it once right
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this possible? It looks like all enqueue and dequeue operations are protected by the call_queue_lock, so the event must have been added either after the target thread finished handling its event queue and released the lock or before the target thread finished handling its event queue, in which case the event would be handled because the enqueue would synchronize with the dequeue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure. But we get clear deadlocks without this patch, where the main thread needs to be woken up, which this patch fixes (see #12258).

It may be that there is something not entirely atomic about our mutexes, in which case I'm not sure what the best debugging approach is (maybe we need to debug the browser itself?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be that there is something not entirely atomic about our mutexes

Yikes! Perhaps we could demonstrate such an issue with our mutexes in a smaller, controlled experiment? A test of a dining philosophers solution could be a good simple stress test for deadlock. It would also be good to narrow down whether the lock misbehaves on the main thread or on non-main threads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#12258 has the smallest controlled experiment I can get so far. But it still depends on allocation, proxying, and mutexes...

@kripken
Copy link
Member Author

kripken commented Sep 22, 2020

I have found the actual cause here, and will open a refactoring PR and then a fix PR shortly.

@kripken kripken closed this Sep 22, 2020
@kripken kripken deleted the pthread2 branch September 22, 2020 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants