Fix a rare pthreads main thread deadlock that worsened in 2.0.2 #12318

kripken · 2020-09-23T20:16:26Z

The deadlock was introduced in #12055 which was released in 2.0.2. However, the
issue is deeper as the deadlock was just made more frequent by that.

Background: Our pthreads implementation has some extra complexity because of
how the main thread works on the web, specifically, we can't block there. So we
can't do Atomics.wait there. Instead, we do a busy-wait in effect, and to make that
as least bad as possible, we special case the main thread. There is a memory address,
the "main thread futex address", which indicates the futex the main thread is waiting
on, or 0 if none. When a thread runs futex_wake, it first checks if the main thread
is waiting on that futex by checking that address, and if it is, we notify it first before
any workers. In effect this gives the main thread priority. And by using that address,
the main thread can busy-wait on it, just polling to see when a pthread sets it to 0,
which indicates the wait can stop.

This broke in #12055 because it erroneously ended up setting the main thread
futex address only on the main thread. We did not get that address to workers.
Instead, they got a JS undefined. So they would not wake up the main thread.

It is actually a lot more complicated than that, because futex_wait can recurse:
while it does the busy-wait loop, it will run the event loop, so it does not starve
pthreads and get deadlocks - that is, it checks if there are incoming proxied
calls, and it runs them. To run the event loop we must take a mutex. With the
right timing we can hit the slow paths in both the original mutex (leading to
a call to futex_wait instead of a quick atomic operation) and in the proxying
mutex, and then we get this recursion. This was not handled by the old code
correctly, in two ways.

First, it did not clear the main thread futex address when futex_wait exits
due to a timeout. So it looked like we were still waiting. A later futex_wake
on that futex would unset the main thread futex address, counting that as
one of the things it wakes up - but nothing was woken up there, so it ended
up not waking anyone up (if it was told to just wake 1 person).

Second, futex_wake did not check if it was on the main thread. The main
thread should never wake itself up, but it did, making it more confusing
(I'm not sure if this caused deadlocks by itself, though).

A possible third issue is that when recursing in that way we could have
the main thread in effect waiting on two futexes, A and then B. If a
worker wants to wake up A while in the recursed call that set B, we
don't see A in the main thread futex address - there is just one such
address, and B has trampled it. The old code would actually just
stop waiting in this case, since it looped while the address is the
address we set, which means that after it called the nested wait,
which wrote B, when we return to the first loop we'd see 0 (if the wait
for B ended cleanly) or B (if we timed out and due the bug mentioned
earlier we left B as the main thread futex address erroneously). And
both are not A so we'd stop. That wasn't correct, though - it's possible
we should still be waiting for A.

To fix all this, this PR does two things:

Allocate the main thread futex address in C, so that it
is accessible in workers properly.
Change how futex_wait works: it sets the main thread futex
address and unsets it before exiting. It also unsets it before
doing a nested call, so that they cannot interfere with each other.

This also adds a lot of assertions in that code, which would have caught
the previous bug, and I also verified happen on our test suite, so we have
decent coverage.

The one tricky thing here is handling the case in the paragraph from
earlier with futexes A and B that we wait for in a nested manner. When
nested, we wait until B is done, then return to the loop for A. How can
we tell if A is done or not? The pthread waking it up may have already
had its chance to wake it. The solution taken here is to read the futex's
memory value, exactly like when starting in futex_wait. That is,
futex_wait checks if an address has a value, and only if it does then
it waits. I believe it is valid to check it later as well, since the result
is the same as if the main thread were busy with other stuff, then
checked the value later, after the other thread did some work. See
the comment in the source for more details.

This fixes a test that was wrong in how it handled a race condition,
and adds a disabled test that was very useful for manual debugging
here. That test hammers on all the proxying and locking mechanisms
at once, basically. It's not a small little test, but I couldn't find a way
to simplify it, and it's just for manual debugging.

@sbc100 I did a KEEPALIVE on a global in C, and it seems to just work, leading to
something like Module['_main_thread_futex'] = 4060; in the output, which is nice,
but I was a little surprised it works - is it ok to depend on that?

tlively

Nice detective work and nice explanation!

tlively · 2020-09-23T21:43:46Z

system/lib/pthread/library_pthread.c

+// Stores the memory address that the main thread is waiting on, if any. If
+// the main thread is waiting, we wake it up before waking up any workers.
+EMSCRIPTEN_KEEPALIVE
+int main_thread_futex;


Why int if this stores a memory address? Or is it the address of this data that is useful?

Oh you're right, this should be a void*, fixed.

While in general futexes are always 32 bits, this is not an actual futex, it's the address of the futex (which some day soon may be 64 bits).

kripken · 2020-09-24T17:58:51Z

@juj please review this when you can. Landing now as the testcase shows this resolves an issue, and it may fix some other user reports we are getting.

kripken · 2020-09-24T17:59:35Z

(also @sbc100 the question at the end of the first comment, again, no rush)

belraquib

I believe there's a race here when calling the initShared() call in initWorker(). The worker JS can init prior to the module being fully loaded (due to workers js being loaded in parallel to the main js) - at that point we don't yet have the main thread futex. It looks as if initShared is idempotent; the main_thread_futex shouldn't change - could you always call it as the first line in worker.js, command == run?

kripken · 2020-10-02T15:14:57Z

@belraquib Hmm, I think it's safe already, but I may be wrong. We call initWorker from postamble, and that is part of the main JS file. So the main JS file must have been loaded. And that file will contain var _main_thread_futex = ..constant.. (we simplified it to that after this landed) set to the proper constant. So I don't think there's a way to reach initWorker before the futex address is set?

belraquib · 2020-10-02T16:03:02Z

Looking at the generated code for the worker, i see the call to
PThread.initWorker(); -- line 6376
is being done before setting the constant
Module["_main_thread_futex"] = 29886168; -- line 6581

Could this be an ordering issue in the jsifier?

belraquib · 2020-10-02T17:21:38Z

Ah, closing this, #12308 fixes it (and I only had this change patched into 2.0.2).

This variable is initialized from `_main_thread_futex` which is a data address exported from native code. Rather than copy it we can simply use that variable directly. I think it made sense to have a seperate JS variable in the past when this address was allocated at runtime, but since #12318 landed it is no longer needed.

This variable is initialized from `_main_thread_futex` which is a data address exported from native code. Rather than copy it we can simply use that variable directly. I think it made sense to have a separate JS variable in the past when this address was allocated at runtime, but since #12318 landed it is no longer needed. Also rename the exported variable to avoid conflict with user symbols.

kripken and others added 26 commits September 16, 2020 21:02

Add a testcase for #12243 #12244 #12245 [ci skip]

5617421

better

2f18199

fix

68874d2

more

36be7cd

fixes

534a208

fixes a regression from #12055

f46a897

cleaner

7ecf888

cleaner

98baeb1

fix

b47e01f

Merge remote-tracking branch 'origin/master' into pthread5

e43d288

restore

213ef27

unset

5d8fca8

fix

34b2ff2

fix

88bf9ff

fix

e256f92

fix

6446726

another assert

faa60be

Merge remote-tracking branch 'origin/master' into pthread5

a9451ba

fix

5f41a59

fix

c956cfd

more

6a3f32d

comment

f3f48c3

undo

67d305a

fix comment

c1374ce

Fix a test

a27ef73

comments

875a0e9

kripken requested review from juj and tlively September 23, 2020 20:16

tlively approved these changes Sep 23, 2020

View reviewed changes

use a pointer

eb6066e

Merge remote-tracking branch 'origin/master' into pthread5

23c5322

kripken mentioned this pull request Sep 24, 2020

Mt deadlock issue #12309

Closed

kripken merged commit 0014175 into master Sep 24, 2020

kripken deleted the pthread5 branch September 24, 2020 17:59

belraquib reviewed Oct 2, 2020

View reviewed changes

sbc100 mentioned this pull request Nov 25, 2020

Remove PThread.mainThreadFutex from JavaScript. NFC #12887

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a rare pthreads main thread deadlock that worsened in 2.0.2 #12318

Fix a rare pthreads main thread deadlock that worsened in 2.0.2 #12318

kripken commented Sep 23, 2020

tlively left a comment

tlively Sep 23, 2020

kripken Sep 23, 2020

kripken commented Sep 24, 2020

kripken commented Sep 24, 2020

belraquib left a comment

kripken commented Oct 2, 2020

belraquib commented Oct 2, 2020

belraquib commented Oct 2, 2020

Fix a rare pthreads main thread deadlock that worsened in 2.0.2 #12318

Fix a rare pthreads main thread deadlock that worsened in 2.0.2 #12318

Conversation

kripken commented Sep 23, 2020

tlively left a comment

Choose a reason for hiding this comment

tlively Sep 23, 2020

Choose a reason for hiding this comment

kripken Sep 23, 2020

Choose a reason for hiding this comment

kripken commented Sep 24, 2020

kripken commented Sep 24, 2020

belraquib left a comment

Choose a reason for hiding this comment

kripken commented Oct 2, 2020

belraquib commented Oct 2, 2020

belraquib commented Oct 2, 2020