Skip to content

Ensure Node worker threads are exited gracefully #12963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 18, 2021

Conversation

kleisauke
Copy link
Collaborator

@kleisauke kleisauke commented Dec 4, 2020

Resolves (partially): #9763.

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

// cp suffix in the function name means "cancellation point", so this wait can be cancelled
// by the users unless current threads cancelability is set to PTHREAD_CANCEL_DISABLE
// which may be either done by the user of __timedwait() function.
if (is_main_runtime_thread || pthread_self()->canceldisable != PTHREAD_CANCEL_DISABLE) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does doesn't makes sense to me. Only the main thread should need to busy wait right? The main runtime thread, if running off the main thread should be able to block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only the main browser thread should busy wait. Somehow when changing it to emscripten_is_main_browser_thread() makes some posix tests stuck on Node (i.e. they never return anything). I still need to investigate why, hence the added TODO item:
https://github.com/kleisauke/emscripten/blob/f4c49eb7bdb36eaebefcca9d5a41c387d415d5ca/system/lib/libc/musl/src/thread/__timedwait.c#L44-L45

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dropped this commit for now, will investigate it later (I'm not sure if emscripten_is_main_browser_thread() should return true on Node).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do busy waiting on side threads in __timedwait_cp function. Because this function may be called from function which is cancellation point (see https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html point "2.9.5 Thread Cancellation").

As we don't have signals we need to check if request to cancel thread hasn't occurred during call to functions like pthread_cond_wait .

@@ -8062,6 +8062,7 @@ def test_fpic_static(self):
self.do_run_in_out_file_test('tests', 'core', 'test_hello_world.c')

@node_pthreads
@unittest.skip("this test can not be run without a pool, see test_pthread_create_pool")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we do run this today, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of tests/core/pthread/create.cpp is that it spawns 2 detached threads where the first thread blocks until the second one is spawned. Without setting -sPTHREAD_POOL_SIZE=2 this can not be done properly, I think.

This test case was previously successful due to the process.exit(status); call. For example, testing this on master:

diff --git a/tests/core/pthread/create.cpp b/tests/core/pthread/create.cpp
index 1111111..2222222 100644
--- a/tests/core/pthread/create.cpp
+++ b/tests/core/pthread/create.cpp
@@ -23,7 +23,12 @@ void *ThreadMain(void *arg) {
     int last = sum.load();
     while (sum.load() == last) {}
   }
-  pthread_exit((void*)TOTAL);
+  EM_ASM({
+    PThread.threadExit($0);
+    //process.exit($0);
+    throw 'unwind';
+  }, TOTAL);
+  return nullptr;
 }
 
 pthread_t thread[NUM_THREADS];

Will also deadlock, uncommenting process.exit($0); makes the test case "pass".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So why not add PTHREAD_POOL_SIZE to this test rather than disabling it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add PTHREAD_POOL_SIZE, but then the test case is essentially the same as test_pthread_create_pool, see:

emscripten/tests/test_core.py

Lines 8066 to 8071 in 62007d1

@node_pthreads
def test_pthread_create_pool(self):
# with a pool, we can synchronously depend on workers being available
self.set_setting('PTHREAD_POOL_SIZE', '2')
self.emcc_args += ['-DALLOW_SYNC']
self.do_run_in_out_file_test('tests', 'core', 'pthread', 'create.cpp')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It seems like the code is supposed to honor ALLOW_SYNC and when that is not defined it should not depend on synchronous thread creation. Maybe we need to fix the code?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say "is that it spawns 2 detached threads where the first thread blocks until the second one is spawned"... I'm a little confused. I don't see any blocking happening the create.cpp code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought it was blocking on:

// wait for a change, so we see interleaved processing.
int last = sum.load();
while (sum.load() == last) {}

fwiw, here's a native program: https://gist.github.com/kleisauke/232be709d82b6aaecfa0f28a8be27fe3

(with that program I see that it runs 2 threads next to each other, I suppose that isn't supported unless you set PTHREAD_POOL_SIZE to a number equal to or greater than 2(?)).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats not quite true. Without PTHREAD_POOL_SIZE emscripten needs to start new workers for each thread, which means you need to return to event loop before the thread will actually start. This means what you can't do is not return from main.. this code in create.cpp as two modes: if ALLOW_SYNC is set it will busy wait in main (and not return), if ALLOW_SYNC is not set it will run an event loop by calling emscripten_set_main_loop (which returns from main).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another way of putting: the thread pool is only needed if you need to do synchronous thread creation. At least IIUC.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you're right. This test case isn't blocking, sorry for the noise. It looks like the Node workers are not teared down when the main loop is canceled. I could fix this with explicitly calling PThread.terminateAllThreads(); on exit, see for example: kleisauke@6141e7e.

I think this behavior is being investigated within issue #12801.

@kleisauke
Copy link
Collaborator Author

kleisauke commented Dec 5, 2020

Some flaky tests I noticed:

Test case Commit / CircleCI log Info
wasm2js1.test_pthread_c11_threads 2831f73 / https://circleci.com/gh/emscripten-core/emscripten/378948
4376982 / https://circleci.com/gh/emscripten-core/emscripten/379155
c753de8 / https://circleci.com/gh/emscripten-core/emscripten/379205
Perhaps the stdout stream has not been flushed properly(?)
posixtest_browser.test_pthread_create_1_1 2831f73 / https://circleci.com/gh/emscripten-core/emscripten/378950
f388f6f / https://circleci.com/gh/emscripten-core/emscripten/379095
Happens only on Firefox so far, unable to reproduce locally.
posixtest.test_pthread_barrier_wait_2_1 4376982 / https://circleci.com/gh/emscripten-core/emscripten/379149
c753de8 / https://circleci.com/gh/emscripten-core/emscripten/379211
Unable to reproduce locally with Node 15.3.0.

kleisauke added a commit to kleisauke/emscripten that referenced this pull request Dec 10, 2020
@kleisauke
Copy link
Collaborator Author

To keep the changes minimal, I've dropped commit kleisauke@36236cf and a part of commit kleisauke@d2182e3 as they are still work-in-progress. For the same reason, this PR is also no longer dependent on #10524.

This is ready for reviewing now.

@kleisauke kleisauke marked this pull request as ready for review January 6, 2021 20:42
@kleisauke kleisauke changed the title [WIP] Fix failing Node tests in posixtest Fix failing Node tests in posixtest Jan 6, 2021
@@ -8062,6 +8062,7 @@ def test_fpic_static(self):
self.do_run_in_out_file_test('tests', 'core', 'test_hello_world.c')

@node_pthreads
@unittest.skip("this test can not be run without a pool, see test_pthread_create_pool")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So why not add PTHREAD_POOL_SIZE to this test rather than disabling it?

// exit the pthread properly on node, as a normal JS exception will halt
// the entire application.
process.exit(status);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this was here in the first place? I don't really understand the comment or why process.exit would ever maskde sense for exiting a thread. (unless perhaps process.exit only applies to the current worker thread which seem unlikely).

In order works I agree this change looks good I just can't see why it was ever any different. Perhaps @kripken can lend some background. I believe this code dates back to the very first pthread node code: #9745

Also, can you mention this fix in the title and/or PR description so it describes what you are fixing exacly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was discovered by trial-and-error as a attempt to fix the failing Node posix tests. So, I'm not entirely sure why it was added (with commit 5a8c86d) and whether this change is correct.

I'll fix the title and PR description, after I've investigated it a bit further.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is adding EXIT_RUNTIME enough to make this test keep working?

Copy link
Collaborator Author

@kleisauke kleisauke Jan 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some further research, it looks like the process.exit() call in a Node worker will immediately cause the thread to exit, and not the whole program. Somehow this call also prevents from new threads to be created(?) and causes the program to hang indefinitely (see for example the added test_pthread_exit.c test case in 38b4bde which hangs on master). The failing stress-test mentioned in #9763 now also seems to work correctly on Node.

EXIT_RUNTIME doesn't make the test_pthread_create test case pass after this change, but calling explicitly exit(0); after cancelling the main loop does (see PR #13211).

Copy link
Member

@kripken kripken Jun 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I added this because indeed the throw on the next line would shut down the entire program (threads + main runtime), as the comment says. And process.exit seemed to just shut down the thread, but I can't find any docs to justify that, so it may have been a reliance on undefined behavior...

But this testcase generally shows that:

const { Worker, isMainThread, parentPort } = require('worker_threads');
if (isMainThread) {
  // This code is executed in the main thread and not in the worker.
  
  // Create the worker.
  const worker = new Worker(__filename);

  setInterval(() => { console.log('time') }, 1000);
} else {
  // This code is executed in the worker and not in the main thread.

  process.exit();
  //throw "foo";
}

Replace the process.exit with the throw to see the result (that is, the throw stops everything, while the process.exit just stops the worker, but the main thread can keep printing "time" every second).

But, it does look good to me to remove the process.exit here, assuming the throw does not halt the entire program. Do we have tests checking that, that is, that a node pthread can exit while things keep running?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If things work then I assume we catch that throw higher up.)

@kleisauke kleisauke changed the title Fix failing Node tests in posixtest Ensure Node worker threads are exited gracefully Jan 8, 2021
@kleisauke
Copy link
Collaborator Author

Rebased on top of the master branch to resolve merge conflict.

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

lgtm, although I'd like to get @kripken to take a look at the process.exit stuff. It certainly seems to be incompatible to the thread polling stuff so it makes sense to me that it should be removed... but I'd like ot know why it was ever added in the first place.

int main() {
printf("main\n");
for (int i = 0; i < NUM_THREADS; ++i) {
thread_state = NOT_CREATED_THREAD;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put pthread_t t; here in the loop?

* the argument (if the thread is not detached).
*
* We do this twice to ensure that the thread has exited
* gracefully and that we are able to recreate the thread.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought we have tests that cover these basics already don't we? I guess this is specific to pthread_exit rather then returning from the thread entry point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be useful to include a simplified test to make this issue easier to reproduce on the master branch. But, indeed this should already be covered with test_pthread_exit_1_2 in posixtestsuite which is enabled within commit 38b4bde.

Removed this test with commit d0f3bc0.

I guess this is specific to pthread_exit rather then returning from the thread entry point?

Yes, I think this issue only manifests when a call to pthread_exit is made explicitly, returning from the thread entry point should work without problems.

self.set_setting('EXIT_RUNTIME')
if not self.has_changed_setting('INITIAL_MEMORY'):
self.set_setting('INITIAL_MEMORY', '64mb')
self.do_runf(path_from_root('tests', 'pthread', 'test_pthread_create.cpp'), '')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if running these tests under tests/pthread that normally run in test_browser.py is something we should try to do more consistently here in test_core.py? Is there a particular reason you include this one as part of this PR?

Copy link
Collaborator Author

@kleisauke kleisauke Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I included tests/pthread/test_pthread_create.cpp in test_core was that in #9763 it was mentioned that this could not be run with Node, this PR seems to fix that. IIRC there was a PR that allowed the test_browser tests to be run with Node, so this seems a bit redundant.

Removed this with commit d0f3bc0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I think didn't mean to make you remove those two tests necessarily, I was mostly wondering out loud. For tests/pthread/test_pthread_create.cpp I think it probably is worth trying to get all of the tests/pthread/ tests working in test_core.py.. perhaps as a followup? You could include this single one with at TODO to add the rest if you like. Or not, its up to you.

@kleisauke
Copy link
Collaborator Author

but I'd like ot know why it was ever added in the first place.

I noticed that tests/core/pthread/create.cpp was also added in #9745. Perhaps the process.exit thing was a way to get this test pass without making an explicit call to exit(0); after cancelling the main loop?

fwiw, I can confirm that installing Emscripten v1.39.2 and cherry-picking 38b4bde on top of 5e3029d make this test hang indefinitely, but applying the exit(0); change from d1d1147 makes this test pass again. See for e.g. this test branch:
5e3029d...kleisauke:old-state

Base automatically changed from master to main March 8, 2021 23:49
@kleisauke
Copy link
Collaborator Author

Rebased and squashed, PTAL. I would also like to know why this was added in the first place, but can't quite answer this without guessing.

As for the test cases, it seems a bit redundant to copy already covered test cases from posixtestsuite to core.

Somehow these tests hang infinitely without using this setting.
This was not necessary before the rebase.
self.emcc_args += ['--bind']
self.do_run_in_out_file_test('core/pthread/create.cpp')

@node_pthreads
def test_pthread_exceptions(self):
self.set_setting('PTHREAD_POOL_SIZE', '2')
self.set_setting('EXIT_RUNTIME')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need EXIT_RUNTIME in basically all the pthreads tests on node, I wonder if maybe we should just enforce that, that is, error if node pthreads are run without it. There may not be a good enough reason to support pthreads without that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we need this in all our unit test I don't think that necessarily means that every program wants that behaviour. Basically its always desirable for our unittests to exit the process when they are done, but that isn't true for JS programs in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I worry this might be confusing but I agree we can't simplify it as easily as I thought.

@sbc100
Copy link
Collaborator

sbc100 commented Jun 18, 2021

I just ran into this issue myself when using pthead_exit() in test... the worker will die completely running under node and not be re-usable. This fix fixed my issue too.

I think we can/should land it.

@sbc100
Copy link
Collaborator

sbc100 commented Jun 18, 2021

I think this could be re-titled "ensure that node workers stay alive after they are done running a thread". It seems to me that without this fix workers cannot be re-used.

@sbc100 sbc100 merged commit bf7c1ea into emscripten-core:main Jun 18, 2021
@kleisauke kleisauke deleted the fix-posix-tests branch June 18, 2021 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants