-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Ensure Node worker threads are exited gracefully #12963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
// cp suffix in the function name means "cancellation point", so this wait can be cancelled | ||
// by the users unless current threads cancelability is set to PTHREAD_CANCEL_DISABLE | ||
// which may be either done by the user of __timedwait() function. | ||
if (is_main_runtime_thread || pthread_self()->canceldisable != PTHREAD_CANCEL_DISABLE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does doesn't makes sense to me. Only the main thread should need to busy wait right? The main runtime thread, if running off the main thread should be able to block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, only the main browser thread should busy wait. Somehow when changing it to emscripten_is_main_browser_thread()
makes some posix tests stuck on Node (i.e. they never return anything). I still need to investigate why, hence the added TODO item:
https://github.com/kleisauke/emscripten/blob/f4c49eb7bdb36eaebefcca9d5a41c387d415d5ca/system/lib/libc/musl/src/thread/__timedwait.c#L44-L45
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've dropped this commit for now, will investigate it later (I'm not sure if emscripten_is_main_browser_thread()
should return true
on Node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to do busy waiting on side threads in __timedwait_cp function. Because this function may be called from function which is cancellation point (see https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html point "2.9.5 Thread Cancellation").
As we don't have signals we need to check if request to cancel thread hasn't occurred during call to functions like pthread_cond_wait .
tests/test_core.py
Outdated
@@ -8062,6 +8062,7 @@ def test_fpic_static(self): | |||
self.do_run_in_out_file_test('tests', 'core', 'test_hello_world.c') | |||
|
|||
@node_pthreads | |||
@unittest.skip("this test can not be run without a pool, see test_pthread_create_pool") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we do run this today, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of tests/core/pthread/create.cpp
is that it spawns 2 detached threads where the first thread blocks until the second one is spawned. Without setting -sPTHREAD_POOL_SIZE=2
this can not be done properly, I think.
This test case was previously successful due to the process.exit(status);
call. For example, testing this on master:
diff --git a/tests/core/pthread/create.cpp b/tests/core/pthread/create.cpp
index 1111111..2222222 100644
--- a/tests/core/pthread/create.cpp
+++ b/tests/core/pthread/create.cpp
@@ -23,7 +23,12 @@ void *ThreadMain(void *arg) {
int last = sum.load();
while (sum.load() == last) {}
}
- pthread_exit((void*)TOTAL);
+ EM_ASM({
+ PThread.threadExit($0);
+ //process.exit($0);
+ throw 'unwind';
+ }, TOTAL);
+ return nullptr;
}
pthread_t thread[NUM_THREADS];
Will also deadlock, uncommenting process.exit($0);
makes the test case "pass".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So why not add PTHREAD_POOL_SIZE to this test rather than disabling it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could add PTHREAD_POOL_SIZE
, but then the test case is essentially the same as test_pthread_create_pool
, see:
Lines 8066 to 8071 in 62007d1
@node_pthreads | |
def test_pthread_create_pool(self): | |
# with a pool, we can synchronously depend on workers being available | |
self.set_setting('PTHREAD_POOL_SIZE', '2') | |
self.emcc_args += ['-DALLOW_SYNC'] | |
self.do_run_in_out_file_test('tests', 'core', 'pthread', 'create.cpp') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It seems like the code is supposed to honor ALLOW_SYNC
and when that is not defined it should not depend on synchronous thread creation. Maybe we need to fix the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You say "is that it spawns 2 detached threads where the first thread blocks until the second one is spawned"... I'm a little confused. I don't see any blocking happening the create.cpp code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought it was blocking on:
emscripten/tests/core/pthread/create.cpp
Lines 22 to 24 in b97bbc5
// wait for a change, so we see interleaved processing. | |
int last = sum.load(); | |
while (sum.load() == last) {} |
fwiw, here's a native program: https://gist.github.com/kleisauke/232be709d82b6aaecfa0f28a8be27fe3
(with that program I see that it runs 2 threads next to each other, I suppose that isn't supported unless you set PTHREAD_POOL_SIZE
to a number equal to or greater than 2(?)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats not quite true. Without PTHREAD_POOL_SIZE
emscripten needs to start new workers for each thread, which means you need to return to event loop before the thread will actually start. This means what you can't do is not return from main.. this code in create.cpp as two modes: if ALLOW_SYNC is set it will busy wait in main (and not return), if ALLOW_SYNC is not set it will run an event loop by calling emscripten_set_main_loop (which returns from main).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another way of putting: the thread pool is only needed if you need to do synchronous thread creation. At least IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, you're right. This test case isn't blocking, sorry for the noise. It looks like the Node workers are not teared down when the main loop is canceled. I could fix this with explicitly calling PThread.terminateAllThreads();
on exit, see for example: kleisauke@6141e7e.
I think this behavior is being investigated within issue #12801.
Some flaky tests I noticed:
|
2831f73
to
4376982
Compare
c753de8
to
71333f7
Compare
To keep the changes minimal, I've dropped commit kleisauke@36236cf and a part of commit kleisauke@d2182e3 as they are still work-in-progress. For the same reason, this PR is also no longer dependent on #10524. This is ready for reviewing now. |
tests/test_core.py
Outdated
@@ -8062,6 +8062,7 @@ def test_fpic_static(self): | |||
self.do_run_in_out_file_test('tests', 'core', 'test_hello_world.c') | |||
|
|||
@node_pthreads | |||
@unittest.skip("this test can not be run without a pool, see test_pthread_create_pool") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So why not add PTHREAD_POOL_SIZE to this test rather than disabling it?
// exit the pthread properly on node, as a normal JS exception will halt | ||
// the entire application. | ||
process.exit(status); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why this was here in the first place? I don't really understand the comment or why process.exit would ever maskde sense for exiting a thread. (unless perhaps process.exit only applies to the current worker thread which seem unlikely).
In order works I agree this change looks good I just can't see why it was ever any different. Perhaps @kripken can lend some background. I believe this code dates back to the very first pthread node code: #9745
Also, can you mention this fix in the title and/or PR description so it describes what you are fixing exacly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was discovered by trial-and-error as a attempt to fix the failing Node posix tests. So, I'm not entirely sure why it was added (with commit 5a8c86d) and whether this change is correct.
I'll fix the title and PR description, after I've investigated it a bit further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is adding EXIT_RUNTIME enough to make this test keep working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some further research, it looks like the process.exit()
call in a Node worker will immediately cause the thread to exit, and not the whole program. Somehow this call also prevents from new threads to be created(?) and causes the program to hang indefinitely (see for example the added test_pthread_exit.c
test case in 38b4bde which hangs on master). The failing stress-test mentioned in #9763 now also seems to work correctly on Node.
EXIT_RUNTIME
doesn't make the test_pthread_create
test case pass after this change, but calling explicitly exit(0);
after cancelling the main loop does (see PR #13211).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I added this because indeed the throw on the next line would shut down the entire program (threads + main runtime), as the comment says. And process.exit
seemed to just shut down the thread, but I can't find any docs to justify that, so it may have been a reliance on undefined behavior...
But this testcase generally shows that:
const { Worker, isMainThread, parentPort } = require('worker_threads');
if (isMainThread) {
// This code is executed in the main thread and not in the worker.
// Create the worker.
const worker = new Worker(__filename);
setInterval(() => { console.log('time') }, 1000);
} else {
// This code is executed in the worker and not in the main thread.
process.exit();
//throw "foo";
}
Replace the process.exit with the throw to see the result (that is, the throw stops everything, while the process.exit just stops the worker, but the main thread can keep printing "time" every second).
But, it does look good to me to remove the process.exit here, assuming the throw does not halt the entire program. Do we have tests checking that, that is, that a node pthread can exit while things keep running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(If things work then I assume we catch that throw higher up.)
3b7d851
to
63d8c81
Compare
63d8c81
to
38b4bde
Compare
Rebased on top of the master branch to resolve merge conflict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
lgtm, although I'd like to get @kripken to take a look at the process.exit
stuff. It certainly seems to be incompatible to the thread polling stuff so it makes sense to me that it should be removed... but I'd like ot know why it was ever added in the first place.
int main() { | ||
printf("main\n"); | ||
for (int i = 0; i < NUM_THREADS; ++i) { | ||
thread_state = NOT_CREATED_THREAD; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put pthread_t t;
here in the loop?
* the argument (if the thread is not detached). | ||
* | ||
* We do this twice to ensure that the thread has exited | ||
* gracefully and that we are able to recreate the thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have thought we have tests that cover these basics already don't we? I guess this is specific to pthread_exit rather then returning from the thread entry point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be useful to include a simplified test to make this issue easier to reproduce on the master branch. But, indeed this should already be covered with test_pthread_exit_1_2
in posixtestsuite which is enabled within commit 38b4bde.
Removed this test with commit d0f3bc0.
I guess this is specific to pthread_exit rather then returning from the thread entry point?
Yes, I think this issue only manifests when a call to pthread_exit
is made explicitly, returning from the thread entry point should work without problems.
tests/test_core.py
Outdated
self.set_setting('EXIT_RUNTIME') | ||
if not self.has_changed_setting('INITIAL_MEMORY'): | ||
self.set_setting('INITIAL_MEMORY', '64mb') | ||
self.do_runf(path_from_root('tests', 'pthread', 'test_pthread_create.cpp'), '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if running these tests under tests/pthread
that normally run in test_browser.py
is something we should try to do more consistently here in test_core.py
? Is there a particular reason you include this one as part of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I included tests/pthread/test_pthread_create.cpp
in test_core
was that in #9763 it was mentioned that this could not be run with Node, this PR seems to fix that. IIRC there was a PR that allowed the test_browser
tests to be run with Node, so this seems a bit redundant.
Removed this with commit d0f3bc0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I think didn't mean to make you remove those two tests necessarily, I was mostly wondering out loud. For tests/pthread/test_pthread_create.cpp
I think it probably is worth trying to get all of the tests/pthread/
tests working in test_core.py.. perhaps as a followup? You could include this single one with at TODO to add the rest if you like. Or not, its up to you.
I noticed that fwiw, I can confirm that installing Emscripten v1.39.2 and cherry-picking 38b4bde on top of 5e3029d make this test hang indefinitely, but applying the |
d0f3bc0
to
a0b9031
Compare
Rebased and squashed, PTAL. I would also like to know why this was added in the first place, but can't quite answer this without guessing. As for the test cases, it seems a bit redundant to copy already covered test cases from posixtestsuite to core. |
Somehow these tests hang infinitely without using this setting. This was not necessary before the rebase.
self.emcc_args += ['--bind'] | ||
self.do_run_in_out_file_test('core/pthread/create.cpp') | ||
|
||
@node_pthreads | ||
def test_pthread_exceptions(self): | ||
self.set_setting('PTHREAD_POOL_SIZE', '2') | ||
self.set_setting('EXIT_RUNTIME') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need EXIT_RUNTIME
in basically all the pthreads tests on node, I wonder if maybe we should just enforce that, that is, error if node pthreads are run without it. There may not be a good enough reason to support pthreads without that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we need this in all our unit test I don't think that necessarily means that every program wants that behaviour. Basically its always desirable for our unittests to exit the process when they are done, but that isn't true for JS programs in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I worry this might be confusing but I agree we can't simplify it as easily as I thought.
I just ran into this issue myself when using I think we can/should land it. |
I think this could be re-titled "ensure that node workers stay alive after they are done running a thread". It seems to me that without this fix workers cannot be re-used. |
Resolves (partially): #9763.