Skip to content

Deadlock on using ProcessPoolExecutor in fork mp context #105464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kekekekule opened this issue Jun 7, 2023 · 8 comments
Open

Deadlock on using ProcessPoolExecutor in fork mp context #105464

kekekekule opened this issue Jun 7, 2023 · 8 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@kekekekule
Copy link

kekekekule commented Jun 7, 2023

Hi!
I got an issue using Python 3.9.16 and 3.9.17

Launching this code (inspired by #90622)

from concurrent import futures
import faulthandler
import asyncio

def work(iteration, item):
    #anything here, even a pass
    pass


async def do_one(i):
    faulthandler.dump_traceback_later(20, repeat=True)
    with futures.ProcessPoolExecutor() as executor:
        loop = asyncio.get_event_loop()
        loop.run_in_executor(executor, work, i, 1)
        print(f"executor: max_workers={executor._max_workers}, processes={executor._processes}, method={executor._mp_context.get_start_method(allow_none=False)}, safe={executor._safe_to_dynamically_spawn_children}")
    faulthandler.cancel_dump_traceback_later()

async def run():
    for i in range(10000):
        await do_one(i)


async def main():
    task = asyncio.create_task(run())
    await asyncio.gather(task)

if __name__ == "__main__":
    asyncio.run(main())

Causes a deadlock sometimes.

The faulthandler dumps the traceback below on timeout:

Thread 0x0000004018db2700 (most recent call first):
  File "/usr/local/lib/python3.9/multiprocessing/popen_fork.py", line 27 in poll
  File "/usr/local/lib/python3.9/multiprocessing/popen_fork.py", line 43 in wait
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 149 in join
  File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 521 in join_executor_internals
  File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 343 in run
  File "/usr/local/lib/python3.9/threading.py", line 980 in _bootstrap_inner
  File "/usr/local/lib/python3.9/threading.py", line 937 in _bootstrap

Thread 0x0000004002fcf3c0 (most recent call first):
  File "/usr/local/lib/python3.9/threading.py", line 1080 in _wait_for_tstate_lock
  File "/usr/local/lib/python3.9/threading.py", line 1060 in join
  File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 767 in shutdown
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 637 in __exit__
  File "//t.py", line 17 in do_one
  File "//t.py", line 22 in run
  File "/usr/local/lib/python3.9/asyncio/events.py", line 80 in _run
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 1905 in _run_once
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 601 in run_forever
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 634 in run_until_complete
  File "/usr/local/lib/python3.9/asyncio/runners.py", line 44 in run
  File "//t.py", line 31 in <module>

The pstree shows the following:

root@6b8cab06d7d7:/# pstree -pst
bash(1)─┬─python(4650)─┬─python(11873)───{python}(11875)
        │              ├─{python}(4652)
        │              └─{python}(11874)
        └─{bash}(8)

Running under docker launched in QEMU on Mac OS M1:

root@6b8cab06d7d7:/# uname -a
Linux 6b8cab06d7d7 5.15.68-0-virt #1-Alpine SMP Fri, 16 Sep 2022 06:29:31 +0000 x86_64 x86_64 x86_64 GNU/Linux

Tried to analyze under strace and suspect that the problem, here's the last output, after that it simply hangs:

37 close(7) = 0
37 close(6) = 0
37 getpid() = 37
304 write(11,0x630d440,104) = 104
304 read(8,0x5e2c410,4) = 4
304 read(8,0x5e2c410,4) = 4
304 getpid() = 304
304 write(11,0x5e427d0,19) = 19
304 exit_group(0)
37 wait4(304,275152568292,0,0,0,39) = 304
37 getpid() = 37

Having child processes with pids 304 and 305, on interrupting my script I got from strace that it hung on wait4 of pid = 305.

^C37 wait4(305,275152568292,0,0,0,39) = -1 errno=4 (Interrupted system call)
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL, si_pid=0, si_uid=0} ---

304 exits normally after SIGINT, 305 not, still hangs.

Did not check versions upper than 3.9.x. However, I do not exclude that the problem can be reproducible on higher versions of Python.

UPD. Reproduced the same on Python 3.10.10:

Thread 0x00000040106b5700 (most recent call first):
  File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 27 in poll
  File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 43 in wait
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 149 in join
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 521 in join_executor_internals
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 343 in run
  File "/usr/local/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/local/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x0000004002fcd3c0 (most recent call first):
  File "/usr/local/lib/python3.10/threading.py", line 1116 in _wait_for_tstate_lock
  File "/usr/local/lib/python3.10/threading.py", line 1096 in join
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 775 in shutdown
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 649 in __exit__
  File "//t.py", line 14 in do_one
  File "//t.py", line 22 in run
  File "/usr/local/lib/python3.10/asyncio/events.py", line 80 in _run
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1909 in _run_once
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 603 in run_forever
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 636 in run_until_complete
  File "/usr/local/lib/python3.10/asyncio/runners.py", line 44 in run
  File "//t.py", line 31 in <module>

UPD2. Just investigated, that when issue happens, one of the children isn't even introduced in strace except wait4.
UPD3. Some logging from multiprocessing:

[DEBUG/MainProcess] created semlock with handle 275333459968
[DEBUG/MainProcess] created semlock with handle 275333464064
[DEBUG/MainProcess] created semlock with handle 275333468160
[DEBUG/MainProcess] created semlock with handle 275333472256
[DEBUG/MainProcess] created semlock with handle 275333476352
[DEBUG/ForkProcess-3332] Queue._after_fork()
[INFO/ForkProcess-3332] child process calling self.run()
executor: max_workers=2, processes={32178: <ForkProcess name='ForkProcess-3331' pid=32178 parent=22185 started>, 32179: <ForkProcess name='ForkProcess-3332' pid=32179 parent=22185 started>}, method=fork, safe=False
[DEBUG/MainProcess] Queue._start_thread()
[DEBUG/MainProcess] doing self._thread.start()
[DEBUG/ForkProcess-3331] Queue._after_fork()
[DEBUG/MainProcess] starting thread to feed data to pipe
[DEBUG/MainProcess] ... done self._thread.start()
[INFO/ForkProcess-3331] child process calling self.run()
[DEBUG/MainProcess] telling queue thread to quit
[INFO/ForkProcess-3332] process shutting down
[DEBUG/ForkProcess-3332] running all "atexit" finalizers with priority >= 0
[DEBUG/ForkProcess-3332] running the remaining "atexit" finalizers
[DEBUG/MainProcess] Queue.join_thread()
[INFO/ForkProcess-3332] process exiting with exitcode 0
[DEBUG/MainProcess] joining queue thread
[INFO/ForkProcess-3331] process shutting down
[DEBUG/MainProcess] feeder thread got sentinel -- exiting
[DEBUG/MainProcess] ... queue thread joined
[DEBUG/ForkProcess-3331] running all "atexit" finalizers with priority >= 0
[DEBUG/ForkProcess-3331] running the remaining "atexit" finalizers
[INFO/ForkProcess-3331] process exiting with exitcode 0
[DEBUG/MainProcess] created semlock with handle 275333480448
[DEBUG/MainProcess] created semlock with handle 275333484544
[DEBUG/MainProcess] created semlock with handle 275333488640
[DEBUG/MainProcess] created semlock with handle 275333492736
[DEBUG/MainProcess] created semlock with handle 275333496832
executor: max_workers=2, processes={32184: <ForkProcess name='ForkProcess-3333' pid=32184 parent=22185 started>, 32186: <ForkProcess name='ForkProcess-3334' pid=32186 parent=22185 started>}, method=fork, safe=False
[DEBUG/MainProcess] Queue._start_thread()
[DEBUG/MainProcess] doing self._thread.start()
[DEBUG/MainProcess] starting thread to feed data to pipe
[DEBUG/MainProcess] ... done self._thread.start()
[DEBUG/ForkProcess-3334] Queue._after_fork()
[INFO/ForkProcess-3334] child process calling self.run()
[DEBUG/MainProcess] telling queue thread to quit
[DEBUG/MainProcess] Queue.join_thread()
[DEBUG/MainProcess] joining queue thread
[DEBUG/MainProcess] feeder thread got sentinel -- exiting
[INFO/ForkProcess-3334] process shutting down
[DEBUG/MainProcess] ... queue thread joined
[DEBUG/ForkProcess-3334] running all "atexit" finalizers with priority >= 0
[DEBUG/ForkProcess-3334] running the remaining "atexit" finalizers
[INFO/ForkProcess-3334] process exiting with exitcode 0
@kekekekule kekekekule added the type-bug An unexpected behavior, bug, or error label Jun 7, 2023
@kekekekule
Copy link
Author

@AlexWaygood see you added this issue on a board. However, could you validate please whether this is a real bug that was not fixes by some multiprocessing/threading/concurrent patches in 3.9

@AlexWaygood
Copy link
Member

Hi, I was just doing some drive-by triaging. I'm a core dev, but I don't primarily work on multiprocessing issues in CPython. I could investigate further, but I'm not a multiprocessing expert, so I'm probably not best placed to do so. I also have a number of other projects on the go at the moment. I added it to the board so that members of the team who are experts on multiprocessing will be able to find it and work on it more easily.

However, please try to see if the problem reproduces on Python 3.11 or later. Python 3.9 and 3.10 are now in "security-only" mode — we won't fix bugs in those versions of Python unless they relate to a security vulnerability.

@kekekekule
Copy link
Author

@AlexWaygood Could reproduce on Python 3.11.2.

So, maybe it's worth labeling the issue for python 3.9, 3.10, 3.11.

@AlexWaygood AlexWaygood added the 3.11 only security fixes label Jun 8, 2023
@AlexWaygood
Copy link
Member

So, maybe it's worth labeling the issue for python 3.9, 3.10, 3.11.

Unless you can persuasively argue it's a security issue, we won't be backporting a fix to 3.10 or older, so I won't add the 3.10 or 3.9 label for now.

@kekekekule
Copy link
Author

kekekekule commented Jun 8, 2023

A new thing I noticed: could reproduce this deadlock even on Python 3.8. Just highlighting for easier debugging.

@YvesDup
Copy link
Contributor

YvesDup commented Jun 13, 2023

Hi @kekekekule , I am just wondering if bug is still present when running ProcessPoolExecuter with a mp_context as spawn or forkserver ?

@kekekekule
Copy link
Author

Spawn/forkserver -- no.
From my repro it follows that I was using fork. I know spawn/forkserver is safer, but this does not cancel the issue :)

@YvesDup
Copy link
Contributor

YvesDup commented Jun 13, 2023

Spawn/forkserver -- no

Sorry, I don't understand if you ran your script with other context than 'fork' just to check if there is still something wrong.

@iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 26, 2023
@gpshead gpshead removed the 3.11 only security fixes label Oct 14, 2024
@gpshead gpshead changed the title Deadlock on using ProcessPoolExecutor Deadlock on using ProcessPoolExecutor in fork mp context Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: No status
Development

No branches or pull requests

5 participants