-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Deadlock in threaded application when using sys._current_frames
#106883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Oh, after several modifications to the interpreter, Anything can happen during GC. In this case, it releases GIL and allows other threads to get deadlocked. Also, GC itself can already trigger deadlocks. "GC being triggered with a lock held" has caused serval problems. See the discussion in #106207 |
We are also seeing this issue at work. I found one reproducible example running Python 3.11.7 (see below), it's much more difficult to reproduce on higher spec machines. On my 8-core cloud VM, I had to run 8 processes at the same time. After a few minutes, a few processes no longer produce output. Although they are still using some CPUs doing something. Since GC is involved here, I haven't been able to reproduce it on Python 3.12.1, likely due to the same 3.12 change to only run GC on the eval breaker as mentioned by @pablogsal here: #106905 (comment) import argparse
import datetime
import difflib
import gc
import random
import sys
import threading
import time
def do_some_work(iterations, sleep_before_work=False):
if sleep_before_work:
# Wait before starting the work, so more concurrent threads could be created
time.sleep(1)
for _ in range(iterations):
seq_a = [str(random.randint(0, 10)) + "\n" for _ in range(100)]
seq_b = [str(random.randint(0, 10)) + "\n" for _ in range(100)]
udiff = list(difflib.unified_diff(seq_a, seq_b))
udiff.append("THE END")
fibonacci(16)
def fibonacci(n):
if n <= 2:
return 1
return fibonacci(n - 1) + fibonacci(n - 2)
def gc_callback(phase, info):
del phase, info
do_some_work(1)
gc.callbacks.append(gc_callback)
class Runner:
def __init__(self):
self.watch = True
def watch_frames(self):
while self.watch:
time.sleep(0.1)
cf = sys._current_frames()
print(
f">>>> {datetime.datetime.now()} number of frames: {len(cf)}",
flush=True,
)
def run(self, num_runs, num_threads, work_load):
self.watch = True
watcher = threading.Thread(target=self.watch_frames)
watcher.start()
for i in range(num_runs):
start = time.time()
print(
f">>>> {datetime.datetime.now()} Run: {i}/{num_runs} started",
flush=True,
)
ts = []
for _ in range(num_threads):
t = threading.Thread(target=do_some_work, args=(work_load, True))
t.start()
ts.append(t)
for t in ts:
t.join()
print(
f">>>> {datetime.datetime.now()} Run: {i}/{num_runs} took"
f" {time.time() - start} seconds",
flush=True,
)
self.watch = False
watcher.join()
def main():
print(f">>>> Running on {sys.version_info=}")
print(f">>>> {datetime.datetime.now()} START")
parser = argparse.ArgumentParser()
parser.add_argument("--num_runs", type=int, default=1000)
parser.add_argument("--num_threads", type=int, default=512)
parser.add_argument("--work_load", type=int, default=1)
ns = parser.parse_args()
runner = Runner()
runner.run(ns.num_runs, ns.num_threads, ns.work_load)
print(f">>>> {datetime.datetime.now()} DONE")
if __name__ == "__main__":
main() |
FWIW, backporting #97920 to 3.11 makes the above example no longer reproducible. |
Should we temporarily disable the GC during |
When using threaded applications, there is a high risk of a deadlock in the intepreter. It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL. It has been suggested to disable GC during the _PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls. Jira: ENTLLT-7285 Change-Id: I2548d07803fc98db8717057ae3006e6af68b2f86
When using threaded applications, there is a high risk of a deadlock in the intepreter. It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL. It has been suggested to disable GC during the _PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls. Jira: ENTLLT-7285 Change-Id: I2548d07803fc98db8717057ae3006e6af68b2f86
When using threaded applications, there is a high risk of a deadlock in the intepreter. It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL. It has been suggested to disable GC during the _PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls.
@colesbury I've created a PR #117332 trying to fix this issue by implementing what you have suggested. Please let me know if this looks reasonable or if I am completely off-road :) |
When using threaded applications, there is a high risk of a deadlock in the intepreter. It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL. By disabling the GC during the _PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls fixes the issue.
This is a reland of commit ae42a26 The CL was reverted as it caused devices to hang when running tests on android. (python/cpython#106883) This has been fixed with a patch that fixed a bug in the python 3.11 interpreter. https://chromium-review.googlesource.com/c/infra/infra/+/6155095 And then rolling out that change to depot_tools: https://chromium-review.googlesource.com/c/chromium/tools/depot_tools/+/6158273 Original change's description: > Reland Migrate to python 3.11 from 3.8 > > Original CL: https://chromium-review.googlesource.com/c/chromium/src/+/5872818 > was reverted and was not able to create an actual reland because of > rebase merge failure with "create reland" button. > > Previous CL was reverted from failing: > chromeos-jacuzzi-rel > chromeos-octopus-rel > > This is from a incompatibility in aioquic, which is now fixed with > an update to aioquick 1.2 (and its supporting libraries) as well as > adding python and permission changes to chromeos. > > crrev.com/5904256 > crrev.com/5904734 > > and some internal changes. > > The webtransport_h3_server.py file has to be changed here with the > vpython change, and then wpt can be rolled. The change is already > in the wpt repo, but has been manually excluded in the roll. > > Bug: 40942322 > Change-Id: Id12b7085fbc1ffd7694712013e6f746e65c3499f > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5887447 > Reviewed-by: Ben Pastene <[email protected]> > Reviewed-by: Brian Sheedy <[email protected]> > Commit-Queue: Benjamin Joyce (Ben) <[email protected]> > Reviewed-by: Jonathan Lee <[email protected]> > Cr-Commit-Position: refs/heads/main@{#1370866} Bug: 40942322 Change-Id: Icce4762639fa96104f81c114e853f8069c95cf85 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6168586 Reviewed-by: Jonathan Lee <[email protected]> Reviewed-by: mmenke <[email protected]> Reviewed-by: Ben Pastene <[email protected]> Reviewed-by: Brian Sheedy <[email protected]> Commit-Queue: Benjamin Joyce (Ben) <[email protected]> Cr-Commit-Position: refs/heads/main@{#1410521}
This reverts commit a5a4a98. Reason for revert: LUCI Bisection has identified this change as the cause of a test failure. See the analysis: https://ci.chromium.org/ui/p/chromium/bisection/test-analysis/b/5987431141605376 Sample build with failed test: https://ci.chromium.org/b/8724910883092854417 Affected test(s): [ninja://:chrome_wpt_tests/external/wpt/webdriver/tests/bidi/session/unsubscribe/invalid.py](https://ci.chromium.org/ui/test/chromium/ninja:%2F%2F:chrome_wpt_tests%2Fexternal%2Fwpt%2Fwebdriver%2Ftests%2Fbidi%2Fsession%2Funsubscribe%2Finvalid.py?q=VHash%3A1b1e9c5537be9730) If this is a false positive, please report it at http://b.corp.google.com/createIssue?component=1199205&description=Analysis%3A+https%3A%2F%2Fci.chromium.org%2Fui%2Fp%2Fchromium%2Fbisection%2Ftest-analysis%2Fb%2F5987431141605376&format=PLAIN&priority=P3&title=Wrongly+blamed+https%3A%2F%2Fchromium-review.googlesource.com%2Fc%2Fchromium%2Fsrc%2F%2B%2F6168586&type=BUG Original change's description: > Reland "Reland Migrate to python 3.11 from 3.8" > > This is a reland of commit ae42a26 > > The CL was reverted as it caused devices to hang when running tests > on android. (python/cpython#106883) > > This has been fixed with a patch that fixed a bug in the python > 3.11 interpreter. > > https://chromium-review.googlesource.com/c/infra/infra/+/6155095 > > And then rolling out that change to depot_tools: > https://chromium-review.googlesource.com/c/chromium/tools/depot_tools/+/6158273 > > Original change's description: > > Reland Migrate to python 3.11 from 3.8 > > > > Original CL: https://chromium-review.googlesource.com/c/chromium/src/+/5872818 > > was reverted and was not able to create an actual reland because of > > rebase merge failure with "create reland" button. > > > > Previous CL was reverted from failing: > > chromeos-jacuzzi-rel > > chromeos-octopus-rel > > > > This is from a incompatibility in aioquic, which is now fixed with > > an update to aioquick 1.2 (and its supporting libraries) as well as > > adding python and permission changes to chromeos. > > > > crrev.com/5904256 > > crrev.com/5904734 > > > > and some internal changes. > > > > The webtransport_h3_server.py file has to be changed here with the > > vpython change, and then wpt can be rolled. The change is already > > in the wpt repo, but has been manually excluded in the roll. > > > > Bug: 40942322 > > Change-Id: Id12b7085fbc1ffd7694712013e6f746e65c3499f > > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5887447 > > Reviewed-by: Ben Pastene <[email protected]> > > Reviewed-by: Brian Sheedy <[email protected]> > > Commit-Queue: Benjamin Joyce (Ben) <[email protected]> > > Reviewed-by: Jonathan Lee <[email protected]> > > Cr-Commit-Position: refs/heads/main@{#1370866} > > Bug: 40942322 > Change-Id: Icce4762639fa96104f81c114e853f8069c95cf85 > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6168586 > Reviewed-by: Jonathan Lee <[email protected]> > Reviewed-by: mmenke <[email protected]> > Reviewed-by: Ben Pastene <[email protected]> > Reviewed-by: Brian Sheedy <[email protected]> > Commit-Queue: Benjamin Joyce (Ben) <[email protected]> > Cr-Commit-Position: refs/heads/main@{#1410521} > Bug: 40942322 Change-Id: I0610989dbb4011d4d10b3c5b2cedaf1417e80c99 No-Presubmit: true No-Tree-Checks: true No-Try: true Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6196028 Owners-Override: Ivana Žužić <[email protected]> Reviewed-by: Ivana Žužić <[email protected]> Commit-Queue: Ivana Žužić <[email protected]> Cr-Commit-Position: refs/heads/main@{#1410881}
Is this issue affecting Python 3.10 as well, or only Python 3.11? |
All instances of this issue I have observed so far happened with Python 3.11. I can also reproduce the deadlock with Yilei's example in Python 3.11, but not in 3.10, so going to assume Python 3.10 is not affected. Planning to work around by replacing calls to
|
@tvalentyn you can also upgrade to 3.12. |
This is a reland of commit a5a4a98 The revert was caused by hanging swarming tasks which had a separate version of python that wasn't synced to the buildbucket version. (See crbug.com/395160553) That version was fixed with https://chrome-internal-review.googlesource.com/c/infradata/config/+/8013788 and a bug has been filed for the versions not being in sync. Original change's description: > Reland "Reland Migrate to python 3.11 from 3.8" > > This is a reland of commit ae42a26 > > The CL was reverted as it caused devices to hang when running tests > on android. (python/cpython#106883) > > This has been fixed with a patch that fixed a bug in the python > 3.11 interpreter. > > https://chromium-review.googlesource.com/c/infra/infra/+/6155095 > > And then rolling out that change to depot_tools: > https://chromium-review.googlesource.com/c/chromium/tools/depot_tools/+/6158273 > > Original change's description: > > Reland Migrate to python 3.11 from 3.8 > > > > Original CL: https://chromium-review.googlesource.com/c/chromium/src/+/5872818 > > was reverted and was not able to create an actual reland because of > > rebase merge failure with "create reland" button. > > > > Previous CL was reverted from failing: > > chromeos-jacuzzi-rel > > chromeos-octopus-rel > > > > This is from a incompatibility in aioquic, which is now fixed with > > an update to aioquick 1.2 (and its supporting libraries) as well as > > adding python and permission changes to chromeos. > > > > crrev.com/5904256 > > crrev.com/5904734 > > > > and some internal changes. > > > > The webtransport_h3_server.py file has to be changed here with the > > vpython change, and then wpt can be rolled. The change is already > > in the wpt repo, but has been manually excluded in the roll. > > > > Bug: 40942322 > > Change-Id: Id12b7085fbc1ffd7694712013e6f746e65c3499f > > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5887447 > > Reviewed-by: Ben Pastene <[email protected]> > > Reviewed-by: Brian Sheedy <[email protected]> > > Commit-Queue: Benjamin Joyce (Ben) <[email protected]> > > Reviewed-by: Jonathan Lee <[email protected]> > > Cr-Commit-Position: refs/heads/main@{#1370866} > > Bug: 40942322 > Change-Id: Icce4762639fa96104f81c114e853f8069c95cf85 > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6168586 > Reviewed-by: Jonathan Lee <[email protected]> > Reviewed-by: mmenke <[email protected]> > Reviewed-by: Ben Pastene <[email protected]> > Reviewed-by: Brian Sheedy <[email protected]> > Commit-Queue: Benjamin Joyce (Ben) <[email protected]> > Cr-Commit-Position: refs/heads/main@{#1410521} Bug: 40942322 Change-Id: I10c428827d8c5a5618dd39f5a5f09484ee3f2d62 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6198919 Reviewed-by: mmenke <[email protected]> Reviewed-by: Ben Pastene <[email protected]> Reviewed-by: Brian Sheedy <[email protected]> Commit-Queue: Benjamin Joyce (Ben) <[email protected]> Reviewed-by: Jonathan Lee <[email protected]> Cr-Commit-Position: refs/heads/main@{#1420258}
When using threaded applications, there is a high risk of a deadlock in the interpreter. It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL. By disabling the GC during the _PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls fixes the issue.
On some platforms the test test_current_frames_exceptions_deadlock goes in timeout when CPython is built with debug. The execution time of the test is about 4x and the current timeout is not enough to complete the test.
On some platforms the test test_current_frames_exceptions_deadlock goes in timeout when CPython is built with debug. The execution time of the test is about 4x and the current timeout is not enough to complete the test.
On some platforms the test test_current_frames_exceptions_deadlock goes in timeout when CPython is built with debug. The execution time of the test is about 4x and the current timeout is not enough to complete the test.
Bug report
When using
sys._current_frames
in a threaded application there is a high risk of a deadlock in the interpreter. Below is my analysis from one such hang. We have never seen this problem with Python 3.8 but experience it regularly with Python 3.11. I know that thesys._current_frames
documentation warns about the function and one solution to this bug report could be to document that the function is not thread safe.We have a huge number of threads in this application but I'll limit the output to the two threads that cause the deadlock.
We have a thread calling
sys._current_frames
with a Python stack trace:and a corresponding C stack trace:
So, this thread does not hold the GIL but seems to hold the
runtime->interpreters.mutex
We have another thread that seems to hold the GIL:
See the
native_thread_id
in the following gdb output.Your environment
Python 3.11.4 on a CentOS 7 machine with conda-forge binaries
Linked PRs
The text was updated successfully, but these errors were encountered: