-
-
Notifications
You must be signed in to change notification settings - Fork 32k
asyncio: FutureIter_dealloc() crashes with negative refcount #122695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can take a look at this whenever you get a reproducer. |
Recent change related to FutureIter_dealloc(): commit 23192ab. Python 3.12.4 contains this change. Do you reproduce this issue with Python 3.12.3? |
I'll give it a go on 3.12.3. On 3.12.4, I hit another symptom of the issue this weekend: instead of crashing on the assert, it sometimes throws the code in an infinite loop (or at least it takes more than ~20h to do a collection). It's not clear if the loop is coming from gc.get_referents() returning some sort of infinite stream of value or if the GC itself is just stuck. Here is the gdb backtrace of the process interrupted at a random point:
|
Now trying with
The C backtrace looks like that:
|
Obviously, the debug will be way easier with a reproducer :-) |
I've spent some time trying to get a minimal reproducer and could not get it to work. It's possible this is related to running under pytest for some reason. As it stands, the only reproducer I can offer is:
EDIT: remove |
A bit of progress with gdb on v3.12.3 : the infinite loop is here: EDIT: It looks like module_free_freelists would cause a refcount issue if |
I added some asserts on v3.12.3 tag: diff --git a/Modules/_asynciomodule.c b/Modules/_asynciomodule.c
index a465090bfaa..9b47661b1fc 100644
--- a/Modules/_asynciomodule.c
+++ b/Modules/_asynciomodule.c
@@ -1597,6 +1597,7 @@ FutureIter_dealloc(futureiterobject *it)
if (state->fi_freelist_len < FI_FREELIST_MAXLEN) {
state->fi_freelist_len++;
it->future = (FutureObj*) state->fi_freelist;
+ assert((void*)it->future != (void*)it);
state->fi_freelist = it;
}
else {
@@ -1818,6 +1819,7 @@ future_new_iter(PyObject *fut)
}
it->future = (FutureObj*)Py_NewRef(fut);
+ assert((void*)it->future != (void*)it);
PyObject_GC_Track(it);
return (PyObject*)it;
}
@@ -3595,6 +3597,7 @@ module_traverse(PyObject *mod, visitproc visit, void *arg)
PyObject *current = next;
Py_VISIT(current);
next = (PyObject*) ((futureiterobject*) current)->future;
+ assert((void*)next != (void*)current);
}
return 0;
}
And this one failed to pass:
|
At a very speculative glance, I would guess that the infinite loop problem originates from |
@ZeroIntensity unfortunately I could not create a reproducer beyond the existing test suite, but I'm happy to apply a patch on cpython and run with that to help diagnose further, provide coredumps etc. What is quite strange that this only triggers an issue on arm64 where there seemingly is nothing arch-specific in that code I could see. That test has been running on x86-64 since ~2018 without problems (obviously not on Python 3.12 all this time but still). |
If this only happens on ARM, then I think this might be an actual bug with the garbage collector. Since we don't have a standalone reproducer, you'll have to help me here a bit:
|
That was my initial guess, but then the issue seemed to be a logic issue in Modules/_asynciomodule.c . It's hard to know for sure though as it could just be an invariant the gc is supposed to provide that is broken.
Yes, see #122695 (comment) . You may be able to skip the The issue manifests as either the test hanging or sometimes (rarely) triggering a negative refcount check assert.
The gc hangs while executing that function: But the exact location is kind of variable and probably depends on when the gc decides to run. Always inside
So far I've tested 3.12.3 and 3.12.4 and it fails on both. I don't think I can easily test on 3.13 as some of our dependencies have extension modules that may or may not be compatible with 3.13, but I may be able to hack through the code to still have the offending code run with no dep. |
Apologies, my connection is slow at the moment so I've been trying to install it for the past hour. I'll see if I can reproduce it locally whenever it installs. Does |
not in itself, only via dependencies (e.g. pandas) On another note, I've tried on v3.12.4 tag with the following patch: diff --git a/Modules/_asynciomodule.c b/Modules/_asynciomodule.c
index 05e79915ba7..3228aed87cb 100644
--- a/Modules/_asynciomodule.c
+++ b/Modules/_asynciomodule.c
@@ -17,7 +17,7 @@ module _asyncio
/*[clinic end generated code: output=da39a3ee5e6b4b0d input=8fd17862aa989c69]*/
-#define FI_FREELIST_MAXLEN 255
+#define FI_FREELIST_MAXLEN 0
typedef struct futureiterobject futureiterobject; and that "fixes" the issue. So it's quite clear that this free list either has a problem on its own or something its implementation depends on (e.g. the gc) has an unexpected behavior. |
Also, I think I just reproduced on |
Another thing I've observed is that cpython/Modules/_asynciomodule.c Line 3609 in 8e8a4ba
Is a |
That's odd, if something's reference count is zero, it's in freed memory. |
Yeah, I had a look at the implementation of I'll try to re-run with extra asserts regarding refcount. |
Storing it in a list increments the reference count. |
That makes sense, but then it would free it when the list gets destroyed, leading to destroying the object again and again every time it lands in such list isn't it ? Also the doc states:
If the refcount is 0, no one owns any strong reference to it, so this might trigger weird things. But I'm not super familiar with the details of Python GC. |
Yup, but it's very difficult to find the cause without a standalone reproducer :( |
Yes, that's very annoying. I tried hacking the code to remove the dependencies that were unused in the test path, and the issue disappears ... It also looks like v3.13 has changed the freelist code to use |
I wonder if the problem is with |
Let me try. I've ran on 3.12.4 with that patch: diff --git a/Modules/_asynciomodule.c b/Modules/_asynciomodule.c
index 05e79915ba7..964f291eab5 100644
--- a/Modules/_asynciomodule.c
+++ b/Modules/_asynciomodule.c
@@ -1612,6 +1612,7 @@ FutureIter_dealloc(futureiterobject *it)
state->fi_freelist_len++;
it->future = (FutureObj*) state->fi_freelist;
state->fi_freelist = it;
+ assert(Py_REFCNT(state->fi_freelist) > 0);
}
else {
PyObject_GC_Del(it);
@@ -1821,6 +1822,7 @@ future_new_iter(PyObject *fut)
state->fi_freelist_len--;
it = state->fi_freelist;
state->fi_freelist = (futureiterobject*) it->future;
+ assert(Py_REFCNT(state->fi_freelist) > 0);
it->future = NULL;
_Py_NewReference((PyObject*) it);
} and it triggered the assert line 1615 So that means that
EDIT: it is expected that tp_dealloc is called with refcnt == 0: https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_dealloc |
Oh right, I was being dumb. Does my patch work? |
So if I understand correctly:
|
still building it |
This might be the problem, being finalized twice causes a double free. Though, it's odd that you can't reproduce it. |
I think I found the issue:
Does that make sense ? I'm now running your patch, let's see ... |
That makes sense, I'm just skeptical considering the segfault occurs only under very specific conditions -- I would think it would have been more reproducible if that's the problem. |
The patch does not seem to work, it gets stuck in module_traverse() the same way. |
That's wrong. A freelist only contains raw uninitialized bytes, not Python objects. |
Maybe this only triggers when:
EDIT: FutureIter is simply created by |
I'll defer to @vstinner, I don't know nearly enough about |
@ZeroIntensity at last, here is your reproducer: import asyncio
import _asyncio
import gc
async def main():
it = iter(asyncio.Future())
del it
xs = gc.get_referents(_asyncio)
asyncio.run(main()) asyncio.run() and the coroutine can be removed at the cost of getting a warning when creating EDIT: I confirm that this reproduces the issue on 3.12 on any arch. There is no issue on 3.8, 3.9, 3.10 nor 3.11. From staring at the sources, it looks like 3.13 should not be affected since c908d1f removed the |
Thanks! It is odd that this only happens on 3.12, I'll try to write a patch. @Eclips4, this issue needs the |
For me, the root issue is that |
Would it make sense to add an asset in debug builds in Py_VISIT() to check for non-zero refcnt ? If the cost is too high for debug builds, maybe that could be done at least in gc.get_referents() to make sure we don't return anything half dead |
Agreeing, but wouldn't that be a breaking change for a patch release? My solution as of now is to store an extra strong reference to prevent the double-free (it's not a leak, since it still gets deallocated by |
It's not a breaking change but a bugfix. As you can see, the current code is wrong and causes undefined behavior when gc.get_referrers()/get_referents() is used. |
I've created #122834 as a fix, but Victor's way could be better. @douglas-raillard-arm does |
@ZeroIntensity no, it's completely accidental. I think the link between my root object and async stuff is another object that uses some asyncio closures, that themselves probably somehow refer to the asyncio module and then we land on that free list. Also, I'm going to replace that code that tries to fix up an object by just building it the right way directly. There is no way to pass any state to the yaml deserializer that builds it but I can probably just use a thread local or context variable to keep that state in my module. |
… with a freed `_asyncio.FutureIter` (python#122837) * Backport python#122834 for 3.13 (cherry picked from commit e8fb088)
Thanks @douglas-raillard-arm for the bug report, thanks @ZeroIntensity for the analysis and the fix! I close the issue. The 3.12 backport will land soon. |
Crash report
What happened?
This is not a standalone reproducer as I'm still working on that, but the code triggering it is:
The issue was triggered on Gentoo on an arm64 machine. For some reason, this works fine on x86-64. I then managed to trigger it again under a different Python 3.12.4 interpreter compiled from the git repo with debug symbols to get a gdb backtrace:
The Python backtrace at this point was deeply recursed in
update_refs()
.CPython versions tested on:
3.12
Operating systems tested on:
Linux
Output from running 'python -VV' on the command line:
Python 3.12.4 (tags/v3.12.4:8e8a4baf652, Aug 2 2024, 18:22:31) [GCC 13.3.1 20240614]
Linked PRs
gc.get_referents
with a freed_asyncio.FutureIter
#122834gc.get_referents
with a freed_asyncio.FutureIter
#122837gc.get_referents
with a freed_asyncio.FutureIter
(#122837) #122859The text was updated successfully, but these errors were encountered: