Skip to content

asyncio: FutureIter_dealloc() crashes with negative refcount #122695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
douglas-raillard-arm opened this issue Aug 5, 2024 · 43 comments
Closed
Labels
3.12 only security fixes topic-asyncio type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@douglas-raillard-arm
Copy link

douglas-raillard-arm commented Aug 5, 2024

Crash report

What happened?

This is not a standalone reproducer as I'm still working on that, but the code triggering it is:

    @classmethod
    def _get_referred_objs(cls, obj, predicate=lambda x: True):
        visited = set()
        objs = []

        def update_refs(obj):
            obj_id = id(obj)
            # Avoid cycles. Use the id() of the objects directly since the
            # inclusion check is orders of magnitude faster than checking for
            # inclusing on the object directly. It also handles well non hashable
            # objects and broken __eq__ implementations.
            if obj_id in visited:
                return
            else:
                visited.add(obj_id)
                # Filter-out weird objects that end up in the list and that can
                # trigger a coredump on the interpreter
                with warnings.catch_warnings():
                    warnings.simplefilter("ignore")
                    has_class = hasattr(obj, '__class__')

                if has_class and predicate(obj):
                    objs.append(obj)

                for sub in gc.get_referents(obj):
                    update_refs(sub)

        update_refs(obj)
        return objs

The issue was triggered on Gentoo on an arm64 machine. For some reason, this works fine on x86-64. I then managed to trigger it again under a different Python 3.12.4 interpreter compiled from the git repo with debug symbols to get a gdb backtrace:

#6  0x0000aaaaaadfd4a0 in _PyObject_AssertFailed (obj=obj@entry=<_asyncio.FutureIter at remote 0xfffff210b3d0>, expr=expr@entry=0x0, 
    msg=msg@entry=0xaaaaaaea5c50 "object has negative ref count", file=<optimized out>, line=<optimized out>, 
    function=function@entry=0xaaaaaaed4928 <__func__.44.lto_priv.1> "_Py_NegativeRefcount") at Objects/object.c:2603
#7  0x0000aaaaaadfd550 in _Py_NegativeRefcount (filename=<optimized out>, lineno=<optimized out>, op=op@entry=<_asyncio.FutureIter at remote 0xfffff210b3d0>) at Objects/object.c:209
#8  0x0000fffff5df4bb0 in Py_DECREF (filename=filename@entry=0xfffff5dfba58 "./Modules/_asynciomodule.c", lineno=lineno@entry=1764, op=<_asyncio.FutureIter at remote 0xfffff210b3d0>)
    at ./Include/object.h:682
#9  0x0000fffff5df59a0 in FutureIter_clear (it=<optimized out>) at ./Modules/_asynciomodule.c:1764
#10 0x0000fffff5dfb57c in FutureIter_dealloc (it=0xfffff2109f50) at ./Modules/_asynciomodule.c:1601
#11 0x0000aaaaaabc9658 in _Py_Dealloc (op=op@entry=<_asyncio.FutureIter at remote 0xfffff2109f50>) at Objects/object.c:2625
#12 0x0000aaaaaabc9acc in Py_DECREF (filename=filename@entry=0xaaaaaae65f90 "./Include/object.h", lineno=lineno@entry=798, op=<_asyncio.FutureIter at remote 0xfffff2109f50>)
    at ./Include/object.h:690
#13 0x0000aaaaaabc9a68 in Py_XDECREF (op=<optimized out>) at ./Include/object.h:798
#14 0x0000aaaaaabc9780 in list_dealloc (op=0xffffdd2bc5a0) at Objects/listobject.c:356
#15 0x0000aaaaaabc9658 in _Py_Dealloc (

The Python backtrace at this point was deeply recursed in update_refs().

CPython versions tested on:

3.12

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

Python 3.12.4 (tags/v3.12.4:8e8a4baf652, Aug 2 2024, 18:22:31) [GCC 13.3.1 20240614]

Linked PRs

@douglas-raillard-arm douglas-raillard-arm added the type-crash A hard crash of the interpreter, possibly with a core dump label Aug 5, 2024
@ZeroIntensity
Copy link
Member

I can take a look at this whenever you get a reproducer.

@vstinner
Copy link
Member

vstinner commented Aug 5, 2024

Recent change related to FutureIter_dealloc(): commit 23192ab. Python 3.12.4 contains this change.

Do you reproduce this issue with Python 3.12.3?

@douglas-raillard-arm
Copy link
Author

I'll give it a go on 3.12.3. On 3.12.4, I hit another symptom of the issue this weekend: instead of crashing on the assert, it sometimes throws the code in an infinite loop (or at least it takes more than ~20h to do a collection). It's not clear if the loop is coming from gc.get_referents() returning some sort of infinite stream of value or if the GC itself is just stuck. Here is the gdb backtrace of the process interrupted at a random point:

#0  0x0000aaaaaabced44 in _PyMem_IsPtrFreed (ptr=<optimized out>) at ./Include/internal/pycore_pymem.h:83
#1  0x0000aaaaaabceccc in _PyObject_IsFreed (op=op@entry=<_asyncio.FutureIter at remote 0xffff369d7b10>) at ./Include/object.h:220
#2  0x0000aaaaaac0c428 in visit_decref (op=<_asyncio.FutureIter at remote 0xffff369d7b10>, parent=0xfffff6427710) at Modules/gcmodule.c:463
#3  0x0000fffff5df4d84 in module_traverse (mod=<optimized out>, visit=0xaaaaaac0c410 <visit_decref>, arg=0xfffff6427710) at ./Modules/_asynciomodule.c:3610
#4  0x0000aaaaaabd600c in module_traverse (m=0xfffff6427710, visit=0xaaaaaac0c410 <visit_decref>, arg=0xfffff6427710) at Objects/moduleobject.c:874
#5  0x0000aaaaaac0b9b0 in subtract_refs (containers=containers@entry=0xaaaaab1a31e8 <_PyRuntime+92624>) at Modules/gcmodule.c:491
#6  0x0000aaaaaac0ad68 in deduce_unreachable (base=base@entry=0xaaaaab1a31e8 <_PyRuntime+92624>, unreachable=unreachable@entry=0xffffffffc1e8) at Modules/gcmodule.c:1116
#7  0x0000aaaaaac091a4 in gc_collect_main (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>, generation=generation@entry=2, n_collected=n_collected@entry=0xffffffffc270, 
    n_uncollectable=n_uncollectable@entry=0xffffffffc268, nofail=nofail@entry=0) at Modules/gcmodule.c:1242
#8  0x0000aaaaaac07e94 in gc_collect_with_callback (tstate=0xaaaaab200a70 <_PyRuntime+475736>, generation=2) at Modules/gcmodule.c:1426
#9  0x0000aaaaaac07e18 in gc_collect_generations (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>) at Modules/gcmodule.c:1481
#10 0x0000aaaaaac07c94 in _Py_RunGC (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>) at Modules/gcmodule.c:2295
#11 0x0000aaaaaac077e0 in _Py_HandlePending (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>) at Python/ceval_gil.c:1045
#12 0x0000aaaaaabf1eac in _PyEval_EvalFrameDefault (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>, frame=0xfffff74e9930, throwflag=throwflag@entry=0) at Python/ceval.c:834
#13 0x0000aaaaaabf1910 in _PyEval_EvalFrame (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>, frame=<optimized out>, throwflag=throwflag@entry=0)
    at ./Include/internal/pycore_ceval.h:89
#14 0x0000aaaaaabee828 in _PyEval_Vector (tstate=0xaaaaab200a70 <_PyRuntime+475736>, func=0xffff369baf90, locals=locals@entry=0x0, args=0xffffffffc550, argcount=1, kwnames=0x0)
    at Python/ceval.c:1683

@douglas-raillard-arm
Copy link
Author

Now trying with Python 3.12.3 (tags/v3.12.3:f6650f9ad73, Aug 6 2024, 10:09:17) [GCC 13.3.1 20240614] and I get the same "stuck in garbage collector" symptom (I stopped and continued 10 times with random delays and I always land on a backtrace starting with Garbage-collecting):

Traceback (most recent call first):
  Garbage-collecting
  File "/home/dourai01/cpython/Lib/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
  File "/home/dourai01/cpython/Lib/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
  File "/home/dourai01/cpython/Lib/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
  File "/home/dourai01/cpython/Lib/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
  File "/home/dourai01/cpython/Lib/abc.py", line 119, in __instancecheck__
    return _abc_instancecheck(cls, instance)
  File "/home/dourai01/lisa/lisa/tests/base.py", line 1168, in predicate
    return isinstance(x, TestBundleBase)
  File "/home/dourai01/lisa/lisa/tests/base.py", line 1144, in update_refs
    if has_class and predicate(obj):
  File "/home/dourai01/lisa/lisa/tests/base.py", line 1148, in update_refs
    update_refs(sub)
  File "/home/dourai01/lisa/lisa/tests/base.py", line 1148, in update_refs
    update_refs(sub)
  File "/home/dourai01/lisa/lisa/tests/base.py", line 1148, in update_refs

[...]

The C backtrace looks like that:

#0  0x0000aaaaaabce7d8 in _PyMem_IsPtrFreed (ptr=ptr@entry=0xffff368a7490) at ./Include/internal/pycore_pymem.h:83
#1  0x0000aaaaaabce754 in _PyObject_IsFreed (op=op@entry=<_asyncio.FutureIter at remote 0xffff368a7490>) at Objects/object.c:468
#2  0x0000aaaaaac0be50 in visit_decref (op=<_asyncio.FutureIter at remote 0xffff368a7490>, parent=0xfffff64cdcd0) at Modules/gcmodule.c:463
#3  0x0000fffff5d54d84 in module_traverse (mod=<optimized out>, visit=0xaaaaaac0be38 <visit_decref>, arg=0xfffff64cdcd0) at ./Modules/_asynciomodule.c:3596
#4  0x0000aaaaaabd5a8c in module_traverse (m=0xfffff64cdcd0, visit=0xaaaaaac0be38 <visit_decref>, arg=0xfffff64cdcd0) at Objects/moduleobject.c:874
#5  0x0000aaaaaac0b3d8 in subtract_refs (containers=containers@entry=0xaaaaab1a2738 <_PyRuntime+92576>) at Modules/gcmodule.c:491
#6  0x0000aaaaaac0a790 in deduce_unreachable (base=base@entry=0xaaaaab1a2738 <_PyRuntime+92576>, unreachable=unreachable@entry=0xffffffffaeb8) at Modules/gcmodule.c:1116
#7  0x0000aaaaaac08bcc in gc_collect_main (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>, generation=generation@entry=2, n_collected=n_collected@entry=0xffffffffaf40, 
    n_uncollectable=n_uncollectable@entry=0xffffffffaf38, nofail=nofail@entry=0) at Modules/gcmodule.c:1242
#8  0x0000aaaaaac078bc in gc_collect_with_callback (tstate=0xaaaaab1fffc0 <_PyRuntime+475688>, generation=2) at Modules/gcmodule.c:1426
#9  0x0000aaaaaac07840 in gc_collect_generations (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>) at Modules/gcmodule.c:1481
#10 0x0000aaaaaac076bc in _Py_RunGC (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>) at Modules/gcmodule.c:2295
#11 0x0000aaaaaac07208 in _Py_HandlePending (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>) at Python/ceval_gil.c:1045
#12 0x0000aaaaaabf1a6c in _PyEval_EvalFrameDefault (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>, frame=0xfffff7f0de70, throwflag=throwflag@entry=0) at Python/ceval.c:834
#13 0x0000aaaaaabf1378 in _PyEval_EvalFrame (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>, frame=<optimized out>, throwflag=throwflag@entry=0)
    at ./Include/internal/pycore_ceval.h:89
#14 0x0000aaaaaabee290 in _PyEval_Vector (tstate=0xaaaaab1fffc0 <_PyRuntime+475688>, func=func@entry=0xfffff7839c10, locals=locals@entry=0x0, args=args@entry=0xffffffffb2e8, 
    argcount=argcount@entry=2, kwnames=kwnames@entry=0x0) at Python/ceval.c:1683
#15 0x0000aaaaaac4663c in _PyFunction_Vectorcall (func=func@entry=<function at remote 0xfffff7839c10>, stack=stack@entry=0xffffffffb2e8, nargsf=nargsf@entry=2, kwnames=kwnames@entry=0x0)
    at Objects/call.c:419
#16 0x0000aaaaaac8ce2c in _PyObject_VectorcallTstate (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>, callable=callable@entry=<function at remote 0xfffff7839c10>, 
    args=args@entry=0xffffffffb2e8, nargsf=nargsf@entry=2, kwnames=kwnames@entry=0x0) at ./Include/internal/pycore_call.h:92
#17 0x0000aaaaaac8c134 in method_vectorcall (method=<optimized out>, args=0xffffffffb2f0, nargsf=9223372036854775809, kwnames=0x0) at Objects/classobject.c:61
#18 0x0000aaaaaabdcea8 in _PyObject_VectorcallTstate (tstate=0xaaaaab1fffc0 <_PyRuntime+475688>, callable=callable@entry=<method at remote 0xffffcdcfdc10>, args=args@entry=0xffffffffb2f0, 
    nargsf=nargsf@entry=9223372036854775809, kwnames=kwnames@entry=0x0) at ./Include/internal/pycore_call.h:92
#19 0x0000aaaaaac4c2b4 in PyObject_CallOneArg (func=func@entry=<method at remote 0xffffcdcfdc10>, arg=arg@entry=<type at remote 0xaaaaab0ea7d0>) at Objects/call.c:401
#20 0x0000aaaaaac7fd1c in object_issubclass (tstate=0xaaaaab1fffc0 <_PyRuntime+475688>, derived=derived@entry=<type at remote 0xaaaaab0ea7d0>, 
    cls=cls@entry=<TestBundleMeta(__module__='lisa.tests.base', __doc__='Dummy class used as a base class for all tests.\n\n.. warning:: Arbitrary code can be executed while loading an instance from a YAML or Pickle file. To include untrusted data in YAML, use the !untrusted tag along with a string\n\n.. note:: As a subclass of :class:`lisa.tests.base.TestBundleBase`, this class is considered as "application" and its API is therefore more subject to change than other parts of :mod:`lisa`.', check_from_target=<classmethod at remote 0xffff36a61b80>, FTRACE_CONF=<FtraceConf(_key_desc_path=[], _src_prio=['TestBundle(required)', 'TestBundle'], _src_override={}, _key_map={'buffer-size': {'TestBundle': 10240}, 'events': {'TestBundle': <AndTraceEventChecker(check=True, checkers=[], op_str='and', prefix_str='') at remote 0xffff368ab440>, 'TestBundle(required)': <AndTraceEventChecker(check=True, checkers=[], op_str='and', prefix_str='') at remote 0xffff368ab990>}, 'events-namespaces': {'TestBundle': [], 'TestBundle(required)': []}, 'sav...(truncated)) at Objects/abstract.c:2736
#21 0x0000aaaaaac7fae0 in PyObject_IsSubclass (derived=derived@entry=<type at remote 0xaaaaab0ea7d0>, 
    cls=cls@entry=<TestBundleMeta(__module__='lisa.tests.base', __doc__='Dummy class used as a base class for all tests.\n\n.. warning:: Arbitrary code can be executed while loading an instance from a YAML or Pickle file. To include untrusted data in YAML, use the !untrusted tag along with a string\n\n.. note:: As a subclass of :class:`lisa.tests.base.TestBundleBase`, this class is considered as "application" and its API is therefore more subject to change than other parts of :mod:`lisa`.', check_from_target=<classmethod at remote 0xffff36a61b80>, FTRACE_CONF=<FtraceConf(_key_desc_path=[], _src_prio=['TestBundle(required)', 'TestBundle'], _src_override={}, _key_map={'buffer-size': {'TestBundle': 10240}, 'events': {'TestBundle': <AndTraceEventChecker(check=True, checkers=[], op_str='and', prefix_str='') at remote 0xffff368ab440>, 'TestBundle(required)': <AndTraceEventChecker(check=True, checkers=[], op_str='and', prefix_str='') at remote 0xffff368ab990>}, 'events-namespaces': {'TestBundle': [], 'TestBundle(required)': []}, 'sav...(truncated)) at Objects/abstract.c:2758
#22 0x0000aaaaaacbd030 in _abc__abc_subclasscheck_impl (module=module@entry=<module at remote 0xfffff7849190>, self=<optimized out>, subclass=<type at remote 0xaaaaab0ea7d0>)
    at ./Modules/_abc.c:779
#23 0x0000aaaaaacbcbf0 in _abc__abc_subclasscheck (module=<module at remote 0xfffff7849190>, args=args@entry=0xfffff7f0de60, nargs=nargs@entry=2) at ./Modules/clinic/_abc.c.h:141

@vstinner
Copy link
Member

vstinner commented Aug 6, 2024

This is not a standalone reproducer as I'm still working on that

Obviously, the debug will be way easier with a reproducer :-)

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Aug 6, 2024

I've spent some time trying to get a minimal reproducer and could not get it to work. It's possible this is related to running under pytest for some reason. As it stands, the only reproducer I can offer is:

  1. install lisa from the git repo: https://gitlab.arm.com/tooling/lisa
git clone https://gitlab.arm.com/tooling/lisa
cd lisa
# A few packages need to be installed, like python3 or kernelshark. Python
# modules will be installed in a venv at the next step, without touching
# any system-wide install location.
./install_base.sh --install-all

# LISA_PYTHON value cannot be a path apparently, otherwise venv gets confused. Add the relevant folder to PATH first and set LISA_PYTHON to the binary filename.
export LISA_PYTHON=<name of your python binary>

# On the first run, it will take care of creating a Python venv and populating it. This will take LISA_PYTHON into account when creating the venv
source init_env
  1. Run the offending test on an arm64 platform: gdb $LISA_PYTHON -ex 'run -m pytest -s tests/test_test_bundle.py'

EDIT: remove -ex quit

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Aug 6, 2024

A bit of progress with gdb on v3.12.3 : the infinite loop is here:
((futureiterobject*) current)->future is set to current, so it keeps going around without making any progress.

EDIT: It looks like module_free_freelists would cause a refcount issue if current->future == current since it calls PyObject_GC_Del(current); and then will try with current->future at the next iteration. That would try to garbage collect the same object multiple times.

@douglas-raillard-arm
Copy link
Author

I added some asserts on v3.12.3 tag:

diff --git a/Modules/_asynciomodule.c b/Modules/_asynciomodule.c
index a465090bfaa..9b47661b1fc 100644
--- a/Modules/_asynciomodule.c
+++ b/Modules/_asynciomodule.c
@@ -1597,6 +1597,7 @@ FutureIter_dealloc(futureiterobject *it)
     if (state->fi_freelist_len < FI_FREELIST_MAXLEN) {
         state->fi_freelist_len++;
         it->future = (FutureObj*) state->fi_freelist;
+        assert((void*)it->future != (void*)it);
         state->fi_freelist = it;
     }
     else {
@@ -1818,6 +1819,7 @@ future_new_iter(PyObject *fut)
     }
 
     it->future = (FutureObj*)Py_NewRef(fut);
+    assert((void*)it->future != (void*)it);
     PyObject_GC_Track(it);
     return (PyObject*)it;
 }
@@ -3595,6 +3597,7 @@ module_traverse(PyObject *mod, visitproc visit, void *arg)
         PyObject *current = next;
         Py_VISIT(current);
         next = (PyObject*) ((futureiterobject*) current)->future;
+        assert((void*)next != (void*)current);
     }
     return 0;
 }

And this one failed to pass:

static void 
FutureIter_dealloc(futureiterobject *it) 
{ 
    PyTypeObject *tp = Py_TYPE(it); 
    asyncio_state *state = get_asyncio_state_by_def((PyObject *)it); 
    PyObject_GC_UnTrack(it); 
    tp->tp_clear((PyObject *)it); 
 
    if (state->fi_freelist_len < FI_FREELIST_MAXLEN) { 
        state->fi_freelist_len++; 
        it->future = (FutureObj*) state->fi_freelist; 
        assert((void*)it->future != (void*)it); // <------- ADDED THIS ASSERT
        state->fi_freelist = it; 
    } 
    else { 
        PyObject_GC_Del(it); 
        Py_DECREF(tp); 
    } 
} 

@ZeroIntensity
Copy link
Member

At a very speculative glance, I would guess that the infinite loop problem originates from visited and obj, or possibly the context returned by warnings.catch_warnings, being tracked by the GC. It's difficult to tell without a full reproducer, though.

@douglas-raillard-arm
Copy link
Author

@ZeroIntensity unfortunately I could not create a reproducer beyond the existing test suite, but I'm happy to apply a patch on cpython and run with that to help diagnose further, provide coredumps etc.

What is quite strange that this only triggers an issue on arm64 where there seemingly is nothing arch-specific in that code I could see. That test has been running on x86-64 since ~2018 without problems (obviously not on Python 3.12 all this time but still).

@ZeroIntensity
Copy link
Member

If this only happens on ARM, then I think this might be an actual bug with the garbage collector. Since we don't have a standalone reproducer, you'll have to help me here a bit:

  • Is there a repository where you're running this? If we could run the test suite and reproduce the bug locally, that's still better than no reproducer.
  • What line specifically causes this? (as in, if you comment things out, commenting out which line(s) prevents the segfault/infinite loop?)
  • Does this happen on other CPython versions (main branch, 3.13, 3.11, etc)

@douglas-raillard-arm
Copy link
Author

If this only happens on ARM, then I think this might be an actual bug with the garbage collector.

That was my initial guess, but then the issue seemed to be a logic issue in Modules/_asynciomodule.c . It's hard to know for sure though as it could just be an invariant the gc is supposed to provide that is broken.

Is there a repository where you're running this? If we could run the test suite and reproduce the bug locally, that's still better than no reproducer.

Yes, see #122695 (comment) . You may be able to skip the install_base.sh part and either just source init_env or create your own venv and pip install -r devmode_requirements.txt .

The issue manifests as either the test hanging or sometimes (rarely) triggering a negative refcount check assert.

What line specifically causes this? (as in, if you comment things out, commenting out which line(s) prevents the segfault/infinite loop?)

The gc hangs while executing that function:
https://gitlab.arm.com/tooling/lisa/-/blob/main/lisa/tests/base.py?ref_type=heads#L1128
Sometimes it hangs inside predicate(obj) that is defined here as can be seen on this Python backtrace: #122695 (comment)

But the exact location is kind of variable and probably depends on when the gc decides to run. Always inside update_ref() though.

Does this happen on other CPython versions (main branch, 3.13, 3.11, etc)

So far I've tested 3.12.3 and 3.12.4 and it fails on both. I don't think I can easily test on 3.13 as some of our dependencies have extension modules that may or may not be compatible with 3.13, but I may be able to hack through the code to still have the offending code run with no dep.

@ZeroIntensity
Copy link
Member

Apologies, my connection is slow at the moment so I've been trying to install it for the past hour. I'll see if I can reproduce it locally whenever it installs. Does lisa contain any C extensions? (If so, this might be user-error.)

@douglas-raillard-arm
Copy link
Author

Does lisa contain any C extensions? (If so, this might be user-error.)

not in itself, only via dependencies (e.g. pandas)

On another note, I've tried on v3.12.4 tag with the following patch:

diff --git a/Modules/_asynciomodule.c b/Modules/_asynciomodule.c
index 05e79915ba7..3228aed87cb 100644
--- a/Modules/_asynciomodule.c
+++ b/Modules/_asynciomodule.c
@@ -17,7 +17,7 @@ module _asyncio
 /*[clinic end generated code: output=da39a3ee5e6b4b0d input=8fd17862aa989c69]*/
 
 
-#define FI_FREELIST_MAXLEN 255
+#define FI_FREELIST_MAXLEN 0
 
 typedef struct futureiterobject futureiterobject;

and that "fixes" the issue. So it's quite clear that this free list either has a problem on its own or something its implementation depends on (e.g. the gc) has an unexpected behavior.

@douglas-raillard-arm
Copy link
Author

Also, I think I just reproduced on Python 3.12.4 (main, Jun 8 2024, 18:29:57) [GCC 9.4.0] on x86-64, so it may not be arch specific after all.

@douglas-raillard-arm
Copy link
Author

Another thing I've observed is that Py_REFCNT(current) == 0 in an infinite loop case where current->future == current:

PyObject *current = next;

Is a m_traverse implementation supposed to call Py_VISIT() on an object with a refcount == 0 ?

@ZeroIntensity
Copy link
Member

That's odd, if something's reference count is zero, it's in freed memory.

@douglas-raillard-arm
Copy link
Author

Yeah, I had a look at the implementation of gc_get_referents() and it just stores everything visited in a list, no questions asked on refcounts. The tp_traverse() implementation for modules also just delegates to m_traverse() with just an extra condition on the module object itself to avoid an issue at module init, irrelevant here I think.

I'll try to re-run with extra asserts regarding refcount.

@ZeroIntensity
Copy link
Member

stores everything visited in a list, no questions asked on refcounts

Storing it in a list increments the reference count.

@douglas-raillard-arm
Copy link
Author

That makes sense, but then it would free it when the list gets destroyed, leading to destroying the object again and again every time it lands in such list isn't it ? Also the doc states:

When implementing tp_traverse, only the members that the instance owns (by having strong references to them) must be visited

If the refcount is 0, no one owns any strong reference to it, so this might trigger weird things. But I'm not super familiar with the details of Python GC.

@ZeroIntensity
Copy link
Member

Yup, but it's very difficult to find the cause without a standalone reproducer :(

@douglas-raillard-arm
Copy link
Author

Yes, that's very annoying. I tried hacking the code to remove the dependencies that were unused in the test path, and the issue disappears ...

It also looks like v3.13 has changed the freelist code to use _Py_FREELIST_* macros instead in c908d1f. The change seems to be motivated by supporting free-threaded builds.

@ZeroIntensity
Copy link
Member

I wonder if the problem is with FutureObj_clear on 3.12. Does this branch fix it?

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Aug 8, 2024

Let me try. I've ran on 3.12.4 with that patch:

diff --git a/Modules/_asynciomodule.c b/Modules/_asynciomodule.c
index 05e79915ba7..964f291eab5 100644
--- a/Modules/_asynciomodule.c
+++ b/Modules/_asynciomodule.c
@@ -1612,6 +1612,7 @@ FutureIter_dealloc(futureiterobject *it)
         state->fi_freelist_len++;
         it->future = (FutureObj*) state->fi_freelist;
         state->fi_freelist = it;
+       assert(Py_REFCNT(state->fi_freelist) > 0);
     }
     else {
         PyObject_GC_Del(it);
@@ -1821,6 +1822,7 @@ future_new_iter(PyObject *fut)
         state->fi_freelist_len--;
         it = state->fi_freelist;
         state->fi_freelist = (futureiterobject*) it->future;
+       assert(Py_REFCNT(state->fi_freelist) > 0);
         it->future = NULL;
         _Py_NewReference((PyObject*) it);
     }

and it triggered the assert line 1615

So that means that FutureIter_dealloc() is called on a dead object, or that code somewhat destroys it:

    PyObject_GC_UnTrack(it);
    tp->tp_clear((PyObject *)it);

EDIT: it is expected that tp_dealloc is called with refcnt == 0: https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_dealloc

@ZeroIntensity
Copy link
Member

it is expected that tp_dealloc is called with refcnt == 0: https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_dealloc

Oh right, I was being dumb. Does my patch work?

@douglas-raillard-arm
Copy link
Author

So if I understand correctly:

  1. the freelist keeps objects with refcount == 0 by design
  2. However, it Py_VISIT() them in the module's tp_traverse()
  3. So that means that gc.get_referents() will add them to the list. Once that list is destroyed, their refcount will drop back to 0 again. Since this can happen an arbitrary number of time, I suppose they will be finalized multiple times ?
  4. The freelist seems to be a linked list using the futureiter->future pointer to point to the previous item. The module_traverse() function explicitly traverses the list. At the same time, FutureIter_traverse() also visits futureiter->future. Is that an issue ?

@douglas-raillard-arm
Copy link
Author

Oh right, I was being dumb. Does my patch work?

still building it

@ZeroIntensity
Copy link
Member

3. So that means that gc.get_referents() will add them to the list.

This might be the problem, being finalized twice causes a double free. Though, it's odd that you can't reproduce it.

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Aug 8, 2024

I think I found the issue:

  1. module_traverse() visits the freelist
  2. gc.get_referents() adds the references to a list. Since those objects had refcount == 0, the refcount is now 1.
  3. the user code drops the list. The refcount of the items goes back to 0, thereby calling FutureIter_dealloc() on each of them.
  4. As long as the freelist is small enough, FutureIter_dealloc() will add the destroyed futureiter add the front of the list, and store the previous freelist item in its ->future attribute.
  5. Since we end up running that dealloc function every time the refcount drops to 0, this can happen multiple times via the gc.get_referents() path. The objects for which that happens are the objects stored in that freelist. In some cases, the object that will be re-destroyed happens to also be at the top of the freelist. In that case, FutureIter_dealloc() will put it again at the front of the list. By doing so, it creates the infinite cycle where it->future == future.

Does that make sense ? I'm now running your patch, let's see ...

@ZeroIntensity
Copy link
Member

That makes sense, I'm just skeptical considering the segfault occurs only under very specific conditions -- I would think it would have been more reproducible if that's the problem.

@douglas-raillard-arm
Copy link
Author

The patch does not seem to work, it gets stuck in module_traverse() the same way.

@vstinner
Copy link
Member

vstinner commented Aug 8, 2024

module_traverse() visits the freelist

That's wrong. A freelist only contains raw uninitialized bytes, not Python objects.

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Aug 8, 2024

I would think it would have been more reproducible if that's the problem.

Maybe this only triggers when:

  1. there is something in that freelist
  2. on 3.12
    Now that I think about it, the only place where we have used 3.12 were user machines, nowhere in the CI yet. I'll try to make a reproducer with that in mind. I suppose a futureiter is created every time you await on a coroutine somehow ?

EDIT: FutureIter is simply created by iter(asyncio.Future())

@ZeroIntensity
Copy link
Member

I'll defer to @vstinner, I don't know nearly enough about FutureIter (or the freelist, for that matter) to answer that.

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Aug 8, 2024

@ZeroIntensity at last, here is your reproducer:

import asyncio
import _asyncio
import gc

async def main():

    it = iter(asyncio.Future())
    del it
    xs = gc.get_referents(_asyncio)

asyncio.run(main())

asyncio.run() and the coroutine can be removed at the cost of getting a warning when creating asyncio.Future()

EDIT: I confirm that this reproduces the issue on 3.12 on any arch. There is no issue on 3.8, 3.9, 3.10 nor 3.11. From staring at the sources, it looks like 3.13 should not be affected since c908d1f removed the Py_VISIT() on the freelist in module_traverse().

@ZeroIntensity
Copy link
Member

Thanks! It is odd that this only happens on 3.12, I'll try to write a patch.

@Eclips4, this issue needs the topic-asyncio and 3.12 label.

@github-project-automation github-project-automation bot moved this to Todo in asyncio Aug 8, 2024
@Eclips4 Eclips4 added the 3.12 only security fixes label Aug 8, 2024
@vstinner
Copy link
Member

vstinner commented Aug 8, 2024

For me, the root issue is that _asyncio module_traverse() function visits the free list. It doesn't make sense to me and it should be removed.

cc @kumaraditya303

@vstinner vstinner changed the title FutureIter_dealloc() crashes with negative refcount asyncio: FutureIter_dealloc() crashes with negative refcount Aug 8, 2024
@douglas-raillard-arm
Copy link
Author

Would it make sense to add an asset in debug builds in Py_VISIT() to check for non-zero refcnt ? If the cost is too high for debug builds, maybe that could be done at least in gc.get_referents() to make sure we don't return anything half dead

@ZeroIntensity
Copy link
Member

For me, the root issue is that _asyncio module_traverse() function visits the free list

Agreeing, but wouldn't that be a breaking change for a patch release? My solution as of now is to store an extra strong reference to prevent the double-free (it's not a leak, since it still gets deallocated by PyObject_GC_Del).

@vstinner
Copy link
Member

vstinner commented Aug 8, 2024

Agreeing, but wouldn't that be a breaking change for a patch release?

It's not a breaking change but a bugfix. As you can see, the current code is wrong and causes undefined behavior when gc.get_referrers()/get_referents() is used.

@ZeroIntensity
Copy link
Member

I've created #122834 as a fix, but Victor's way could be better. @douglas-raillard-arm does lisa rely on those FutureIter objects being returned by get_referents?

@douglas-raillard-arm
Copy link
Author

@ZeroIntensity no, it's completely accidental. I think the link between my root object and async stuff is another object that uses some asyncio closures, that themselves probably somehow refer to the asyncio module and then we land on that free list.

Also, I'm going to replace that code that tries to fix up an object by just building it the right way directly. There is no way to pass any state to the yaml deserializer that builds it but I can probably just use a thread local or context variable to keep that state in my module.

kumaraditya303 pushed a commit that referenced this issue Aug 9, 2024
…a freed `_asyncio.FutureIter` (#122837)

* Backport #122834 for 3.13
vstinner pushed a commit to vstinner/cpython that referenced this issue Aug 9, 2024
… with a freed `_asyncio.FutureIter` (python#122837)

* Backport python#122834 for 3.13

(cherry picked from commit e8fb088)
@vstinner
Copy link
Member

vstinner commented Aug 9, 2024

Thanks @douglas-raillard-arm for the bug report, thanks @ZeroIntensity for the analysis and the fix! I close the issue. The 3.12 backport will land soon.

@vstinner vstinner closed this as completed Aug 9, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in asyncio Aug 9, 2024
vstinner added a commit that referenced this issue Aug 9, 2024
…a freed `_asyncio.FutureIter` (#122837) (#122859)

[3.13] gh-122695: Fix double-free when using `gc.get_referents` with a freed `_asyncio.FutureIter` (#122837)

* Backport #122834 for 3.13

(cherry picked from commit e8fb088)

Co-authored-by: Peter Bierma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 only security fixes topic-asyncio type-crash A hard crash of the interpreter, possibly with a core dump
Projects
Status: Done
Development

No branches or pull requests

4 participants