Skip to content

Conversation

@Yhg1s
Copy link

@Yhg1s Yhg1s commented Oct 27, 2025

Add support for free-threaded Python (PEP 703).

The significant change here is the use of thread local instead of a volatile global for the switching_thread_state global (which is otherwise protected by the GIL). There's some overhead to using a thread local, so only do this in the free-threaded build.

The only other two bits of shared mutable data are G_TOTAL_MAIN_GREENLETS and ThreadState::clocks_used_during_gc. Modify the latter to use a std::atomic with relaxed memory order, which should be good enough, and performance probably matters for those updates.

For G_TOTAL_MAIN_GREENLETS, switch to a std::atomic without changing the inc/dec operations (which means they use sequential consistency), because they're rare enough that performance doesn't really matter.

Also mark the main extension modules and the two test extensions as supporting free-threading (without switching to multi-phase init). The GIL will still temporarily be enabled during module import, but that probably won't matter (modules are usually imported before starting threads). If it does, switching to multi-phase init is always an option.

The existing test suite cover threads extensively enough that no extra tests are necessary. There is an intermittent failure (<0.2% of runs) that shows up when running the testsuite in a tight loop, but this happens in regular Python builds (and before 3.14) too. ThreadSanitizer can't be used on greenlet, from what I can tell because of how it gets confused by the stack switching. This is the case for GILful Python builds as well.

Yhg1s added 2 commits October 27, 2025 14:34
The significant change here is the use of thread local instead of a volatile
global for the switching_thread_state global (which is otherwise protected
by the GIL). There's some overhead to using a thread local, so only do this
in the free-threaded build.

The only other two bits of shared mutable data are `G_TOTAL_MAIN_GREENLETS
and ThreadState::clocks_used_during_gc. Modify the latter to use a
std::atomic with relaxed memory order, which should be good enough, and
performance probably matters for those updates.

For G_MAIN_TOTAL_GREENLETS, switch to a std::atomic without changing the
inc/dec operations (which means they use sequential consistency), because
they're rare enough that performance doesn't really matter.

Also mark the main extension modules and the two test extensions as
supporting free-threading (without switching to multi-phase init). The GIL
will still temporarily be enabled during module import, but that probably
won't matter (modules are usually imported before starting threads). If it
does, switching to multi-phase init is always an option.

The existing test suite cover threads extensively enough that no extra tests
are necessary. There is an intermittent failure (<0.2% of runs) that shows
up when running the testsuite in a tight loop, but this happens in regular
Python builds (and before 3.14) too. ThreadSanitizer can't be used on
greenlet, from what I can tell because of how it gets confused by the stack
switching. This is the case for GILful Python builds as well.
@jamadden
Copy link
Contributor

Thanks, this is a great start! I've picked it up to take it across the finish line. There are still some things that clearly need done (e.g., actually enabling the allocator the correct way) and, unfortunately, some hard interpreter crashes to finish debugging.

@Yhg1s
Copy link
Author

Yhg1s commented Nov 24, 2025

Feel free to let me know if you want any help debugging the crashes. I've seen one on CI that I haven't been able to locally reproduce, but I do have some experience with both free-threading and the CPython internals.

@jamadden
Copy link
Contributor

The fun one right now is a crash clearing module globals on shutdown because an object (set_trace) still in a module (glob) dict has been deallocated (it's type pointer is 0xDDD...); the crash fixes itself if thread-local byte code is disabled. It only shows up when running the doctests from sphinx, and the only doctest needed to produce it is a very small portion of greenlet_gc.rst (that is, you can disable all other doctests except this one and it crashes). Running the same snippet as a regular unit test doesn't trigger the crash.

A more complete backtrace with debugging symbols (cpython commit af586d8d2601b5fe52277ba7bf5d9e1ff93ffbb6, built with assertions enabled):

  * frame #0: 0x00000001000eb0b0 python`_Py_Dealloc(op=0x0000039749036f10) at object.c:3047:32 [opt]
    frame #1: 0x00000001000eb21c python`_Py_MergeZeroLocalRefcount(op=<unavailable>) at object.c:0 [opt] [artificial]
    frame #2: 0x00000001000c9928 python`Py_DECREF(op=<unavailable>) at refcount.h:375:13 [opt] [inlined]
    frame #3: 0x00000001000c98fc python`Py_XDECREF(op=<unavailable>) at refcount.h:514:9 [opt] [inlined]
    frame #4: 0x00000001000c98f8 python`insertdict(interp=<unavailable>, mp=0x0000039748b1ac30, key=0x0000039748b2c250, hash=<unavailable>, value=0x00000001004eef60) at dictobject.c:1876:5 [opt]
    frame #5: 0x00000001000c8fa8 python`setitem_take2_lock_held(mp=<unavailable>, key=<unavailable>, value=<unavailable>) at dictobject.c:2627:12 [opt] [artificial]
    frame #6: 0x00000001000c91dc python`_PyDict_SetItem_Take2(mp=0x0000039748b1ac30, key=<unavailable>, value=<unavailable>) at dictobject.c:2635:11 [opt]
    frame #7: 0x00000001000c9170 python`PyDict_SetItem(op=0x0000039748b1ac30, key=<unavailable>, value=<unavailable>) at dictobject.c:2655:12 [opt]
    frame #8: 0x00000001000e8324 python`_PyModule_ClearDict(d=0x0000039748b1ac30) at moduleobject.c:780:21 [opt]
    frame #9: 0x00000001000e815c python`_PyModule_Clear(m=<unavailable>) at moduleobject.c:727:9 [opt] [artificial]
    frame #10: 0x000000010027bd68 python`finalize_modules_clear_weaklist(interp=0x00000001005314c0, weaklist=0x000003974671eb10, verbose=3) at pylifecycle.c:1673:9 [opt]

Usually something like this means I'm not switching something correctly but I just started debugging and haven't found it yet.

@jamadden
Copy link
Contributor

jamadden commented Nov 24, 2025

Running only this doctest from sphinx causes the crash:

==================================
 Garbage Collection and greenlets
==================================

.. doctest::

   >>> from greenlet import getcurrent, greenlet, GreenletExit

.. doctest::

   >>> import gc
   >>> glet = greenlet(gc.collect)
   >>> _ = glet.switch()
  • Using two doctest snippets is critical.
  • When we call gc.collect, the garbage collector determines that pdb.set_trace is unreachable garbage and frees it. This is the root cause, it obviously shouldn't be collected.
  • This is despite the fact that sys.getrefcount reports a value of 2 for set_trace when called before collecting.
  • If I modify the sphinx_build script itself to hold a reference to set_trace, everything works. (That same sys.getrefcount still reports a value of 2 when I would have expected it to now be 3. But I know those numbers aren't as reliable as they used to be because of deferred refcounting for objects on the stack.)
  • If I modify sphinx_build to print(pdb.set_trace) before exiting, the crash happens there, so it's not directly related to interpreter cleanup.
  • Modifying greenlet's tp_traverse to look at the stashed values of c_stack_refs changes nothing --- because c_stack_refs is NULL in the greenlets that exist at this point.
  • Again, disabling the TLBC solves the problem. So some interaction with that is causing us to "lose" the reference to pdb.set_trace.
  • Adding __getattr__ to pdb.py, thus disabling specialization of attribute access on the pdb module (if I'm understanding specialize.c correctly) also solves the problem.

Still debugging...I've tried various things with the tlbc_index member and the brc member without success...

@jamadden
Copy link
Contributor

The interpreter is much more complex than it used to be, even from 3.13, and I don't yet fully understand all the new interactions.

The issue is definitely some difference between the generic LOAD_ATTR and the specialized LOAD_ATTR_MODULE (both in generated_cases.c.h).

No threads are involved, so I believe we can take any of the cross-thread reference counting stuff off the table.

I can edit specialize.c:specialize_module_load_attr_lock_held to limit the specialization transformation of LOAD_ATTR to LOAD_ATTR_MODULE to just set_trace, in which case we still fail. Alternately, I can edit that function to disable specialization for set_trace only, and things are fine. (This is a more targeted, limited version of what's accomplished by adding __getattr__ to pdb.py.)

Both bytecodes make use of _PyStackRef objects, and both bytecodes manipulate _PyInterpreterFrame->stackpointer in subtly different ways. The stackpointer member is new in 3.14, but saving and restoring it around switches doesn't make any difference (though it seems like the right thing to do, GIL or no).

I've about exhausted the amount of time I have to spend on this, and I'm kind of stumped as to where to look next.

Since I do have a workaround, I may go ahead and merge the changes (these plus my other fixes from branch free-threading-fixes-part-deux) with the workaround documented, and hope for help or a brainstorm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants