From b8a82f02d5e03e6874483451a02c50d14a61a28f Mon Sep 17 00:00:00 2001 From: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Date: Wed, 7 May 2025 23:25:04 +0800 Subject: [PATCH 1/4] Create dual_call_stacks.md --- 3.15/dual_call_stacks.md | 113 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 113 insertions(+) create mode 100644 3.15/dual_call_stacks.md diff --git a/3.15/dual_call_stacks.md b/3.15/dual_call_stacks.md new file mode 100644 index 0000000..aba401a --- /dev/null +++ b/3.15/dual_call_stacks.md @@ -0,0 +1,113 @@ +# Dual Call Stacks + +Authors: Ken Jin, Pablo Galindo + +CPython currently uses a single Python call stack +(separate from the C call stack). This is in contrast to other runtimes +like HotSpot and some Wasm runtimes which use dual call stacks. +In this document, we propose changing CPython's runtime to use dual +call stacks. + +## The Problem + +There are two main issues faced with single call stacks: +1. Traversing the all frames in a thread state is expensive, as + the frame layout is roughly a linked list. +2. Reconstructing frames in the case of function inlining is complicated. + +The impact of point 1. alone is wide-reaching. Traversing frames +is a common operation done by out-of-memory profilers, and free-threaded +CPython (see `PyUnstable_Object_IsUniqueReferencedTemporary` and the +free-threaded GC). Improving these would have substantial impact on both +these applications. + +Point 2. is mainly applicable if we want "full" function inlining without +shim frames. Even so, with shim frames, dual call stacks makes full +reconstruction easier, as we can push a "skeleton" frame when inlining +(for cheaper cost), and rehydrate the frame as-and-when needed by +tools like `sys._getframe`. + +## Our Approach + +One possible solution is to separate out all non-`f_localplus` fields +into a separate contiguous array that is allocated to the size of +`sys.setrecursionlimit`, called the "control frame". +This is the first stack. The second stack +is all the `f_localsplus` fields, called the "locals frame". +This would sove the linked list problem, +as now everything is a contiguous array requiring only a single memcpy +or array traversal. + +A reminder, `_PyInterpreterFrame` currently looks like this: + +```C +struct _PyInterpreterFrame { + _PyStackRef f_executable; /* Deferred or strong reference (code object or None) */ + struct _PyInterpreterFrame *previous; + _PyStackRef f_funcobj; /* Deferred or strong reference. Only valid if not on C stack */ + PyObject *f_globals; /* Borrowed reference. Only valid if not on C stack */ + PyObject *f_builtins; /* Borrowed reference. Only valid if not on C stack */ + PyObject *f_locals; /* Strong reference, may be NULL. Only valid if not on C stack */ + PyFrameObject *frame_obj; /* Strong reference, may be NULL. Only valid if not on C stack */ + _Py_CODEUNIT *instr_ptr; /* Instruction currently executing (or about to begin) */ + _PyStackRef *stackpointer; +#ifdef Py_GIL_DISABLED + /* Index of thread-local bytecode containing instr_ptr. */ + int32_t tlbc_index; +#endif + uint16_t return_offset; /* Only relevant during a function call */ + char owner; +#ifdef Py_DEBUG + uint8_t visited:1; + uint8_t lltrace:7; +#else + uint8_t visited; +#endif + /* Locals and stack */ + _PyStackRef localsplus[1]; +}; +``` + +The dual call stack layout would look something like this: + + + +```mermaid +erDiagram + _PyInterpreterControlFrame ||--||{ _PyInterpreterLocalsFrame : matches + _PyInterpreterControlFrame { + _PyStackRef f_executable; + struct _PyInterpreterFrame *previous; %% We only need this when yielding to generators and backtraces (see genobject.c) + _PyStackRef f_funcobj; + PyObject *f_globals; + ... + uint8_t visited; + _PyInterpreterLocalsFrame *localsplus_frame; + } + + _PyInterpreterLocalsFrame { + _PyStackRef localsplus[1]; + } +``` + +## Performance Impact + +We hope to achieve a less than 1% pyperformance slowdown with this approach. + +Frame pushing and popping will require 1 more write, but will also be cheaper +because we don't need to write `previous` anymore. So this nets to zero writes. +Additionally, the frame bump allocator +that CPython currently uses should exhaust much slower in the case of recursive calls, +as it will only consume the second (locals) call stack, not the first one. +Lastly, the performance improvements from making `PyUnstable_Object_IsUniqueReferencedTemporary` +significantly cheaper on the free-threaded build will likely make this +more than worth it. A reminder that this replaced the `Py_REFCNT(op) == 1` optimization +that libraries like `numpy` use to reuse objects. + +The initial naiive implementation may use an extra register in the tail-calling +interpreter for the control frame pointer. This can be offset by some tricks +to store the `tstate` variable at a fixed offset from the control stack. This will +save us on the register, making the total register usage to be zero. Alternatively, +we can store a `tstate` field in the control frame pointing to the real tstate. + From 4fab366c6100c58925bbfe44ed1afd714e9e5ce6 Mon Sep 17 00:00:00 2001 From: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Date: Wed, 7 May 2025 23:37:05 +0800 Subject: [PATCH 2/4] fix mermaid diagram --- 3.15/dual_call_stacks.md | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/3.15/dual_call_stacks.md b/3.15/dual_call_stacks.md index aba401a..bd95b1a 100644 --- a/3.15/dual_call_stacks.md +++ b/3.15/dual_call_stacks.md @@ -75,19 +75,17 @@ The dual call stack layout would look something like this: --> ```mermaid erDiagram - _PyInterpreterControlFrame ||--||{ _PyInterpreterLocalsFrame : matches - _PyInterpreterControlFrame { - _PyStackRef f_executable; - struct _PyInterpreterFrame *previous; %% We only need this when yielding to generators and backtraces (see genobject.c) - _PyStackRef f_funcobj; - PyObject *f_globals; - ... - uint8_t visited; - _PyInterpreterLocalsFrame *localsplus_frame; + CONTROL_FRAME ||--|| LOCALS_FRAME : contains + CONTROL_FRAME { + _PyStackRef* f_executable + _PyInterpreterFrame* previous + _PyStackRef *f_funcobj + PyObject* f_globals + Rest the_rest + LOCALS_FRAME* localsplus_frame } - - _PyInterpreterLocalsFrame { - _PyStackRef localsplus[1]; + LOCALS_FRAME { + _PyStackRef localsplus[1] } ``` From 74a0a59339643e51b1372b17c39f96344d933ee5 Mon Sep 17 00:00:00 2001 From: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Date: Thu, 8 May 2025 00:06:05 +0800 Subject: [PATCH 3/4] correct --- 3.15/dual_call_stacks.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/3.15/dual_call_stacks.md b/3.15/dual_call_stacks.md index bd95b1a..ade6f5a 100644 --- a/3.15/dual_call_stacks.md +++ b/3.15/dual_call_stacks.md @@ -17,8 +17,7 @@ There are two main issues faced with single call stacks: The impact of point 1. alone is wide-reaching. Traversing frames is a common operation done by out-of-memory profilers, and free-threaded -CPython (see `PyUnstable_Object_IsUniqueReferencedTemporary` and the -free-threaded GC). Improving these would have substantial impact on both +CPython (see free-threaded GC). Improving these would have substantial impact on both these applications. Point 2. is mainly applicable if we want "full" function inlining without @@ -98,8 +97,8 @@ because we don't need to write `previous` anymore. So this nets to zero writes. Additionally, the frame bump allocator that CPython currently uses should exhaust much slower in the case of recursive calls, as it will only consume the second (locals) call stack, not the first one. -Lastly, the performance improvements from making `PyUnstable_Object_IsUniqueReferencedTemporary` -significantly cheaper on the free-threaded build will likely make this +Lastly, the performance improvements from making GC mark/sweep +cheaper on the free-threaded build will likely make this more than worth it. A reminder that this replaced the `Py_REFCNT(op) == 1` optimization that libraries like `numpy` use to reuse objects. From 79a56afdf04db32d3dd15b11a8d69b4f3bf3bad1 Mon Sep 17 00:00:00 2001 From: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Date: Thu, 8 May 2025 00:19:42 +0800 Subject: [PATCH 4/4] Update dual_call_stacks.md --- 3.15/dual_call_stacks.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/3.15/dual_call_stacks.md b/3.15/dual_call_stacks.md index ade6f5a..6f3de01 100644 --- a/3.15/dual_call_stacks.md +++ b/3.15/dual_call_stacks.md @@ -16,8 +16,7 @@ There are two main issues faced with single call stacks: 2. Reconstructing frames in the case of function inlining is complicated. The impact of point 1. alone is wide-reaching. Traversing frames -is a common operation done by out-of-memory profilers, and free-threaded -CPython (see free-threaded GC). Improving these would have substantial impact on both +is a common operation done by out-of-memory profilers. Improving these would have substantial impact on both these applications. Point 2. is mainly applicable if we want "full" function inlining without @@ -97,10 +96,6 @@ because we don't need to write `previous` anymore. So this nets to zero writes. Additionally, the frame bump allocator that CPython currently uses should exhaust much slower in the case of recursive calls, as it will only consume the second (locals) call stack, not the first one. -Lastly, the performance improvements from making GC mark/sweep -cheaper on the free-threaded build will likely make this -more than worth it. A reminder that this replaced the `Py_REFCNT(op) == 1` optimization -that libraries like `numpy` use to reuse objects. The initial naiive implementation may use an extra register in the tail-calling interpreter for the control frame pointer. This can be offset by some tricks @@ -108,3 +103,7 @@ to store the `tstate` variable at a fixed offset from the control stack. This wi save us on the register, making the total register usage to be zero. Alternatively, we can store a `tstate` field in the control frame pointing to the real tstate. + +## Open problems + +How to handle line numbers? \ No newline at end of file