-
Notifications
You must be signed in to change notification settings - Fork 0
Dual call stacks #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Fidget-Spinner
wants to merge
4
commits into
main
Choose a base branch
from
dual_call_stacks
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# Dual Call Stacks | ||
|
||
Authors: Ken Jin, Pablo Galindo | ||
|
||
CPython currently uses a single Python call stack | ||
(separate from the C call stack). This is in contrast to other runtimes | ||
like HotSpot and some Wasm runtimes which use dual call stacks. | ||
In this document, we propose changing CPython's runtime to use dual | ||
call stacks. | ||
|
||
## The Problem | ||
|
||
There are two main issues faced with single call stacks: | ||
1. Traversing the all frames in a thread state is expensive, as | ||
the frame layout is roughly a linked list. | ||
2. Reconstructing frames in the case of function inlining is complicated. | ||
|
||
The impact of point 1. alone is wide-reaching. Traversing frames | ||
is a common operation done by out-of-memory profilers. Improving these would have substantial impact on both | ||
these applications. | ||
|
||
Point 2. is mainly applicable if we want "full" function inlining without | ||
shim frames. Even so, with shim frames, dual call stacks makes full | ||
reconstruction easier, as we can push a "skeleton" frame when inlining | ||
(for cheaper cost), and rehydrate the frame as-and-when needed by | ||
tools like `sys._getframe`. | ||
|
||
## Our Approach | ||
|
||
One possible solution is to separate out all non-`f_localplus` fields | ||
into a separate contiguous array that is allocated to the size of | ||
`sys.setrecursionlimit`, called the "control frame". | ||
This is the first stack. The second stack | ||
is all the `f_localsplus` fields, called the "locals frame". | ||
This would sove the linked list problem, | ||
as now everything is a contiguous array requiring only a single memcpy | ||
or array traversal. | ||
|
||
A reminder, `_PyInterpreterFrame` currently looks like this: | ||
|
||
```C | ||
struct _PyInterpreterFrame { | ||
_PyStackRef f_executable; /* Deferred or strong reference (code object or None) */ | ||
struct _PyInterpreterFrame *previous; | ||
_PyStackRef f_funcobj; /* Deferred or strong reference. Only valid if not on C stack */ | ||
PyObject *f_globals; /* Borrowed reference. Only valid if not on C stack */ | ||
PyObject *f_builtins; /* Borrowed reference. Only valid if not on C stack */ | ||
PyObject *f_locals; /* Strong reference, may be NULL. Only valid if not on C stack */ | ||
PyFrameObject *frame_obj; /* Strong reference, may be NULL. Only valid if not on C stack */ | ||
_Py_CODEUNIT *instr_ptr; /* Instruction currently executing (or about to begin) */ | ||
_PyStackRef *stackpointer; | ||
#ifdef Py_GIL_DISABLED | ||
/* Index of thread-local bytecode containing instr_ptr. */ | ||
int32_t tlbc_index; | ||
#endif | ||
uint16_t return_offset; /* Only relevant during a function call */ | ||
char owner; | ||
#ifdef Py_DEBUG | ||
uint8_t visited:1; | ||
uint8_t lltrace:7; | ||
#else | ||
uint8_t visited; | ||
#endif | ||
/* Locals and stack */ | ||
_PyStackRef localsplus[1]; | ||
}; | ||
``` | ||
|
||
The dual call stack layout would look something like this: | ||
|
||
|
||
<!--- Not really an entity relationship diagram, but the syntax is useful. | ||
--> | ||
```mermaid | ||
erDiagram | ||
CONTROL_FRAME ||--|| LOCALS_FRAME : contains | ||
CONTROL_FRAME { | ||
_PyStackRef* f_executable | ||
_PyInterpreterFrame* previous | ||
_PyStackRef *f_funcobj | ||
PyObject* f_globals | ||
Rest the_rest | ||
LOCALS_FRAME* localsplus_frame | ||
} | ||
LOCALS_FRAME { | ||
_PyStackRef localsplus[1] | ||
} | ||
``` | ||
|
||
## Performance Impact | ||
|
||
We hope to achieve a less than 1% pyperformance slowdown with this approach. | ||
|
||
Frame pushing and popping will require 1 more write, but will also be cheaper | ||
because we don't need to write `previous` anymore. So this nets to zero writes. | ||
Additionally, the frame bump allocator | ||
that CPython currently uses should exhaust much slower in the case of recursive calls, | ||
as it will only consume the second (locals) call stack, not the first one. | ||
|
||
The initial naiive implementation may use an extra register in the tail-calling | ||
interpreter for the control frame pointer. This can be offset by some tricks | ||
to store the `tstate` variable at a fixed offset from the control stack. This will | ||
save us on the register, making the total register usage to be zero. Alternatively, | ||
we can store a `tstate` field in the control frame pointing to the real tstate. | ||
|
||
|
||
## Open problems | ||
|
||
How to handle line numbers? |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this resolved by the introduction of
_PyStackChunk
? or am I misreading what is meant byframes
here?_PyStackChunk
is still a linked list, but is much better than copying all_PyInterpreterFrame
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's a bit subtle, that's what I meant by "roughly" a linked list :).