-
Notifications
You must be signed in to change notification settings - Fork 78
Description
TL;DR: MMTk core should be more friendly for VMs where each Mutator does not correspond to one language-level thread. To do this, we must reconsider the relation between mutators, threads and stacks, and reconsider when to scan them. In short,
- We should have as many
MutatorContext
as the level of true parallelism. - We can scan
MutatorContext
, threads and stacks when they are not being used by a native thread. Note, however, that they may not be the same time. Read the last section for details.
Motivation
OpenJDK 21 stablised the feature of "virtual thread". It is basically a M*P
thread model where the number of application threads (M
) may not be the same as the number of underlying OS-level native threads (P
).
Ruby uses one native thread for one Ruby thread, but due to "global VM lock", only one thread in each "Ractor" can be executing Ruby code at the same time. Meanwhile, Ruby threads can swap between different "Fibers" which provides the semantics of symmetric or asymmetric coroutines. Under the hood, it swaps between stacks using swapcontext
.
GHC also uses green threads. See https://mmtk.zulipchat.com/#narrow/stream/315620-Porting/topic/Green.20Threads
History
In the past, JikesRVM used to use the M*P
thread model. There was one Processor
instance for each native thread, and one RVMThread
for each application-level thread.
Regarding what MutatorContext
corresponds to, here is an excerpt from the comment of abstract class MutatorContext
from an early version of JikesRVM:
/**
* This class (and its sub-classes) implement <i>per-mutator thread</i>
* behavior. We assume <i>N</i> collector threads and <i>M</i>
* mutator threads, where <i>N</i> is often equal to the number of
* available processors, P (for P-way parallelism at GC-time), and
* <i>M</i> may simply be the number of mutator (application) threads.
* Both <i>N</i> and <i>M</i> are determined by the VM, not MMTk. In
* the case where a VM uses posix threads (pthreads) for each mutator
* ("1:1" threading), <i>M</i> will typically be equal to the number of
* mutator threads. When a uses "green threads" or a hybrid threading
* scheme (such as Jikes RVM), <i>M</i> will typically be equal to the
* level of <i>true</i> parallelism (ie the number of underlying
* kernel threads).<p>
* ...
*/
This means there is one MutatorContext
per native thread, not per app thread. The Processor
class extends Selected.Mutator
which is an alias of the concrete MutatorContext
of the selected Plan
. The method call Selected.Mutator.get()
simply returns Processor.getCurrentProcessor()
.
Then when JikesRVM started using native threads exclusively, MutatorContext
became per-RVMThread
, too. Selected.Mutator.get()
now returns RVMThread.getCurrentThread()
.
After we ported MMTk to Rust, we still assume there is one MutatorContext
per thread.
What are MutatorContext
, thread and stack, anyway?
MutatorContext
In simple terms, the trait MutatorContext
defines something that
- We can allocate objects using it. (
alloc
andpost_alloc
) - It contains states and methods related to accessing objects, such as mod buf and write barriers. (
barrier
,barrier_impl
,flush
,flush_remembered_set
) - Its states need to be inspected and updated during GC. (
prepare
andrelease
)
Implementation-wise, the struct Mutator
contains
- Many allocators (depending on plan). Some allocators contain data structures for allocation fast paths (such as
struct BumpPointer
which has acursor
and alimit
). - The concrete
Barrier
implementation - References to related data structures (plans, spaces, etc.) and functions (
prepare_func
andrelease_func
).
What does the prepare_func
and release_func
do?
- For MarkSweep, it sweeps the buckets in
prepare_func
if sweeping lazily, or it does so inrelease_func
if sweeing eagerly. - For MarkCompact, the
release_func
resets the bump pointer because the space is compacted. - For SemiSpace, the
release_func
rebinds the bump pointer to the right CopySpace because it swaps the from/to space, and resets the bump pointer. - For GenCopy and GenImmix, the
release_func
resets the bump pointer because it evacuates the nursery. - For Immix and StickyImmix, both the
prepare_func
andrelease_func
reset the bump pointer and the "large bump pointer" (for medium objects). (p.s. I wonder why they need to reset the bump pointers twice. Isn't that idempotent?
Thread
For most modern OSes, a native thread is the unit of scheduling, and they share the same address space, thus the same heap. The OS may schedule two threads on two different processors, letting them run simultaneously. The OS can also schedule different threads on the same processor at different time, preempting the execution of the previous thread. Because of the concurrent and non-cooperative nature of native threads, we need synchronisation when two native threads access shared data.
(Synchronisation is what MutatorContext
is all about. A MutatorContext
holds thread-local data so that native threads can avoid expensive synchronisations on fast paths.)
Some programming languages use M*P
thread model. M
application threads (in OpenJDK's terminology, "virtual threads", and in Go's terminology, "goroutines") are scheduled in user space on P
native threads (in JikesRVM's and Go's terminology, "processors") This addresses the fact that it is expensive to create or swap between native threads (at least on some OSes). In this model, at most P
application threads can be executing simultaneously at any given time. Implementation-wise, such user-level scheduling is usually implemented with swap-stack mechanisms (explained below).
If P = 1
, only one native thread exists, and only one application thread can be executing at any given time. That's "green threads".
Some languages, such as Python and Ruby, have global interpreter lock. Although language-level application threads are implemented 1:1 as native threads, only one native thread that holds the global lock can execute Python/Ruby code (thereby accessing the heap) at the same time. Other threads may be executing native code (such as doing IO in syscalls, or running C code).
Stack
A control stack (or simply a "stack") holds the activation records (stack frames) of nested function calls. It is LIFO. The most recently entered function will return first. For traditional programming languages that do not support stackful coroutines (such as C, C++, Java, etc.), one thread only needs one stack.
A swap-stack operation (the name comes from this paper. Actually, it swaps the SP pointer instead of the stack content) can unbind a thread from a stack, and rebind it to another stack. This will change the PC and the SP of the current thread, changing where the thread is executing, as well as where it will return to when the current function returns. The old stack is preserved, and can be swapped back later, resuming from where the swap-stack operation happened. (You may be interested in this paper and the section for swap-stack in this thesis.) This effectively provides the semantics of symmetric coroutines. Some languages, such as Ruby, natively supports swap-stack (Fiber in Ruby). Other languages can support this using extensions (such as the swapcontext
POSIX function, the boost-context
library for C++ as well as the greenlet
library for Python). (You may be interested in this blog post, too.)
M*P
thread model can be implemented with swap-stack.
- Create one stack for each app thread.
- Whenever a thread is about to perform a blocking operation (such as reading a file), it instead
- performs the non-blocking version of the same operation (such as registering the file in a
epoll
syscall that handles many open files), and - swap to a different stack.
- performs the non-blocking version of the same operation (such as registering the file in a
- Whenever a thread is about to call a long-running native function, it instead
- create a task to be executed in a thread pool dedicated for long-running native functions, and
- swap to a different stack.
- Whenever it executed a large number of instruction, it swaps stack, too.
In this way, M
app threads can share P
native threads, and the scheduling can happen in the user space, at the time of swapping stacks.
So what should we do then?
How many MutatorContext instances do we need?
Just like the comment in JikesRVM stated, we need as many MutatorContext
instances as the level of true parallelism.
- For OpenJDK <= 20, it is the number of Java threads.
- For OpenJDK 21, it is the number of native threads.
- For Ruby, it is the number of Ractors.
- For Python, one.
But it is still correct to have more MutatorContext
instances than necessary. For example, currently MMTk-Ruby has one MutatorContext
per Ruby Thread. But I am not doing that now because MMTk-core currently assumes one thread per Mutator.
When can we scan threads/stacks/mutators?
Instead of thinking about when can we scan them, it is easier to think about what prevents us from scanning them. It is all about synchronisation, i.e. if some other threads are using it, don't scan it.
What's preventing us from scanning MutatorContext?
- In OpenJDK 20, a Java thread can allocate objects and access fields when it is executing Java code. And it needs
MutatorContext
to allocate objects (usingAllocator
) and access fields (usingBarrier
). So we cannot scanMutatorContext
when a thread is executing Java code.- Corollary: We can scan
MutatorContext
if either (1) the thread yielded, or (2) the thread is executing native code (in JNI).
- Corollary: We can scan
- In OpenJDK 21, it is basically the same as OpenJDK 21, but since
MutatorContext
is per native thread, we cannot scan it when the native thread is executing Java code (in any virtual thread) - In Ruby, only one Thread in a Ractor can be running Ruby code.
- In the status-quo where each Ruby Thread has a
MutatorContext
, only the running thread in a Ractor is accessing itsMutatorContext
, and all other threads are idle (or running native code, not usingMutatorContext
).- This means we only need to wait for the active Thread to stop to scan its
MutatorContext
, but theMutatorContext
for other Threads are free to scan, as long as we can prevent those threads from waking up when we scan theirMutatorContext
.
- This means we only need to wait for the active Thread to stop to scan its
- If we change MMTk-Ruby so that there is only one
MutatorContext
per Ractor, we only need the only running thread to stop before scanning the uniqueMutatorContext
of the Ractor.
- In the status-quo where each Ruby Thread has a
What's preventing us from scanning thread-local states?
As long as the thread is running, it can access its thread-local states freely, such as its JNI handles. This means, we can only scan them when the thread yields.
What's preventing us from scanning stacks?
If a native thread is running on the stack (i.e. if the stack is bound to a native thread), we can't scan the stack.
- In OpenJDK 20, it is one stack per thread.
- So we can only scan a stack when its unique thread has yielded.
- Currently, the
Collection::stop_all_mutators
makes assumption of this and provides a callback for the VM to notify MMTk core when to scan a mutator (actually it is "when to scan a stack").
- In OpenJDK 21, we expect there are more native threads than virtual (app) threads. If an virtual thread (implemented as a stack) is running on a native thread (i.e. the stack is bound to a thread), we can't scan the stack.
- The stacks of idle virtual threads are ready for scanning when GC starts. But we need to prevent them from being bound to native threads when we scan them. It is sufficient to block all native threads that attempt to swap stack (i.e rescheduling to run other virtual threads) during GC.
- If a virtual thread is running on a native thread, we have to wait for the native thread to yield before scanning the stack of the virtual thread.
- In Ruby, we scan stacks conservatively. We have to wait for threads to yield.
- But unbound Fibers are not roots. They may have been abandoned. Those stacks should be garbage-collected. But they are PPPs anyway, and have to be scanned (conservatively).