Supporting M*P thread model (including green threads)

TL;DR: MMTk core should be more friendly for VMs where each Mutator does not correspond to one language-level thread. To do this, we must reconsider the relation between **mutators**, **threads** and **stacks**, and reconsider when to scan them. In short,

- We should have as many `MutatorContext` as the level of *true* parallelism.
- We can scan `MutatorContext`, threads and stacks when they are not being used by a native thread. Note, however, that they may not be the same time. Read the last section for details.

# Motivation

OpenJDK 21 stablised the feature of "virtual thread". It is basically a `M*P` thread model where the number of application threads (`M`) may not be the same as the number of underlying OS-level native threads (`P`).

Ruby uses one native thread for one Ruby thread, but due to "global VM lock", only one thread in each "Ractor" can be executing Ruby code at the same time. Meanwhile, Ruby threads can swap between different "Fibers" which provides the semantics of symmetric or asymmetric coroutines. Under the hood, it swaps between stacks using `swapcontext`.

GHC also uses green threads. See https://mmtk.zulipchat.com/#narrow/stream/315620-Porting/topic/Green.20Threads

# History

In the past, JikesRVM used to use the `M*P` thread model. There was one `Processor` instance for each native thread, and one `RVMThread` for each application-level thread.

Regarding what `MutatorContext` corresponds to, here is an excerpt from the comment of `abstract class MutatorContext` from an early version of JikesRVM:

```java
/**
 * This class (and its sub-classes) implement per-mutator thread
 * behavior. We assume N collector threads and M
 * mutator threads, where N is often equal to the number of
 * available processors, P (for P-way parallelism at GC-time), and
 * M may simply be the number of mutator (application) threads.
 * Both N and M are determined by the VM, not MMTk. In
 * the case where a VM uses posix threads (pthreads) for each mutator
 * ("1:1" threading), M will typically be equal to the number of
 * mutator threads. When a uses "green threads" or a hybrid threading
 * scheme (such as Jikes RVM), M will typically be equal to the
 * level of true parallelism (ie the number of underlying
 * kernel threads).
 * ...
 */
```

This means there is one `MutatorContext` per native thread, not per app thread. The `Processor` class extends `Selected.Mutator` which is an alias of the concrete `MutatorContext` of the selected `Plan`. The method call `Selected.Mutator.get()` simply returns `Processor.getCurrentProcessor()`.

Then when JikesRVM started using native threads exclusively, `MutatorContext` became per-`RVMThread`, too. `Selected.Mutator.get()` now returns `RVMThread.getCurrentThread()`.

After we ported MMTk to Rust, we still assume there is one `MutatorContext` per thread.

# What are `MutatorContext`, thread and stack, anyway?

## `MutatorContext`

In simple terms, the `trait MutatorContext` defines something that

- We can allocate objects using it. (`alloc` and `post_alloc`)
- It contains states and methods related to accessing objects, such as mod buf and write barriers. (`barrier`, `barrier_impl`, `flush`, `flush_remembered_set`)
- Its states need to be inspected and updated during GC. (`prepare` and `release`)

Implementation-wise, the `struct Mutator` contains

- Many allocators (depending on plan). Some allocators contain data structures for allocation fast paths (such as `struct BumpPointer` which has a `cursor` and a `limit`).
- The concrete `Barrier` implementation
- References to related data structures (plans, spaces, etc.) and functions (`prepare_func` and `release_func`).

What does the `prepare_func` and `release_func` do?

- For MarkSweep, it sweeps the buckets in `prepare_func` if sweeping lazily, or it does so in `release_func` if sweeing eagerly.
- For MarkCompact, the `release_func` resets the bump pointer because the space is compacted.
- For SemiSpace, the `release_func` rebinds the bump pointer to the right CopySpace because it swaps the from/to space, and resets the bump pointer.
- For GenCopy and GenImmix, the `release_func` resets the bump pointer because it evacuates the nursery.
- For Immix and StickyImmix, both the `prepare_func` and `release_func` reset the bump pointer and the "large bump pointer" (for medium objects). (p.s. I wonder why they need to reset the bump pointers twice. Isn't that idempotent?
 - *Update*: It is confirmed to be a [bug](https://github.com/mmtk/mmtk-core/issues/959) and is [fixed](https://github.com/mmtk/mmtk-core/pull/964).

## Thread

For most modern OSes, a native thread is the unit of scheduling, and they share the same address space, thus the same heap. The OS may schedule two threads on two different processors, letting them run simultaneously. The OS can also schedule different threads on the same processor at different time, preempting the execution of the previous thread. Because of the concurrent and non-cooperative nature of native threads, we need synchronisation when two native threads access shared data.

*(Synchronisation is what `MutatorContext` is all about. A `MutatorContext` holds thread-local data so that native threads can avoid expensive synchronisations on fast paths.)*

Some programming languages use `M*P` thread model. `M` application threads (in OpenJDK's terminology, "virtual threads", and in Go's terminology, "goroutines") are scheduled in user space on `P` native threads (in JikesRVM's and Go's terminology, "processors") This addresses the fact that it is expensive to create or swap between native threads (at least on some OSes). In this model, at most `P` application threads can be executing simultaneously at any given time. Implementation-wise, such user-level scheduling is usually implemented with swap-stack mechanisms (explained below).

If `P = 1`, only one native thread exists, and only one application thread can be executing at any given time. That's "green threads".

Some languages, such as Python and Ruby, have global interpreter lock. Although language-level application threads are implemented 1:1 as native threads, only one native thread that holds the global lock can execute Python/Ruby code (thereby accessing the heap) at the same time. Other threads may be executing native code (such as doing IO in syscalls, or running C code).

## Stack

A control stack (or simply a "stack") holds the activation records (stack frames) of nested function calls. It is LIFO. The most recently entered function will return first. For traditional programming languages that do not support stackful coroutines (such as C, C++, Java, etc.), one thread only needs one stack.

A swap-stack operation (the name comes from [this paper](https://dl.acm.org/doi/10.1145/2400682.2400695). Actually, it swaps the SP pointer instead of the stack content) can unbind a thread from a stack, and rebind it to another stack. This will change the PC and the SP of the current thread, changing where the thread is executing, as well as where it will return to when the current function returns. The old stack is preserved, and can be swapped back later, resuming from where the swap-stack operation happened. (You may be interested in [this paper](https://wks.github.io/downloads/pdf/osr-vee-2018.pdf) and the section for swap-stack in [this thesis](https://wks.github.io/downloads/pdf/wang-thesis-2018.pdf).) This effectively provides the semantics of symmetric coroutines. Some languages, such as Ruby, natively supports swap-stack (Fiber in Ruby). Other languages can support this using extensions (such as the `swapcontext` POSIX function, the `boost-context` library for C++ as well as the `greenlet` library for Python). (You may be interested in [this blog post](https://wks.github.io/blog/2022/09/22/coroutine-flatten.html), too.)

`M*P` thread model can be implemented with swap-stack.

- Create one stack for each app thread.
- Whenever a thread is about to perform a blocking operation (such as reading a file), it instead
 - performs the non-blocking version of the same operation (such as registering the file in a `epoll` syscall that handles many open files), and 
 - swap to a different stack.
- Whenever a thread is about to call a long-running native function, it instead
 - create a task to be executed in a thread pool dedicated for long-running native functions, and
 - swap to a different stack.
- Whenever it executed a large number of instruction, it swaps stack, too.

In this way, `M` app threads can share `P` native threads, and the scheduling can happen in the user space, at the time of swapping stacks.

# So what should we do then?

## How many MutatorContext instances do we need?

Just like the comment in JikesRVM stated, we need as many `MutatorContext` instances as the level of *true* parallelism.

- For OpenJDK <= 20, it is the number of Java threads.
- For OpenJDK 21, it is the number of native threads.
- For Ruby, it is the number of Ractors.
- For Python, one.

But it is still correct to have more `MutatorContext` instances than necessary. For example, currently MMTk-Ruby has one `MutatorContext` per Ruby Thread. But I am not doing that now because MMTk-core currently assumes one thread per Mutator.

## When can we scan threads/stacks/mutators?

Instead of thinking about when *can* we scan them, it is easier to think about what *prevents* us from scanning them. It is all about synchronisation, i.e. if some other threads are using it, don't scan it.

### What's preventing us from scanning MutatorContext?

- In OpenJDK 20, a Java thread can allocate objects and access fields when it is executing Java code. And it needs `MutatorContext` to allocate objects (using `Allocator`) and access fields (using `Barrier`). So we cannot scan `MutatorContext` when a thread is executing Java code.
 - Corollary: We can scan `MutatorContext` if either (1) the thread yielded, or (2) the thread is executing native code (in JNI).
- In OpenJDK 21, it is basically the same as OpenJDK 21, but since `MutatorContext` is per native thread, we cannot scan it when the native thread is executing Java code (in any virtual thread)
- In Ruby, only one Thread in a Ractor can be running Ruby code.
 - In the status-quo where each Ruby Thread has a `MutatorContext`, only the running thread in a Ractor is accessing its `MutatorContext`, and all other threads are idle (or running native code, not using `MutatorContext`).
 - This means we only need to wait for the active Thread to stop to scan its `MutatorContext`, but the `MutatorContext` for other Threads are free to scan, as long as we can prevent those threads from waking up when we scan their `MutatorContext`.
 - If we change MMTk-Ruby so that there is only one `MutatorContext` per Ractor, we only need the only running thread to stop before scanning the unique `MutatorContext` of the Ractor.

### What's preventing us from scanning thread-local states?

As long as the thread is running, it can access its thread-local states freely, such as its JNI handles. This means, we can only scan them when the thread yields.

### What's preventing us from scanning stacks?

If a native thread is running on the stack (i.e. if the stack is bound to a native thread), we can't scan the stack.

- In OpenJDK 20, it is one stack per thread.
 - So we can only scan a stack when its unique thread has yielded.
 - Currently, the `Collection::stop_all_mutators` makes assumption of this and provides a callback for the VM to notify MMTk core when to scan a mutator (actually it is "when to scan a stack").
- In OpenJDK 21, we expect there are more native threads than virtual (app) threads. If an virtual thread (implemented as a stack) is running on a native thread (i.e. the stack is bound to a thread), we can't scan the stack.
 - The stacks of idle virtual threads are ready for scanning when GC starts. But we need to prevent them from being bound to native threads when we scan them. It is sufficient to block all native threads that attempt to swap stack (i.e rescheduling to run other virtual threads) during GC.
 - If a virtual thread is running on a native thread, we have to wait for the native thread to yield before scanning the stack of the virtual thread.
- In Ruby, we scan stacks conservatively. We have to wait for threads to yield.
 - But unbound Fibers are not roots. They may have been abandoned. Those stacks should be garbage-collected. But they are PPPs anyway, and have to be scanned (conservatively).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Supporting M*P thread model (including green threads) #960

Motivation

History

What are `MutatorContext`, thread and stack, anyway?

`MutatorContext`

Thread

Stack

So what should we do then?

How many MutatorContext instances do we need?

When can we scan threads/stacks/mutators?

What's preventing us from scanning MutatorContext?

What's preventing us from scanning thread-local states?

What's preventing us from scanning stacks?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Supporting M*P thread model (including green threads) #960

Description

Motivation

History

What are MutatorContext, thread and stack, anyway?

MutatorContext

Thread

Stack

So what should we do then?

How many MutatorContext instances do we need?

When can we scan threads/stacks/mutators?

What's preventing us from scanning MutatorContext?

What's preventing us from scanning thread-local states?

What's preventing us from scanning stacks?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What are `MutatorContext`, thread and stack, anyway?

`MutatorContext`