Make sure we never context switch while holding VM lock. #735

luke-gruber · 2025-09-10T19:04:00Z

We were seeing errors in our application that looked like:

[BUG] unexpected situation - recordd:1 current:0
/error.c:1097 rb_bug_without_die_internal
/vm_sync.c:275 disallow_reentry
/eval_intern.h:136 rb_ec_vm_lock_rec_check
/eval_intern.h:147 rb_ec_tag_state
/vm.c:2619 rb_vm_exec
/vm.c:1702 rb_yield
/eval.c:1173 rb_ensure

We concluded that there was context switching going on while a thread held the VM lock. During the investigation into the issue, we added assertions that we never yield to another thread with the VM lock held. We enabled these VM lock assertions even in single ractor mode. These assertions were failing in a few places, but most notably in finalizers. We were running finalizers with the VM lock held, and they were context switching and causing this issue.

These rules must be held going forward to ensure we don't context switch unexpectedly:

If you have the VM lock held,
* Don't enter the interpreter loop.
* Don't yield to ruby code.
* Don't call rb_nogvl (it will context switch you and will not unlock the VM lock).
* Don't check your own interrupts, it can switch you.

If you don't have the GVL:
* Don't call rb_ensure/rb_protect, etc (these are old rules but good to have assertions for).

luke-gruber · 2025-09-11T19:08:44Z

There's 1 bug in bigdecimal right now that's crashing due to GC.add_stress_to_class(BigDecimal). This is an old issue that was fixed by Matt. I'll try to investigate more later. For now, I'm going to see how this does in the experimental cluster, if it stops the errors and doesn't crash.
cc @jhawthorn

luke-gruber · 2025-09-12T21:40:42Z

The BigDecimal bug has been fixed.

ioquatix · 2025-09-18T01:11:52Z

I wonder if this could be useful: https://github.com/ruby/ruby/blob/0bb6a8bea49fed8ccef0a70aca5f2ea05af94292/vm_core.h#L73-L103

We were seeing errors in our application that looked like: ``` [BUG] unexpected situation - recordd:1 current:0 /error.c:1097 rb_bug_without_die_internal /vm_sync.c:275 disallow_reentry /eval_intern.h:136 rb_ec_vm_lock_rec_check /eval_intern.h:147 rb_ec_tag_state /vm.c:2619 rb_vm_exec /vm.c:1702 rb_yield /eval.c:1173 rb_ensure ``` We concluded that there was context switching going on while a thread held the VM lock. During the investigation into the issue, we added assertions that we never yield to another thread with the VM lock held. We enabled these VM lock assertions even in single ractor mode. These assertions were failing in a few places, but most notably in finalizers. We were running finalizers with the VM lock held, and they were context switching and causing this issue. These rules must be held going forward to ensure we don't context switch unexpectedly: If you have the VM lock held, * Don't enter the interpreter loop. * Don't call ruby methods whether or not they are defined in ruby * Don't call `rb_nogvl`, `rb_mutex_lock`, etc. * Don't check interrupts Rework global variables, don't lock when calling getter or setter. Add a test that fails without these lock_rec changes. Add ASSERT_vm_unlocking() to vm_call0_body This uncovered many more test failures. Revert changes introduced in 2f6c694 This didn't appear to be a correct fix. We should allow raising NoMemoryError even if we're under the VM lock. It will automatically unlock us.

luke-gruber force-pushed the fix_context_switch_while_holding_vm_lock branch 7 times, most recently from 7ca26e8 to 1fa307d Compare September 11, 2025 16:38

luke-gruber force-pushed the fix_context_switch_while_holding_vm_lock branch from 36d6901 to f8887a8 Compare September 11, 2025 19:28

luke-gruber force-pushed the fix_context_switch_while_holding_vm_lock branch 6 times, most recently from 938a94c to 29575c4 Compare September 22, 2025 16:45

luke-gruber force-pushed the fix_context_switch_while_holding_vm_lock branch from 29575c4 to 8cf74f1 Compare September 22, 2025 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make sure we never context switch while holding VM lock. #735

Make sure we never context switch while holding VM lock. #735

Uh oh!

luke-gruber commented Sep 10, 2025

Uh oh!

luke-gruber commented Sep 11, 2025

Uh oh!

luke-gruber commented Sep 12, 2025

Uh oh!

ioquatix commented Sep 18, 2025

Uh oh!

Uh oh!

Make sure we never context switch while holding VM lock. #735

Are you sure you want to change the base?

Make sure we never context switch while holding VM lock. #735

Uh oh!

Conversation

luke-gruber commented Sep 10, 2025

Uh oh!

luke-gruber commented Sep 11, 2025

Uh oh!

luke-gruber commented Sep 12, 2025

Uh oh!

ioquatix commented Sep 18, 2025

Uh oh!

Uh oh!