Skip to content

Conversation

luke-gruber
Copy link

We were seeing errors in our application that looked like:

[BUG] unexpected situation - recordd:1 current:0
/error.c:1097 rb_bug_without_die_internal
/vm_sync.c:275 disallow_reentry
/eval_intern.h:136 rb_ec_vm_lock_rec_check
/eval_intern.h:147 rb_ec_tag_state
/vm.c:2619 rb_vm_exec
/vm.c:1702 rb_yield
/eval.c:1173 rb_ensure

We concluded that there was context switching going on while a thread held the VM lock. During the investigation into the issue, we added assertions that we never yield to another thread with the VM lock held. We enabled these VM lock assertions even in single ractor mode. These assertions were failing in a few places, but most notably in finalizers. We were running finalizers with the VM lock held, and they were context switching and causing this issue.

These rules must be held going forward to ensure we don't context switch unexpectedly:

If you have the VM lock held,
* Don't enter the interpreter loop.
* Don't yield to ruby code.
* Don't call rb_nogvl (it will context switch you and will not unlock the VM lock).
* Don't check your own interrupts, it can switch you.

If you don't have the GVL:
* Don't call rb_ensure/rb_protect, etc (these are old rules but good to have assertions for).

@luke-gruber luke-gruber force-pushed the fix_context_switch_while_holding_vm_lock branch 7 times, most recently from 7ca26e8 to 1fa307d Compare September 11, 2025 16:38
@luke-gruber
Copy link
Author

There's 1 bug in bigdecimal right now that's crashing due to GC.add_stress_to_class(BigDecimal). This is an old issue that was fixed by Matt. I'll try to investigate more later. For now, I'm going to see how this does in the experimental cluster, if it stops the errors and doesn't crash.
cc @jhawthorn

@luke-gruber luke-gruber force-pushed the fix_context_switch_while_holding_vm_lock branch from 36d6901 to f8887a8 Compare September 11, 2025 19:28
@luke-gruber
Copy link
Author

The BigDecimal bug has been fixed.

@ioquatix
Copy link

@luke-gruber luke-gruber force-pushed the fix_context_switch_while_holding_vm_lock branch 6 times, most recently from 938a94c to 29575c4 Compare September 22, 2025 16:45
We were seeing errors in our application that looked like:

```
[BUG] unexpected situation - recordd:1 current:0
/error.c:1097 rb_bug_without_die_internal
/vm_sync.c:275 disallow_reentry
/eval_intern.h:136 rb_ec_vm_lock_rec_check
/eval_intern.h:147 rb_ec_tag_state
/vm.c:2619 rb_vm_exec
/vm.c:1702 rb_yield
/eval.c:1173 rb_ensure
```

We concluded that there was context switching going on while a thread
held the VM lock. During the investigation into the issue, we added
assertions that we never yield to another thread with the VM lock held.
We enabled these VM lock assertions even in single ractor mode. These
assertions were failing in a few places, but most notably in finalizers.
We were running finalizers with the VM lock held, and they were context
switching and causing this issue.

These rules must be held going forward to ensure we don't context switch unexpectedly:

If you have the VM lock held,
    * Don't enter the interpreter loop.
    * Don't call ruby methods whether or not they are defined in ruby
    * Don't call `rb_nogvl`, `rb_mutex_lock`, etc.
    * Don't check interrupts

Rework global variables, don't lock when calling getter or setter.

Add a test that fails without these lock_rec changes.

Add ASSERT_vm_unlocking() to vm_call0_body
This uncovered many more test failures.

Revert changes introduced in 2f6c694

This didn't appear to be a correct fix. We should allow raising
NoMemoryError even if we're under the VM lock. It will automatically
unlock us.
@luke-gruber luke-gruber force-pushed the fix_context_switch_while_holding_vm_lock branch from 29575c4 to 8cf74f1 Compare September 22, 2025 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants