[Push|Pop]ErrorScope thread local error stacks #211

lokokung · 2023-08-09T22:38:16Z

Came up in a discussion with Kai today, but main question is how we intend to deal with error scopes in a multithread environment. The current model could race when an unrelated error on thread 1 is caught in a popErrorScope on thread 2.

The simplest solution for now is a thread-local error stack, however, there are still some open questions / discussion points especially for other language bindings...

Users must understand that they cannot rely on error scopes set on the passing thread of a callback. Instead, users that are interested in error scopes in callbacks need to explicitly call the pop/push in the callback. (This is probably fine?)
Does this work for cases where tasks can yield and be rescheduled on a different OS level thread? One thing we talked about is potentially having to add an API that can be called in a context switch hook (does this exist?)? Do we have any other alternatives for this?
Given that a thread-local solution should be fine for JS, is it possible to make the native bindings the same? One thought was that since the JS API isn't standardized for multithread yet, we could make sure that the multithread version doesn't allow calls to [Push|Pop]ErrorScope, and instead expose something different so that we could be aligned with how the native version would work? Kai also mentioned that while the thread-local solution should work for JS, that it may be possible for some Rust runtimes on top of WASM to do something similar to the point above and run into the same issue.

The text was updated successfully, but these errors were encountered:

kainino0x · 2023-08-11T00:05:47Z

(Mostly my own thoughts but some from @lokokung @cwfitzgerald @rajveermalviya and old ideas from the JS API standardization)

The current proposal for JS multithreading would have error scope stacks be thread-local. It's unlikely we'd want to make them racy, as this introduces major threading hazards to the JS API, and JS usually doesn't like those (I think the web platform currently has none - they're in JavaScript itself via SharedArrayBuffer - though the proposal does add another hazard to GPUBuffer mapping).

If we went forward with that, then the model for WASM would have to be the same. Then in order to match that in native, we'd need to make the error scope stack thread local.

Some contexts will not like this, e.g. bindings from languages that have green threads. It could still be the default, if we add a way for language bindings to cover it up. For example:

we could make the thread-to-error-scope-stack mapping mutable, so the bindings can maintain one per green thread (instead of one per OS thread) and set it current whenever the green thread's OS thread changes. May require JS API support.

Maybe we could turn the thread-locality on/off in a global way (in requestDevice). It would be required to be on in initial WASM implementations (because that's all we can do with the JS API now) but then maybe allow turning it off later. If we can find a way to make it somewhat reasonable for error scope stacks to be non-thread-local but still usable in multithreaded native code (see "forking" below) then maybe this could work.

We do still have the opportunity to change the proposal for multithreading in JS. We might have the opportunity to remove the "mistake" of global error scope state when you're using multithreading.

Ideas (a grab bag, not a list of alternatives):

just have it be racy (most likely not palatable)
as mentioned above, have it be thread-local but allow swapping out the thread's error-scope-stack
when you send the GPUDevice to another js thread, it has no error scope support. That is, push/popErrorScope don't exist (and uncapturederror events don't exist either probably, as has been proposed). All errors go straight to uncapturederror (on the originating GPUDevice) until you set an error-scope-stack.
opt into multithreading in requestDevice, and if you do this then you get no error scope stack even on the originating thread
allow "forking" the device so you can have multiple error scopes that aren't tied to specific threads (Multiple components cannot effectively share a GPUDevice because it is stateful gpuweb/gpuweb#3250) - not fully figured out what would need to provide here to make something work for WASM
allow overriding the error-sink for individual function calls (see Consider making ShaderModule compilation errors special gpuweb/gpuweb#2119 (comment))
- I had pointed out we made it so all encoding calls can generate validation errors (if called on a closed encoder) but I forgot that encoders are non-thread-shareable, so we wouldn't need to set the error-sink on every command, only once on createCommandEncoder and createRenderBundleEncoder.

This reminds me there's been interest in faking multithreaded webgpu.h over single-threaded WebGPU by proxying WebGPU work to one thread: emscripten-core/emscripten#19645
This, of course, would implement the racy version - we can't do anything thread-local unless we do some serious emulation to fully reimplement error scopes in WASM land, over JS ones that we automatically push and pop on the webgpu thread when switching which application thread is executing webgpu calls there.

kainino0x · 2023-08-17T18:58:20Z

webgpu.h meeting partial resolution:

Keep the current error scopes, make them thread-local.
- In Rust, use an inverted-control withErrorScope(|| { closure }) type thing to make it reliably stay on-thread
Add explicit error sink (native only) to createTexture/Buffer/QuerySet/RenderPipeline/ComputePipeline.
- In WASM emulate this with error scopes
- TBD the exact form of this; perhaps not a full error scope stack, just one-off sinks for internal and OOM errors (makes WASM implementation easier)

kainino0x · 2024-12-04T02:36:56Z

Posting some old minutes because this issue is complicated and I need them to be accessible. (I wasn't posting full minutes back then for some reason.)

2023-08-10:

CF: in wgpu ~~I believe it’s thread local~~ it’s a thread-global thing on the device
CF: In rust could return a thing from PushErrorScope that’s not Send that you have to use to Pop
KN/LK propose having it be thread local, then later providing ways to get out of it
… KN concerned about exposing something this racy to JS, think JS will need to be thread local
…
CF: It’s global but you can override it in every call?
CF: Error scope tied to the GPUDevice, and you can make copies of the GPUDevice? In JS when you postMessage you get a new one
- KN: actually similar to Multiple components cannot effectively share a GPUDevice because it is stateful gpuweb/gpuweb#3250
need to think more about this

2023-08-17:

CF: Ran into a similar issue with a related timestamp query library that has global locked state
KN: Kelsey filed a related issue in WebGPU ErrorScopes are not useful for CommandEncoder methods · Issue #4269
…
CF: Error scope sink may be the best solution
…
CW: Think TLS is OK. In Rust could invert control withErrorScope(|| { … })
CW: Worried about big changes because we know what’s wrong with the current thing but don’t know what’s wrong with the new idea yet.
AE: Kai mentioned forking the device, what if you can fork every object?
- KN: think only device, queue, texture
- CW: for now.
LK: compromise where current version is TLS, and also have the object?
- CW: seems similar: have one in TLS (the default), one you can override, and a macro that goes “push into TLS, do call, remove from TLS”
- CF: would need pop to return an object that’s not Send. (other concerns about concurrency)
CF: in native it would be nice to be able to do what all the backend APIs do, just return the error
- CW: could return a future that’s already resolved
KN: suspect we can rely on green threads not to be rescheduled in unexpected places. or can we ask green threads systems to pin us to a thread?
- CW: almost none of the things that would yield a green thread happen inside graphics code. You know the granularity of your task in practice
CF/CW: could pass an extra error scope only for out-of-memory/internal (createTexture/Buffer/QuerySet/Pipeline)
CF: in rust most people don’t use error scopes and just rely on the fact that the default UncapturedError callback panics.
(discussion of the fact that out-of-memory only comes from a few calls, how do you write a program that intentionally runs up against memory limits)
- CF: wgpu/dawn reserve memory and OOM early so it doesn’t run out internally?
- CW: can also clean up resources, or wait for the GPU to finish so we can free locked resources, etc.
Proposal
- Current error scopes: Use TLS in C. Invert control in Rust.
- Add explicit error sink (in native only) for out-of-memory and internal errors, in createTexture/Buffer/QuerySet/RenderPipeline/ComputePipeline
  - Emulated with error scopes in JS

kainino0x · 2024-12-09T22:24:52Z

Dec 9 meeting:

KN: Do we really need [error sinks] for 1.0? Do we need them after?
LK: Think it was for green threads (...)
CF: Need to define wrt async.
…
KN: createPipelineAsync don't produce error scope errors. mapAsync does, but only immediately (not in the "return" task)
Close this for now, pending more concrete use-case.

lokokung · 2024-12-10T18:00:42Z

Just to clarify the notes from above, I believe that the decision was Yes, do TLS for push/pop error stacks, and No, don't do error sinks for now. The notes in the comment above were all pertaining to error sinks, not TLS.

lokokung added the threading Thread safety and WASM-JS threading problems label Aug 9, 2023

kainino0x added the !discuss Needs discussion (at meeting or online) label Aug 10, 2023

kainino0x added errors Error reporting and removed !discuss Needs discussion (at meeting or online) labels Aug 17, 2023

kainino0x mentioned this issue Nov 22, 2023

Implementing mapping in multithreaded WASM #248

Closed

kainino0x added the needs exploration Needs offline exploration (e.g. in an implementation) label Jun 6, 2024

kainino0x mentioned this issue Dec 4, 2024

Document that error scopes are thread-local #454

Merged

This was referenced Dec 4, 2024

Some additional flavor about thread local error scopes #455

Merged

Add error sinks #456

Closed

kainino0x removed the needs exploration Needs offline exploration (e.g. in an implementation) label Dec 4, 2024

kainino0x closed this as completed Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Push|Pop]ErrorScope thread local error stacks #211

[Push|Pop]ErrorScope thread local error stacks #211

lokokung commented Aug 9, 2023

kainino0x commented Aug 11, 2023

Uh oh!

kainino0x commented Aug 17, 2023

Uh oh!

kainino0x commented Dec 4, 2024 •

edited

Loading

Uh oh!

kainino0x commented Dec 9, 2024 •

edited

Loading

Uh oh!

lokokung commented Dec 10, 2024

Uh oh!

[Push|Pop]ErrorScope thread local error stacks #211

[Push|Pop]ErrorScope thread local error stacks #211

Comments

lokokung commented Aug 9, 2023

kainino0x commented Aug 11, 2023

Uh oh!

kainino0x commented Aug 17, 2023

Uh oh!

kainino0x commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kainino0x commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lokokung commented Dec 10, 2024

Uh oh!

kainino0x commented Dec 4, 2024 •

edited

Loading

kainino0x commented Dec 9, 2024 •

edited

Loading