Skip to content

[Push|Pop]ErrorScope thread local error stacks #211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lokokung opened this issue Aug 9, 2023 · 5 comments
Closed

[Push|Pop]ErrorScope thread local error stacks #211

lokokung opened this issue Aug 9, 2023 · 5 comments
Labels
errors Error reporting threading Thread safety and WASM-JS threading problems

Comments

@lokokung
Copy link
Collaborator

lokokung commented Aug 9, 2023

Came up in a discussion with Kai today, but main question is how we intend to deal with error scopes in a multithread environment. The current model could race when an unrelated error on thread 1 is caught in a popErrorScope on thread 2.

The simplest solution for now is a thread-local error stack, however, there are still some open questions / discussion points especially for other language bindings...

  • Users must understand that they cannot rely on error scopes set on the passing thread of a callback. Instead, users that are interested in error scopes in callbacks need to explicitly call the pop/push in the callback. (This is probably fine?)
  • Does this work for cases where tasks can yield and be rescheduled on a different OS level thread? One thing we talked about is potentially having to add an API that can be called in a context switch hook (does this exist?)? Do we have any other alternatives for this?
  • Given that a thread-local solution should be fine for JS, is it possible to make the native bindings the same? One thought was that since the JS API isn't standardized for multithread yet, we could make sure that the multithread version doesn't allow calls to [Push|Pop]ErrorScope, and instead expose something different so that we could be aligned with how the native version would work? Kai also mentioned that while the thread-local solution should work for JS, that it may be possible for some Rust runtimes on top of WASM to do something similar to the point above and run into the same issue.
@lokokung lokokung added the threading Thread safety and WASM-JS threading problems label Aug 9, 2023
@kainino0x kainino0x added the !discuss Needs discussion (at meeting or online) label Aug 10, 2023
@kainino0x
Copy link
Collaborator

(Mostly my own thoughts but some from @lokokung @cwfitzgerald @rajveermalviya and old ideas from the JS API standardization)

The current proposal for JS multithreading would have error scope stacks be thread-local. It's unlikely we'd want to make them racy, as this introduces major threading hazards to the JS API, and JS usually doesn't like those (I think the web platform currently has none - they're in JavaScript itself via SharedArrayBuffer - though the proposal does add another hazard to GPUBuffer mapping).

If we went forward with that, then the model for WASM would have to be the same. Then in order to match that in native, we'd need to make the error scope stack thread local.

Some contexts will not like this, e.g. bindings from languages that have green threads. It could still be the default, if we add a way for language bindings to cover it up. For example:

  • we could make the thread-to-error-scope-stack mapping mutable, so the bindings can maintain one per green thread (instead of one per OS thread) and set it current whenever the green thread's OS thread changes. May require JS API support.

Maybe we could turn the thread-locality on/off in a global way (in requestDevice). It would be required to be on in initial WASM implementations (because that's all we can do with the JS API now) but then maybe allow turning it off later. If we can find a way to make it somewhat reasonable for error scope stacks to be non-thread-local but still usable in multithreaded native code (see "forking" below) then maybe this could work.


We do still have the opportunity to change the proposal for multithreading in JS. We might have the opportunity to remove the "mistake" of global error scope state when you're using multithreading.

Ideas (a grab bag, not a list of alternatives):

  • just have it be racy (most likely not palatable)
  • as mentioned above, have it be thread-local but allow swapping out the thread's error-scope-stack
  • when you send the GPUDevice to another js thread, it has no error scope support. That is, push/popErrorScope don't exist (and uncapturederror events don't exist either probably, as has been proposed). All errors go straight to uncapturederror (on the originating GPUDevice) until you set an error-scope-stack.
  • opt into multithreading in requestDevice, and if you do this then you get no error scope stack even on the originating thread
  • allow "forking" the device so you can have multiple error scopes that aren't tied to specific threads (Multiple components cannot effectively share a GPUDevice because it is stateful gpuweb/gpuweb#3250) - not fully figured out what would need to provide here to make something work for WASM
  • allow overriding the error-sink for individual function calls (see Consider making ShaderModule compilation errors special gpuweb/gpuweb#2119 (comment))
    • I had pointed out we made it so all encoding calls can generate validation errors (if called on a closed encoder) but I forgot that encoders are non-thread-shareable, so we wouldn't need to set the error-sink on every command, only once on createCommandEncoder and createRenderBundleEncoder.

This reminds me there's been interest in faking multithreaded webgpu.h over single-threaded WebGPU by proxying WebGPU work to one thread: emscripten-core/emscripten#19645
This, of course, would implement the racy version - we can't do anything thread-local unless we do some serious emulation to fully reimplement error scopes in WASM land, over JS ones that we automatically push and pop on the webgpu thread when switching which application thread is executing webgpu calls there.

@kainino0x
Copy link
Collaborator

webgpu.h meeting partial resolution:

  • Keep the current error scopes, make them thread-local.
    • In Rust, use an inverted-control withErrorScope(|| { closure }) type thing to make it reliably stay on-thread
  • Add explicit error sink (native only) to createTexture/Buffer/QuerySet/RenderPipeline/ComputePipeline.
    • In WASM emulate this with error scopes
    • TBD the exact form of this; perhaps not a full error scope stack, just one-off sinks for internal and OOM errors (makes WASM implementation easier)

@kainino0x kainino0x added errors Error reporting and removed !discuss Needs discussion (at meeting or online) labels Aug 17, 2023
@kainino0x kainino0x added the needs exploration Needs offline exploration (e.g. in an implementation) label Jun 6, 2024
@kainino0x
Copy link
Collaborator

kainino0x commented Dec 4, 2024

Posting some old minutes because this issue is complicated and I need them to be accessible. (I wasn't posting full minutes back then for some reason.)

2023-08-10:

  • CF: in wgpu I believe it’s thread local it’s a thread-global thing on the device
  • CF: In rust could return a thing from PushErrorScope that’s not Send that you have to use to Pop
  • KN/LK propose having it be thread local, then later providing ways to get out of it
  • … KN concerned about exposing something this racy to JS, think JS will need to be thread local
  • CF: It’s global but you can override it in every call?
  • CF: Error scope tied to the GPUDevice, and you can make copies of the GPUDevice? In JS when you postMessage you get a new one
  • need to think more about this

2023-08-17:

  • CF: Ran into a similar issue with a related timestamp query library that has global locked state
  • KN: Kelsey filed a related issue in WebGPU ErrorScopes are not useful for CommandEncoder methods · Issue #4269
  • CF: Error scope sink may be the best solution
  • CW: Think TLS is OK. In Rust could invert control withErrorScope(|| { … })
  • CW: Worried about big changes because we know what’s wrong with the current thing but don’t know what’s wrong with the new idea yet.
  • AE: Kai mentioned forking the device, what if you can fork every object?
    • KN: think only device, queue, texture
    • CW: for now.
  • LK: compromise where current version is TLS, and also have the object?
    • CW: seems similar: have one in TLS (the default), one you can override, and a macro that goes “push into TLS, do call, remove from TLS”
    • CF: would need pop to return an object that’s not Send. (other concerns about concurrency)
  • CF: in native it would be nice to be able to do what all the backend APIs do, just return the error
    • CW: could return a future that’s already resolved
  • KN: suspect we can rely on green threads not to be rescheduled in unexpected places. or can we ask green threads systems to pin us to a thread?
    • CW: almost none of the things that would yield a green thread happen inside graphics code. You know the granularity of your task in practice
  • CF/CW: could pass an extra error scope only for out-of-memory/internal (createTexture/Buffer/QuerySet/Pipeline)
  • CF: in rust most people don’t use error scopes and just rely on the fact that the default UncapturedError callback panics.
  • (discussion of the fact that out-of-memory only comes from a few calls, how do you write a program that intentionally runs up against memory limits)
    • CF: wgpu/dawn reserve memory and OOM early so it doesn’t run out internally?
    • CW: can also clean up resources, or wait for the GPU to finish so we can free locked resources, etc.
  • Proposal
    • Current error scopes: Use TLS in C. Invert control in Rust.
    • Add explicit error sink (in native only) for out-of-memory and internal errors, in createTexture/Buffer/QuerySet/RenderPipeline/ComputePipeline
      • Emulated with error scopes in JS

@kainino0x kainino0x removed the needs exploration Needs offline exploration (e.g. in an implementation) label Dec 4, 2024
@kainino0x
Copy link
Collaborator

kainino0x commented Dec 9, 2024

Dec 9 meeting:

  • KN: Do we really need [error sinks] for 1.0? Do we need them after?
  • LK: Think it was for green threads (...)
  • CF: Need to define wrt async.
  • KN: createPipelineAsync don't produce error scope errors. mapAsync does, but only immediately (not in the "return" task)
  • Close this for now, pending more concrete use-case.

@lokokung
Copy link
Collaborator Author

Just to clarify the notes from above, I believe that the decision was Yes, do TLS for push/pop error stacks, and No, don't do error sinks for now. The notes in the comment above were all pertaining to error sinks, not TLS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
errors Error reporting threading Thread safety and WASM-JS threading problems
Projects
None yet
Development

No branches or pull requests

2 participants