Skip to content

Conversation

@PaulJKathmann
Copy link

@PaulJKathmann PaulJKathmann commented Aug 25, 2025

Before this PR

  1. If posting the job result fails 5 times an exception gets raised but the forwarder doesn't know about it, since it never gets the message from the user code container.
  2. If the user code crashes the forwarder will think the user code is still running until the user code posts a result for the same jobId. However, after restarting the user code will not know about the previously failed job so it will never report on it. This way a node/module might be blocked from receiving new requests after restarting:

https://github.palantir.build/foundry/interactive-infra/issues/9237

After this PR

  1. If the job posting fails 5 times (e.g. result too large) then the client tries 5 more times to post just a simple error message.
  2. We inform the forwarder whenever a node starts up so it will remove all existing jobs related to it that it thinks are still running.

Possible downsides?

Are Docs needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants