Skip to content

Conversation

s1gr1d
Copy link
Member

@s1gr1d s1gr1d commented Sep 10, 2025

Previously, our Cloudflare Workflows instrumentation created a new Sentry client inside every step.do. This resets the tracing context: a custom span started with startSpan around multiple step.do would finish under a different client/scope than its children.

This change initializes Sentry once per workflow run and preserves the active scope across steps. We capture the current scope in step.do and pass it to startSpan.

A new test was added to check that a step.do span becomes a child of a surrounding custom span.

fixes #17419

ref (for being able to test this): cloudflare/workerd#5030

Multiple step.do with a surrounding startSpan call will now result in this:
image

@timfish
Copy link
Collaborator

timfish commented Sep 10, 2025

I don't think this will work reliably in production.

Cloudflare workflows don't run like the code might suggest. The workflow can be suspended between steps and retries and all state is lost. The only state that is persisted are the return values from step.do to allow workflows to continue where they left off.

We derive the traceId (and sample random) from the workflow instanceId as this is the only reliable shared data between step runs and the only way we can reliably link the steps.

https://developers.cloudflare.com/workflows/build/rules-of-workflows/

Do not rely on state outside of a step

Workflows may hibernate and lose all in-memory state. This will happen when engine detects that there is no pending work and can hibernate until it needs to wake-up (because of a sleep, retry, or event).

This means that you should not store state outside of a step

This means that any spans created outside of step.do might not exist later. That is not to say that the code in this PR won't work much of the time, in fact I suspect in many cases it will. I'm just not sure we should be violating the workflow rules and creating an implementation that doesn't work reliably.

I'm seeking clarification from Cloudflare because I'm not even convinced the users example will work reliably if the workflow gets suspended. If no state is preserved between the steps, how will it know far it has iterated through page.item if it gets suspended?

class WorkflowIngestBase extends WorkflowEntrypoint<Env, Param> {
  async run(event, step): Promise<void> {
    const page = step.do('fetch page'() => {...})

    for (const item in page.item){
      await step.do('process page item', async () => {...})
      await step.do('save item', async () => {...})
    }
  }
}

@timfish
Copy link
Collaborator

timfish commented Sep 10, 2025

Ah ok, I just looked through the docs again and it looks like it must persist the iteration point because they have this as an example:

image

const config = typeof configOrCallback === 'function' ? undefined : configOrCallback;

const instrumentedCallback: () => Promise<T> = async () => {
return workflowStepWithSentry(this._instanceId, this._options, async () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be a problem that we no longer create a sentry instance here? 🤔 is there always an existing sentry instance outside here...?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we always need to create a Sentry instance (or maybe check that one exists?) inside step.do because these can be run in a new execution context if the workflow gets suspended.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the assumption that run is called on every new invocation of the workflow (e.g. after hibernation), we should have a client here

@s1gr1d
Copy link
Member Author

s1gr1d commented Sep 10, 2025

Thanks Tim for the review! I am not 100% sure if this will really violate the rules here 🤔 The docs are not too clear about that.

This is how I understood it:

  • Every time a workflow runs (first invocation, after hibernation), it will call run
    • run is now the place where we init Sentry, so there should always be a client
  • We get the current scope from inside step.do and so it's only using the parenting logic inside the do invocation

@timfish
Copy link
Collaborator

timfish commented Sep 10, 2025

If run was executed again after hibernation, it would have to skip over the steps that have already completed successfully. Maybe that is how it works. The steps that have already resolved successfully just return the previously serialised results?That would explain how it is able to cope with for(const a of arr) and why step names need to be deterministic.

Ok, so I'm more convinced that run will be re-run after hibernation but that still leaves us with some issues. if run gets called multiple times, if users create spans in there, they will be re-created after any hibernation too!

@LuisDuarte1
Copy link

Hey from the Workflows team 👋

Ok, so I'm more convinced that run will be re-run after hibernation but that still leaves us with some issues. if run gets called multiple times, if users create spans in there, they will be re-created after any hibernation too!

Yes, at the moment, every time "hibernation" happens run is executed again from the start (but no work should be duplicated on step.* functions since they are idempotent). Therefore, it also means that if you create spans on the run top-level scope, they can be created more than once.

Not sure what would be the best way to fix this (in the specific context of the sentry-sdk) - but I would recommend only creating spans inside the step.do callback to avoid duplication.

Initializing Sentry at the start of every run function should be okay (if the spans generated are deterministic?) - I personally think that things like o11y generally doesn't apply to the Do not rely on state outside of a step rule, and might be improved on docs. 🔜 ™️

@s1gr1d
Copy link
Member Author

s1gr1d commented Sep 11, 2025

Thanks for the clarification! 🙌

Sentry still starts the span inside step.do (see here) and the init would happen in run, the change in this PR should be fine.

I only added a test-case where a span is created around step.do (users could do that like in this issue).

Copy link
Collaborator

@timfish timfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like this will be a good improvement! This also has the bonus that it will also catch exceptions in run but outside of step.do which would have been missed before?

We should possibly update the docs to discourage users from creating spans outside of step.do and explain why.

@s1gr1d s1gr1d merged commit 71880da into develop Sep 11, 2025
48 checks passed
@s1gr1d s1gr1d deleted the sig/cloudflare-workflow-fix branch September 11, 2025 11:25
s1gr1d added a commit to getsentry/sentry-docs that referenced this pull request Sep 11, 2025
## DESCRIBE YOUR PR

Adds notice about rather not creating spans outside of `step.do`
mentioned in this PR:
getsentry/sentry-javascript#17582

ref: getsentry/sentry-javascript#17419

## IS YOUR CHANGE URGENT?  

Help us prioritize incoming PRs by letting us know when the change needs
to go live.
- [ ] Urgent deadline (GA date, etc.): <!-- ENTER DATE HERE -->
- [ ] Other deadline: <!-- ENTER DATE HERE -->
- [ ] None: Not urgent, can wait up to 1 week+

## SLA

- Teamwork makes the dream work, so please add a reviewer to your PRs.
- Please give the docs team up to 1 week to review your PR unless you've
added an urgent due date to it.
Thanks in advance for your help!

## PRE-MERGE CHECKLIST

*Make sure you've checked the following before merging your changes:*

- [ ] Checked Vercel preview for correctness, including links
- [ ] PR was reviewed and approved by any necessary SMEs (subject matter
experts)
- [ ] PR was reviewed and approved by a member of the [Sentry docs
team](https://github.com/orgs/getsentry/teams/docs)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Cloudflare workflows] Dedicated span do not show up in error reports
4 participants