Skip to content

Timer function replays when singleton lock lost #2968

@mathewc

Description

@mathewc

Summary

There are some situations where a timer trigger function can be ran again for the same schedule occurrence.

Details

As described here timer trigger functions rely on a blob lease behind the scenes to ensure only a single instance of the timer is running across scaled out instances. In normal processing, an instance is long lived, holds the lease and monitors the schedule, invoking the function at the right time. Behind the scenes the lease is periodically renewed to ensure no other host runs the timer.

However, in cases where the lease is lost during the invocation of the timer function, another host instance can start up, get the lease and start running the timer while the other invocation is still completing. A lease can be lost due to due to host level failure, resource exhaustion/thread starvation preventing the lease from being renewed, etc. Because of our host graceful shutdown, active function invocations aren't terminated, they're allowed to run to completion while the host is stopped. When the new instance gets the lease it will see that the timer is behind schedule and will start running the function again. Thus, in cases like this it's possible to get two completed invocations for the same schedule occurrence.

A related situation could happen if rather than simply losing the lease, the entire VM goes down while in the middle of the function invocation. Another instance will get the lease and start running that schedule occurrence (in this case which never completed). The first invocation is no longer running, but depending on at what point in its execution the VM went down it might have already sent some emails, etc. So you could do duplicate work in this case as well, even though the functions aren’t running concurrently. The general guidance is for people to write idempotent code that can handle retries (e.g. in queue processing where the first invocation might fail part way through), which is what this would effectively be at that point.

Our TimerTrigger listener lease ensures that the timer scheduler is only running on a single app instance. However, it isn’t enforcing a single invocation of the function, in error cases like the above it seems. One idea for us to ensure that only a single instance of a timer function is running would be to bring the process down when the lease is lost, which would kill any running functions. Note that when using TimerTrigger in WebJobs classic a lost lease will cause the process to go down - so this is only an issue in functions because we're handling host errors and keeping the process alive. Bringing down the process will prevent duplicate completed invocations which is worse than an invocation failing half way through then being retried (a retry is expected in this case). Bringing the process down also has the problem of terminating other executions of other functions.

Mitigation/Solutions

Schedule monitoring can be disabled by setting “useMonitor: false” in the function.json for your timer function. See here for details. It will no longer check past due on startup, which will prevent concurrent executions.

Alternately, if your function was written in a .NET based language, modify your function to accept a CancellationToken and check it during execution (and pass it into async operations). That way if a shutdown event is occurring, you can abort the execution of your function.

Also, check the resource consumption of your host instances. Lost leases should be very rare. However if the instance is running into thread starvation / high CPU, the background renewals may fail. Throttling your functions to keep the instance under redline will solve these issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions