You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Auto merge of #1803 - sgrif:sg-environment-variable-for-background-timeout, r=jtgeibel
Configure the background job timeout via an environment variable
An incident was caused by #1798. There is a description below if you're
interested, but this PR does not fix the problem. However, the band-aid
to get things running again fix is to increase the timeout for the job
runner. When responding to an incident, waiting for a full rebuild to
change this is not acceptable. This replaces the hard-coded value with
an environment variable so we can quickly change this on the fly in the
future.
Description of the actual problem that this does not fix
--
The problem was that the `update_downloads` job takes longer than the
timeout we had set for jobs to begin running. So swirl would start the
`update_downloads` job, try to spawn another worker, and then would time
out hearing from that worker whether it got a job or not. So we would
crash the process, the job would be incomplete, and we'd just start the
whole thing over again.
There's several real fixes for this, and I will open a PR that is some
combination of all of them. Ultimately each of these fixes just increase
the number of slow concurrent jobs that can be run before we hit the
timeout and the problem re-appears, but that's fundamentally always
going to be the case... If we are getting more jobs than we can process,
we do need to get paged so we can remedy the situation. Still, any or
all of these will be the "real" fix:
- Increasing the number of concurrent jobs
- Increasing the timeout
- Re-building the runner before crashing
- The reason this would fix the issue is that by not crashing the
process, we give the spawned threads a chance to finish. We do still
want to *eventually* crash the process, as there might be something
inherent to this process or machine preventing the jobs from
running, but starting with a new thread/connection pool a few times
gives things a better chance to recover on their own.
0 commit comments