Auto merge of #1803 - sgrif:sg-environment-variable-for-background-timeout, r=jtgeibel

bors · bors · commit f73b7a7c47b6 · 2019-08-14T23:04:24.000Z
Configure the background job timeout via an environment variable An incident was caused by #1798. There is a description below if you're interested, but this PR does not fix the problem. However, the band-aid to get things running again fix is to increase the timeout for the job runner. When responding to an incident, waiting for a full rebuild to change this is not acceptable. This replaces the hard-coded value with an environment variable so we can quickly change this on the fly in the future. Description of the actual problem that this does not fix -- The problem was that the `update_downloads` job takes longer than the timeout we had set for jobs to begin running. So swirl would start the `update_downloads` job, try to spawn another worker, and then would time out hearing from that worker whether it got a job or not. So we would crash the process, the job would be incomplete, and we'd just start the whole thing over again. There's several real fixes for this, and I will open a PR that is some combination of all of them. Ultimately each of these fixes just increase the number of slow concurrent jobs that can be run before we hit the timeout and the problem re-appears, but that's fundamentally always going to be the case... If we are getting more jobs than we can process, we do need to get paged so we can remedy the situation. Still, any or all of these will be the "real" fix: - Increasing the number of concurrent jobs - Increasing the timeout - Re-building the runner before crashing - The reason this would fix the issue is that by not crashing the process, we give the spawned threads a chance to finish. We do still want to *eventually* crash the process, as there might be something inherent to this process or machine preventing the jobs from running, but starting with a new thread/connection pool a few times gives things a better chance to recover on their own.
diff --git a/src/bin/background-worker.rs b/src/bin/background-worker.rs
@@ -31,6 +31,11 @@ fn main() {
         _ => None,
     };
 
+    let job_start_timeout = dotenv::var("BACKGROUND_JOB_TIMEOUT")
+        .unwrap_or_else(|_| "10".into())
+        .parse()
+        .expect("Invalid value for `BACKGROUND_JOB_TIMEOUT`");
+
     println!("Cloning index");
 
     let repository = Repository::open(&config.index_location).expect("Failed to clone index");
@@ -45,7 +50,7 @@ fn main() {
 
     let runner = swirl::Runner::builder(db_pool, environment)
         .thread_count(1)
-        .job_start_timeout(Duration::from_secs(10))
+        .job_start_timeout(Duration::from_secs(job_start_timeout))
         .build();
 
     println!("Runner booted, running jobs");