-
-
Notifications
You must be signed in to change notification settings - Fork 83
Description
The current implementation of os.proc.call's timeout flag uses two consecutive calls to p.destroy() and p.forciblyDestroy(). The effects of these calls are to first send SIGTERM, and then send SIGKILL to the process.
The roles of these two signals are:
SIGTERM: instruct the process to terminate, the process may intercept this and perform necessary clean-up operations, or may decide to ignore it entirelySIGKILL: instruct the process to terminate immediately -- this signal cannot be intercepted.
By sending these two signals back-to-back, the parent process produces a race-condition between how quickly the child can execute its SIGTERM handler and clean up resources and the issue of the SIGKILL. In my experiments, I've found that SIGKILL it the cause of process exit the vast majority of the time. This means that the (potentially necessary) clean-up of the process is often not performed or worse interrupted. If the handler itself contained code to write file contents back to disk, modify a database, and so on, these operations may be corrupted. If the child process itself has children that need terminating, this could not be issued, leading to the parent process hanging.
What are the possible desired outcomes?
There are three ways that the timeout should be terminating the process:
- Only send
SIGTERM: it doesn't matter how long it takes, we need to ensure safe clean-up - Only send
SIGKILL: the process has no important state, it should be terminated immediately - Send
SIGTERM, wait an appropriate amount of time, then sendSIGKILL: we want to offer the process an opportunity to clean-up, but if this takes too long (perhaps the clean-up process itself is hanging), we want to forcibly terminate -- this is the scenario done byos.lib, albeit without allowing sufficient time to perform the handler.
What is normally done?
The SIGKILL signal is useful to issue when a process is not responding in a timely fashion to its SIGTERM event and the two are usually sent together with a delay. The Linux timeout command offers this with the -k n flag, which sends a SIGKILL signal n seconds after the original timeout sent SIGTERM.
Solutions
The race condition caused by the consecutive calls to destroy and forciblyDestroy is a bug, and could be addressed by supporting a similar system allowing outcomes (1), (2), or (3) configurably and safely.
For backwards compatibility, however, it might be wise to just support (1) with the current system.