Skip to content

Commit 89fffb5

Browse files
cg505AlexCuadron
authored andcommitted
[jobs] autodown managed job clusters (skypilot-org#4267)
* [jobs] autodown managed job clusters If all goes correctly, the managed job controller should tear down a managed job cluster once the managed job completes. However, if the controller fails somehow (e.g. crashes, is terminated, etc), we don't want to leak resources. As a failsafe, set autodown on the job cluster. This is not foolproof, since the skylet on the cluster can also crash, but it's likely to catch many cases. * add comment about autodown duration * add leading _
1 parent 3525689 commit 89fffb5

File tree

1 file changed

+16
-5
lines changed

1 file changed

+16
-5
lines changed

sky/jobs/recovery_strategy.py

+16-5
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,11 @@
3636
# 10 * JOB_STARTED_STATUS_CHECK_GAP_SECONDS = 10 * 5 = 50 seconds
3737
MAX_JOB_CHECKING_RETRY = 10
3838

39+
# Minutes to job cluster autodown. This should be significantly larger than
40+
# managed_job_utils.JOB_STATUS_CHECK_GAP_SECONDS, to avoid tearing down the
41+
# cluster before its status can be updated by the job controller.
42+
_AUTODOWN_MINUTES = 5
43+
3944

4045
def terminate_cluster(cluster_name: str, max_retry: int = 3) -> None:
4146
"""Terminate the cluster."""
@@ -302,11 +307,17 @@ def _launch(self,
302307
usage_lib.messages.usage.set_internal()
303308
# Detach setup, so that the setup failure can be detected
304309
# by the controller process (job_status -> FAILED_SETUP).
305-
sky.launch(self.dag,
306-
cluster_name=self.cluster_name,
307-
detach_setup=True,
308-
detach_run=True,
309-
_is_launched_by_jobs_controller=True)
310+
sky.launch(
311+
self.dag,
312+
cluster_name=self.cluster_name,
313+
# We expect to tear down the cluster as soon as the job is
314+
# finished. However, in case the controller dies, set
315+
# autodown to try and avoid a resource leak.
316+
idle_minutes_to_autostop=_AUTODOWN_MINUTES,
317+
down=True,
318+
detach_setup=True,
319+
detach_run=True,
320+
_is_launched_by_jobs_controller=True)
310321
logger.info('Managed job cluster launched.')
311322
except (exceptions.InvalidClusterNameError,
312323
exceptions.NoCloudAccessError,

0 commit comments

Comments
 (0)