-
Notifications
You must be signed in to change notification settings - Fork 633
[GCP] Add retry for transient error during launching GCP clusters #2669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @Michaelvll.
if 'Head node fetch timed out' in stderr: | ||
# Example: click.exceptions.ClickException: Head node fetch | ||
# timed out. Failed to create head node. | ||
# This is a transient error, but we have retried in need_ray_up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit/Q: is this a transient error from the cloud, or due to certain image sizes being too large (which then makes the timeout in ray flaky)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is likely the issue is with the cloud, as ray will get through the waiting loop, once the VM goes into PROVISIONING
state and has the tag set, while the image setup should probably happen in the next state STAGING
. https://cloud.google.com/compute/docs/instances/instance-life-cycle
I have seen multiple times when I sky launch --gpus A100:8
, our program gets the IP address for the VM, while the VM just got deleted by GCP on the console.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if 'Head node fetch timed out' in stderr: | ||
# Example: click.exceptions.ClickException: Head node fetch | ||
# timed out. Failed to create head node. | ||
# This is a transient error, but we have retried in need_ray_up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG
…ypilot-org#2669) * Add retry for flaky error during launching GCP clusters * handle error * format * Do not log out stderr * Add retry for gcloud crash * fix retry return code
GCP generats some flaky error while launching causing #2666. We should add retry for the
ray up
during launching as well as the blocklist update for errors.Tested (run the relevant ones):
bash format.sh
sky launch --cloud gcp --gpus A100:8 --image-id projects/deeplearning-platform-release/global/images/pytorch-2-0-gpu-v20230822-ubuntu-2004-py310 --down --retry-until-up
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh