-
Notifications
You must be signed in to change notification settings - Fork 633
[Core] User ray cluster causes SkyPilot cluster in INIT state #2020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll! Some questions.
sky/backends/backend_utils.py
Outdated
@@ -139,6 +139,10 @@ | |||
('available_node_types', 'ray.worker.default', 'node_config', 'UserData'), | |||
] | |||
|
|||
_SET_RAY_ADDRESS = ('RAY_PORT=$(python -c "from sky.skylet import job_lib; ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we apply this to the only other ray status
call site?
sky/backends/cloud_vm_ray_backend.py
1711: 'ray status',
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, good catch! Just fixed. Thanks!
sky/backends/backend_utils.py
Outdated
@@ -1968,7 +1972,7 @@ def _update_cluster_status_no_lock( | |||
ssh_credentials = ssh_credential_from_yaml(handle.cluster_yaml) | |||
runner = command_runner.SSHCommandRunner(external_ips[0], | |||
**ssh_credentials) | |||
rc, output, _ = runner.run('ray status', | |||
rc, output, _ = runner.run(f'{_SET_RAY_ADDRESS}ray status', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: Why was there an INIT transition? Is it because originally here we queried the user-level ray cluster, and the _count_healthy_nodes_from_ray() below somehow didn't show 1 node being healthy? What's the buggy output from _count_healthy_nodes_from_ray()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is caused by the _count_healthy_nodes_from_ray
as the user's ray status
will be as following, which does not contain the node name, i.e. ray_head_default
/ray_worker_default
.
2023-06-04 05:48:34,791 WARNING services.py:405 -- Found multiple active Ray instances: {'10.128.0.47:6380', '10.128.0.47:7379'}. Connecting to latest cluster at 10.128.0.47:7379. You can override this by setting the `--address` flag or `RAY_ADDRESS` environment variable.
======== Autoscaler status: 2023-06-04 05:48:32.064004 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_b9d99c9b660a0266a95b423ab4647c52eb3d888299de4f315d4f4375
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/4.0 CPU
0B/8.33GiB memory
0B/4.17GiB object_store_memory
Demands:
(no resource demands)
@@ -330,6 +330,14 @@ def get_job_submission_port(): | |||
return port | |||
|
|||
|
|||
def get_ray_port(): | |||
port_path = os.path.expanduser(constants.SKY_REMOTE_RAY_PORT_FILE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can we document in constants.py what SKY_REMOTE_RAY_PORT_FILE means exactly? Does it always contain the currently active SkyPilot-internal Ray cluster's port (for old clusters, 6379; newer ones, 6380)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the comment.
Yes, we dump the info right after the ray start
is called in the ray yaml file. The file will not exist if our ray cluster is on old 6379, because those clusters should be launched before #1790. That is why we have the file existence check and fallback to the default port. Added a docstr for these two functions as well.
ah sorry I only tested running ray applications not starting a ray server. |
@infwinston No worries at all! I didn't notice that during the review either. Tested again:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a minor concern about any overheads.
'RAY_ADDRESS=127.0.0.1:$RAY_PORT ') | ||
RAY_STATUS_WITH_SKY_RAY_PORT_CMD = ( | ||
'RAY_PORT=$(python -c "from sky.skylet import job_lib; ' | ||
'print(job_lib.get_ray_port())" 2> /dev/null || echo 6379);' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this extra python invoke add latency to sky launch
or status -r
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra python invoke adds about 0.75s overhead to the original ray status call. (1.25 -> 2.0) by comparing the following two commands:
time ray status
time { RAY_PORT=$(python -u -c "from sky.skylet import job_lib; print(job_lib.get_ray_port())" 2> /dev/null || echo 6379);RAY_ADDRESS=127.0.0.1:$RAY_PORT ray status; }
It seems if we use the jq
command @infwinston used before, the overhead is negligible, but the jq
command may not exist in some image.
$ time { RAY_PORT=$(jq .ray_port ~/.sky/ray_port.json 2> /dev/null || echo 6379);RAY_ADDRESS=127.0.0.1:$RAY_PORT ray status; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a lot standalone, but considering sky launch
or status -r
take minutes and tens of seconds, 0.75s *x seems ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it might not be that significant compared to the total time. I am going to merge this for now, and we can think of optimization later, if needed. : )
Co-authored-by: Zongheng Yang <[email protected]>
fa9acc1
to
c73c9ae
Compare
* Fix ray status * Add test * format * address comments * format * Update sky/backends/backend_utils.py Co-authored-by: Zongheng Yang <[email protected]> * address comments * format --------- Co-authored-by: Zongheng Yang <[email protected]>
* Fix ray status * Add test * format * address comments * format * Update sky/backends/backend_utils.py Co-authored-by: Zongheng Yang <[email protected]> * address comments * format --------- Co-authored-by: Zongheng Yang <[email protected]>
Fixes #2019
Tested (run the relevant ones):
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
pytest tests/test_smoke.py::test_user_ray_cluster
bash tests/backward_comaptibility_tests.sh