Skip to content

[Core] User ray cluster causes SkyPilot cluster in INIT state #2020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 4, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jun 3, 2023

Fixes #2019

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
    • pytest tests/test_smoke.py::test_user_ray_cluster
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll changed the title [Core] User ray cluster cause cluster in INIT state [Core] User ray cluster causes cluster in INIT state Jun 3, 2023
@Michaelvll Michaelvll changed the title [Core] User ray cluster causes cluster in INIT state [Core] User ray cluster causes SkyPilot cluster in INIT state Jun 3, 2023
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Some questions.

@@ -139,6 +139,10 @@
('available_node_types', 'ray.worker.default', 'node_config', 'UserData'),
]

_SET_RAY_ADDRESS = ('RAY_PORT=$(python -c "from sky.skylet import job_lib; '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we apply this to the only other ray status call site?


sky/backends/cloud_vm_ray_backend.py
1711:            'ray status',

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, good catch! Just fixed. Thanks!

@@ -1968,7 +1972,7 @@ def _update_cluster_status_no_lock(
ssh_credentials = ssh_credential_from_yaml(handle.cluster_yaml)
runner = command_runner.SSHCommandRunner(external_ips[0],
**ssh_credentials)
rc, output, _ = runner.run('ray status',
rc, output, _ = runner.run(f'{_SET_RAY_ADDRESS}ray status',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: Why was there an INIT transition? Is it because originally here we queried the user-level ray cluster, and the _count_healthy_nodes_from_ray() below somehow didn't show 1 node being healthy? What's the buggy output from _count_healthy_nodes_from_ray()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is caused by the _count_healthy_nodes_from_ray as the user's ray status will be as following, which does not contain the node name, i.e. ray_head_default/ray_worker_default.

2023-06-04 05:48:34,791 WARNING services.py:405 -- Found multiple active Ray instances: {'10.128.0.47:6380', '10.128.0.47:7379'}. Connecting to latest cluster at 10.128.0.47:7379. You can override this by setting the `--address` flag or `RAY_ADDRESS` environment variable.
======== Autoscaler status: 2023-06-04 05:48:32.064004 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_b9d99c9b660a0266a95b423ab4647c52eb3d888299de4f315d4f4375
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/4.0 CPU
 0B/8.33GiB memory
 0B/4.17GiB object_store_memory

Demands:
 (no resource demands)

@@ -330,6 +330,14 @@ def get_job_submission_port():
return port


def get_ray_port():
port_path = os.path.expanduser(constants.SKY_REMOTE_RAY_PORT_FILE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we document in constants.py what SKY_REMOTE_RAY_PORT_FILE means exactly? Does it always contain the currently active SkyPilot-internal Ray cluster's port (for old clusters, 6379; newer ones, 6380)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment.

Yes, we dump the info right after the ray start is called in the ray yaml file. The file will not exist if our ray cluster is on old 6379, because those clusters should be launched before #1790. That is why we have the file existence check and fallback to the default port. Added a docstr for these two functions as well.

@infwinston
Copy link
Member

ah sorry I only tested running ray applications not starting a ray server.
@Michaelvll thanks for fixing this..!

@Michaelvll
Copy link
Collaborator Author

@infwinston No worries at all! I didn't notice that during the review either.

Tested again:

  • pytest tests/test_smoke.py::test_user_ray_cluster

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a minor concern about any overheads.

'RAY_ADDRESS=127.0.0.1:$RAY_PORT ')
RAY_STATUS_WITH_SKY_RAY_PORT_CMD = (
'RAY_PORT=$(python -c "from sky.skylet import job_lib; '
'print(job_lib.get_ray_port())" 2> /dev/null || echo 6379);'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this extra python invoke add latency to sky launch or status -r?

Copy link
Collaborator Author

@Michaelvll Michaelvll Jun 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra python invoke adds about 0.75s overhead to the original ray status call. (1.25 -> 2.0) by comparing the following two commands:

time ray status
time { RAY_PORT=$(python -u -c  "from sky.skylet import job_lib; print(job_lib.get_ray_port())" 2> /dev/null || echo 6379);RAY_ADDRESS=127.0.0.1:$RAY_PORT ray status; }

It seems if we use the jq command @infwinston used before, the overhead is negligible, but the jq command may not exist in some image.

$ time { RAY_PORT=$(jq .ray_port ~/.sky/ray_port.json 2> /dev/null || echo 6379);RAY_ADDRESS=127.0.0.1:$RAY_PORT ray status; }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a lot standalone, but considering sky launch or status -r take minutes and tens of seconds, 0.75s *x seems ok?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it might not be that significant compared to the total time. I am going to merge this for now, and we can think of optimization later, if needed. : )

@Michaelvll Michaelvll force-pushed the multi-ray-cluster branch from fa9acc1 to c73c9ae Compare June 4, 2023 07:04
@Michaelvll Michaelvll merged commit e31c004 into master Jun 4, 2023
@Michaelvll Michaelvll deleted the multi-ray-cluster branch June 4, 2023 07:40
concretevitamin added a commit that referenced this pull request Jun 4, 2023
* Fix ray status

* Add test

* format

* address comments

* format

* Update sky/backends/backend_utils.py

Co-authored-by: Zongheng Yang <[email protected]>

* address comments

* format

---------

Co-authored-by: Zongheng Yang <[email protected]>
concretevitamin added a commit that referenced this pull request Jun 4, 2023
* Fix ray status

* Add test

* format

* address comments

* format

* Update sky/backends/backend_utils.py

Co-authored-by: Zongheng Yang <[email protected]>

* address comments

* format

---------

Co-authored-by: Zongheng Yang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ray] User launched ray cluster will cause cluster becoming INIT
3 participants