Skip to content

[jobs] multi-user managed jobs #4787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 26, 2025

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Feb 21, 2025

Changes managed jobs to work like clusters

  • by default, sky jobs queue shows only your jobs, run sky jobs queue -u to show all jobs
  • by default, sky jobs cancel -a cancels only your jobs, run sky jobs cancel -u to cancel all users' jobs

Closes #4686.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • New smoke test
  • Managed jobs smoke tests
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@cg505 cg505 marked this pull request as ready for review February 21, 2025 02:47
@cg505 cg505 force-pushed the conslidated-controller-user branch from ca5a529 to 79883fa Compare February 21, 2025 02:48
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505! Mostly looks good to me. Left several comments.

@@ -1044,11 +1072,14 @@ def generate_details(failure_reason: Optional[str]) -> str:
if not managed_job_status.is_terminal():
status_str += f' (task: {current_task_id})'

job_id = job_hash[1] if tasks_have_user else job_hash
user_values = generate_user_values(job_tasks[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor:

Suggested change
user_values = generate_user_values(job_tasks[0])
user_values = get_user_columns(job_tasks[0])

@@ -3,7 +3,7 @@
# API server version, whenever there is a change in API server that requires a
# restart of the local API server or error out when the client does not match
# the server version.
API_VERSION = '1'
API_VERSION = '2'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Is it possible that we only raise an update error when the user actually use -u with sky jobs queue or sky jobs cancel?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea, it may require some pydantic investigation etc. Do we think this is quite valuable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine for now, but once we have more user on client-server, it would be better to avoid frequent API version upgrade.

It was already bumped in an earlier commit in the PR.
@cg505 cg505 requested a review from Michaelvll February 21, 2025 22:37
@cg505
Copy link
Collaborator Author

cg505 commented Feb 21, 2025

/quicktest-core

Instead of writing to an env file locally, we will just do it as part
of the run commands.
@cg505 cg505 force-pushed the conslidated-controller-user branch from e07011e to 054d020 Compare February 21, 2025 22:38
@cg505
Copy link
Collaborator Author

cg505 commented Feb 21, 2025

/quicktest-core

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michaelvll
Copy link
Collaborator

btw, the quicktest-core's backward compatibility test is broken, see: #4741.

We should run backward compatibility tests manually.

@zpoint
Copy link
Collaborator

zpoint commented Feb 23, 2025

We should run backward compatibility tests manually

Manually also broken after client server merged. Fixed in #4741.

@Michaelvll
Copy link
Collaborator

/quicktest-core

@cg505 cg505 force-pushed the conslidated-controller-user branch from 9977f3d to 74487b6 Compare February 25, 2025 23:29
@cg505
Copy link
Collaborator Author

cg505 commented Feb 25, 2025

/quicktest-core

@cg505
Copy link
Collaborator Author

cg505 commented Feb 25, 2025

/smoke-test -k test_multi_tenant

@cg505
Copy link
Collaborator Author

cg505 commented Feb 25, 2025

/smoke-test --managed-jobs

@cg505 cg505 merged commit 8b95082 into skypilot-org:master Feb 26, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consolidate jobs controller across multiple users
3 participants