Skip to content

[k8s] Enable multiple kubernetes contexts for failover #3968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Sep 26, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Sep 22, 2024

This allows user to specify the following ~/.sky/config.yaml to enable SkyPilot to failover through different kubernetes contexts.

kubernetes:
  allowed_contexts:
    - kind-skypilot
    - gke_skypilot-xxx_us-central1-c_test-zhwu

TODO:

  • Check if we should improve the UX output for region vs context
  • Multiple k8s clusters with different resources
  • show-gpus should show resources from all allowed_contexts (left for future)
  • Add example policy for dynamic k8s context update
  • Add tests for multi-k8s

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test --cloud kubernetes --cpus 4 echo hi with two k8s clusters, one with nodes having less than 4 CPUs, one with nodes with more than 4 CPUs; it correctly failover through the first k8s cluster to the second one
    • Remove the larger k8s cluster context name from allowed_contexts, and sky exec/ sky launch again on the existing SkyPilot cluster
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Took a quick look

allowed_contexts = skypilot_config.get_nested(
('kubernetes', 'allowed_contexts'), None)
if allowed_contexts is None:
return cls._regions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[commentary, no action required] I am liking the idea of using regions (instead of clouds) to do multi-kubernetes. In the future, if we want to enable multi-k8s out of the box, we can simply return all contexts here :)

Copy link
Collaborator Author

@Michaelvll Michaelvll Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, I found it more clear to have the following mapping:
k8s contexts -> local cloud config profiles.
Because both of them contains:

  1. the identity to use for accessing the resource pool (k8s: user + namespace; cloud config: account)
  2. the resource pool to look at (k8s: cluster; cloud config: project to use)

I think the current way is a simple workaround for now, but we may need to have a better design in the future. The main confusion with using region may come from: multiple context can map to the same k8s cluster with different namespace or user.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, probably need better solutions. I just realized many properties in the config may need to be updated in the near future to work well for multi-cluster (e.g., some contexts may need ports: ingress, while others may need ports: loadbalancer. Same ofr other fields).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when I am updating the code for always showing context for region, I also realized that there are more places to be updated, especially the code for failover Kubernetes._get_feasible_launchable_resources. If we have two clusters with different resource set, our failover will likely disregard all the Kubernetes clusters if the cluster without the resource is the current activate context.

Marking this PR to draft for now to fix this issue.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Tested it and works nicely. Left some comments.

allowed_contexts = skypilot_config.get_nested(
('kubernetes', 'allowed_contexts'), None)
if allowed_contexts is None:
return cls._regions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, probably need better solutions. I just realized many properties in the config may need to be updated in the near future to work well for multi-cluster (e.g., some contexts may need ports: ingress, while others may need ports: loadbalancer. Same ofr other fields).

@Michaelvll Michaelvll marked this pull request as draft September 24, 2024 08:10
@Michaelvll
Copy link
Collaborator Author

We realized that we need to update the code for checking resource feasibility on a kubernetes cluster to support different context and make failover fully functional. Changed this PR to draft for now to fix that issue.

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Sep 25, 2024

Fixed the feasible resources checking and seems working with multiple kubernetes containing different resources.

Test setup

sky local up and another GKE cluster with 2 nodes.

  1. Local k8s: labeled with L4 GPUs, with test-namespace as namespace
  2. GKE cluster: 1 node labeled with H100 GPU, with gke-namespace as namespace
  • local k8s as current context, sky launch --gpus h100 echo hi -- correctly go to GKE and launch a machine
  • GKE as current context, sky launch --gpus h100 echo hi -- correctly go to GKE and launch a machine
  • no allowed context specified, GKE as current context: sky launch --gpus h100 echo hi -- correctly go to GKE and launch a machine
  • no allowed context specified, local k8s as current context: sky launch --gpus h100 echo hi -- does not launch on GKE
  • Two allowed contexts, create a cluster on local k8s cluster, remove the kind-skypilot from allowed_contexts and exec again on the previous cluster.

TODO:

  • backward compatibility
    • On master, sky launch -c test --cloud kubernetes echo hi switch to this PR sky exec test echo hi
    • sky status -r test
    • sky launch -c test echo hi again
  • Add a smoke test

@Michaelvll Michaelvll marked this pull request as ready for review September 25, 2024 02:01
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thanks @Michaelvll! Tested failover on GKE + local. Left some comments

@Michaelvll Michaelvll mentioned this pull request Sep 26, 2024
7 tasks
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Sep 26, 2024

This should be ready for another look. @romilbhardwaj : )

Future TODOs:

  • Add a dedicated doc for multiple kubernetes

test = Test(
'kubernetes-context-failover',
[
'sky show-gpus --cloud kubernetes --region kind-skypilot | grep H100 | grep "1, 2, 3, 4, 5, 6, 7, 8"',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this test will fail if the sky local up cluster is not setup, can we add a quick error message for the dev running this test at this line? Something along the lines of: "Unable to find mocked GPUs in the sky local up cluster. Please read the instructions for test_kubernetes_context_failover on how to set it up".

Or better yet, automate the sky local up setup :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Thanks!

Copy link
Collaborator Author

@Michaelvll Michaelvll Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatic local up is a bit scary, as when I tried to do it, it turned out to me that we may have multiple tests in the future using the same local k8s cluster, and can cause issue if everyone is trying to modify that cluster.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @Michaelvll! LGTM.

@Michaelvll Michaelvll added this pull request to the merge queue Sep 26, 2024
Merged via the queue into master with commit 4e46cf4 Sep 26, 2024
20 checks passed
@Michaelvll Michaelvll deleted the multi-k8s-contexts branch September 26, 2024 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants