-
Notifications
You must be signed in to change notification settings - Fork 716
[k8s] Fix incluster auth after multi-context support #4014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Shall we add such a smoke test? |
This would have been caught by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @romilbhardwaj! Just wondering, when we are running things through controller, do we need to pop the allowed_contexts
from the config.yaml being used for the jobs or services as well?
https://github.com/skypilot-org/skypilot/blob/master/sky/utils/controller_utils.py#L359-L370
sky/authentication.py
Outdated
@@ -380,6 +380,11 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]: | |||
secret_field_name = clouds.Kubernetes().ssh_key_secret_field_name | |||
context = config['provider'].get( | |||
'context', kubernetes_utils.get_current_kube_config_context_name()) | |||
if context == kubernetes_utils.SINGLETON_REGION: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this SINGLETON_REGION
to some thing else, such as IN_CLUSTER_CONTEXT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we just make the value of SINGLETON_REGION = '_in-cluster'
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - updated to in-cluster
.
If running from within a cluster, it now looks like this:
(base) sky@test-2ea4-head:~$ sky launch -c wow --cloud kubernetes -- echo hi
Task from command: echo hi
I 09-28 03:39:33 optimizer.py:719] == Optimizer ==
I 09-28 03:39:33 optimizer.py:742] Estimated cost: $0.0 / hour
I 09-28 03:39:33 optimizer.py:742]
I 09-28 03:39:33 optimizer.py:867] Considered resources (1 node):
I 09-28 03:39:33 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-28 03:39:33 optimizer.py:937] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 09-28 03:39:33 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-28 03:39:33 optimizer.py:937] Kubernetes 2CPU--2GB 2 2 - in-cluster 0.00 ✔
I 09-28 03:39:33 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-28 03:39:33 optimizer.py:937]
Launching a new cluster 'wow'. Proceed? [Y/n]:
Thanks @Michaelvll - updated the PR and ran tests again. Tested
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @romilbhardwaj! LGTM.
sky/clouds/kubernetes.py
Outdated
all_contexts = kubernetes_utils.get_all_kube_config_context_names() | ||
if all_contexts is None: | ||
return [] | ||
return [None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, actually, is it possible that directly return a [None]
for the get_all_kube_config_context_names()
when we detect it is actually within a cluster, and still return an empty list of context if is not in a cluster. Otherwise, it seems to mix the two situations (able to access kubernetes cluster or not) when all_contexts is None
which feels unintuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. I've updated get_all_kube_config_context_names()
to return [None] if incluster and [] if no kubeconfig is detected, so the caller can distinguish between these two cases.
sky/utils/controller_utils.py
Outdated
# Remove allowed_contexts from local_user_config since the controller | ||
# may be running in a Kubernetes cluster with in-cluster auth and may | ||
# not have kubeconfig available to it. This is the typical case since | ||
# remote_identity default for Kubernetes is SERVICE_ACCOUNT. | ||
local_user_config.pop('allowed_contexts', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit more complicated, since if the controller is running on a cloud, it can be fine and correct to set the allowed_contexts
. How about we add a TODO here?
Thanks @Michaelvll! Re-running above tests after the changes to verify correctness, will merge after tests pass. |
In-cluster auth broke after our recent multi-context support. This is because incluster auth does not have any kubeconfig (and consequently, does not have any contexts) and relies solely on mounted service accounts.
As a result, SkyServe controller (and sky jobs controller) was not able to launch replicas.
This PR fixes it by making context optional throughout our codebase, and using the in-cluster auth when context is not detected.
Note: we probably need to update our docs/elsewhere to mention that multiple contexts may not work when the controller runs in a Kubernetes cluster.
Tested
sky serve up -n http http_server.yaml
sky launch -c test --cloud kubernetes -- echo hi
allowed_contexts
set in config.yaml.