Skip to content

[k8s] Support for switching k8s contexts #3913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Sep 11, 2024
Merged

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Sep 5, 2024

Previously, if the user switched kube contexts and tried exec/down or other operations on a cluster, they would fail with a ClusterIdentityMismatch error.

This PR changes the behavior to allow the user to operate on clusters even if the current-context in the kubeconfig is different from the context that was used to launch the cluster.

This is implemented by storing kubecontext information as a part of the cluster config and making all our Kubernetes API calls context-sensitive. Since this effectively requires switching "identities", we also add a cloud.get_supported_identities method to get a list of all identities SkyPilot can switch to.

I considered using a less invasive approach (which does not require changing all API calls to be context sensitive), one which used a python contextmanager to temporarily set the context globally in adaptors.kubernetes._load_config(), but that was quickly discarded since many SkyPilot operations (e.g., down) are run in parallel and thread safety becomes hard.

TODO:

  • Extensive testing, incl. all kubernetes smoke tests and backward compatibility

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual tests with
  • New smoke test - test_kubernetes_context_switch
  • Other smoke tests and backward compatibility

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @romilbhardwaj! Looks mostly good to me with minor comments for an interface.

@@ -66,7 +59,7 @@ def wrapped(*args, **kwargs):
return decorated_api


def _load_config():
def _load_config(context: str = None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _load_config(context: str = None):
def _load_config(context: Optional[str] = None):

@@ -1559,6 +1559,7 @@ def check_owner_identity(cluster_name: str) -> None:

cloud = handle.launched_resources.cloud
current_user_identity = cloud.get_current_user_identity()
supported_user_identities = cloud.get_supported_identities()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we directly make get_current_user_identity as a list, instead of adding a new API?

Copy link
Collaborator Author

@romilbhardwaj romilbhardwaj Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little confused by our implementation which made me add this new method.

Currently, we iterate on the zip(owner_identity, current_user_identity) to check identities:

# It is OK if the owner identity is shorter, which will happen when
# the cluster is launched before #1808. In that case, we only check
# the same length (zip will stop at the shorter one).
for i, (owner,
current) in enumerate(zip(owner_identity,
current_user_identity)):
# Clean up the owner identity for the backslash and newlines, caused
# by the cloud CLI output, e.g. gcloud.
owner = owner.replace('\n', '').replace('\\', '')

Say the user had launched the cluster with just one identity configured in their kubeconfig, and thus owner_identity is ["myid1"].

Say they updated their kubeconfig and now they have 3 identities in current_user_identity, the original one being 3rd in the list: ["myid3", "myid2", "myid1"].

zip(owner_identity, current_user_identity) will now be ["myid1", "myid3"], which will fail the check we have.

>>> x=["myid1"]
>>> y=["myid3", "myid2", "myid1"]
>>> list(zip(x,y))
[('myid1', 'myid3')]

For k8s this logic should be changed to itertools.product instead of zip, but I think that will affect AWS implementation, which has a stricter ordering requirement to handle [user_id, AccountId]

Copy link
Collaborator

@Michaelvll Michaelvll Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we have a list for the current_user_identity and owner_identity is that it can be fine for a cloud owner with a different cloud account but from the same project to operate a VM, for example, we allows the following to match

owner_identity = ['myaccount1', 'myproject']
current_identity = ['myaccount2', 'myproject']
for i, (owner, 
         current) in enumerate(zip(owner_identity, 
                                   current_user_identity)): 

In order to support multiple current_identity, we can do something like the following:

owner_identity = ['myaccount1', 'myproject']
current_identity = [['myaccount2', 'myproject']]
for available_identity in current_identities:
  for i, (owner, 
         current) in enumerate(zip(owner_identity, available_identity)): 

And for k8s, it can be:

owner_identity = ['mycontext']
current_identity = [['mycontext1'], ['mycontext2']]
for available_identity in current_identities:
  for i, (owner, 
         current) in enumerate(zip(owner_identity, available_identity)): 

@romilbhardwaj
Copy link
Collaborator Author

Thanks @Michaelvll! I've updated our identity logic:

  • Renamed get_current_user_identity to get_active_user_identity
  • Added get_user_identities to fetch all identities.

We can roll these into one method as well (i.e., just have get_user_identities and replace usage of cloud.get_active_user_identity() with cloud.get_user_identities()[0]), but I figured this is a little cleaner.

@romilbhardwaj
Copy link
Collaborator Author

romilbhardwaj commented Sep 7, 2024

Tested:

  • sky launch on context1, switch to context2, launch on context2, run sky exec/queue/queue/logs/cancel on cluster running in context1
  • sky launch on context1, switch to context2, launch on context2, delete context1 from kubeconfig and make sure IdentityError is thrown
  • New smoke test test_kubernetes_context_switch
  • All kubernetes smoke tests

@romilbhardwaj
Copy link
Collaborator Author

This PR now also fixes a bug introduced by #3897 where if a namespace was passed but no context was set (e.g., when happen inside a controller pod when using incluster auth) to kubernetes-port-forward-proxy-command.sh, the namespace arg would be incorrectly assumed to be the context arg and ssh would fail.

@romilbhardwaj
Copy link
Collaborator Author

romilbhardwaj commented Sep 9, 2024

Backward compatibility is now fixed.

TODO:

  • sky launch from a different context leaks pods in the original context since the cluster owner changes. We may need to add logic to disallow owner change when cluster already exists in state.db.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @romilbhardwaj! Mostly looks good to me.

Comment on lines 419 to 423
# Setup service for SSH jump pod. We create the SSH jump service here
# because we need to know the service IP address and port to set the
# ssh_proxy_command in the autoscaler config.
kubernetes_utils.setup_ssh_jump_svc(ssh_jump_name, namespace,
kubernetes_utils.setup_ssh_jump_svc(ssh_jump_name, namespace, context,
service_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need a jump pod for k8s? I thought we have removed it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still used when kubernetes.networking is nodeport (we will deprecate that soon)

Comment on lines 500 to 526
@classmethod
def get_user_identities(cls) -> Optional[List[List[str]]]:
"""Returns all available user identities of this cloud.

See get_active_user_identity for definition of user identity.

This method returns a list of user identities, with the current active
identity being the first element. Most clouds have only one identity
available, so the returned list will only have one element: the current
active identity.

However, some clouds (e.g., Kubernetes) can have multiple current
identities (e.g., multiple contexts configured in kubeconfig) that
can be dynamically switched, so the list can have multiple elements.

Example return values:
- AWS: [[UserId, AccountId]]
- GCP: [[email address + project ID]]
- Azure: [[email address + subscription ID]]
- Kubernetes: [[current active context], [context 2], ...]

Returns:
None if the cloud does not have a concept of user identity;
otherwise all the user identities.
"""
active_identity = cls.get_active_user_identity()
return [active_identity] if active_identity is not None else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found calling get_active_user_identity within get_user_identities a bit weird to me. Should we instead make get_user_identities always return a list of identities with the active one being the first one, and having get_active_user_identity to index the returned list of identities?

With that, we can also add TODO in AWS and GCP's implementation: Return a list of identities in the profile, when we support automatically switching context.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, fixed now!

# We add a version suffix to the port-forward proxy command to ensure backward
# compatibility and avoid overwriting the older version.
PORT_FORWARD_PROXY_CMD_PATH = ('~/.sky/'
'kubernetes-port-forward-proxy-command-v2.sh')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can save the version of the proxy command in constants, if we expect it can change in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yeah we don't have a constants for kubernetes yet, but I've moved it to it's own variable for now

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @romilbhardwaj for the great effort! LGTM.

…o k8s_multik8s_state2

# Conflicts:
#	sky/adaptors/kubernetes.py
@romilbhardwaj
Copy link
Collaborator Author

Verified it works with the job controller as well and sky launch works across contexts. Kubernetes smoke tests pass, merging now.

@romilbhardwaj romilbhardwaj added this pull request to the merge queue Sep 11, 2024
Merged via the queue into master with commit bad7dab Sep 11, 2024
20 checks passed
@romilbhardwaj romilbhardwaj deleted the k8s_multik8s_state2 branch September 11, 2024 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants