[k8s] Support for switching k8s contexts #3913

romilbhardwaj · 2024-09-05T08:36:00Z

Previously, if the user switched kube contexts and tried exec/down or other operations on a cluster, they would fail with a ClusterIdentityMismatch error.

This PR changes the behavior to allow the user to operate on clusters even if the current-context in the kubeconfig is different from the context that was used to launch the cluster.

This is implemented by storing kubecontext information as a part of the cluster config and making all our Kubernetes API calls context-sensitive. Since this effectively requires switching "identities", we also add a cloud.get_supported_identities method to get a list of all identities SkyPilot can switch to.

I considered using a less invasive approach (which does not require changing all API calls to be context sensitive), one which used a python contextmanager to temporarily set the context globally in adaptors.kubernetes._load_config(), but that was quickly discarded since many SkyPilot operations (e.g., down) are run in parallel and thread safety becomes hard.

TODO:

Extensive testing, incl. all kubernetes smoke tests and backward compatibility

Tested (run the relevant ones):

Code formatting: bash format.sh
Manual tests with
New smoke test - test_kubernetes_context_switch
Other smoke tests and backward compatibility

Michaelvll

Thanks for adding this @romilbhardwaj! Looks mostly good to me with minor comments for an interface.

Michaelvll · 2024-09-05T19:59:35Z

sky/adaptors/kubernetes.py

@@ -66,7 +59,7 @@ def wrapped(*args, **kwargs):
    return decorated_api


-def _load_config():
+def _load_config(context: str = None):


Suggested change

def _load_config(context: str = None):

def _load_config(context: Optional[str] = None):

Michaelvll · 2024-09-05T21:41:00Z

sky/backends/backend_utils.py

@@ -1559,6 +1559,7 @@ def check_owner_identity(cluster_name: str) -> None:

    cloud = handle.launched_resources.cloud
    current_user_identity = cloud.get_current_user_identity()
+    supported_user_identities = cloud.get_supported_identities()


Should we directly make get_current_user_identity as a list, instead of adding a new API?

I was a little confused by our implementation which made me add this new method.

Currently, we iterate on the zip(owner_identity, current_user_identity) to check identities:

skypilot/sky/backends/backend_utils.py

Lines 1578 to 1586 in 2e545b8

# It is OK if the owner identity is shorter, which will happen when

# the cluster is launched before #1808. In that case, we only check

# the same length (zip will stop at the shorter one).

for i, (owner,

current) in enumerate(zip(owner_identity,

current_user_identity)):

# Clean up the owner identity for the backslash and newlines, caused

# by the cloud CLI output, e.g. gcloud.

owner = owner.replace('\n', '').replace('\\', '')

Say the user had launched the cluster with just one identity configured in their kubeconfig, and thus owner_identity is ["myid1"].

Say they updated their kubeconfig and now they have 3 identities in current_user_identity, the original one being 3rd in the list: ["myid3", "myid2", "myid1"].

zip(owner_identity, current_user_identity) will now be ["myid1", "myid3"], which will fail the check we have.

>>> x=["myid1"] >>> y=["myid3", "myid2", "myid1"] >>> list(zip(x,y)) [('myid1', 'myid3')]

For k8s this logic should be changed to itertools.product instead of zip, but I think that will affect AWS implementation, which has a stricter ordering requirement to handle [user_id, AccountId]

The reason we have a list for the current_user_identity and owner_identity is that it can be fine for a cloud owner with a different cloud account but from the same project to operate a VM, for example, we allows the following to match

owner_identity = ['myaccount1', 'myproject'] current_identity = ['myaccount2', 'myproject'] for i, (owner, current) in enumerate(zip(owner_identity, current_user_identity)):

In order to support multiple current_identity, we can do something like the following:

owner_identity = ['myaccount1', 'myproject'] current_identity = [['myaccount2', 'myproject']] for available_identity in current_identities: for i, (owner, current) in enumerate(zip(owner_identity, available_identity)):

And for k8s, it can be:

owner_identity = ['mycontext'] current_identity = [['mycontext1'], ['mycontext2']] for available_identity in current_identities: for i, (owner, current) in enumerate(zip(owner_identity, available_identity)):

romilbhardwaj · 2024-09-07T00:31:36Z

Thanks @Michaelvll! I've updated our identity logic:

Renamed get_current_user_identity to get_active_user_identity
Added get_user_identities to fetch all identities.

We can roll these into one method as well (i.e., just have get_user_identities and replace usage of cloud.get_active_user_identity() with cloud.get_user_identities()[0]), but I figured this is a little cleaner.

romilbhardwaj · 2024-09-07T05:36:15Z

Tested:

sky launch on context1, switch to context2, launch on context2, run sky exec/queue/queue/logs/cancel on cluster running in context1
sky launch on context1, switch to context2, launch on context2, delete context1 from kubeconfig and make sure IdentityError is thrown
New smoke test test_kubernetes_context_switch
All kubernetes smoke tests

romilbhardwaj · 2024-09-09T06:53:39Z

This PR now also fixes a bug introduced by #3897 where if a namespace was passed but no context was set (e.g., when happen inside a controller pod when using incluster auth) to kubernetes-port-forward-proxy-command.sh, the namespace arg would be incorrectly assumed to be the context arg and ssh would fail.

romilbhardwaj · 2024-09-09T22:34:36Z

Backward compatibility is now fixed.

TODO:

~~sky launch from a different context leaks pods in the original context since the cluster owner changes. We may need to add logic to disallow owner change when cluster already exists in state.db.~~

Michaelvll

Thanks for the fix @romilbhardwaj! Mostly looks good to me.

sky/adaptors/kubernetes.py

Michaelvll · 2024-09-09T22:59:11Z

sky/authentication.py

        # Setup service for SSH jump pod. We create the SSH jump service here
        # because we need to know the service IP address and port to set the
        # ssh_proxy_command in the autoscaler config.
-        kubernetes_utils.setup_ssh_jump_svc(ssh_jump_name, namespace,
+        kubernetes_utils.setup_ssh_jump_svc(ssh_jump_name, namespace, context,
                                            service_type)


Do we still need a jump pod for k8s? I thought we have removed it.

It is still used when kubernetes.networking is nodeport (we will deprecate that soon)

sky/backends/backend_utils.py

Michaelvll · 2024-09-09T23:09:30Z

sky/clouds/cloud.py

+    @classmethod
+    def get_user_identities(cls) -> Optional[List[List[str]]]:
+        """Returns all available user identities of this cloud.
+
+        See get_active_user_identity for definition of user identity.
+
+        This method returns a list of user identities, with the current active
+        identity being the first element. Most clouds have only one identity
+        available, so the returned list will only have one element: the current
+        active identity.
+
+        However, some clouds (e.g., Kubernetes) can have multiple current
+        identities (e.g., multiple contexts configured in kubeconfig) that
+        can be dynamically switched, so the list can have multiple elements.
+
+        Example return values:
+            - AWS: [[UserId, AccountId]]
+            - GCP: [[email address + project ID]]
+            - Azure: [[email address + subscription ID]]
+            - Kubernetes: [[current active context], [context 2], ...]
+
+        Returns:
+            None if the cloud does not have a concept of user identity;
+            otherwise all the user identities.
+        """
+        active_identity = cls.get_active_user_identity()
+        return [active_identity] if active_identity is not None else None


I found calling get_active_user_identity within get_user_identities a bit weird to me. Should we instead make get_user_identities always return a list of identities with the active one being the first one, and having get_active_user_identity to index the returned list of identities?

With that, we can also add TODO in AWS and GCP's implementation: Return a list of identities in the profile, when we support automatically switching context.

I see what you mean, fixed now!

Michaelvll · 2024-09-09T23:12:26Z

sky/provision/kubernetes/utils.py

+# We add a version suffix to the port-forward proxy command to ensure backward
+# compatibility and avoid overwriting the older version.
+PORT_FORWARD_PROXY_CMD_PATH = ('~/.sky/'
+                               'kubernetes-port-forward-proxy-command-v2.sh')


nit: we can save the version of the proxy command in constants, if we expect it can change in the future

Ahh yeah we don't have a constants for kubernetes yet, but I've moved it to it's own variable for now

tests/test_smoke.py

Michaelvll

Thanks @romilbhardwaj for the great effort! LGTM.

sky/clouds/lambda_cloud.py

sky/clouds/scp.py

sky/utils/kubernetes/rsync_helper.sh

…o k8s_multik8s_state2 # Conflicts: # sky/adaptors/kubernetes.py

romilbhardwaj · 2024-09-11T05:40:23Z

Verified it works with the job controller as well and sky launch works across contexts. Kubernetes smoke tests pass, merging now.

romilbhardwaj added 7 commits September 4, 2024 17:44

wip

209aa10

wip

c493363

fix rsync

1d69068

fix user identity checks

fc539f8

newline

f315be0

lint and tests

004e920

fix

9a87124

Michaelvll reviewed Sep 5, 2024

View reviewed changes

romilbhardwaj added 4 commits September 6, 2024 10:34

typing

a4a932e

fix

55888c8

lint

0d0b8a6

Update identity logic

8dcc77b

update node_id

48d2c7e

romilbhardwaj added 2 commits September 8, 2024 22:41

fix context/namespace passing to helper scripts

191e998

lint

dc4ce42

romilbhardwaj added 4 commits September 9, 2024 12:58

backward compatibility

5a61caa

backward compatibility

21d80e0

lint

5ed8ccc

Add k8s logging

2ebabbd

Michaelvll reviewed Sep 9, 2024

View reviewed changes

romilbhardwaj added 3 commits September 9, 2024 18:24

comments

9e2c7d8

comments

6c6e64a

comments

78b7131

Michaelvll approved these changes Sep 10, 2024

View reviewed changes

sky/clouds/lambda_cloud.py Outdated Show resolved Hide resolved

sky/clouds/scp.py Outdated Show resolved Hide resolved

sky/utils/kubernetes/rsync_helper.sh Outdated Show resolved Hide resolved

romilbhardwaj added 2 commits September 10, 2024 15:21

comments

ef9da04

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

438e5b8

…o k8s_multik8s_state2 # Conflicts: # sky/adaptors/kubernetes.py

newline

10bd531

romilbhardwaj added this pull request to the merge queue Sep 11, 2024

Merged via the queue into master with commit bad7dab Sep 11, 2024
20 checks passed

romilbhardwaj deleted the k8s_multik8s_state2 branch September 11, 2024 05:46

romilbhardwaj mentioned this pull request Nov 28, 2024

[k8s] Fix resources.image_id backward compatibility #4425

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Support for switching k8s contexts #3913

[k8s] Support for switching k8s contexts #3913

romilbhardwaj commented Sep 5, 2024 •

edited

Loading

Michaelvll left a comment •

edited

Loading

Michaelvll Sep 5, 2024

Michaelvll Sep 5, 2024

romilbhardwaj Sep 6, 2024 •

edited

Loading

Michaelvll Sep 6, 2024 •

edited

Loading

romilbhardwaj commented Sep 7, 2024

romilbhardwaj commented Sep 7, 2024 •

edited

Loading

romilbhardwaj commented Sep 9, 2024

romilbhardwaj commented Sep 9, 2024 •

edited

Loading

Michaelvll left a comment

Michaelvll Sep 9, 2024

romilbhardwaj Sep 10, 2024

Michaelvll Sep 9, 2024

romilbhardwaj Sep 10, 2024

Michaelvll Sep 9, 2024

romilbhardwaj Sep 10, 2024

Michaelvll left a comment

romilbhardwaj commented Sep 11, 2024

	def _load_config(context: str = None):
	def _load_config(context: Optional[str] = None):

	# It is OK if the owner identity is shorter, which will happen when
	# the cluster is launched before #1808. In that case, we only check
	# the same length (zip will stop at the shorter one).
	for i, (owner,
	current) in enumerate(zip(owner_identity,
	current_user_identity)):
	# Clean up the owner identity for the backslash and newlines, caused
	# by the cloud CLI output, e.g. gcloud.
	owner = owner.replace('\n', '').replace('\\', '')

[k8s] Support for switching k8s contexts #3913

[k8s] Support for switching k8s contexts #3913

Conversation

romilbhardwaj commented Sep 5, 2024 • edited Loading

Michaelvll left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romilbhardwaj Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Michaelvll Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

romilbhardwaj commented Sep 7, 2024

romilbhardwaj commented Sep 7, 2024 • edited Loading

romilbhardwaj commented Sep 9, 2024

romilbhardwaj commented Sep 9, 2024 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

romilbhardwaj commented Sep 11, 2024

romilbhardwaj commented Sep 5, 2024 •

edited

Loading

Michaelvll left a comment •

edited

Loading

romilbhardwaj Sep 6, 2024 •

edited

Loading

Michaelvll Sep 6, 2024 •

edited

Loading

romilbhardwaj commented Sep 7, 2024 •

edited

Loading

romilbhardwaj commented Sep 9, 2024 •

edited

Loading